Skip to content

Bulk Uploader Design

Sandeep Dolia edited this page Sep 14, 2015 · 21 revisions

Bulk Upload Design

Introduction

HMIS Bulk Uploader process enables importing big HMIS CSV, XML, and someday soon, JSON files into the HMIS schema. Bulk uploader stages the data, dedupes, and loads the records into the operational database, keeping logs of the activity, so rollback of an import can happen.

Design

Bulk Uploader process receives files via FTP process or uploading the (XML,csv)files via HMIS Admin screens. It is important to note that the file should be in HMIS approved format. Once the files are in the FTP location in accordance with the documentation they will be processed by a Worker process. At first the file under processing is backed up into a different location or eventually into our Big Data Ware house. Then the contents of the file are unmarshalled and validation take place with the file. Once the validations are complete, the data in the file is then persisted into the HMIS Staging schema. Please note: That the HMIS Staging schema is not where the data should eventually be stored. Its actually where the data is really getting staged. This worker process also makes REST calls to the "Client Dedup Microservice" which uses locally hosted "OPEN EMPI" application to determine a unique client(homeless person). Thus the entire HMIS data from the file is stored inside our Staging schema. We also have a table "Bulk_Upload" which keeps an audit trail of all activities which are performed with in a bulk upload process. Once the data is properly stored with the staging schema. We then perform data replication across "Staging" and "Live" schema. Live schema is where the real HMIS data is stored. Once the data has reached the "Live" schema it will then be synced via Sync Process into the Hadoop (HBASE).

Files --FTP-->FTP Server(Ec2 Instance)--Worker-->Staging Schema--Worker--->Live Schema--Sync--->HBASE

Bulk Upload Stages

  • INITIAL - When a file is being taken by the Worker it creates a record with in the Bulk_upload table with the INITIAL status.
  • ERROR - When an error has been encountered related to file validation and staging then the status is updated to ERROR. A column in the Bulk Uploader table will be added called "Message" which would contain additional information or error messages related to the Bulk Upload process.
  • STAGING - When the data from a file reaches the Staging schema successfully then the status is changed to STAGING.
  • LIVE - When the data from the Staging schema is replicated to the live schema then the status is changed from STAGING to LIVE.

Bulk Upload Staging & Validations

The HMIS bulk uploader process is designed to be fault tolerant. Typically when systems get data from external users or applications via FTP, it is possible that there may be various abnormalities with the data.Hence the data needs to staged and validated before processing the data.

  • Duplicate file validation, the process is terminated immediately when a duplicate file is received.
  • Data is hydrated into the "Staging" schema, where the dedup logic of clients really occur.
  • Each HMIS table has an Export_ID associated with it which makes rolling back very easy for a bulk upload if errors were encountered.
  • Typically a bulk upload may require multiple iteration of upload depending on the quality of the data and hence having a "Staging" schema where we stage the data before passing it to the "Live" environment often keeps the data in the "Live" schema clean.
Clone this wiki locally