Original Design Notes: Big Data within the Open HMIS Data Warehouse Project
Big Data is an integral part of this project for the following reasons:
- We need a way to store deidentified (homeless clients' personal identifying information removed or obfuscated) data sets separately for researchers and policymakers to mine (using analytic tools such as Tableau).
- All reports will be run off HBase, and not PostgreSQL. This is to keep the transactional tasks of PostgreSQL as speedy as possible by offloading all reporting onto the HBase store. HBase will contain all years' data, not just the last 5 years of data, so reports spanning many years can be generated, beyond what is just in the PostgreSQL relational database.
- The amount of data generated by users, their notifications, and applications logs will be massive. We are currently planning on using HBase, built on top of Hadoop, which allows map-reduce manipulation of the data. We want these operations to not affect the operational databases' performance.
- We also want to be able to "roll off" of the relational operational database inactive records, and only keep the data "of record" within the operational database. In other words, if data is deleted, made inactive, or superseded by more current data for the same reportable purpose, this superseded/inactive data will be persisted within the Big Data table, and linked back to the operational database by client and or enrollment ID.
Other clarifying points:
- Some of the summary data could be report totals for each year.
- Access control and other logs would go straight into HBase, and never into PostgreSQL
- The PostgreSQL database would keep many years of historical data "of record", not interim and inactivated records. It will be a "pristine and clean" set of HMIS reportable data over perhaps 10 years. *We could use a Sync process to load data into HBase, but it would first have to be committed into PostreSQL, so the data has relational integrity for operational purposes.
- If data is deleted from the Postgres (perhaps via the API), then that record is inactivated in HBase eventually also. So PostgreSQL "leads" HBase by less than 24 hours.