Skip to content
ken farmer edited this page Sep 30, 2015 · 9 revisions

Q1: Why test data at rest rather than before it gets to Hadoop?

Answer: data should be tested in both places, but it can be most reliably & effectively tested within its target location since Hadoop has more ways of expressing a test against data at rest than ETL processes have, and have access to all data that's been loaded, not just a single batch. Specifically:

  • Testing of uniqueness & completeness doesn't work as well in the ETL process since a tested-file may get loaded twice, dropped before load, or in the case of uniqueness may require analysis of multiple batches.
  • Testing of data distributions can be problematic within ETL since the distribution problem may not be revealed until enough data has been processed that some has already been loaded.
  • Policy testing doesn't even apply to ETL at all. This includes directory, file & table naming conventions, ownership, file types, compression, security, completeness & freshness of stats, etc.
  • Certain types of tests are difficult to write and get good performance for within ETL - like foreign key constraints across two tables.
  • Consistency testing between two peer tables (ex: for failover in two separate clusters) with a single ETL process loaded data into one - isn't even possible.

So, for this reason HadoopInspector can be used either after final loading or prior in the ETL pipeline by targeting staging tables or interim files.

Q2: What types of tests should still be performed within the ETL process?

Answer: since the cost of repair is generally lower when invalid data is caught by the ETL, whatever can be effectively tested at that stage should be. This includes, for example:

  • testing of logic conditions completely contained within a single batch of data
  • testing of field formats, lengths, valid values, unknown values, min & max values, etc
  • testing of foreign key relationships when practical

Q3: Why not write all test results directly into Hadoop rather than SQLite?

Answer: The data volumes are generally small, SQLite is ubiquitous and there is some need for transactions - say in annotating bad data.