Skip to content

StorageLoader

Anton Parkhomenko edited this page Aug 24, 2017 · 3 revisions

HOME » SNOWPLOW TECHNICAL DOCUMENTATION » Storage » The StorageLoader

StorageLoader is discounted since R90 Lascaux release and replaced with RDB Loader

An overview of how the StorageLoader instruments the loading of data from S3 into Redshift

  • Data from enriched Snowplow event files generated by the Scalding process on EMR is read and written to Amazon Redshift
  • The enriched event files are then moved from the in-bucket (which was the archive bucket for the EmrEtlRunner) to the archive bucket (for the StorageLoader)

The StorageLoader is configured via the configuration file shared with EmrEtlRunner. For more information, see the guide to setting up the StorageLoader.

The StorageLoader role in ETL process

The enriched files contain the tab-separated values contributing to atomic.events and custom tables. The shredding process

  1. reads Snowplow enriched events from enriched good files (produced and temporary stored in HDFS as a result of enrichment process);
  2. extracts any unstructured (self-describing) event JSONs and contexts JSONs found;
  3. validates that these JSONs conform to the corresponding schemas located in Iglu registry;
  4. adds metadata to these JSONs to track their origins;
  5. writes these JSONs out to nested folders on S3 dependent on their schema.

As a result the enriched good file is "shredded" into a few shredded good files (provided the event file contained data from at least one of the following: custom self-describing events, contexts, configurable enrichments):

  1. a TSV formatted file containing the data for atomic.events table;
  2. possibly one or more JSON files related to custom user specific (self-describing) events extracted from unstruct_event field of the enriched good file;
  3. possibly one or more JSON files related to custom contexts extracted from contexts filed of the enriched good file;
  4. possibly one or more JSON files related to configurable enrichments (if any was enabled) extracted from derived_contexts field of the enriched good file.

Those files end up in S3 and are used to load the data into Redshift tables dedicated to each of the above files under the StorageLoader orchestration.

The whole process could be depicted with the following dataflow diagram.

Clone this wiki locally
You can’t perform that action at this time.