-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EmrEtlRunner: allow shredding without prior enrichment #927
Comments
Great, thanks Alex. To give you a little bit more context of how it could be used here: We plan to use Snowplow for tracking our Server-Side events. We will track all typed events happening in our business platform into unstructured events. We talk about 100 to 1000 different types of events. So in a first step we want to track them all - then in a second step, if we decided to fine analyse ones of them, we will activate Shredding for them. |
Right - yes that makes a lot of sense. Just as a note - at the moment validation of typed events against their schemas only happens in the shredding step. We will eventually add it into the enrich as well, but in the meantime when you launch a new event type you probably want to be shredding a sample just to make sure that your instances fit your schema... |
Also added #928 based on use case ^^ |
Thinking about this some more. If a user does
We also need to decide: does --skip enrich imply --skip staging,archive as well? @yalisassoon what would be your preferred behaviour for running shredding without enrichment? |
Let's do this in 0.9.7 |
Instead of 'skip shred', what about a 'shred only' option that takes a
|
Sounds good @yalisassoon ! Let's do it that way... |
Renamed --process-bucket option to --process-enrich (fixes snowplow/snowplow#972) Now allows shredding without prior enrichment (fixes snowplow/snowplow#927)
Renamed --process-bucket option to --process-enrich (fixes snowplow/snowplow#972) Now allows shredding without prior enrichment (fixes snowplow/snowplow#927)
Suggestion from @epantera:
The shredding Hadoop job is completely self-contained, but at the moment the orchestration assumes you want to run the enrichment job followed by the shredding job. So it would be a little fiddly just to do the shredding job.
By way of example,
--skip emr
CLI arg skips enrichment plus shredding, while--skip shred
skips the shredding. There is no--skip enrich
to just skip the enrichment step.Also, the shredding job by default looks for enriched event files on HDFS, but this is easily circumventable with
--skip s3distcp
.Likely work-steps in this ticket:
--skip enrich
CLI arg.--skip emr
just becomes an umbrella for--skip enrich,shred
--skip enrich
but do shred, then come up with a sensible way for enriched events to land on HDFS (this may work already without code changes - needs testing)The text was updated successfully, but these errors were encountered: