Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmrEtlRunner: allow shredding without prior enrichment #927

Closed
alexanderdean opened this issue Jul 28, 2014 · 7 comments
Closed

EmrEtlRunner: allow shredding without prior enrichment #927

alexanderdean opened this issue Jul 28, 2014 · 7 comments
Assignees
Milestone

Comments

@alexanderdean
Copy link
Member

Suggestion from @epantera:

One question about that: Is there a way to catchup some unstructured events already
processed by the ETL ?

Let's consider for example that I fire many kind of unstructured events and in a first time I
don't want to Shredd them. But if one day its become necessary, it would be nice to be
able to re-process them.

The shredding Hadoop job is completely self-contained, but at the moment the orchestration assumes you want to run the enrichment job followed by the shredding job. So it would be a little fiddly just to do the shredding job.

By way of example, --skip emr CLI arg skips enrichment plus shredding, while --skip shred skips the shredding. There is no --skip enrich to just skip the enrichment step.

Also, the shredding job by default looks for enriched event files on HDFS, but this is easily circumventable with --skip s3distcp.

Likely work-steps in this ticket:

  • Add a --skip enrich CLI arg. --skip emr just becomes an umbrella for --skip enrich,shred
  • If --skip enrich but do shred, then come up with a sensible way for enriched events to land on HDFS (this may work already without code changes - needs testing)
@alexanderdean alexanderdean self-assigned this Jul 28, 2014
@epantera
Copy link

Great, thanks Alex.

To give you a little bit more context of how it could be used here:

We plan to use Snowplow for tracking our Server-Side events. We will track all typed events happening in our business platform into unstructured events. We talk about 100 to 1000 different types of events.

So in a first step we want to track them all - then in a second step, if we decided to fine analyse ones of them, we will activate Shredding for them.

@alexanderdean
Copy link
Member Author

Right - yes that makes a lot of sense. Just as a note - at the moment validation of typed events against their schemas only happens in the shredding step. We will eventually add it into the enrich as well, but in the meantime when you launch a new event type you probably want to be shredding a sample just to make sure that your instances fit your schema...

@alexanderdean
Copy link
Member Author

Also added #928 based on use case ^^

@alexanderdean alexanderdean changed the title EmrEtlRunner: make it possible to shred existing events (i.e. don't run enrichment) EmrEtlRunner: allow shredding without prior enrichment Aug 28, 2014
@alexanderdean
Copy link
Member Author

Thinking about this some more. If a user does --skip enrich, implying that they want to run the shred but not the enrich, then we need to decide where to get the input for the shred step from. Two options would be:

  1. The processing location in S3 - but this is really where raw events are found, not enriched events
  2. The enriched event location in S3 - again a little confusing, as the user probably doesn't want to shred all enriched events, just the run that failed somehow

We also need to decide: does --skip enrich imply --skip staging,archive as well?

@yalisassoon what would be your preferred behaviour for running shredding without enrichment?

@alexanderdean alexanderdean added this to the Version 0.9.7 milestone Aug 28, 2014
@alexanderdean
Copy link
Member Author

Let's do this in 0.9.7

@yalisassoon
Copy link
Member

Instead of 'skip shred', what about a 'shred only' option that takes a
command line argument pointing at the enriched output in s3, which would be
the input for this step?
On 28 Aug 2014 22:46, "Alexander Dean" notifications@github.com wrote:

Let's do this in 0.9.7


Reply to this email directly or view it on GitHub
#927 (comment).

@alexanderdean
Copy link
Member Author

Sounds good @yalisassoon ! Let's do it that way...

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 26, 2020
Renamed --process-bucket option to --process-enrich (fixes snowplow/snowplow#972)
Now allows shredding without prior enrichment (fixes snowplow/snowplow#927)
peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020
Renamed --process-bucket option to --process-enrich (fixes snowplow/snowplow#972)
Now allows shredding without prior enrichment (fixes snowplow/snowplow#927)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants