EmrEtlRunner: treat archive_enriched and archive_shredded as separate steps #3401

alexanderdean · 2017-08-21T08:54:22Z

At the moment we confusingly use the archive_enriched step to refer to archiving enriched and shredded. This is problematic if a user is running Snowplow without shredding & loading data into Redshift, because:

If we disable archive_enriched, then the enriched events are left in that folder and the next run can't start
If we don't disable archive_shredded, then the S3DistCp trying to move the shredded data will fail due to no data being present

Note that I am open to other suggestions (e.g. hardening the S3DistCp step), but the solution of treat archive_enriched and archive_shredded as separate steps seems fairly clean and simple.

The text was updated successfully, but these errors were encountered:

alexanderdean · 2017-08-21T08:54:40Z

/cc @stdfalse

BenFradet · 2017-08-21T14:52:10Z

The current behavior is:

if we skip archive_enriched, we don't do any archiving of either enrich or shred
otherwise:
if enrichwe archive enrich and shred for the current run id
otherwise we archive enrich and shred for the last run id

Do we want to convert that to:

if we skip archive_enriched, we don't archive enrich
otherwise:
if enrich we archive enrich for the current run id
otherwise we archive enrich for the last run id

and

if we skip archive_shredded, we don't archive shred
otherwise:
if shred we archive shred for the current run id
otherwise we archive shred for the last run id

?

alexanderdean · 2017-08-21T14:53:47Z

Can you clarify the nested booleans, "if enrich" and "if shred". What conditions are you referring to here?

BenFradet · 2017-08-21T14:58:00Z

I'm referring to https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb#L70-L76

ping @chuwy

alexanderdean · 2017-08-21T15:02:31Z

I think @chuwy wrote this recovery code? I'm not familiar with it.

BenFradet · 2017-08-21T15:02:55Z

yup

chuwy · 2017-08-21T15:31:31Z

Sorry for delay here.

Regarding code: if we're in recover mode - we don't know exact run id to archive (because timestamp is lost) therefore S3DistCp archives latest found directory; pipeline mode means we're in usual pipeline run and run id is known.

EmrEtlRunner can decides we're in recovery mode iff archive_enriched is present and enrich is absent.

Does it answer a question?

alexanderdean · 2017-08-21T15:33:43Z

Would you guys mind jumping on a Zoom together? Can be tomorrow morning of course, but we can't afford to introduce a regression here...

… steps (closes #3401)

…nowplow/snowplow#3401)

alexanderdean added the 3. Enrich label Aug 21, 2017

alexanderdean added this to the R92 [BAT] Maiden Castle milestone Aug 21, 2017

alexanderdean assigned BenFradet Aug 21, 2017

BenFradet added a commit that referenced this issue Aug 21, 2017

EmrEtlRunner: treat archive_enriched and archive_shredded as separate…

da8e2ee

… steps (closes #3401)

BenFradet added a commit that referenced this issue Aug 21, 2017

EmrEtlRunner: treat archive_enriched and archive_shredded as separate…

6ab9426

… steps (closes #3401)

BenFradet closed this as completed in 768e021 Sep 11, 2017

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 26, 2020

Treat archive_enriched and archive_shredded as separate steps (closes s…

3f5de9e

…nowplow/snowplow#3401)

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Treat archive_enriched and archive_shredded as separate steps (closes s…

2112524

…nowplow/snowplow#3401)

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Treat archive_enriched and archive_shredded as separate steps (closes s…

09fbcc3

…nowplow/snowplow#3401)

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Treat archive_enriched and archive_shredded as separate steps (close s…

aace381

…nowplow/snowplow#3401)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmrEtlRunner: treat archive_enriched and archive_shredded as separate steps #3401

EmrEtlRunner: treat archive_enriched and archive_shredded as separate steps #3401

alexanderdean commented Aug 21, 2017

alexanderdean commented Aug 21, 2017

BenFradet commented Aug 21, 2017

alexanderdean commented Aug 21, 2017

BenFradet commented Aug 21, 2017

alexanderdean commented Aug 21, 2017

BenFradet commented Aug 21, 2017

chuwy commented Aug 21, 2017 •

edited

alexanderdean commented Aug 21, 2017 •

edited

EmrEtlRunner: treat archive_enriched and archive_shredded as separate steps #3401

EmrEtlRunner: treat archive_enriched and archive_shredded as separate steps #3401

Comments

alexanderdean commented Aug 21, 2017

alexanderdean commented Aug 21, 2017

BenFradet commented Aug 21, 2017

alexanderdean commented Aug 21, 2017

BenFradet commented Aug 21, 2017

alexanderdean commented Aug 21, 2017

BenFradet commented Aug 21, 2017

chuwy commented Aug 21, 2017 • edited

alexanderdean commented Aug 21, 2017 • edited

chuwy commented Aug 21, 2017 •

edited

alexanderdean commented Aug 21, 2017 •

edited