Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmrEtlRunner: treat archive_enriched and archive_shredded as separate steps #3401

Closed
alexanderdean opened this issue Aug 21, 2017 · 8 comments
Assignees

Comments

@alexanderdean
Copy link
Member

At the moment we confusingly use the archive_enriched step to refer to archiving enriched and shredded. This is problematic if a user is running Snowplow without shredding & loading data into Redshift, because:

  1. If we disable archive_enriched, then the enriched events are left in that folder and the next run can't start
  2. If we don't disable archive_shredded, then the S3DistCp trying to move the shredded data will fail due to no data being present

Note that I am open to other suggestions (e.g. hardening the S3DistCp step), but the solution of treat archive_enriched and archive_shredded as separate steps seems fairly clean and simple.

@alexanderdean
Copy link
Member Author

/cc @stdfalse

@BenFradet
Copy link
Contributor

The current behavior is:

if we skip archive_enriched, we don't do any archiving of either enrich or shred
otherwise:
  if enrichwe archive enrich and shred for the current run id
  otherwise we archive enrich and shred for the last run id

Do we want to convert that to:

if we skip archive_enriched, we don't archive enrich
otherwise:
  if enrich we archive enrich for the current run id
  otherwise we archive enrich for the last run id

and

if we skip archive_shredded, we don't archive shred
otherwise:
  if shred we archive shred for the current run id
  otherwise we archive shred for the last run id

?

@alexanderdean
Copy link
Member Author

Can you clarify the nested booleans, "if enrich" and "if shred". What conditions are you referring to here?

@BenFradet
Copy link
Contributor

@alexanderdean
Copy link
Member Author

I think @chuwy wrote this recovery code? I'm not familiar with it.

@BenFradet
Copy link
Contributor

yup

@chuwy
Copy link
Contributor

chuwy commented Aug 21, 2017

Sorry for delay here.

Regarding code: if we're in recover mode - we don't know exact run id to archive (because timestamp is lost) therefore S3DistCp archives latest found directory; pipeline mode means we're in usual pipeline run and run id is known.

EmrEtlRunner can decides we're in recovery mode iff archive_enriched is present and enrich is absent.

Does it answer a question?

@alexanderdean
Copy link
Member Author

alexanderdean commented Aug 21, 2017

Would you guys mind jumping on a Zoom together? Can be tomorrow morning of course, but we can't afford to introduce a regression here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants