EmrEtlRunner: use S3DistCp not Sluice for staging step #276

alexanderdean · 2013-06-03T12:43:58Z

We may be able to use S3DistCp for all collector_formats (including tricky ones like Urban Airship, which was authored by @ninjabear). S3DistCp's manifests option may be useful here.

If we can't, then we do still need to move this step out of EmrEtlRunner, so one option would be to build a tiny file moving binary and call that as a jobflow step. So kind of like a poor man's, master node-only S3DistCp, specific to one or more of our collector_formats. But anything can move to S3DistCp should.

alexanderdean · 2015-06-15T07:42:13Z

Blocked by #1775

BenFradet · 2017-02-24T15:56:46Z

Here's the brute force way of doing things I came up with:

keep the original folder structures (no flattening): as a result there wouldn't be any overwrite and we would have to glob the input path of the enrich step
handle the --end and --start flags on a per collector format basis

Advantages:

everything is s3distcp
fairly generic (if we wish to add another format, we don't have to write and support another script or binary)

Drawbacks:

no renaming, correct me if I'm wrong but since our folder structure is not flattened that wouldn't be an issue

Would love feedback as I don't know if the file renaming serve other purposes.

alexanderdean · 2017-02-24T16:09:17Z

Hey @BenFradet - TBH I am happy to drop the --end and --start arguments - we have never used these (although I know a few in the community did) and I think they have outlived their purpose and are unnecessarily complicated.

I don't think the renaming is essential either. We just need to be careful that the sub-folder structure is preserved through the pipeline to prevent accidental overwrites.

BenFradet · 2017-02-24T16:13:43Z

The sub-folder structure will be fairly short-lived as it'll only be persisted up to enrich which will have a flat output just like right now.

I'll create a ticket for removing --end and --start then 👍 .

alexanderdean · 2017-02-24T16:14:42Z

Right - but the sub-folder structure needs to be persisted when archiving the raw files out of staging...

BenFradet · 2017-02-24T16:15:09Z

True, that's something I haven't investigated yet but since it's S3DistCp-based that shouldn't be an issue.

rbolkey · 2017-07-18T19:13:22Z

Hi. Was asked to leave a comment here about file moves in s3. We're currently using the clojure collector, and have run into an issue that the file naming scheme for files placed in the processing folder do not provide enough precision to uniquely identify a published file. As a result, the log files overwrite each other in the processing folder, and we lose data.

The root of the problem is that we're needing to generate logs more frequently than 1 per hour per instance, but the file name only keeps hour precision on the timestamp. For us, if the file name could retain both minute and second precision that would prevent our need for a custom staging script.

ghost assigned alexanderdean Jun 3, 2013

rgabo mentioned this issue Aug 10, 2013

Using EmrEtlRunner to process the archive bucket works for S3DistCp but fails on the subsequent ETL step #317

Closed

alexanderdean changed the title ~~Move file copies to use S3DistCp if possible~~ Snowplow CLI: port copy to staging to S3DistCp Jun 15, 2015

alexanderdean changed the title ~~Snowplow CLI: port copy to staging to S3DistCp~~ Snowplow CLI: port move to processing to S3DistCp Jun 15, 2015

alexanderdean changed the title ~~Snowplow CLI: port move to processing to S3DistCp~~ Snowplow CLI: re-implement S3 file moves to Processing using S3DistCp Jun 15, 2015

alexanderdean added this to the snowplow CLI #2 milestone Jun 15, 2015

This was referenced Jun 15, 2015

Snowplow CLI: re-implement S3 raw event file moves to Archive using S3DistCp #1776

Closed

EmrEtlRunner: add S3DistCp step to move enriched and shredded files to archive #1777

Closed

alexanderdean modified the milestones: snowplow CLI #3, snowplow CLI #2 Aug 20, 2015

alexanderdean mentioned this issue Aug 21, 2015

Snowplow CLI: support manifest for S3DistCp to Staging #1983

Closed

alexanderdean changed the title ~~Snowplow CLI: re-implement S3 file moves to Processing using S3DistCp~~ EmrEtlRunner: use S3DistCp not Sluice for staging step Feb 9, 2017

alexanderdean modified the milestones: R88 Angkor Wat, snowplowctl #5, R9x [HAD] EmrEtlRunner robustness Feb 9, 2017

alexanderdean assigned BenFradet and unassigned alexanderdean Feb 9, 2017

BenFradet mentioned this issue Feb 13, 2017

EmrEtlRunner: add "no logs to process" message on exit with return code 3 #2644

Closed

BenFradet modified the milestones: R9x [HAD] EmrEtlRunner robustness, R9x [HAD] Spark port Feb 20, 2017

BenFradet added a commit that referenced this issue Feb 24, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

d7cb619

BenFradet added a commit that referenced this issue Feb 27, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

47885f4

BenFradet added a commit that referenced this issue Mar 2, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

beb099c

BenFradet added a commit that referenced this issue Mar 13, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

146bf96

BenFradet added a commit that referenced this issue Mar 27, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

65481f7

BenFradet added a commit that referenced this issue Mar 27, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

c57f278

alexanderdean mentioned this issue Aug 1, 2017

EmrEtlRunner: turn empty S3 directory detections through Sluice into EMR steps #3130

Closed

BenFradet added a commit that referenced this issue Aug 2, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

730ce09

BenFradet added a commit that referenced this issue Aug 2, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

0535ae5

BenFradet added a commit that referenced this issue Aug 3, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

600b2c8

BenFradet added a commit that referenced this issue Aug 3, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

b38fd7e

BenFradet added a commit that referenced this issue Aug 4, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

f55f412

BenFradet added a commit that referenced this issue Aug 7, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

e528e1b

This was referenced Aug 7, 2017

Staging process fails for buckets in multiple regions #2823

Closed

EmrEtlRunner: race condition overwriting Clojure Collector files during staging step #3085

Closed

BenFradet added a commit that referenced this issue Aug 9, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

8e6b7d1

BenFradet added a commit that referenced this issue Aug 11, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

41e178f

BenFradet added a commit that referenced this issue Aug 14, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

1c12fca

BenFradet added a commit that referenced this issue Aug 15, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

bfa24a0

BenFradet added a commit that referenced this issue Aug 15, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

a4b7467

BenFradet added a commit that referenced this issue Aug 15, 2017

EmrEtlRunner: use S3DistCp not Sluice for staging step (closes #276)

4ab2054

BenFradet closed this as completed in d8046e2 Aug 17, 2017

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 26, 2020

Use S3DistCp not Sluice for staging step (closes snowplow/snowplow#276)

e09958b

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Use S3DistCp not Sluice for staging step (closes snowplow/snowplow#276)

7435481

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Use S3DistCp not Sluice for staging step (closes snowplow/snowplow#276)

8143a95

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020

Use S3DistCp not Sluice for staging step (close snowplow/snowplow#276)

73b14b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmrEtlRunner: use S3DistCp not Sluice for staging step #276

EmrEtlRunner: use S3DistCp not Sluice for staging step #276

alexanderdean commented Jun 3, 2013 •

edited

Loading

alexanderdean commented Jun 15, 2015

BenFradet commented Feb 24, 2017

alexanderdean commented Feb 24, 2017 •

edited

Loading

BenFradet commented Feb 24, 2017

alexanderdean commented Feb 24, 2017

BenFradet commented Feb 24, 2017 •

edited

Loading

rbolkey commented Jul 18, 2017 •

edited

Loading

EmrEtlRunner: use S3DistCp not Sluice for staging step #276

EmrEtlRunner: use S3DistCp not Sluice for staging step #276

Comments

alexanderdean commented Jun 3, 2013 • edited Loading

alexanderdean commented Jun 15, 2015

BenFradet commented Feb 24, 2017

alexanderdean commented Feb 24, 2017 • edited Loading

BenFradet commented Feb 24, 2017

alexanderdean commented Feb 24, 2017

BenFradet commented Feb 24, 2017 • edited Loading

rbolkey commented Jul 18, 2017 • edited Loading

alexanderdean commented Jun 3, 2013 •

edited

Loading

alexanderdean commented Feb 24, 2017 •

edited

Loading

BenFradet commented Feb 24, 2017 •

edited

Loading

rbolkey commented Jul 18, 2017 •

edited

Loading