Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmrEtlRunner: use S3DistCp not Sluice for staging step #276

Closed
alexanderdean opened this issue Jun 3, 2013 · 7 comments
Closed

EmrEtlRunner: use S3DistCp not Sluice for staging step #276

alexanderdean opened this issue Jun 3, 2013 · 7 comments
Assignees

Comments

@alexanderdean
Copy link
Member

alexanderdean commented Jun 3, 2013

We may be able to use S3DistCp for all collector_formats (including tricky ones like Urban Airship, which was authored by @ninjabear). S3DistCp's manifests option may be useful here.

If we can't, then we do still need to move this step out of EmrEtlRunner, so one option would be to build a tiny file moving binary and call that as a jobflow step. So kind of like a poor man's, master node-only S3DistCp, specific to one or more of our collector_formats. But anything can move to S3DistCp should.

@ghost ghost assigned alexanderdean Jun 3, 2013
@alexanderdean alexanderdean changed the title Move file copies to use S3DistCp if possible Snowplow CLI: port copy to staging to S3DistCp Jun 15, 2015
@alexanderdean alexanderdean changed the title Snowplow CLI: port copy to staging to S3DistCp Snowplow CLI: port move to processing to S3DistCp Jun 15, 2015
@alexanderdean alexanderdean changed the title Snowplow CLI: port move to processing to S3DistCp Snowplow CLI: re-implement S3 file moves to Processing using S3DistCp Jun 15, 2015
@alexanderdean alexanderdean added this to the snowplow CLI #2 milestone Jun 15, 2015
@alexanderdean
Copy link
Member Author

Blocked by #1775

@alexanderdean alexanderdean modified the milestones: snowplow CLI #3, snowplow CLI #2 Aug 20, 2015
@alexanderdean alexanderdean changed the title Snowplow CLI: re-implement S3 file moves to Processing using S3DistCp EmrEtlRunner: use S3DistCp not Sluice for staging step Feb 9, 2017
@alexanderdean alexanderdean modified the milestones: R88 Angkor Wat, snowplowctl #5, R9x [HAD] EmrEtlRunner robustness Feb 9, 2017
@BenFradet BenFradet modified the milestones: R9x [HAD] EmrEtlRunner robustness, R9x [HAD] Spark port Feb 20, 2017
@BenFradet
Copy link
Contributor

Here's the brute force way of doing things I came up with:

  • keep the original folder structures (no flattening): as a result there wouldn't be any overwrite and we would have to glob the input path of the enrich step
  • handle the --end and --start flags on a per collector format basis

Advantages:

  • everything is s3distcp
  • fairly generic (if we wish to add another format, we don't have to write and support another script or binary)

Drawbacks:

  • no renaming, correct me if I'm wrong but since our folder structure is not flattened that wouldn't be an issue

Would love feedback as I don't know if the file renaming serve other purposes.

@alexanderdean
Copy link
Member Author

alexanderdean commented Feb 24, 2017

Hey @BenFradet - TBH I am happy to drop the --end and --start arguments - we have never used these (although I know a few in the community did) and I think they have outlived their purpose and are unnecessarily complicated.

I don't think the renaming is essential either. We just need to be careful that the sub-folder structure is preserved through the pipeline to prevent accidental overwrites.

@BenFradet
Copy link
Contributor

The sub-folder structure will be fairly short-lived as it'll only be persisted up to enrich which will have a flat output just like right now.

I'll create a ticket for removing --end and --start then 👍 .

@alexanderdean
Copy link
Member Author

Right - but the sub-folder structure needs to be persisted when archiving the raw files out of staging...

@BenFradet
Copy link
Contributor

BenFradet commented Feb 24, 2017

True, that's something I haven't investigated yet but since it's S3DistCp-based that shouldn't be an issue.

@rbolkey
Copy link
Contributor

rbolkey commented Jul 18, 2017

Hi. Was asked to leave a comment here about file moves in s3. We're currently using the clojure collector, and have run into an issue that the file naming scheme for files placed in the processing folder do not provide enough precision to uniquely identify a published file. As a result, the log files overwrite each other in the processing folder, and we lose data.

The root of the problem is that we're needing to generate logs more frequently than 1 per hour per instance, but the file name only keeps hour precision on the timestamp. For us, if the file name could retain both minute and second precision that would prevent our need for a custom staging script.

peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 26, 2020
peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020
peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020
peel pushed a commit to snowplow/emr-etl-runner that referenced this issue May 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants