-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EmrEtlRunner: use S3DistCp not Sluice for staging step #276
Comments
Blocked by #1775 |
Here's the brute force way of doing things I came up with:
Advantages:
Drawbacks:
Would love feedback as I don't know if the file renaming serve other purposes. |
Hey @BenFradet - TBH I am happy to drop the I don't think the renaming is essential either. We just need to be careful that the sub-folder structure is preserved through the pipeline to prevent accidental overwrites. |
The sub-folder structure will be fairly short-lived as it'll only be persisted up to enrich which will have a flat output just like right now. I'll create a ticket for removing --end and --start then 👍 . |
Right - but the sub-folder structure needs to be persisted when archiving the raw files out of staging... |
True, that's something I haven't investigated yet but since it's S3DistCp-based that shouldn't be an issue. |
Hi. Was asked to leave a comment here about file moves in s3. We're currently using the clojure collector, and have run into an issue that the file naming scheme for files placed in the processing folder do not provide enough precision to uniquely identify a published file. As a result, the log files overwrite each other in the processing folder, and we lose data. The root of the problem is that we're needing to generate logs more frequently than 1 per hour per instance, but the file name only keeps hour precision on the timestamp. For us, if the file name could retain both minute and second precision that would prevent our need for a custom staging script. |
We may be able to use S3DistCp for all collector_formats (including tricky ones like Urban Airship, which was authored by @ninjabear). S3DistCp's manifests option may be useful here.
If we can't, then we do still need to move this step out of EmrEtlRunner, so one option would be to build a tiny file moving binary and call that as a jobflow step. So kind of like a poor man's, master node-only S3DistCp, specific to one or more of our collector_formats. But anything can move to S3DistCp should.
The text was updated successfully, but these errors were encountered: