Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmrEtlRunner: fix srcPattern for copying stream enriched data to HDFS #3722

Closed
chuwy opened this issue Apr 13, 2018 · 15 comments
Closed

EmrEtlRunner: fix srcPattern for copying stream enriched data to HDFS #3722

chuwy opened this issue Apr 13, 2018 · 15 comments
Assignees

Comments

@chuwy
Copy link
Contributor

chuwy commented Apr 13, 2018

Related to #3717

When we're resuming from shred in stream enrich mode, S3DistCp tries to copy data from enriched/good, not from enriched/good/run=2018... and fails because of that.

Due this bug, we can recover R102 stream enrich mode only by re-staging enriched data back to enriched.stream.

@alexanderdean
Copy link
Member

What's our chance of getting this into R103?

@chuwy
Copy link
Contributor Author

chuwy commented Apr 13, 2018

This is already implemented, starting to test it. So question is for @BenFradet.

@BenFradet
Copy link
Contributor

103 has been in code freeze for over a week, so I'm 👎

@BenFradet
Copy link
Contributor

plus it'd make sense to integrate at least #3719 with this

@alexanderdean
Copy link
Member

Makes sense - let's start sketching out the next high priority batch release. @chuwy can you compose that milestone please.

@chuwy
Copy link
Contributor Author

chuwy commented Apr 13, 2018

Ok, will do.

@chuwy chuwy added this to the R104 TBC (Stream enrich mode fixes) milestone Apr 13, 2018
@chuwy
Copy link
Contributor Author

chuwy commented Apr 13, 2018

Turns out fix should (can) be different from one described in title. When we're resuming from shred in batch-enrich mode, S3DistCp knows nothing about run-folder and also uses enriched/good, but with --srcPattern .*part-.*, which seems to handle files from subfolders. With stream-enrich mode we use --srcPattern .+, which somehow doesn't handle subfolders in the same way.

@chuwy
Copy link
Contributor Author

chuwy commented Apr 13, 2018

Don't understand why, but --srcPattern .*\.gz did the trick.

@chuwy chuwy changed the title EmrEtlRunner: stage data from run folder in stream enrich mode EmrEtlRunner: fix srcPattern for copying stream enriched data to HDFS Apr 13, 2018
@BenFradet
Copy link
Contributor

this is locking us down to a particular file format (gz) which I don't really like :(

@chuwy
Copy link
Contributor Author

chuwy commented Apr 16, 2018

Do you mean because we can support other formats in future? AFAIK gz right now is only format for enriched data produced by S3 Loader.

@chuwy
Copy link
Contributor Author

chuwy commented Apr 16, 2018

I'll try to test it with --srcPattern .* (which I guess can be different from initial .+), but also not sure if this is more bullet-proof option.

@chuwy
Copy link
Contributor Author

chuwy commented Apr 16, 2018

Ok, .* also worked. @BenFradet can you confirm .* seems like a better option for you than .*\.gz?

@BenFradet
Copy link
Contributor

As long as there are no other files being unnecessarily moved, yes

@chuwy
Copy link
Contributor Author

chuwy commented Apr 16, 2018

Ok, no unnecessarily files were moved. Pushing rc then.

@chuwy
Copy link
Contributor Author

chuwy commented Apr 16, 2018

Sorry, I was wrong this regex handles $folder$ files and tries to move them from enriched/good to HDFS for shredding. I cannot find any evidences that S3DistCp ignores empty files, so I think it still would be better to stick with .gz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants