-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EmrEtlRunner: race condition overwriting Clojure Collector files during staging step #3085
Comments
The converted datetime doesnt match every time with the original timestamp. A "reverse conversion" is done afterwards. It iterates some times on the same filename/timestamps and send the file to a retry folder if the conversion continues to fail. The rejected file is processed in the following run.
Hi @vceron - thanks for the detailed bug report. The problem must be something to do with threads or similar, given that the vanilla unthreaded code works fine:
If the problem is indeed jruby/jruby#3670, then hopefully it will be fixed when we release R87 very soon. Can you re-test once this has been released? |
Yes, now I truly believe it is linked to threads usage in Sluice. Now I've no problems with processing raw files but with the storage in archive.
I'm looking forward to install the R87 |
@vceron Have you had a chance to update to emr etl runner 0.23 (or 0.24) as we updated jruby to 9.1.6 in both those releases? |
Not yet @BenFradet. We'll update the stack to the last version in some days (a couple of weeks). |
will be fixed by #3136 |
The jruby update to 9.1.6 in r87-chichen-itza helped into, or even fixed, this issue. From the graph :
|
Thanks for the additional detail @vceron . Glad the problem stopped for you from R87... |
So the changes will allow to stage everything (raw, enriched, shredded) from the EMR cluster as for the raw files @BenFradet ? |
The changes discussed here will move the staging step away from Sluice to S3DistCp. |
Why did you remove the labels @BenFradet ? |
Since it's been treated in another issue, it prevents people from going back to the e.g. data-loss label x months from now and look at what happened in this ticket. |
I disagree - a bug report is immutable - it intrinsically relates to data loss, and that doesn't change with the fix being in another ticket. Removing the work assignment metadata by contrast is fine. In x months from now, I want to be able to go back and review bugs which relate to data loss. The ticket that resolved the problem is uninteresting in comparison. |
Some raw files are misnamed during the CloudFront-like conversion process, causing 2 root problems of missing files.
In the example below the time-stamp into the filename of both files is always renamed as 2017-01-13-03 (UTC). A 2017-01-13-04 file is missing.
EmrEtlRunner output:
Time-stamp conversion should be :
So, when the file is archived the "2017-01-13-03" could be either the one at 5:01 or the one at 4:01 without any file named as "2017-01-13-04"
Additional info :
Maybe the bump to 9.1.6.0 could fix the issue but I wasn't able to make run the snowplow-emr-etl-runner & snowplow-storage-loader with this version.
Bonus :
Maybe keeping the timestamp in the filename could be interesting to validate the converted CloudFront format.
Related:
The text was updated successfully, but these errors were encountered: