Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmrEtlRunner: race condition overwriting Clojure Collector files during staging step #3085

Closed
sl-victorceron opened this issue Jan 30, 2017 · 13 comments

Comments

@sl-victorceron
Copy link

sl-victorceron commented Jan 30, 2017

Some raw files are misnamed during the CloudFront-like conversion process, causing 2 root problems of missing files.

In the example below the time-stamp into the filename of both files is always renamed as 2017-01-13-03 (UTC). A 2017-01-13-04 file is missing.

EmrEtlRunner output:

[Fri Jan 13 06:15:05 UTC 2017] (t3)    MOVE snowplow-collectors-log/e-xrnips2p/i-050292a49/_var_log_tomcat8_rotated_localhost_access_log.txt1484280061.gz -> viadeo-snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-13-03.us-east-1.i-050292a49944536d7.txt.gz
[Fri Jan 13 06:15:05 UTC 2017] (t5)    MOVE snowplow-collectors-log/e-xrnips2p/i-050292a49/_var_log_tomcat8_rotated_localhost_access_log.txt1484276462.gz -> viadeo-snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-13-03.us-east-1.i-050292a49944536d7.txt.gz

Time-stamp conversion should be :

  • 1484280061 = 13/1/2017 at 5:01:01 CET
  • 1484276462 = 13/1/2017 at 4:01:02 CET

So, when the file is archived the "2017-01-13-03" could be either the one at 5:01 or the one at 4:01 without any file named as "2017-01-13-04"


Additional info :

  • We have more than 30,000 files in our raw/in bucket ( remaining Elastic Beanstalk logs)
  • I wasn't able to verify if the issue comes from ruby or jruby (suspect bug in jruby)
    Maybe the bump to 9.1.6.0 could fix the issue but I wasn't able to make run the snowplow-emr-etl-runner & snowplow-storage-loader with this version.
  • We have this issue at least once every two days

Bonus :
Maybe keeping the timestamp in the filename could be interesting to validate the converted CloudFront format.

Related:

@sl-victorceron sl-victorceron changed the title EmrEtlRunner - misnamed after CloudFront-like conversion EmrEtlRunner : misnamed after CloudFront-like conversion Jan 30, 2017
sl-victorceron pushed a commit to sl-victorceron/snowplow that referenced this issue Jan 30, 2017
The converted datetime doesnt match every time with the original timestamp.
A "reverse conversion" is done afterwards.

It iterates some times on the same filename/timestamps and send the file to a retry folder if the conversion continues to fail.

The rejected file is processed in the following run.
@sl-victorceron sl-victorceron changed the title EmrEtlRunner : misnamed after CloudFront-like conversion EmrEtlRunner : misnamed CloudFront-like conversion Jan 30, 2017
@alexanderdean
Copy link
Member

alexanderdean commented Jan 30, 2017

Hi @vceron - thanks for the detailed bug report. The problem must be something to do with threads or similar, given that the vanilla unthreaded code works fine:

vagrant@snowplow:~$ irb
jruby-9.1.6.0 :003 > require 'date'
 => true
jruby-9.1.6.0 :004 > Time.at("1484280061".to_i).utc.to_datetime.strftime("%Y-%m-%d-%H")
 => "2017-01-13-04"
jruby-9.1.6.0 :005 > Time.at("1484276462".to_i).utc.to_datetime.strftime("%Y-%m-%d-%H")
 => "2017-01-13-03"

If the problem is indeed jruby/jruby#3670, then hopefully it will be fixed when we release R87 very soon. Can you re-test once this has been released?

@alexanderdean alexanderdean changed the title EmrEtlRunner : misnamed CloudFront-like conversion EmrEtlRunner: misnamed CloudFront-like conversion Jan 30, 2017
@sl-victorceron
Copy link
Author

sl-victorceron commented Jan 31, 2017

The problem must be something to do with threads

Yes, now I truly believe it is linked to threads usage in Sluice.

Now I've no problems with processing raw files but with the storage in archive.
This morning i identified 4 missing files for yesterday's runs (1h, 9h, 12h, 21h).

grep -i "archive/raw" | grep MOVE | egrep "30-01|30-09|30-12|30-21"
Doing the previous grep on EmrEtlRunner output I found that some files wrere overwritten, independently of the instance.

[Mon Jan 30 08:15:05 UTC 2017] (t5)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-096c6fe1.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-096c6fe1.txt.gz
[Mon Jan 30 08:15:05 UTC 2017] (t7)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-d5bf5d5z.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-050292a4.txt.gz
[Mon Jan 30 08:15:05 UTC 2017] (t0)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-050292a4.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-050292a4.txt.gz

[Mon Jan 30 09:52:04 UTC 2017] (t8)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-d5bf5d5z.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-d5bf5d5z.txt.gz
[Mon Jan 30 09:52:04 UTC 2017] (t5)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-050292a4.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-096c6fe1.txt.gz
[Mon Jan 30 09:52:04 UTC 2017] (t6)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-096c6fe1.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-096c6fe1.txt.gz

[Mon Jan 30 12:52:05 UTC 2017] (t6)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-d5bf5d5z.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-d5bf5d5z.txt.gz
[Mon Jan 30 12:52:05 UTC 2017] (t4)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-050292a4.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-050292a4.txt.gz
[Mon Jan 30 12:52:05 UTC 2017] (t5)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-096c6fe1.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-d5bf5d5z.txt.gz

[Mon Jan 30 23:15:05 UTC 2017] (t4)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-050292a4.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-096c6fe1.txt.gz
[Mon Jan 30 23:15:05 UTC 2017] (t5)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-096c6fe1.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-096c6fe1.txt.gz
[Mon Jan 30 23:15:05 UTC 2017] (t6)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-d5bf5d5z.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-d5bf5d5z.txt.gz

I'm looking forward to install the R87

@alexanderdean alexanderdean added this to the R9x [HAD] EmrEtlRunner robustness milestone May 30, 2017
@BenFradet
Copy link
Contributor

@vceron Have you had a chance to update to emr etl runner 0.23 (or 0.24) as we updated jruby to 9.1.6 in both those releases?

@sl-victorceron
Copy link
Author

Not yet @BenFradet. We'll update the stack to the last version in some days (a couple of weeks).

@alexanderdean alexanderdean changed the title EmrEtlRunner: misnamed CloudFront-like conversion EmrEtlRunner: race condition misnaming files Aug 1, 2017
@alexanderdean alexanderdean changed the title EmrEtlRunner: race condition misnaming files EmrEtlRunner: race condition overwriting files during staging step Aug 1, 2017
@BenFradet BenFradet removed their assignment Aug 7, 2017
@BenFradet
Copy link
Contributor

will be fixed by #3136

@sl-victorceron
Copy link
Author

The jruby update to 9.1.6 in r87-chichen-itza helped into, or even fixed, this issue.

image

From the graph :

  • before mid-July we had the r86 with the raw files copying issue.
  • from mid-July until start of August the indicator was broken, as the pattern in S3 bucket changed (to run=xxxxx)
  • from August the indicator was fixed and we notice all the raw files are OK. Here you have a gist to monitor the raw files if needed

@BenFradet
Copy link
Contributor

@vceron we have experienced this bug post r87 too, that's why we're going forward with the fix in #276 and not 3136. my bad

@alexanderdean
Copy link
Member

Thanks for the additional detail @vceron . Glad the problem stopped for you from R87...

@sl-victorceron
Copy link
Author

So the changes will allow to stage everything (raw, enriched, shredded) from the EMR cluster as for the raw files @BenFradet ?
At least for the last 8 days it is OK @alexanderdean, ill keep monitoring it for some weeks before concluding it is solved.

@BenFradet
Copy link
Contributor

The changes discussed here will move the staging step away from Sluice to S3DistCp.

@alexanderdean
Copy link
Member

Why did you remove the labels @BenFradet ?

@alexanderdean alexanderdean changed the title EmrEtlRunner: race condition overwriting files during staging step EmrEtlRunner: race condition overwriting Clojure Collector files during staging step Aug 17, 2017
@BenFradet
Copy link
Contributor

Since it's been treated in another issue, it prevents people from going back to the e.g. data-loss label x months from now and look at what happened in this ticket.

@alexanderdean
Copy link
Member

alexanderdean commented Aug 17, 2017

I disagree - a bug report is immutable - it intrinsically relates to data loss, and that doesn't change with the fix being in another ticket. Removing the work assignment metadata by contrast is fine.

In x months from now, I want to be able to go back and review bugs which relate to data loss. The ticket that resolved the problem is uninteresting in comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants