EmrEtlRunner: race condition overwriting Clojure Collector files during staging step #3085

sl-victorceron · 2017-01-30T17:09:10Z

Some raw files are misnamed during the CloudFront-like conversion process, causing 2 root problems of missing files.

In the example below the time-stamp into the filename of both files is always renamed as 2017-01-13-03 (UTC). A 2017-01-13-04 file is missing.

EmrEtlRunner output:

[Fri Jan 13 06:15:05 UTC 2017] (t3)    MOVE snowplow-collectors-log/e-xrnips2p/i-050292a49/_var_log_tomcat8_rotated_localhost_access_log.txt1484280061.gz -> viadeo-snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-13-03.us-east-1.i-050292a49944536d7.txt.gz
[Fri Jan 13 06:15:05 UTC 2017] (t5)    MOVE snowplow-collectors-log/e-xrnips2p/i-050292a49/_var_log_tomcat8_rotated_localhost_access_log.txt1484276462.gz -> viadeo-snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-13-03.us-east-1.i-050292a49944536d7.txt.gz

Time-stamp conversion should be :

1484280061 = 13/1/2017 at 5:01:01 CET
1484276462 = 13/1/2017 at 4:01:02 CET

So, when the file is archived the "2017-01-13-03" could be either the one at 5:01 or the one at 4:01 without any file named as "2017-01-13-04"

Additional info :

We have more than 30,000 files in our raw/in bucket ( remaining Elastic Beanstalk logs)
I wasn't able to verify if the issue comes from ruby or jruby (suspect bug in jruby)
Maybe the bump to 9.1.6.0 could fix the issue but I wasn't able to make run the snowplow-emr-etl-runner & snowplow-storage-loader with this version.
We have this issue at least once every two days

Bonus :
Maybe keeping the timestamp in the filename could be interesting to validate the converted CloudFront format.

Related:

61 Pygmy Parrot
- EmrEtlRunner: change Clojure Collector log timestamp format to match CloudFront logs #1398 - EmrEtlRunner: change Clojure Collector log timestamp format to match CloudFront logs ( 4b36d14)
- EmrEtlRunner: append region name to Clojure Collector log files #1379 - EmrEtlRunner: append region name to Clojure Collector log files
- EmrEtlRunner: append rather than prepend instance names to Clojure Collector log files #1404 - EmrEtlRunner: append rather than prepend instance names to Clojure Collector log files

The text was updated successfully, but these errors were encountered:

The converted datetime doesnt match every time with the original timestamp. A "reverse conversion" is done afterwards. It iterates some times on the same filename/timestamps and send the file to a retry folder if the conversion continues to fail. The rejected file is processed in the following run.

alexanderdean · 2017-01-30T23:08:46Z

Hi @vceron - thanks for the detailed bug report. The problem must be something to do with threads or similar, given that the vanilla unthreaded code works fine:

vagrant@snowplow:~$ irb
jruby-9.1.6.0 :003 > require 'date'
 => true
jruby-9.1.6.0 :004 > Time.at("1484280061".to_i).utc.to_datetime.strftime("%Y-%m-%d-%H")
 => "2017-01-13-04"
jruby-9.1.6.0 :005 > Time.at("1484276462".to_i).utc.to_datetime.strftime("%Y-%m-%d-%H")
 => "2017-01-13-03"

If the problem is indeed jruby/jruby#3670, then hopefully it will be fixed when we release R87 very soon. Can you re-test once this has been released?

sl-victorceron · 2017-01-31T14:15:56Z

The problem must be something to do with threads

Yes, now I truly believe it is linked to threads usage in Sluice.

Now I've no problems with processing raw files but with the storage in archive.
This morning i identified 4 missing files for yesterday's runs (1h, 9h, 12h, 21h).

[Mon Jan 30 08:15:05 UTC 2017] (t5)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-096c6fe1.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-096c6fe1.txt.gz
[Mon Jan 30 08:15:05 UTC 2017] (t7)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-d5bf5d5z.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-050292a4.txt.gz
[Mon Jan 30 08:15:05 UTC 2017] (t0)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-050292a4.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-01.us-east-1.i-050292a4.txt.gz

[Mon Jan 30 09:52:04 UTC 2017] (t8)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-d5bf5d5z.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-d5bf5d5z.txt.gz
[Mon Jan 30 09:52:04 UTC 2017] (t5)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-050292a4.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-096c6fe1.txt.gz
[Mon Jan 30 09:52:04 UTC 2017] (t6)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-096c6fe1.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-09.us-east-1.i-096c6fe1.txt.gz

[Mon Jan 30 12:52:05 UTC 2017] (t6)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-d5bf5d5z.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-d5bf5d5z.txt.gz
[Mon Jan 30 12:52:05 UTC 2017] (t4)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-050292a4.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-050292a4.txt.gz
[Mon Jan 30 12:52:05 UTC 2017] (t5)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-096c6fe1.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-12.us-east-1.i-d5bf5d5z.txt.gz

[Mon Jan 30 23:15:05 UTC 2017] (t4)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-050292a4.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-096c6fe1.txt.gz
[Mon Jan 30 23:15:05 UTC 2017] (t5)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-096c6fe1.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-096c6fe1.txt.gz
[Mon Jan 30 23:15:05 UTC 2017] (t6)    MOVE snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-d5bf5d5z.txt.gz -> snowplow-archive/raw/2017-01-30/var_log_tomcat8_rotated_localhost_access_log.2017-01-30-21.us-east-1.i-d5bf5d5z.txt.gz

I'm looking forward to install the R87

BenFradet · 2017-05-30T14:39:14Z

@vceron Have you had a chance to update to emr etl runner 0.23 (or 0.24) as we updated jruby to 9.1.6 in both those releases?

sl-victorceron · 2017-06-19T13:47:48Z

Not yet @BenFradet. We'll update the stack to the last version in some days (a couple of weeks).

BenFradet · 2017-08-07T14:55:58Z

will be fixed by #3136

sl-victorceron · 2017-08-08T09:15:03Z

The jruby update to 9.1.6 in r87-chichen-itza helped into, or even fixed, this issue.

From the graph :

before mid-July we had the r86 with the raw files copying issue.
from mid-July until start of August the indicator was broken, as the pattern in S3 bucket changed (to run=xxxxx)
from August the indicator was fixed and we notice all the raw files are OK. Here you have a gist to monitor the raw files if needed

BenFradet · 2017-08-08T10:06:25Z

@vceron we have experienced this bug post r87 too, that's why we're going forward with the fix in #276 and not 3136. my bad

alexanderdean · 2017-08-08T11:03:30Z

Thanks for the additional detail @vceron . Glad the problem stopped for you from R87...

sl-victorceron · 2017-08-08T11:24:21Z

So the changes will allow to stage everything (raw, enriched, shredded) from the EMR cluster as for the raw files @BenFradet ?
At least for the last 8 days it is OK @alexanderdean, ill keep monitoring it for some weeks before concluding it is solved.

BenFradet · 2017-08-08T11:26:28Z

The changes discussed here will move the staging step away from Sluice to S3DistCp.

alexanderdean · 2017-08-17T08:53:00Z

Why did you remove the labels @BenFradet ?

BenFradet · 2017-08-17T08:56:27Z

Since it's been treated in another issue, it prevents people from going back to the e.g. data-loss label x months from now and look at what happened in this ticket.

alexanderdean · 2017-08-17T08:57:59Z

I disagree - a bug report is immutable - it intrinsically relates to data loss, and that doesn't change with the fix being in another ticket. Removing the work assignment metadata by contrast is fine.

In x months from now, I want to be able to go back and review bugs which relate to data loss. The ticket that resolved the problem is uninteresting in comparison.

sl-victorceron changed the title ~~EmrEtlRunner - misnamed after CloudFront-like conversion~~ EmrEtlRunner : misnamed after CloudFront-like conversion Jan 30, 2017

sl-victorceron changed the title ~~EmrEtlRunner : misnamed after CloudFront-like conversion~~ EmrEtlRunner : misnamed CloudFront-like conversion Jan 30, 2017

sl-victorceron mentioned this issue Jan 30, 2017

Fixes #3085 - EmrEtlRunner : misnamed CloudFront-like conversion #3086

Closed

alexanderdean added 3. Enrich bug labels Jan 30, 2017

alexanderdean changed the title ~~EmrEtlRunner : misnamed CloudFront-like conversion~~ EmrEtlRunner: misnamed CloudFront-like conversion Jan 30, 2017

alexanderdean assigned BenFradet May 30, 2017

alexanderdean added this to the R9x [HAD] EmrEtlRunner robustness milestone May 30, 2017

alexanderdean changed the title ~~EmrEtlRunner: misnamed CloudFront-like conversion~~ EmrEtlRunner: race condition misnaming files Aug 1, 2017

alexanderdean changed the title ~~EmrEtlRunner: race condition misnaming files~~ EmrEtlRunner: race condition overwriting files during staging step Aug 1, 2017

alexanderdean added the data-loss label Aug 2, 2017

BenFradet removed their assignment Aug 7, 2017

BenFradet removed 3. Enrich bug labels Aug 7, 2017

BenFradet removed this from the R91 Stonehenge (EmrEtlRunner robustness) milestone Aug 7, 2017

BenFradet closed this as completed Aug 7, 2017

alexanderdean added the 4. Storage label Aug 17, 2017

alexanderdean added data-loss 3. Enrich and removed 4. Storage labels Aug 17, 2017

alexanderdean changed the title ~~EmrEtlRunner: race condition overwriting files during staging step~~ EmrEtlRunner: race condition overwriting Clojure Collector files during staging step Aug 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EmrEtlRunner: race condition overwriting Clojure Collector files during staging step #3085

EmrEtlRunner: race condition overwriting Clojure Collector files during staging step #3085

sl-victorceron commented Jan 30, 2017 •

edited

Loading

alexanderdean commented Jan 30, 2017 •

edited

Loading

sl-victorceron commented Jan 31, 2017 •

edited

Loading

BenFradet commented May 30, 2017

sl-victorceron commented Jun 19, 2017

BenFradet commented Aug 7, 2017

sl-victorceron commented Aug 8, 2017

BenFradet commented Aug 8, 2017

alexanderdean commented Aug 8, 2017

sl-victorceron commented Aug 8, 2017

BenFradet commented Aug 8, 2017

alexanderdean commented Aug 17, 2017

BenFradet commented Aug 17, 2017

alexanderdean commented Aug 17, 2017 •

edited

Loading

EmrEtlRunner: race condition overwriting Clojure Collector files during staging step #3085

EmrEtlRunner: race condition overwriting Clojure Collector files during staging step #3085

Comments

sl-victorceron commented Jan 30, 2017 • edited Loading

alexanderdean commented Jan 30, 2017 • edited Loading

sl-victorceron commented Jan 31, 2017 • edited Loading

BenFradet commented May 30, 2017

sl-victorceron commented Jun 19, 2017

BenFradet commented Aug 7, 2017

sl-victorceron commented Aug 8, 2017

BenFradet commented Aug 8, 2017

alexanderdean commented Aug 8, 2017

sl-victorceron commented Aug 8, 2017

BenFradet commented Aug 8, 2017

alexanderdean commented Aug 17, 2017

BenFradet commented Aug 17, 2017

alexanderdean commented Aug 17, 2017 • edited Loading

sl-victorceron commented Jan 30, 2017 •

edited

Loading

alexanderdean commented Jan 30, 2017 •

edited

Loading

sl-victorceron commented Jan 31, 2017 •

edited

Loading

alexanderdean commented Aug 17, 2017 •

edited

Loading