New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add final check to EmrEtlRunner to check no dupes #82

Closed
alexanderdean opened this Issue Nov 11, 2012 · 13 comments

Comments

Projects
None yet
3 participants
@alexanderdean
Member

alexanderdean commented Nov 11, 2012

On a massive run, I had a problem where at the end there were a few files left in In Bucket which were also copied to Processing Bucket. This could lead to dupes in the output data.

Worth checking after the move In Bucket -> Processing Bucket, that none of the files left in the In Bucket are duplicates (i.e. also in Processing Bucket). And delete them if they are.

@ghost ghost assigned alexanderdean Nov 11, 2012

@mtibben

This comment has been minimized.

Show comment
Hide comment
@mtibben

mtibben Nov 11, 2012

Contributor

I've run into this myself, and I believe this could be an issue with S3 consistency. The files do disappear in time... but for some reason some files get stuck and take up to 24 hours to disappear

Contributor

mtibben commented Nov 11, 2012

I've run into this myself, and I believe this could be an issue with S3 consistency. The files do disappear in time... but for some reason some files get stuck and take up to 24 hours to disappear

@alexanderdean

This comment has been minimized.

Show comment
Hide comment
@alexanderdean

alexanderdean Nov 11, 2012

Member

Hey @mtibben - thanks, that's good to know. I'm going to leave this ticket open for a while to monitor what's going on.

I've been doing a historic run of 3 months' CloudFront logs for a pretty small site - it's been quite disappointing how much time the S3 file moves to Processing and Archiving have gone. Over 12 hours each (vs 2 hours for the EMR job with 5 m1.smalls)...

Member

alexanderdean commented Nov 11, 2012

Hey @mtibben - thanks, that's good to know. I'm going to leave this ticket open for a while to monitor what's going on.

I've been doing a historic run of 3 months' CloudFront logs for a pretty small site - it's been quite disappointing how much time the S3 file moves to Processing and Archiving have gone. Over 12 hours each (vs 2 hours for the EMR job with 5 m1.smalls)...

@mtibben

This comment has been minimized.

Show comment
Hide comment
@mtibben

mtibben Nov 11, 2012

Contributor

Yeah, this is a problem with big runs.. I have been bumping up the S3 concurrency constant when I need to do a big run, which does seem to help

Something else that @larsyencken has just started doing is aggregating the cloudfront logs hourly. That way there are only 24 files per day.. This is a quick and dirty python script that's running at the moment, but perhaps we could add an aggregation step to EmrEtlRunner?

Contributor

mtibben commented Nov 11, 2012

Yeah, this is a problem with big runs.. I have been bumping up the S3 concurrency constant when I need to do a big run, which does seem to help

Something else that @larsyencken has just started doing is aggregating the cloudfront logs hourly. That way there are only 24 files per day.. This is a quick and dirty python script that's running at the moment, but perhaps we could add an aggregation step to EmrEtlRunner?

@alexanderdean

This comment has been minimized.

Show comment
Hide comment
@alexanderdean

alexanderdean Nov 11, 2012

Member

That's definitely interesting! Is the Python script pulling the files down from S3, aggregating and re-uploading?

Member

alexanderdean commented Nov 11, 2012

That's definitely interesting! Is the Python script pulling the files down from S3, aggregating and re-uploading?

@mtibben

This comment has been minimized.

Show comment
Hide comment
@mtibben

mtibben Nov 11, 2012

Contributor

Yes exactly

Contributor

mtibben commented Nov 11, 2012

Yes exactly

@larsyencken

This comment has been minimized.

Show comment
Hide comment
@larsyencken

larsyencken Nov 12, 2012

Some more details.

What it does It scans an S3 bucket for new CloudFront logs. It keeps a manifest of files already included for each hour, and only fetches logs from S3 which aren't in the manifest for that hour. We then only need to upload hours which have changed.

How it's deployed Ideally we could just use S3, but because of the consistency issues we've found, we're instead aggregating on an EBS volume attached to an EC2 instance. This is the "master" copy of the aggregated data. We just keep an S3 bucket for hourly logs in sync with the master.

larsyencken commented Nov 12, 2012

Some more details.

What it does It scans an S3 bucket for new CloudFront logs. It keeps a manifest of files already included for each hour, and only fetches logs from S3 which aren't in the manifest for that hour. We then only need to upload hours which have changed.

How it's deployed Ideally we could just use S3, but because of the consistency issues we've found, we're instead aggregating on an EBS volume attached to an EC2 instance. This is the "master" copy of the aggregated data. We just keep an S3 bucket for hourly logs in sync with the master.

@mtibben

This comment has been minimized.

Show comment
Hide comment
@mtibben

mtibben Nov 12, 2012

Contributor

A massive advantage to doing this step is that there may be no need for the processing and archiving steps. Once the data is aggregated like that, the ETL can be run on any particular day quite easily

Contributor

mtibben commented Nov 12, 2012

A massive advantage to doing this step is that there may be no need for the processing and archiving steps. Once the data is aggregated like that, the ETL can be run on any particular day quite easily

@alexanderdean

This comment has been minimized.

Show comment
Hide comment
@alexanderdean

alexanderdean Nov 14, 2012

Member

Thanks guys. This definitely sounds interesting - could be a good approach for big sites using the CloudFront collector. Would you be able to share a suitably anonymised version of the Python script e.g. in a Gist?

Member

alexanderdean commented Nov 14, 2012

Thanks guys. This definitely sounds interesting - could be a good approach for big sites using the CloudFront collector. Would you be able to share a suitably anonymised version of the Python script e.g. in a Gist?

@larsyencken

This comment has been minimized.

Show comment
Hide comment
@larsyencken

larsyencken Nov 15, 2012

Here's a gist for that script:

https://gist.github.com/4076413

It's actually pretty self-contained. Since our Ruby ETL script was archiving the logs into a second bucket, we set it up to read from both the original and archive bucket. Once configured, you run it like:

fetch_and_combine.py /path/to/local/mirror

Inside /path/to/local/mirror, you get folders by date, and files by hour. For example, you'd get a file 2012-11-03/01.gz containing the aggregate logs and 2012-11-03/01.manifest listing the files it's already included for that hour.

larsyencken commented Nov 15, 2012

Here's a gist for that script:

https://gist.github.com/4076413

It's actually pretty self-contained. Since our Ruby ETL script was archiving the logs into a second bucket, we set it up to read from both the original and archive bucket. Once configured, you run it like:

fetch_and_combine.py /path/to/local/mirror

Inside /path/to/local/mirror, you get folders by date, and files by hour. For example, you'd get a file 2012-11-03/01.gz containing the aggregate logs and 2012-11-03/01.manifest listing the files it's already included for that hour.

@alexanderdean

This comment has been minimized.

Show comment
Hide comment
@alexanderdean

alexanderdean Nov 15, 2012

Member

Many thanks for sharing Lars! The script looks good. So the script is building 2012-11-03/01.gz etc in your EBS volume attached to EC2, with the datestamps taken from the CloudFront filenames.

A couple of questions:

  1. What's your process for then feeding those aggregates into EmrEtlRunner? Presumably you move the files up into an S3 bucket for EmrEtlRunner to operate on - but how do you know when a file like 2012-11-03/01.gz is 'finalised' - given that CloudFront access logs can take some time to arrive?
  2. It looks like you are leaving the ingested files in the CloudFront log bucket. Is that going to scale okay? In a couple of months, aren't you going to have to parse a million filenames in the bucket to find the new filenames? Or am I missing something?

/cc @yalisassoon

Member

alexanderdean commented Nov 15, 2012

Many thanks for sharing Lars! The script looks good. So the script is building 2012-11-03/01.gz etc in your EBS volume attached to EC2, with the datestamps taken from the CloudFront filenames.

A couple of questions:

  1. What's your process for then feeding those aggregates into EmrEtlRunner? Presumably you move the files up into an S3 bucket for EmrEtlRunner to operate on - but how do you know when a file like 2012-11-03/01.gz is 'finalised' - given that CloudFront access logs can take some time to arrive?
  2. It looks like you are leaving the ingested files in the CloudFront log bucket. Is that going to scale okay? In a couple of months, aren't you going to have to parse a million filenames in the bucket to find the new filenames? Or am I missing something?

/cc @yalisassoon

@larsyencken

This comment has been minimized.

Show comment
Hide comment
@larsyencken

larsyencken Nov 19, 2012

We're still working out some of these issues.

For (1), the short answer is that we're planning to add a delay of 24h to our Hive workflow. Last we measured, 95% of logs arrived in S3 within 3h, the remaining 5% within 14h (a single spike in log delays). So 24h feels like a reasonable and conservative length to wait, in order to capture all of a day's data.

The longer answer is that we've had some trouble with Hive, so we're running two workflows. The second workflow is like a dependency graph. If the data for an hour has changed, it reparses just that hour and regenerates all derivative data. So, this second way of doing things is a bit more tolerant to log delays, but it's only efficient at incremental updates. For big batches, it's far less scalable than a working Hive setup.

For (2), you're absolutely right, we'll still need to archive logs to another bucket, and also to limit the scan for updated data. So far this has been done by the ruby ETL runner, but we may end up adding it to the Python workflow instead.

larsyencken commented Nov 19, 2012

We're still working out some of these issues.

For (1), the short answer is that we're planning to add a delay of 24h to our Hive workflow. Last we measured, 95% of logs arrived in S3 within 3h, the remaining 5% within 14h (a single spike in log delays). So 24h feels like a reasonable and conservative length to wait, in order to capture all of a day's data.

The longer answer is that we've had some trouble with Hive, so we're running two workflows. The second workflow is like a dependency graph. If the data for an hour has changed, it reparses just that hour and regenerates all derivative data. So, this second way of doing things is a bit more tolerant to log delays, but it's only efficient at incremental updates. For big batches, it's far less scalable than a working Hive setup.

For (2), you're absolutely right, we'll still need to archive logs to another bucket, and also to limit the scan for updated data. So far this has been done by the ruby ETL runner, but we may end up adding it to the Python workflow instead.

@alexanderdean

This comment has been minimized.

Show comment
Hide comment
@alexanderdean

alexanderdean Nov 26, 2012

Member

Thanks for clarifying @larsyencken ! Keep us posted on what you find out - we're going to explore a few different options ourselves over the next month or so...

Member

alexanderdean commented Nov 26, 2012

Thanks for clarifying @larsyencken ! Keep us posted on what you find out - we're going to explore a few different options ourselves over the next month or so...

@alexanderdean

This comment has been minimized.

Show comment
Hide comment
@alexanderdean

alexanderdean Feb 25, 2017

Member

Closing this as it hasn't recurred since CloudFront logging was refactored by AWS a few years ago

Member

alexanderdean commented Feb 25, 2017

Closing this as it hasn't recurred since CloudFront logging was refactored by AWS a few years ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment