Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Completed adding support for handling the boundary issue (COPY the da…
…y-after's files to Processing, then DELETE them after a successful run)
- Loading branch information
1 parent
efb00f0
commit 72ab49c
Showing
2 changed files
with
130 additions
and
52 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alexanderdean ... I was initially confused by this, but I see what you're doing here, I see now it's necessary if you're trying to capture an exact, specific date range.
My use-case is that I'm using it as a rolling ETL. The scripts are run daily, and so I have just been assuming that the overlap will be captured on the next day's run. I modified the etl.q script to load the existing partitions, and for the INSERT, I removed the date range conditions, so that records from a previous day could be inserted.
So I'm trying to decide whether I should add an option to allow the rolling approach, or use your approach here, to process exactly one day.
So a some queries:
*-yyyy-mm-dd-00-*
files ?72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mtibben - yep exactly, I added that boundary code in after @yalisassoon's points (#46 (comment)) in the other thread. I've tried to explain the thinking in the new wiki page for the EmrEtlRunner here:
https://github.com/snowplow/snowplow/wiki/Deploying-the-EMR-ETL-Runner#wiki-usage-warnings
Your rolling approach makes sense as well - I guess the difference is that you only capture the last few events from Monday evening on the Wednesday morning run, not the Tuesday morning run. (But you do less file copy as a result.)
I don't know enough about Hive partioning to quite understand the implications of the HiveQL changes you made for the rolling approach - could you share your current HiveQL script in a gist so that @yalisassoon could take a look?
On your two questions:
Yes - I think we could limit it to say the first 2 hours (2 rather than 1 to be safe as I can't find anything official from Amazon about the upper limit on how long a CloudFront log takes to arrive). Even if this only saves 2 hours of files to copy (if people run it at 4am), then I think it's a worthwhile addition - great idea.
No, I can't think of a situation where a file with a Wednesday timestamp has a Thursday event in it. So I think we're safe.
Anyway, let me know your thoughts Mike, and would be great to have a look at your updated HiveQL script. Either way, we're getting close to having all the code needed to support either the exact-one-day or a rolling approach - I think the rolling approach becomes quite interesting once you start to reduce batch size from 24 hours to maybe every 6 or 1 hours (which I know some people e.g. @ramn will want to do eventually)...
72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mtibben,
I'm intrigued by your rolling approach. The reason we didn't implement a rolling approach was that we populate each Hive partition using an INSERT OVERWRITE statement. When using this statement, any data already in that Hive partition is blown away, so we need to be sure when we execute the statement that all the raw data we want to transfer into that table for that time period is available to the query when it is executed - then the partition is populated - then we don't load any more data into it.
In your rolling approach: if you've run the job on a Monday you'll primarily process data for Sunday. On Tuesday when you rerun the job, you'll process Monday's data, but also potentially some additional rows for Sunday. How do you make sure that that handful of rows does not overwrite all the rows you wrote when you ran the job on Monday? If you wouldn't mind sharing your HiveQL, I'd love to see how you worked around this issue, as a rolling approach is nicer than ours...
72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @yalisassoon ,
Hive 0.8+ supports INSERT INTO which appends instead of overwrites. My hql liooks like this
72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mtibben! The fact that Hive now supports
INSERT INTO TABLE
had completely passed me by :-) I'll have a think now about whether we should update our approach to utiliseINSERT INTO TABLE
command (i.e. adopt your rolling approach), or stick to adding in exactly one day's worth of data at a time, now I understand your approach...72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mtibben - just to let you know that we decided to go with your rolling approach in the end. Many thanks for suggesting it and bearing with us as we tried to understand it :-)
In the end your rolling approach seemed to be more flexible, but also more error-proof than our approach. So we've updated the codebase to support rolling mode, including:
--start
and--end
parameters so that they work properly (only processing files within those dates)Anyway, thanks again - we're doing our final tests and then will cut a release...
72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problems at all, I'm glad I can contribute back in some way :)
Something else I'm working on - I've been playing with using a column table format (RCFILE) in hive, partitioning by date and bucketing by user. I'm hoping that will improve performance of the table. I'll let you know how I go
72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again - that sounds promising about the RCFILE format! CCing @yalisassoon as I know he'll be interested too...
72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
72ab49c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Qubole sounds very interesting. To be honest I've found using Hive on EMR a pretty frustrating experience at times, so I would be interested in what they have to offer.