Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Duplicated revision pairs when bzip2 input is used #1
I believe it is, although the duplicates shouldn't be too many. Change "<=" in the last assertion in testSplitCompressed() to "==", and it won't pass (while it ideally should). According to the error I get there, the scale of duplicates looks like this: "expected: 93939, found: 93946".
The problem is in the way bzip files can be split - splits must be aligned to bzip2 blocks, which might end at in the middle of a revision. To not lose any revision, I had to implement to cover some revisions doubly.
It might make sense to solve this by adding another layer of a Hadoop job to remove duplicates in the larger workflow. (Looking back, I have a very vague memory discussing solving it more neatly, but anyway it wasn't implemented at the end.)