New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated revision pairs when bzip2 input is used #1

Open
whym opened this Issue Aug 16, 2011 · 2 comments

Comments

Projects
None yet
2 participants
@whym
Owner

whym commented Aug 16, 2011

Revisions around a page ending can be duplicated in the results when bzip2 input is used.

@ottomata

This comment has been minimized.

ottomata commented Oct 9, 2014

Hiya! I'm talking to Aaron Halfaker right now! We are thinking about using this again. Is this still an issue? He seems to remember you guys resolving this.

@whym

This comment has been minimized.

Owner

whym commented Oct 10, 2014

I believe it is, although the duplicates shouldn't be too many. Change "<=" in the last assertion in testSplitCompressed() to "==", and it won't pass (while it ideally should). According to the error I get there, the scale of duplicates looks like this: "expected: 93939, found: 93946".

The problem is in the way bzip files can be split - splits must be aligned to bzip2 blocks, which might end at in the middle of a revision. To not lose any revision, I had to implement to cover some revisions doubly.

It might make sense to solve this by adding another layer of a Hadoop job to remove duplicates in the larger workflow. (Looking back, I have a very vague memory discussing solving it more neatly, but anyway it wasn't implemented at the end.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment