Re-implement MedianIterator to use a proper sequence of medians #88
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The previous implementation of MedianIterator didn't actually iterate over medians, it just started from the global median and then flip-flopped between higher and lower elements in sequential order (not sure how the sign flipping was better than just a simple sequential loop.)
The new implementation actually follows a full sequence of medians in a recursive-like manner. This allows search trees constructed from a sorted wordlist to be well-balanced, providing much better tree search performance. In the existing performance unit test it shows around 40% improvement in search speed.
If you merge the other PRs from today first, you can also print out the tree structure before/after this change and see how much more balanced it looks, or print out the node depth histogram and see the numbers. I've also attached here for your amusement a plot of the histograms for various iterator implementations (depth of all end-of-word nodes vs how many times each depth occurs in the tree) - the orange curve is the previous implementation, the green curve is a random insertion order, the purple curve is an implementation that follows medians of a power of two greater than or equal to to the data size and skips the indices that are beyond the data size, and the yellow curve is the current implementation which is the series of medians for the exact data size (it almost exactly coincides with the purple curve, but ever so slightly better, and is more elegant in implementation and in theory). Note that even the longest paths in the current implementation are shorter than the average path length in the previous implementation!
This was fun :-)