Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved union-find data structure implementation #412

Merged
merged 1 commit into from Sep 24, 2020
Merged

Improved union-find data structure implementation #412

merged 1 commit into from Sep 24, 2020

Conversation

kylehofmann
Copy link
Contributor

This patch makes several bugfixes and improvements to the union-find data
structure implemented by the BoruvkaUnionFind class:

(1) union_() did not update is_component. If it merged two components
and no subsequent find() operation updated is_component, then
components() would report an incorrect result.

(2) Formerly, union_() did not recognize when it was passed two objects
in the same set. In this case, it would improperly increment the set's
rank. As a result, later calls to union_() sometimes produced unbalanced
trees, degrading performance.

(3) The path compression algorithm in find() has been replaced by Tarjan
and Van Leeuwen's path splitting algorithm. Path splitting has the same
asymptotic performance as path compression, but it has better constant
factors because it requires only one pass through the tree.

(4) The parent pointer tree and the ranks have been separated into
different arrays. This improves memory locality because parent pointers
are accessed much more frequently than the ranks of the nodes.

(5) Ranks are now stored in a single byte, decreasing memory usage and
again improving memory locality. Ranks are bounded above by
floor(log2(size)), so one byte is plenty (...for now).

This patch makes several bugfixes and improvements to the union-find data
structure implemented by the BoruvkaUnionFind class:

(1) union_() did not update is_component.  If it merged two components
and no subsequent find() operation updated is_component, then
components() would report an incorrect result.

(2) Formerly, union_() did not recognize when it was passed two objects
in the same set.  In this case, it would improperly increment the set's
rank.  As a result, later calls to union_() sometimes produced unbalanced
trees, degrading performance.

(3) The path compression algorithm in find() has been replaced by Tarjan
and Van Leeuwen's path splitting algorithm.  Path splitting has the same
asymptotic performance as path compression, but it has better constant
factors because it requires only one pass through the tree.

(4) The parent pointer tree and the ranks have been separated into
different arrays.  This improves memory locality because parent pointers
are accessed much more frequently than the ranks of the nodes.

(5) Ranks are now stored in a single byte, decreasing memory usage and
again improving memory locality.  Ranks are bounded above by
floor(log2(size)), so one byte is plenty (...for now).
@lmcinnes
Copy link
Collaborator

lmcinnes commented Sep 24, 2020

Thanks Kyle! It has been a long time since I really looked at this chunk of the code, which was written what feels like a very long time ago now. I definitely appreciate the fixes for the bugs, but the better algorithm choices (I agree path splitting is likely better in this case), and data structure optimizations are also very worthwhile. Very happy to get this merged (errors are due to other issues -- not least that I need to update the test infrastructure scripts for new releases of sklearn etc,).

@lmcinnes lmcinnes merged commit d5530fb into scikit-learn-contrib:master Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants