Improved union-find data structure implementation #412

kylehofmann · 2020-09-24T22:27:39Z

This patch makes several bugfixes and improvements to the union-find data
structure implemented by the BoruvkaUnionFind class:

(1) union_() did not update is_component. If it merged two components
and no subsequent find() operation updated is_component, then
components() would report an incorrect result.

(2) Formerly, union_() did not recognize when it was passed two objects
in the same set. In this case, it would improperly increment the set's
rank. As a result, later calls to union_() sometimes produced unbalanced
trees, degrading performance.

(3) The path compression algorithm in find() has been replaced by Tarjan
and Van Leeuwen's path splitting algorithm. Path splitting has the same
asymptotic performance as path compression, but it has better constant
factors because it requires only one pass through the tree.

(4) The parent pointer tree and the ranks have been separated into
different arrays. This improves memory locality because parent pointers
are accessed much more frequently than the ranks of the nodes.

(5) Ranks are now stored in a single byte, decreasing memory usage and
again improving memory locality. Ranks are bounded above by
floor(log2(size)), so one byte is plenty (...for now).

This patch makes several bugfixes and improvements to the union-find data structure implemented by the BoruvkaUnionFind class: (1) union_() did not update is_component. If it merged two components and no subsequent find() operation updated is_component, then components() would report an incorrect result. (2) Formerly, union_() did not recognize when it was passed two objects in the same set. In this case, it would improperly increment the set's rank. As a result, later calls to union_() sometimes produced unbalanced trees, degrading performance. (3) The path compression algorithm in find() has been replaced by Tarjan and Van Leeuwen's path splitting algorithm. Path splitting has the same asymptotic performance as path compression, but it has better constant factors because it requires only one pass through the tree. (4) The parent pointer tree and the ranks have been separated into different arrays. This improves memory locality because parent pointers are accessed much more frequently than the ranks of the nodes. (5) Ranks are now stored in a single byte, decreasing memory usage and again improving memory locality. Ranks are bounded above by floor(log2(size)), so one byte is plenty (...for now).

lmcinnes · 2020-09-24T23:36:21Z

Thanks Kyle! It has been a long time since I really looked at this chunk of the code, which was written what feels like a very long time ago now. I definitely appreciate the fixes for the bugs, but the better algorithm choices (I agree path splitting is likely better in this case), and data structure optimizations are also very worthwhile. Very happy to get this merged (errors are due to other issues -- not least that I need to update the test infrastructure scripts for new releases of sklearn etc,).

lmcinnes merged commit d5530fb into scikit-learn-contrib:master Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved union-find data structure implementation #412

Improved union-find data structure implementation #412

kylehofmann commented Sep 24, 2020

lmcinnes commented Sep 24, 2020 •

edited

Improved union-find data structure implementation #412

Improved union-find data structure implementation #412

Conversation

kylehofmann commented Sep 24, 2020

lmcinnes commented Sep 24, 2020 • edited

lmcinnes commented Sep 24, 2020 •

edited