kmeans elkan vs. lloyd performance #16976

mayujsw · 2020-04-21T03:40:20Z

I tried kmeans fit_predict performance on 0.23-dev0 with mnist_all.csv dataset (http://ratml.org/datasets/mnist_all.csv). It seems elkan cost more time than lloyd (no matter dense or sparse format input), not sure whether this is due to dgemm vectorization. so wondering whether this is as expected, if yes, will we change algorithm parameter 'auto' logic to redirect the algorithem to lloyd ?

And is there any on-going work or plan to improve elkan perf ? Thanks

jeremiedbb · 2020-04-21T11:57:51Z

Both lloyd and elkan were improved recently. Lloyd got a bigger boost than elkan (due to the use of gemm indeed) so elkan is not always faster than lloyd as it kind of was before. However, I observed that which is faster depends on the dataset. When the clusters are well separated elkan tends to still be faster than lloyd because the number of distances to compute quickly drops. So I'm not sure the default should be changed. Maybe running a benchmark on a set of standard datasets would give us a hint.

jim0421 · 2020-04-23T01:51:25Z

When it comes to improving the kmeans elkan, I compared the code of kmeans between 0.22 and 0.23-dev0 and I found that the double-layer for loop with scipy dot function when updating sample-cluster distance was replaced by scipy dgemm function in kmeans lloyd. However, kmeans elkan can not directly use this way for acceleration since the it eliminates many meaningless euclidean distance calculation. So I wonder whether we have chance to use dgemm-like api in kmeans elkan by rearranging the data or some change to the algorithm. Besides, do you have any suggestion for further optimization for kmeans elkan?

jeremiedbb · 2020-04-23T12:09:15Z

So I wonder whether we have chance to use dgemm-like api in kmeans elkan by rearranging the data or some change to the algorithm

When working on it I haven't found a way to do that. Seems unlikely though since for elkan all the distances we need to compute can't be expressed as a pairwise distances between 2 sets of points.

Besides, do you have any suggestion for further optimization for kmeans elkan?

If I had, I would have implemented them when reworking kmeans :) so unfortunately no :/

ogrisel · 2020-12-16T09:48:44Z

We should probably update the documentation of the ELKAN option to emphasize that it can be slower than the default Lloyd solver and that this heavily depends on the natural "blob-iness" of the data.

ogrisel · 2020-12-16T09:50:22Z

Maybe we could even add or update an example that compare the performance of the 2 solvers on different synthetic datasets where the difference is expected to be most extreme (e.g. numpy.random.uniform and sklearn.datasets.make_blobs).

ogrisel · 2020-12-16T09:52:50Z

I put the Moderate label for this issue as interested contributors should first be really familiar with the inner workings of both algorithms (Elkan vs LLoy) before considering submitting a PR to address my above comments. If not, please read the scikit-learn documentation, the source code of the 2 implementations and the references linked in the chapter of the user guide on K-means.

ogrisel · 2021-12-15T13:11:30Z

Because of this observation, the Lloyd solver was enabled by default in #21735.

But the poor performance of the Elkan implementation can be caused by a bad interaction between the iterated calls to 2 distinct threadpools (one from OpenBLAS and one from OpenMP) as diagnosed in #20642.

To confirm this, it might we worth measure the performance of both methods in a conda-forge environment where both the OpenBLAS from scipy and the scikit-learn compiled extensions are linked to a unique OpenMP runtime to make sure that the Operating System scheduler only deals with a single active threadpool when fitting the model as explained here: #20642 (comment)

cmarmo added the module:cluster label Apr 30, 2020

ogrisel added the Documentation label Dec 16, 2020

ogrisel added help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kmeans elkan vs. lloyd performance #16976

kmeans elkan vs. lloyd performance #16976

mayujsw commented Apr 21, 2020

jeremiedbb commented Apr 21, 2020

jim0421 commented Apr 23, 2020

jeremiedbb commented Apr 23, 2020

ogrisel commented Dec 16, 2020

ogrisel commented Dec 16, 2020

ogrisel commented Dec 16, 2020

ogrisel commented Dec 15, 2021

kmeans elkan vs. lloyd performance #16976

kmeans elkan vs. lloyd performance #16976

Comments

mayujsw commented Apr 21, 2020

jeremiedbb commented Apr 21, 2020

jim0421 commented Apr 23, 2020

jeremiedbb commented Apr 23, 2020

ogrisel commented Dec 16, 2020

ogrisel commented Dec 16, 2020

ogrisel commented Dec 16, 2020

ogrisel commented Dec 15, 2021