Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kmeans elkan vs. lloyd performance #16976

Open
mayujsw opened this issue Apr 21, 2020 · 7 comments
Open

kmeans elkan vs. lloyd performance #16976

mayujsw opened this issue Apr 21, 2020 · 7 comments
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices module:cluster

Comments

@mayujsw
Copy link

mayujsw commented Apr 21, 2020

I tried kmeans fit_predict performance on 0.23-dev0 with mnist_all.csv dataset (http://ratml.org/datasets/mnist_all.csv). It seems elkan cost more time than lloyd (no matter dense or sparse format input), not sure whether this is due to dgemm vectorization. so wondering whether this is as expected, if yes, will we change algorithm parameter 'auto' logic to redirect the algorithem to lloyd ?

And is there any on-going work or plan to improve elkan perf ? Thanks

@jeremiedbb
Copy link
Member

Both lloyd and elkan were improved recently. Lloyd got a bigger boost than elkan (due to the use of gemm indeed) so elkan is not always faster than lloyd as it kind of was before. However, I observed that which is faster depends on the dataset. When the clusters are well separated elkan tends to still be faster than lloyd because the number of distances to compute quickly drops. So I'm not sure the default should be changed. Maybe running a benchmark on a set of standard datasets would give us a hint.

@jim0421
Copy link
Contributor

jim0421 commented Apr 23, 2020

When it comes to improving the kmeans elkan, I compared the code of kmeans between 0.22 and 0.23-dev0 and I found that the double-layer for loop with scipy dot function when updating sample-cluster distance was replaced by scipy dgemm function in kmeans lloyd. However, kmeans elkan can not directly use this way for acceleration since the it eliminates many meaningless euclidean distance calculation. So I wonder whether we have chance to use dgemm-like api in kmeans elkan by rearranging the data or some change to the algorithm. Besides, do you have any suggestion for further optimization for kmeans elkan?

@jeremiedbb
Copy link
Member

So I wonder whether we have chance to use dgemm-like api in kmeans elkan by rearranging the data or some change to the algorithm

When working on it I haven't found a way to do that. Seems unlikely though since for elkan all the distances we need to compute can't be expressed as a pairwise distances between 2 sets of points.

Besides, do you have any suggestion for further optimization for kmeans elkan?

If I had, I would have implemented them when reworking kmeans :) so unfortunately no :/

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2020

We should probably update the documentation of the ELKAN option to emphasize that it can be slower than the default Lloyd solver and that this heavily depends on the natural "blob-iness" of the data.

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2020

Maybe we could even add or update an example that compare the performance of the 2 solvers on different synthetic datasets where the difference is expected to be most extreme (e.g. numpy.random.uniform and sklearn.datasets.make_blobs).

@ogrisel ogrisel added help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Dec 16, 2020
@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2020

I put the Moderate label for this issue as interested contributors should first be really familiar with the inner workings of both algorithms (Elkan vs LLoy) before considering submitting a PR to address my above comments. If not, please read the scikit-learn documentation, the source code of the 2 implementations and the references linked in the chapter of the user guide on K-means.

@ogrisel
Copy link
Member

ogrisel commented Dec 15, 2021

Because of this observation, the Lloyd solver was enabled by default in #21735.

But the poor performance of the Elkan implementation can be caused by a bad interaction between the iterated calls to 2 distinct threadpools (one from OpenBLAS and one from OpenMP) as diagnosed in #20642.

To confirm this, it might we worth measure the performance of both methods in a conda-forge environment where both the OpenBLAS from scipy and the scikit-learn compiled extensions are linked to a unique OpenMP runtime to make sure that the Operating System scheduler only deals with a single active threadpool when fitting the model as explained here: #20642 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices module:cluster
Projects
None yet
Development

No branches or pull requests

5 participants