New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kmeans elkan vs. lloyd performance #16976
Comments
Both lloyd and elkan were improved recently. Lloyd got a bigger boost than elkan (due to the use of gemm indeed) so elkan is not always faster than lloyd as it kind of was before. However, I observed that which is faster depends on the dataset. When the clusters are well separated elkan tends to still be faster than lloyd because the number of distances to compute quickly drops. So I'm not sure the default should be changed. Maybe running a benchmark on a set of standard datasets would give us a hint. |
When it comes to improving the |
When working on it I haven't found a way to do that. Seems unlikely though since for elkan all the distances we need to compute can't be expressed as a pairwise distances between 2 sets of points.
If I had, I would have implemented them when reworking kmeans :) so unfortunately no :/ |
We should probably update the documentation of the ELKAN option to emphasize that it can be slower than the default Lloyd solver and that this heavily depends on the natural "blob-iness" of the data. |
Maybe we could even add or update an example that compare the performance of the 2 solvers on different synthetic datasets where the difference is expected to be most extreme (e.g. |
I put the Moderate label for this issue as interested contributors should first be really familiar with the inner workings of both algorithms (Elkan vs LLoy) before considering submitting a PR to address my above comments. If not, please read the scikit-learn documentation, the source code of the 2 implementations and the references linked in the chapter of the user guide on K-means. |
Because of this observation, the Lloyd solver was enabled by default in #21735. But the poor performance of the Elkan implementation can be caused by a bad interaction between the iterated calls to 2 distinct threadpools (one from OpenBLAS and one from OpenMP) as diagnosed in #20642. To confirm this, it might we worth measure the performance of both methods in a conda-forge environment where both the OpenBLAS from scipy and the scikit-learn compiled extensions are linked to a unique OpenMP runtime to make sure that the Operating System scheduler only deals with a single active threadpool when fitting the model as explained here: #20642 (comment) |
I tried kmeans fit_predict performance on 0.23-dev0 with mnist_all.csv dataset (http://ratml.org/datasets/mnist_all.csv). It seems elkan cost more time than lloyd (no matter dense or sparse format input), not sure whether this is due to dgemm vectorization. so wondering whether this is as expected, if yes, will we change algorithm parameter 'auto' logic to redirect the algorithem to lloyd ?
And is there any on-going work or plan to improve elkan perf ? Thanks
The text was updated successfully, but these errors were encountered: