New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark KMeans against pydaal KMeans #9430
Comments
@amueller Which dataset should we use for the purposes of the benchmark and do we make use of intel python distribution for the benchmark? |
I am a newcomer to Open Source contribution . I know python can you tell me how to get started with this bug ? |
try using random data with varying dimensionality and number of samples. Use data that is large enough so that the overhead won't outweight the computation, so maybe use something that takes an order of a minute. |
Their benchmarks are here: https://github.com/dvnagorny/sklearn_benchs |
Can I take this up as my first contribution here? |
@amueller I have started the benchmark by making the modifications to the repository given. I will be posting my results soonish. |
Creating a intel python environment can be done by
|
ok so it looks like they ignore n_init, which means they only run one random iteration, which gives them a 10x speedup to begin with lol. That's pretty ridiculous. Though running 10 random initializations is a bit arbitrary, to be fair.... |
Actually, they might do 5 random restarts? Though it's not entirely obvious. |
It looks like a lot of their speedups were due to the restarts and they did a better benchmark which showed only fractional improvements (I don't remember, like 20% or 40% speedup or something). I think this issue is now well covered by #10744. |
Pydaal cited speedups of about 40x for a single cpu between their lloyd k-means and ours.
It would be great to reproduce and identify bottlenecks. Also, we should compare against our elkan and see what the differences are.
I think this was done for several low-dim spaces, n_clusters=100 and varying n_samples.
The text was updated successfully, but these errors were encountered: