Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark KMeans against pydaal KMeans #9430

Closed
amueller opened this issue Jul 21, 2017 · 10 comments
Closed

Benchmark KMeans against pydaal KMeans #9430

amueller opened this issue Jul 21, 2017 · 10 comments
Labels
Easy Well-defined and straightforward way to resolve help wanted

Comments

@amueller
Copy link
Member

Pydaal cited speedups of about 40x for a single cpu between their lloyd k-means and ours.
It would be great to reproduce and identify bottlenecks. Also, we should compare against our elkan and see what the differences are.
I think this was done for several low-dim spaces, n_clusters=100 and varying n_samples.

@amueller amueller added Easy Well-defined and straightforward way to resolve Need Contributor labels Jul 21, 2017
@souravsingh
Copy link
Contributor

souravsingh commented Jul 22, 2017

@amueller Which dataset should we use for the purposes of the benchmark and do we make use of intel python distribution for the benchmark?

@sangeet259
Copy link

I am a newcomer to Open Source contribution . I know python can you tell me how to get started with this bug ?

@amueller
Copy link
Member Author

amueller commented Aug 1, 2017

try using random data with varying dimensionality and number of samples. Use data that is large enough so that the overhead won't outweight the computation, so maybe use something that takes an order of a minute.
I think you need to use their python distribution to use pydaal.

@amueller
Copy link
Member Author

amueller commented Aug 9, 2017

Their benchmarks are here: https://github.com/dvnagorny/sklearn_benchs

@vrishank97
Copy link
Contributor

Can I take this up as my first contribution here?

@souravsingh
Copy link
Contributor

@amueller I have started the benchmark by making the modifications to the repository given. I will be posting my results soonish.

@amueller
Copy link
Member Author

Creating a intel python environment can be done by

conda create -n idp intelpython3_full python=3 -c intel

@amueller
Copy link
Member Author

ok so it looks like they ignore n_init, which means they only run one random iteration, which gives them a 10x speedup to begin with lol. That's pretty ridiculous. Though running 10 random initializations is a bit arbitrary, to be fair....

@amueller
Copy link
Member Author

Actually, they might do 5 random restarts? Though it's not entirely obvious.

@amueller
Copy link
Member Author

amueller commented Apr 6, 2018

It looks like a lot of their speedups were due to the restarts and they did a better benchmark which showed only fractional improvements (I don't remember, like 20% or 40% speedup or something).

I think this issue is now well covered by #10744.

@amueller amueller closed this as completed Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve help wanted
Projects
None yet
Development

No branches or pull requests

5 participants