Benchmark KMeans against pydaal KMeans #9430

amueller · 2017-07-21T15:57:03Z

Pydaal cited speedups of about 40x for a single cpu between their lloyd k-means and ours.
It would be great to reproduce and identify bottlenecks. Also, we should compare against our elkan and see what the differences are.
I think this was done for several low-dim spaces, n_clusters=100 and varying n_samples.

souravsingh · 2017-07-22T20:03:14Z

@amueller Which dataset should we use for the purposes of the benchmark and do we make use of intel python distribution for the benchmark?

sangeet259 · 2017-08-01T08:52:59Z

I am a newcomer to Open Source contribution . I know python can you tell me how to get started with this bug ?

amueller · 2017-08-01T20:41:00Z

try using random data with varying dimensionality and number of samples. Use data that is large enough so that the overhead won't outweight the computation, so maybe use something that takes an order of a minute.
I think you need to use their python distribution to use pydaal.

amueller · 2017-08-09T16:19:02Z

Their benchmarks are here: https://github.com/dvnagorny/sklearn_benchs

vrishank97 · 2017-08-13T18:49:13Z

Can I take this up as my first contribution here?

souravsingh · 2017-08-13T18:51:26Z

@amueller I have started the benchmark by making the modifications to the repository given. I will be posting my results soonish.

amueller · 2017-09-11T14:16:09Z

Creating a intel python environment can be done by

conda create -n idp intelpython3_full python=3 -c intel

amueller · 2017-09-11T14:50:24Z

ok so it looks like they ignore n_init, which means they only run one random iteration, which gives them a 10x speedup to begin with lol. That's pretty ridiculous. Though running 10 random initializations is a bit arbitrary, to be fair....

amueller · 2017-09-11T15:44:23Z

Actually, they might do 5 random restarts? Though it's not entirely obvious.

amueller · 2018-04-06T15:12:34Z

It looks like a lot of their speedups were due to the restarts and they did a better benchmark which showed only fractional improvements (I don't remember, like 20% or 40% speedup or something).

I think this issue is now well covered by #10744.

amueller added Easy Well-defined and straightforward way to resolve Need Contributor labels Jul 21, 2017

rth mentioned this issue Sep 1, 2017

Benchmark NMF and SGDClassifier on MKL vs OpenBlas #9429

Open

amueller mentioned this issue Sep 11, 2017

Change KMeans n_init default value to 1. #9729

Closed

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

qinhanmin2014 mentioned this issue Nov 7, 2017

[WIP] Providing stable implementation for euclidean_distances #10069

Closed

amueller closed this as completed Apr 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark KMeans against pydaal KMeans #9430

Benchmark KMeans against pydaal KMeans #9430

amueller commented Jul 21, 2017

souravsingh commented Jul 22, 2017 •

edited

sangeet259 commented Aug 1, 2017

amueller commented Aug 1, 2017

amueller commented Aug 9, 2017

vrishank97 commented Aug 13, 2017

souravsingh commented Aug 13, 2017

amueller commented Sep 11, 2017

amueller commented Sep 11, 2017

amueller commented Sep 11, 2017

amueller commented Apr 6, 2018

Benchmark KMeans against pydaal KMeans #9430

Benchmark KMeans against pydaal KMeans #9430

Comments

amueller commented Jul 21, 2017

souravsingh commented Jul 22, 2017 • edited

sangeet259 commented Aug 1, 2017

amueller commented Aug 1, 2017

amueller commented Aug 9, 2017

vrishank97 commented Aug 13, 2017

souravsingh commented Aug 13, 2017

amueller commented Sep 11, 2017

amueller commented Sep 11, 2017

amueller commented Sep 11, 2017

amueller commented Apr 6, 2018

souravsingh commented Jul 22, 2017 •

edited