Replacement for wrong k-means++ initialization #99

f0k · 2011-03-04T02:48:20Z

As noted in Issue #98, the k-means++ initialization in scikits.learn is based on a widespread implementation for k-means++ in Python which is short and simple, but wrong (i.e. it is not k-means++).

I have now ported and optimized the original C++ implementation of the authors of the 2007 k-means++ paper. This does not only correctly implement k-means++, it also reduces the computational complexity from O(k * n_samples**2) to O(k * n_samples * numLocalTries).
I therefore removed the max_samples parameter -- it is now fast enough even with large data sets (on my system it takes around a minute to choose 64 centers for 1e6 data points). Alternatively, we could just leave it there (with a high default value) for backwards compatibility.
As scikits.learn depends on scipy anyway, I am using scipy.spatial.distances.cdist for distance calculations. It is a lot faster than scikits.learn.metrics.pairwise.euclidean_distances -- it may pay to make _e_step() use cdist() as well.

mblondel · 2011-03-04T06:13:59Z

Thanks for your work. One of the strengths of open source is that we can have many pairs of eyes to check the source :)

A quick look at the source code tells me that use javaStyle but we use python_style naming convention.

I would vote for giving up backward compatibility : as far as I know, the sampling step is important to avoid outliers. The default choice should always be the most sensible choice for the general user.

In your experience, is scipy.spatial.distances.cdist faster than scikits.learn.metrics.pairwise.euclidean_distances even when n_samples and n_features are big? If I remember correctly, scipy.spatial.distances.cdist doesn't use blas internally.

If scipy.spatial.distances.cdist is faster, I would vote for using it in scikits.learn.metrics.pairwise. Then we can create sparse equivalents in scikits.learn.metrics.pairwise.sparse.

ogrisel · 2011-03-04T10:35:53Z

Thanks for spotting this f0k!

We should also check the compatibility with various versions of scipy (Fabian wants the scikit to work with scipy version 0.6 onwards with compat / wrapper code in the scikits.learn.utils when necessary).

To check the code style please run the pep8 utility on your source code:

$ sudo pip install pep8
$ pep8 /path/to/my/source/file.py

As Mathieu said could you please also check the performance impact for various n_features, n_samples and n_clusters?

ogrisel · 2011-03-04T10:40:35Z

scikits/learn/cluster/k_means_.py

-        algorithm is n_samples**2, if n_samples > n_samples_max,
-        we use the Niquist strategy, and choose our centers in the
-        n_samples_max samples randomly choosen.
+    numLocalTries: integer, optional


s/numLocalTries/n_trials/

Please also mention the default value of this param in the docstring.

And describe what it does :)

I renamed all the variables now, and mention the default value.
@Gael: I already described it, you'd just have had to scroll down in the diff ;)

GaelVaroquaux · 2011-03-04T10:43:18Z

In my experience, scipy is faster than scikit for distance computation only for small dimensions. However, your point is valid: we should have in euclidean_distances a switch on the dimension to choose which implementation to use, based on benchmarks (and the switch should take in account whether dot is using a blas, or whether numpy hasn't been compiled with a blas).

… replaced tab-indents with space-indents, pep8

f0k · 2011-03-04T15:56:52Z

@mblondel:
In this case, sampling was just important because of the quadratic runtime. The greedy sampling strategy should not have been sensitive to outliers (unlike a farthest-first heuristic).

@ALL:
You're right, scipy doesn't use BLAS, just custom C routines, and it is not always faster than scikit. However, scipy uses less memory, and I think part of the large runtime difference I saw yesterday was due to swapping...
Now I did some new benchmarks. I cannot do a systematic large-scale test as most of my cores are busy with my Master's Thesis and will be so for the better part of the next weeks. Anyway, these are my results:

In [273]: %time foo = cdist(np.random.rand(5000,10),np.random.rand(10000,10),'euclidean')
CPU times: user 1.16 s, sys: 0.17 s, total: 1.33 s
Wall time: 1.56 s
In [275]: %time foo = euclidian_distances(np.random.rand(5000,10),np.random.rand(10000,10))
CPU times: user 1.50 s, sys: 1.21 s, total: 2.71 s
Wall time: 4.72 s
In [277]: %time foo = cdist(np.random.rand(5000,100),np.random.rand(10000,100),'euclidean')
CPU times: user 11.22 s, sys: 0.22 s, total: 11.44 s
Wall time: 11.91 s
In [279]: %time foo = euclidian_distances(np.random.rand(5000,100),np.random.rand(10000,100))
CPU times: user 2.47 s, sys: 1.13 s, total: 3.60 s
Wall time: 7.58 s

So it seems scipy beats scikits only for low-dimensional settings. This also holds for the more relevant setting of comparing a small number of cluster centres to a large number of samples (I should use 1e6 samples instead of 1e5 here, but I don't have enough free memory right now):

In [305]: %time foo = cdist(np.random.rand(50,10),np.random.rand(100000,10),'euclidean')
CPU times: user 0.17 s, sys: 0.03 s, total: 0.20 s
Wall time: 0.26 s
In [307]: %time foo = euclidian_distances(np.random.rand(50,10),np.random.rand(100000,10))
CPU times: user 0.26 s, sys: 0.08 s, total: 0.34 s
Wall time: 0.41 s
In [309]: %time foo = cdist(np.random.rand(50,100),np.random.rand(100000,100),'euclidean')
CPU times: user 1.38 s, sys: 0.10 s, total: 1.48 s
Wall time: 1.67 s
In [311]: %time foo = euclidian_distances(np.random.rand(50,100),np.random.rand(100000,100))
CPU times: user 0.77 s, sys: 0.25 s, total: 1.02 s
Wall time: 1.22 s

Based on these results, I think it would be best to use the scikits distance routine exclusively and rely on it to perform optimal in all cases (via a switch as suggested by Gael). I will adapt my code once more and hope somebody will work on scikits.metrics.pairwise :)

ogrisel · 2011-03-04T16:18:11Z

Thanks for the bench. I am +1 for merging this once n_trials is renamed to n_local_trials and calls to cdist are replaced by calls to euclidian_distances.

…ns of x_squared_norms whereever possible. Completion and unification of docstrings.

f0k · 2011-03-04T17:16:19Z

Renamed n_trials and used euclidean_distances. I also pulled out the computation of x_squared_norms of the loop in k_means() and reuse it in k_init(). Last, but not least, I added some missing parameters to the docstrings and unified their format. What I didn't do is testing the completed code... is there an easy way to do this without installing the whole package?

ogrisel · 2011-03-04T17:20:35Z

Just type:

make

In the toplevel folder of the source tree.

f0k · 2011-03-04T18:23:04Z

Thanks, ogrisel. Okay, I think everything works fine.

Old version:
In [1]: import numpy as np
In [2]: import scikits.learn.cluster.k_means_ as km
In [3]: from scikits.learn.cluster import k_means
In [4]: %time x = km.k_init(np.random.rand(5000,10), 15, n_samples_max=5000)
CPU times: user 44.90 s, sys: 0.00 s, total: 44.90 s
Wall time: 44.94 s
In [6]: k_means(np.random.rand(10000,10), 15)[2]
Out[6]: 5481.2078227712773

New version:
In [4]: %time x = km.k_init(np.random.rand(5000,10), 15)
CPU times: user 0.01 s, sys: 0.01 s, total: 0.02 s
Wall time: 0.02 s
In [6]: k_means(np.random.rand(10000,10), 15)[2]
Out[6]: 5478.6292170467568

It's faster and it works, but we will have to see whether it also produces better clusterings on real data.

GaelVaroquaux · 2011-03-06T12:16:42Z

Is there anything remaining to be done on this branch, or should I look into merging it?

f0k · 2011-03-06T12:28:21Z

It can be merged.

Probably someone should work on using cdist in euclidean_distances in specific cases, but that's a separate issue. (I'm not too interested in it as I'm creating a gpu version anyway.)

GaelVaroquaux · 2011-03-06T12:33:15Z

OK, let's merge your excellent modifications first, and we can work on using cdist in euclidean_distances in a separate task. I'll try and do it this afternoon.

GaelVaroquaux · 2011-03-06T14:15:45Z

Merged.

Replaced wrong k-means++ implementation with a correct one.

35ad5ea

ogrisel reviewed Mar 4, 2011
View reviewed changes

Extended docstring, renamed variables from javaStyle to python_style,…

fbfa6f2

… replaced tab-indents with space-indents, pep8

Use scikits distance functions instead of scipy's. Avoid recomputatio…

4adfd43

…ns of x_squared_norms whereever possible. Completion and unification of docstrings.

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacement for wrong k-means++ initialization #99

Replacement for wrong k-means++ initialization #99

f0k commented Mar 4, 2011

mblondel commented Mar 4, 2011

ogrisel commented Mar 4, 2011

ogrisel Mar 4, 2011

GaelVaroquaux Mar 4, 2011

f0k Mar 4, 2011

GaelVaroquaux commented Mar 4, 2011

f0k commented Mar 4, 2011

ogrisel commented Mar 4, 2011

f0k commented Mar 4, 2011

ogrisel commented Mar 4, 2011

f0k commented Mar 4, 2011

GaelVaroquaux commented Mar 6, 2011

f0k commented Mar 6, 2011

GaelVaroquaux commented Mar 6, 2011

GaelVaroquaux commented Mar 6, 2011

Replacement for wrong k-means++ initialization #99

Replacement for wrong k-means++ initialization #99

Conversation

f0k commented Mar 4, 2011

mblondel commented Mar 4, 2011

ogrisel commented Mar 4, 2011

ogrisel Mar 4, 2011

Choose a reason for hiding this comment

GaelVaroquaux Mar 4, 2011

Choose a reason for hiding this comment

f0k Mar 4, 2011

Choose a reason for hiding this comment

GaelVaroquaux commented Mar 4, 2011

f0k commented Mar 4, 2011

ogrisel commented Mar 4, 2011

f0k commented Mar 4, 2011

ogrisel commented Mar 4, 2011

f0k commented Mar 4, 2011

GaelVaroquaux commented Mar 6, 2011

f0k commented Mar 6, 2011

GaelVaroquaux commented Mar 6, 2011

GaelVaroquaux commented Mar 6, 2011