Replacement for wrong k-means++ initialization #99

Closed
wants to merge 3 commits into
from

Conversation

Projects
None yet
4 participants
@f0k
Contributor

f0k commented Mar 4, 2011

As noted in Issue scikit-learn/scikit-learn#98, the k-means++ initialization in scikits.learn is based on a widespread implementation for k-means++ in Python which is short and simple, but wrong (i.e. it is not k-means++).

I have now ported and optimized the original C++ implementation of the authors of the 2007 k-means++ paper. This does not only correctly implement k-means++, it also reduces the computational complexity from O(k * n_samples**2) to O(k * n_samples * numLocalTries).
I therefore removed the max_samples parameter -- it is now fast enough even with large data sets (on my system it takes around a minute to choose 64 centers for 1e6 data points). Alternatively, we could just leave it there (with a high default value) for backwards compatibility.
As scikits.learn depends on scipy anyway, I am using scipy.spatial.distances.cdist for distance calculations. It is a lot faster than scikits.learn.metrics.pairwise.euclidean_distances -- it may pay to make _e_step() use cdist() as well.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Mar 4, 2011

Member

Thanks for your work. One of the strengths of open source is that we can have many pairs of eyes to check the source :)

A quick look at the source code tells me that use javaStyle but we use python_style naming convention.

I would vote for giving up backward compatibility : as far as I know, the sampling step is important to avoid outliers. The default choice should always be the most sensible choice for the general user.

In your experience, is scipy.spatial.distances.cdist faster than scikits.learn.metrics.pairwise.euclidean_distances even when n_samples and n_features are big? If I remember correctly, scipy.spatial.distances.cdist doesn't use blas internally.

If scipy.spatial.distances.cdist is faster, I would vote for using it in scikits.learn.metrics.pairwise. Then we can create sparse equivalents in scikits.learn.metrics.pairwise.sparse.

Member

mblondel commented Mar 4, 2011

Thanks for your work. One of the strengths of open source is that we can have many pairs of eyes to check the source :)

A quick look at the source code tells me that use javaStyle but we use python_style naming convention.

I would vote for giving up backward compatibility : as far as I know, the sampling step is important to avoid outliers. The default choice should always be the most sensible choice for the general user.

In your experience, is scipy.spatial.distances.cdist faster than scikits.learn.metrics.pairwise.euclidean_distances even when n_samples and n_features are big? If I remember correctly, scipy.spatial.distances.cdist doesn't use blas internally.

If scipy.spatial.distances.cdist is faster, I would vote for using it in scikits.learn.metrics.pairwise. Then we can create sparse equivalents in scikits.learn.metrics.pairwise.sparse.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Mar 4, 2011

Member

Thanks for spotting this f0k!

We should also check the compatibility with various versions of scipy (Fabian wants the scikit to work with scipy version 0.6 onwards with compat / wrapper code in the scikits.learn.utils when necessary).

To check the code style please run the pep8 utility on your source code:

$ sudo pip install pep8
$ pep8 /path/to/my/source/file.py

As Mathieu said could you please also check the performance impact for various n_features, n_samples and n_clusters?

Member

ogrisel commented Mar 4, 2011

Thanks for spotting this f0k!

We should also check the compatibility with various versions of scipy (Fabian wants the scikit to work with scipy version 0.6 onwards with compat / wrapper code in the scikits.learn.utils when necessary).

To check the code style please run the pep8 utility on your source code:

$ sudo pip install pep8
$ pep8 /path/to/my/source/file.py

As Mathieu said could you please also check the performance impact for various n_features, n_samples and n_clusters?

scikits/learn/cluster/k_means_.py
- algorithm is n_samples**2, if n_samples > n_samples_max,
- we use the Niquist strategy, and choose our centers in the
- n_samples_max samples randomly choosen.
+ numLocalTries: integer, optional

This comment has been minimized.

@ogrisel

ogrisel Mar 4, 2011

Member

s/numLocalTries/n_trials/

Please also mention the default value of this param in the docstring.

@ogrisel

ogrisel Mar 4, 2011

Member

s/numLocalTries/n_trials/

Please also mention the default value of this param in the docstring.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Mar 4, 2011

Member

And describe what it does :)

@GaelVaroquaux

GaelVaroquaux Mar 4, 2011

Member

And describe what it does :)

This comment has been minimized.

@f0k

f0k Mar 4, 2011

Contributor

I renamed all the variables now, and mention the default value.
@Gael: I already described it, you'd just have had to scroll down in the diff ;)

@f0k

f0k Mar 4, 2011

Contributor

I renamed all the variables now, and mention the default value.
@Gael: I already described it, you'd just have had to scroll down in the diff ;)

scikits/learn/cluster/k_means_.py
+ currentPot = closestDistSq.sum()
+
+ # Pick the remaining k-1 points
+ for c in xrange(1, k):

This comment has been minimized.

@ogrisel

ogrisel Mar 4, 2011

Member

Indenting is 4 spaces (no tabs) instead of 2. Only google has the right to use 2 spaces indents :)

@ogrisel

ogrisel Mar 4, 2011

Member

Indenting is 4 spaces (no tabs) instead of 2. Only google has the right to use 2 spaces indents :)

This comment has been minimized.

@f0k

f0k Mar 4, 2011

Contributor

Oops... I remembered about setting my editor to use space-indentation, but then I guttenberg'ed, err, copied and pasted the tab-indented code from another window. Fixed now.
BTW, github shows tab-indentation as 2-space-indents, is this endorsed by google?

@f0k

f0k Mar 4, 2011

Contributor

Oops... I remembered about setting my editor to use space-indentation, but then I guttenberg'ed, err, copied and pasted the tab-indented code from another window. Fixed now.
BTW, github shows tab-indentation as 2-space-indents, is this endorsed by google?

This comment has been minimized.

@mblondel

mblondel Mar 6, 2011

Member

shall we add vim and emacs tags at the beginning of files to automatically the proper indentation mode?

nipy does that IIRC

@mblondel

mblondel Mar 6, 2011

Member

shall we add vim and emacs tags at the beginning of files to automatically the proper indentation mode?

nipy does that IIRC

This comment has been minimized.

@mblondel

mblondel Mar 6, 2011

Member

something like

# emacs: -*- mode: python; py-indent-offset: 4; indent-tabs-mode: nil -*-
# vi: set ft=python sts=4 ts=4 sw=4 et:
@mblondel

mblondel Mar 6, 2011

Member

something like

# emacs: -*- mode: python; py-indent-offset: 4; indent-tabs-mode: nil -*-
# vi: set ft=python sts=4 ts=4 sw=4 et:

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Member

If you want. I am -0 on this: I don' think that such lines do any harm, but I don't really use them, as my vimrc is well set. Thus I won't volunteer to maintain them, but I'll be happy if someone adds them.

@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Member

If you want. I am -0 on this: I don' think that such lines do any harm, but I don't really use them, as my vimrc is well set. Thus I won't volunteer to maintain them, but I'll be happy if someone adds them.

This comment has been minimized.

@ogrisel

ogrisel Mar 6, 2011

Member

I don't have a strong opinion on this either. Maybe we should instead improve the contributor documentations.

@ogrisel

ogrisel Mar 6, 2011

Member

I don't have a strong opinion on this either. Maybe we should instead improve the contributor documentations.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Member

I am +10 on improving the contributor documentations.

@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Member

I am +10 on improving the contributor documentations.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 4, 2011

Member

In my experience, scipy is faster than scikit for distance computation only for small dimensions. However, your point is valid: we should have in euclidean_distances a switch on the dimension to choose which implementation to use, based on benchmarks (and the switch should take in account whether dot is using a blas, or whether numpy hasn't been compiled with a blas).

Member

GaelVaroquaux commented Mar 4, 2011

In my experience, scipy is faster than scikit for distance computation only for small dimensions. However, your point is valid: we should have in euclidean_distances a switch on the dimension to choose which implementation to use, based on benchmarks (and the switch should take in account whether dot is using a blas, or whether numpy hasn't been compiled with a blas).

Extended docstring, renamed variables from javaStyle to python_style,…
… replaced tab-indents with space-indents, pep8
@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Mar 4, 2011

Contributor

@mblondel:
In this case, sampling was just important because of the quadratic runtime. The greedy sampling strategy should not have been sensitive to outliers (unlike a farthest-first heuristic).

@ALL:
You're right, scipy doesn't use BLAS, just custom C routines, and it is not always faster than scikit. However, scipy uses less memory, and I think part of the large runtime difference I saw yesterday was due to swapping...
Now I did some new benchmarks. I cannot do a systematic large-scale test as most of my cores are busy with my Master's Thesis and will be so for the better part of the next weeks. Anyway, these are my results:

In [273]: %time foo = cdist(np.random.rand(5000,10),np.random.rand(10000,10),'euclidean')
CPU times: user 1.16 s, sys: 0.17 s, total: 1.33 s
Wall time: 1.56 s
In [275]: %time foo = euclidian_distances(np.random.rand(5000,10),np.random.rand(10000,10))
CPU times: user 1.50 s, sys: 1.21 s, total: 2.71 s
Wall time: 4.72 s
In [277]: %time foo = cdist(np.random.rand(5000,100),np.random.rand(10000,100),'euclidean')
CPU times: user 11.22 s, sys: 0.22 s, total: 11.44 s
Wall time: 11.91 s
In [279]: %time foo = euclidian_distances(np.random.rand(5000,100),np.random.rand(10000,100))
CPU times: user 2.47 s, sys: 1.13 s, total: 3.60 s
Wall time: 7.58 s

So it seems scipy beats scikits only for low-dimensional settings. This also holds for the more relevant setting of comparing a small number of cluster centres to a large number of samples (I should use 1e6 samples instead of 1e5 here, but I don't have enough free memory right now):

In [305]: %time foo = cdist(np.random.rand(50,10),np.random.rand(100000,10),'euclidean')
CPU times: user 0.17 s, sys: 0.03 s, total: 0.20 s
Wall time: 0.26 s
In [307]: %time foo = euclidian_distances(np.random.rand(50,10),np.random.rand(100000,10))
CPU times: user 0.26 s, sys: 0.08 s, total: 0.34 s
Wall time: 0.41 s
In [309]: %time foo = cdist(np.random.rand(50,100),np.random.rand(100000,100),'euclidean')
CPU times: user 1.38 s, sys: 0.10 s, total: 1.48 s
Wall time: 1.67 s
In [311]: %time foo = euclidian_distances(np.random.rand(50,100),np.random.rand(100000,100))
CPU times: user 0.77 s, sys: 0.25 s, total: 1.02 s
Wall time: 1.22 s

Based on these results, I think it would be best to use the scikits distance routine exclusively and rely on it to perform optimal in all cases (via a switch as suggested by Gael). I will adapt my code once more and hope somebody will work on scikits.metrics.pairwise :)

Contributor

f0k commented Mar 4, 2011

@mblondel:
In this case, sampling was just important because of the quadratic runtime. The greedy sampling strategy should not have been sensitive to outliers (unlike a farthest-first heuristic).

@ALL:
You're right, scipy doesn't use BLAS, just custom C routines, and it is not always faster than scikit. However, scipy uses less memory, and I think part of the large runtime difference I saw yesterday was due to swapping...
Now I did some new benchmarks. I cannot do a systematic large-scale test as most of my cores are busy with my Master's Thesis and will be so for the better part of the next weeks. Anyway, these are my results:

In [273]: %time foo = cdist(np.random.rand(5000,10),np.random.rand(10000,10),'euclidean')
CPU times: user 1.16 s, sys: 0.17 s, total: 1.33 s
Wall time: 1.56 s
In [275]: %time foo = euclidian_distances(np.random.rand(5000,10),np.random.rand(10000,10))
CPU times: user 1.50 s, sys: 1.21 s, total: 2.71 s
Wall time: 4.72 s
In [277]: %time foo = cdist(np.random.rand(5000,100),np.random.rand(10000,100),'euclidean')
CPU times: user 11.22 s, sys: 0.22 s, total: 11.44 s
Wall time: 11.91 s
In [279]: %time foo = euclidian_distances(np.random.rand(5000,100),np.random.rand(10000,100))
CPU times: user 2.47 s, sys: 1.13 s, total: 3.60 s
Wall time: 7.58 s

So it seems scipy beats scikits only for low-dimensional settings. This also holds for the more relevant setting of comparing a small number of cluster centres to a large number of samples (I should use 1e6 samples instead of 1e5 here, but I don't have enough free memory right now):

In [305]: %time foo = cdist(np.random.rand(50,10),np.random.rand(100000,10),'euclidean')
CPU times: user 0.17 s, sys: 0.03 s, total: 0.20 s
Wall time: 0.26 s
In [307]: %time foo = euclidian_distances(np.random.rand(50,10),np.random.rand(100000,10))
CPU times: user 0.26 s, sys: 0.08 s, total: 0.34 s
Wall time: 0.41 s
In [309]: %time foo = cdist(np.random.rand(50,100),np.random.rand(100000,100),'euclidean')
CPU times: user 1.38 s, sys: 0.10 s, total: 1.48 s
Wall time: 1.67 s
In [311]: %time foo = euclidian_distances(np.random.rand(50,100),np.random.rand(100000,100))
CPU times: user 0.77 s, sys: 0.25 s, total: 1.02 s
Wall time: 1.22 s

Based on these results, I think it would be best to use the scikits distance routine exclusively and rely on it to perform optimal in all cases (via a switch as suggested by Gael). I will adapt my code once more and hope somebody will work on scikits.metrics.pairwise :)

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Mar 4, 2011

Member

Thanks for the bench. I am +1 for merging this once n_trials is renamed to n_local_trials and calls to cdist are replaced by calls to euclidian_distances.

Member

ogrisel commented Mar 4, 2011

Thanks for the bench. I am +1 for merging this once n_trials is renamed to n_local_trials and calls to cdist are replaced by calls to euclidian_distances.

Use scikits distance functions instead of scipy's. Avoid recomputatio…
…ns of x_squared_norms whereever possible. Completion and unification of docstrings.
@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Mar 4, 2011

Contributor

Renamed n_trials and used euclidean_distances. I also pulled out the computation of x_squared_norms of the loop in k_means() and reuse it in k_init(). Last, but not least, I added some missing parameters to the docstrings and unified their format. What I didn't do is testing the completed code... is there an easy way to do this without installing the whole package?

Contributor

f0k commented Mar 4, 2011

Renamed n_trials and used euclidean_distances. I also pulled out the computation of x_squared_norms of the loop in k_means() and reuse it in k_init(). Last, but not least, I added some missing parameters to the docstrings and unified their format. What I didn't do is testing the completed code... is there an easy way to do this without installing the whole package?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Mar 4, 2011

Member

Just type:

make

In the toplevel folder of the source tree.

Member

ogrisel commented Mar 4, 2011

Just type:

make

In the toplevel folder of the source tree.

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Mar 4, 2011

Contributor

Thanks, ogrisel. Okay, I think everything works fine.

Old version:
In [1]: import numpy as np
In [2]: import scikits.learn.cluster.k_means_ as km
In [3]: from scikits.learn.cluster import k_means
In [4]: %time x = km.k_init(np.random.rand(5000,10), 15, n_samples_max=5000)
CPU times: user 44.90 s, sys: 0.00 s, total: 44.90 s
Wall time: 44.94 s
In [6]: k_means(np.random.rand(10000,10), 15)[2]
Out[6]: 5481.2078227712773

New version:
In [4]: %time x = km.k_init(np.random.rand(5000,10), 15)
CPU times: user 0.01 s, sys: 0.01 s, total: 0.02 s
Wall time: 0.02 s
In [6]: k_means(np.random.rand(10000,10), 15)[2]
Out[6]: 5478.6292170467568

It's faster and it works, but we will have to see whether it also produces better clusterings on real data.

Contributor

f0k commented Mar 4, 2011

Thanks, ogrisel. Okay, I think everything works fine.

Old version:
In [1]: import numpy as np
In [2]: import scikits.learn.cluster.k_means_ as km
In [3]: from scikits.learn.cluster import k_means
In [4]: %time x = km.k_init(np.random.rand(5000,10), 15, n_samples_max=5000)
CPU times: user 44.90 s, sys: 0.00 s, total: 44.90 s
Wall time: 44.94 s
In [6]: k_means(np.random.rand(10000,10), 15)[2]
Out[6]: 5481.2078227712773

New version:
In [4]: %time x = km.k_init(np.random.rand(5000,10), 15)
CPU times: user 0.01 s, sys: 0.01 s, total: 0.02 s
Wall time: 0.02 s
In [6]: k_means(np.random.rand(10000,10), 15)[2]
Out[6]: 5478.6292170467568

It's faster and it works, but we will have to see whether it also produces better clusterings on real data.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Member

Is there anything remaining to be done on this branch, or should I look into merging it?

Member

GaelVaroquaux commented Mar 6, 2011

Is there anything remaining to be done on this branch, or should I look into merging it?

@f0k

This comment has been minimized.

Show comment
Hide comment
@f0k

f0k Mar 6, 2011

Contributor

It can be merged.

Probably someone should work on using cdist in euclidean_distances in specific cases, but that's a separate issue. (I'm not too interested in it as I'm creating a gpu version anyway.)

Contributor

f0k commented Mar 6, 2011

It can be merged.

Probably someone should work on using cdist in euclidean_distances in specific cases, but that's a separate issue. (I'm not too interested in it as I'm creating a gpu version anyway.)

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Member

OK, let's merge your excellent modifications first, and we can work on using cdist in euclidean_distances in a separate task. I'll try and do it this afternoon.

Member

GaelVaroquaux commented Mar 6, 2011

OK, let's merge your excellent modifications first, and we can work on using cdist in euclidean_distances in a separate task. I'll try and do it this afternoon.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 6, 2011

Member

Merged.

Member

GaelVaroquaux commented Mar 6, 2011

Merged.

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment