Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacement for wrong k-means++ initialization #99

Closed
wants to merge 3 commits into from

Conversation

f0k
Copy link
Contributor

@f0k f0k commented Mar 4, 2011

As noted in Issue #98, the k-means++ initialization in scikits.learn is based on a widespread implementation for k-means++ in Python which is short and simple, but wrong (i.e. it is not k-means++).

I have now ported and optimized the original C++ implementation of the authors of the 2007 k-means++ paper. This does not only correctly implement k-means++, it also reduces the computational complexity from O(k * n_samples**2) to O(k * n_samples * numLocalTries).
I therefore removed the max_samples parameter -- it is now fast enough even with large data sets (on my system it takes around a minute to choose 64 centers for 1e6 data points). Alternatively, we could just leave it there (with a high default value) for backwards compatibility.
As scikits.learn depends on scipy anyway, I am using scipy.spatial.distances.cdist for distance calculations. It is a lot faster than scikits.learn.metrics.pairwise.euclidean_distances -- it may pay to make _e_step() use cdist() as well.

@mblondel
Copy link
Member

mblondel commented Mar 4, 2011

Thanks for your work. One of the strengths of open source is that we can have many pairs of eyes to check the source :)

A quick look at the source code tells me that use javaStyle but we use python_style naming convention.

I would vote for giving up backward compatibility : as far as I know, the sampling step is important to avoid outliers. The default choice should always be the most sensible choice for the general user.

In your experience, is scipy.spatial.distances.cdist faster than scikits.learn.metrics.pairwise.euclidean_distances even when n_samples and n_features are big? If I remember correctly, scipy.spatial.distances.cdist doesn't use blas internally.

If scipy.spatial.distances.cdist is faster, I would vote for using it in scikits.learn.metrics.pairwise. Then we can create sparse equivalents in scikits.learn.metrics.pairwise.sparse.

@ogrisel
Copy link
Member

ogrisel commented Mar 4, 2011

Thanks for spotting this f0k!

We should also check the compatibility with various versions of scipy (Fabian wants the scikit to work with scipy version 0.6 onwards with compat / wrapper code in the scikits.learn.utils when necessary).

To check the code style please run the pep8 utility on your source code:

$ sudo pip install pep8
$ pep8 /path/to/my/source/file.py

As Mathieu said could you please also check the performance impact for various n_features, n_samples and n_clusters?

algorithm is n_samples**2, if n_samples > n_samples_max,
we use the Niquist strategy, and choose our centers in the
n_samples_max samples randomly choosen.
numLocalTries: integer, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/numLocalTries/n_trials/

Please also mention the default value of this param in the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And describe what it does :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed all the variables now, and mention the default value.
@Gael: I already described it, you'd just have had to scroll down in the diff ;)

@GaelVaroquaux
Copy link
Member

In my experience, scipy is faster than scikit for distance computation only for small dimensions. However, your point is valid: we should have in euclidean_distances a switch on the dimension to choose which implementation to use, based on benchmarks (and the switch should take in account whether dot is using a blas, or whether numpy hasn't been compiled with a blas).

… replaced tab-indents with space-indents, pep8
@f0k
Copy link
Contributor Author

f0k commented Mar 4, 2011

@mblondel:
In this case, sampling was just important because of the quadratic runtime. The greedy sampling strategy should not have been sensitive to outliers (unlike a farthest-first heuristic).

@ALL:
You're right, scipy doesn't use BLAS, just custom C routines, and it is not always faster than scikit. However, scipy uses less memory, and I think part of the large runtime difference I saw yesterday was due to swapping...
Now I did some new benchmarks. I cannot do a systematic large-scale test as most of my cores are busy with my Master's Thesis and will be so for the better part of the next weeks. Anyway, these are my results:

In [273]: %time foo = cdist(np.random.rand(5000,10),np.random.rand(10000,10),'euclidean')
CPU times: user 1.16 s, sys: 0.17 s, total: 1.33 s
Wall time: 1.56 s
In [275]: %time foo = euclidian_distances(np.random.rand(5000,10),np.random.rand(10000,10))
CPU times: user 1.50 s, sys: 1.21 s, total: 2.71 s
Wall time: 4.72 s
In [277]: %time foo = cdist(np.random.rand(5000,100),np.random.rand(10000,100),'euclidean')
CPU times: user 11.22 s, sys: 0.22 s, total: 11.44 s
Wall time: 11.91 s
In [279]: %time foo = euclidian_distances(np.random.rand(5000,100),np.random.rand(10000,100))
CPU times: user 2.47 s, sys: 1.13 s, total: 3.60 s
Wall time: 7.58 s

So it seems scipy beats scikits only for low-dimensional settings. This also holds for the more relevant setting of comparing a small number of cluster centres to a large number of samples (I should use 1e6 samples instead of 1e5 here, but I don't have enough free memory right now):

In [305]: %time foo = cdist(np.random.rand(50,10),np.random.rand(100000,10),'euclidean')
CPU times: user 0.17 s, sys: 0.03 s, total: 0.20 s
Wall time: 0.26 s
In [307]: %time foo = euclidian_distances(np.random.rand(50,10),np.random.rand(100000,10))
CPU times: user 0.26 s, sys: 0.08 s, total: 0.34 s
Wall time: 0.41 s
In [309]: %time foo = cdist(np.random.rand(50,100),np.random.rand(100000,100),'euclidean')
CPU times: user 1.38 s, sys: 0.10 s, total: 1.48 s
Wall time: 1.67 s
In [311]: %time foo = euclidian_distances(np.random.rand(50,100),np.random.rand(100000,100))
CPU times: user 0.77 s, sys: 0.25 s, total: 1.02 s
Wall time: 1.22 s

Based on these results, I think it would be best to use the scikits distance routine exclusively and rely on it to perform optimal in all cases (via a switch as suggested by Gael). I will adapt my code once more and hope somebody will work on scikits.metrics.pairwise :)

@ogrisel
Copy link
Member

ogrisel commented Mar 4, 2011

Thanks for the bench. I am +1 for merging this once n_trials is renamed to n_local_trials and calls to cdist are replaced by calls to euclidian_distances.

…ns of x_squared_norms whereever possible. Completion and unification of docstrings.
@f0k
Copy link
Contributor Author

f0k commented Mar 4, 2011

Renamed n_trials and used euclidean_distances. I also pulled out the computation of x_squared_norms of the loop in k_means() and reuse it in k_init(). Last, but not least, I added some missing parameters to the docstrings and unified their format. What I didn't do is testing the completed code... is there an easy way to do this without installing the whole package?

@ogrisel
Copy link
Member

ogrisel commented Mar 4, 2011

Just type:

make

In the toplevel folder of the source tree.

@f0k
Copy link
Contributor Author

f0k commented Mar 4, 2011

Thanks, ogrisel. Okay, I think everything works fine.

Old version:
In [1]: import numpy as np
In [2]: import scikits.learn.cluster.k_means_ as km
In [3]: from scikits.learn.cluster import k_means
In [4]: %time x = km.k_init(np.random.rand(5000,10), 15, n_samples_max=5000)
CPU times: user 44.90 s, sys: 0.00 s, total: 44.90 s
Wall time: 44.94 s
In [6]: k_means(np.random.rand(10000,10), 15)[2]
Out[6]: 5481.2078227712773

New version:
In [4]: %time x = km.k_init(np.random.rand(5000,10), 15)
CPU times: user 0.01 s, sys: 0.01 s, total: 0.02 s
Wall time: 0.02 s
In [6]: k_means(np.random.rand(10000,10), 15)[2]
Out[6]: 5478.6292170467568

It's faster and it works, but we will have to see whether it also produces better clusterings on real data.

@GaelVaroquaux
Copy link
Member

Is there anything remaining to be done on this branch, or should I look into merging it?

@f0k
Copy link
Contributor Author

f0k commented Mar 6, 2011

It can be merged.

Probably someone should work on using cdist in euclidean_distances in specific cases, but that's a separate issue. (I'm not too interested in it as I'm creating a gpu version anyway.)

@GaelVaroquaux
Copy link
Member

OK, let's merge your excellent modifications first, and we can work on using cdist in euclidean_distances in a separate task. I'll try and do it this afternoon.

@GaelVaroquaux
Copy link
Member

Merged.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants