New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacement for wrong k-means++ initialization #99
Conversation
Thanks for your work. One of the strengths of open source is that we can have many pairs of eyes to check the source :) A quick look at the source code tells me that use javaStyle but we use python_style naming convention. I would vote for giving up backward compatibility : as far as I know, the sampling step is important to avoid outliers. The default choice should always be the most sensible choice for the general user. In your experience, is scipy.spatial.distances.cdist faster than scikits.learn.metrics.pairwise.euclidean_distances even when n_samples and n_features are big? If I remember correctly, scipy.spatial.distances.cdist doesn't use blas internally. If scipy.spatial.distances.cdist is faster, I would vote for using it in scikits.learn.metrics.pairwise. Then we can create sparse equivalents in scikits.learn.metrics.pairwise.sparse. |
Thanks for spotting this f0k! We should also check the compatibility with various versions of scipy (Fabian wants the scikit to work with scipy version 0.6 onwards with compat / wrapper code in the scikits.learn.utils when necessary). To check the code style please run the pep8 utility on your source code:
As Mathieu said could you please also check the performance impact for various n_features, n_samples and n_clusters? |
algorithm is n_samples**2, if n_samples > n_samples_max, | ||
we use the Niquist strategy, and choose our centers in the | ||
n_samples_max samples randomly choosen. | ||
numLocalTries: integer, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/numLocalTries/n_trials/
Please also mention the default value of this param in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And describe what it does :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed all the variables now, and mention the default value.
@Gael: I already described it, you'd just have had to scroll down in the diff ;)
In my experience, scipy is faster than scikit for distance computation only for small dimensions. However, your point is valid: we should have in euclidean_distances a switch on the dimension to choose which implementation to use, based on benchmarks (and the switch should take in account whether dot is using a blas, or whether numpy hasn't been compiled with a blas). |
… replaced tab-indents with space-indents, pep8
@mblondel: @ALL: In [273]: %time foo = cdist(np.random.rand(5000,10),np.random.rand(10000,10),'euclidean') So it seems scipy beats scikits only for low-dimensional settings. This also holds for the more relevant setting of comparing a small number of cluster centres to a large number of samples (I should use 1e6 samples instead of 1e5 here, but I don't have enough free memory right now): In [305]: %time foo = cdist(np.random.rand(50,10),np.random.rand(100000,10),'euclidean') Based on these results, I think it would be best to use the scikits distance routine exclusively and rely on it to perform optimal in all cases (via a switch as suggested by Gael). I will adapt my code once more and hope somebody will work on scikits.metrics.pairwise :) |
Thanks for the bench. I am +1 for merging this once |
…ns of x_squared_norms whereever possible. Completion and unification of docstrings.
Renamed |
Just type:
In the toplevel folder of the source tree. |
Thanks, ogrisel. Okay, I think everything works fine. Old version: New version: It's faster and it works, but we will have to see whether it also produces better clusterings on real data. |
Is there anything remaining to be done on this branch, or should I look into merging it? |
It can be merged. Probably someone should work on using cdist in euclidean_distances in specific cases, but that's a separate issue. (I'm not too interested in it as I'm creating a gpu version anyway.) |
OK, let's merge your excellent modifications first, and we can work on using cdist in euclidean_distances in a separate task. I'll try and do it this afternoon. |
Merged. |
As noted in Issue #98, the k-means++ initialization in scikits.learn is based on a widespread implementation for k-means++ in Python which is short and simple, but wrong (i.e. it is not k-means++).
I have now ported and optimized the original C++ implementation of the authors of the 2007 k-means++ paper. This does not only correctly implement k-means++, it also reduces the computational complexity from O(k * n_samples**2) to O(k * n_samples * numLocalTries).
I therefore removed the max_samples parameter -- it is now fast enough even with large data sets (on my system it takes around a minute to choose 64 centers for 1e6 data points). Alternatively, we could just leave it there (with a high default value) for backwards compatibility.
As scikits.learn depends on scipy anyway, I am using scipy.spatial.distances.cdist for distance calculations. It is a lot faster than scikits.learn.metrics.pairwise.euclidean_distances -- it may pay to make _e_step() use cdist() as well.