Skip to content

[MRG+1] Elkan k-means #2008

Closed
wants to merge 4 commits into from

8 participants

@amueller
scikit-learn member

This is an old pr for an accelerated K-Means implementation.
I am not sure why I thought it was slow, it actually looks not too shabby:

benchmarking here:


n_clusters = 3
digits
lloyd time: 0.028653 inertia: 1733032.689910
elkan time: 0.017471 inertia: 1733032.689910
iris
lloyd time: 0.002176 inertia: 78.945066
elkan time: 0.002279 inertia: 78.945066
mnist
lloyd time: 21.133714 inertia: 213604863631.214050
elkan time: 3.947992 inertia: 219624625318.684296
mnist.T
lloyd time: 12.684101 inertia: 223006838798.008301
elkan time: 5.028256 inertia: 229492418949.734283

n_clusters = 10
digits
lloyd time: 0.055593 inertia: 1165142.004619
elkan time: 0.020247 inertia: 1165142.004619
iris
lloyd time: 0.001268 inertia: 26.582819
elkan time: 0.001245 inertia: 26.582819
mnist
lloyd time: 134.436917 inertia: 178462558367.345459
elkan time: 3.798663 inertia: 185425965229.923004
mnist.T
lloyd time: 15.768814 inertia: 179957987722.406860
elkan time: 5.269035 inertia: 176726576301.085938

n_clusters = 100
digits
lloyd time: 0.206861 inertia: 590828.045549
elkan time: 0.051845 inertia: 588733.951315
iris
lloyd time: 0.002030 inertia: 2.720000
elkan time: 0.002651 inertia: 2.766667
mnist
lloyd time: 1502.579576 inertia: 127426420029.702515
elkan time: 17.265350 inertia: 128080224680.242188
mnist.T
lloyd time: 172.475294 inertia: 78912667991.498444
elkan time: 22.524726 inertia: 78459207541.870010

There seem to be some slight differences in the results, caused by slightly different stopping criteria, I think.

I need to rebase.

@jnothman jnothman and 1 other commented on an outdated diff Jun 19, 2013
sklearn/cluster/_k_means_elkan.pyx
+ cdef double* X_p = getfloatpointer(X_)
+ cdef double* x_p
+ cdef int n_samples = X_.shape[0]
+ cdef int n_features = X_.shape[1]
+ cdef int point_index, center_index, label
+ cdef float upper_bound, distance
+ cdef double[:, :] center_distances = euclidean_distances(centers_) / 2.
+ cdef double[:, :] lower_bounds = np.zeros((n_samples, n_clusters))
+ cdef double[:] distance_next_center
+ labels_ = np.empty(n_samples, dtype=np.int32)
+ cdef int[:] labels = labels_
+ upper_bounds_ = np.empty(n_samples, dtype=np.float)
+ cdef double[:] upper_bounds = upper_bounds_
+ assign_labels(X_p, centers_p, center_distances, labels, upper_bounds, n_samples, n_features, n_clusters)
+ # make bounds tight for current labelss
+ for point_index in range(n_samples):
@jnothman
scikit-learn member
jnothman added a note Jun 19, 2013

My reading of the paper (and my intuition) is that lower bounds should be set for all point-center distances calculated, not just for the initial assignment of each point. (It says to update "each time d(x, c) is computed", not "each time d(x, c(x))".) Setting lower_bounds every time d is called in assign_labels cuts 17% off a benchmark with (samples, feats, clusters) = (500, 10000, 20) and 26% off (1000, 10000, 50). Results are identical.

@jnothman
scikit-learn member
jnothman added a note Jun 19, 2013

For the former parameters, it's about 50% faster than KMeans on my system, and 20% for the latter. So this might well have been the only problem with your implementation. However, it doesn't seem to provide much speed gain for lower-dimensional feature spaces.

@amueller
scikit-learn member
amueller added a note Jul 24, 2013

Thanks a lot. I believe you for now, as I don't have time to check the paper.
That sounds good. No gain for lower-dim spaces seems ok.
Still not sure if this is worth including in sklearn, though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman and 1 other commented on an outdated diff Jun 19, 2013
sklearn/cluster/tests/test_k_means_elkan.py
@@ -0,0 +1,38 @@
+import numpy as np
+
+from sklearn.metrics import euclidean_distances
+from sklearn.datasets import make_blobs
+from sklearn.cluster.k_means_elkan import assign_labels, k_means_elkan
@jnothman
scikit-learn member
jnothman added a note Jun 19, 2013

This non-underscored module is not currently committed.

@amueller
scikit-learn member
amueller added a note Jul 24, 2013

huh. lets hope I'll find it on my laptop ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman commented on the diff Jun 19, 2013
sklearn/cluster/_k_means_elkan.pyx
+ for point_index in range(n_samples):
+ lower_bounds[point_index, labels[point_index]] = upper_bounds[point_index]
+ cdef np.uint8_t[:] bounds_tight = np.ones(n_samples, dtype=np.uint8)
+ cdef np.uint8_t[:] points_to_update = np.zeros(n_samples, dtype=np.uint8)
+ for iteration in range(max_iter):
+ print("start iteration")
+ # we could find the closest center in O(n) but
+ # this does not seem to be the bottleneck
+ distance_next_center = np.sort(center_distances, axis=0)[1]
+ print("done sorting")
+ for point_index in range(n_samples):
+ upper_bound = upper_bounds[point_index]
+ label = labels[point_index]
+ if distance_next_center[label] >= upper_bound:
+ continue
+ x_p = X_p + point_index * n_features
@jnothman
scikit-learn member
jnothman added a note Jun 19, 2013

for a tiny tweak, you could use point_lower_bounds = lower_bounds + point_index * n_clusters. lower_bounds[point_index, ...] is referred to much more often than x_p is!

@amueller
scikit-learn member
amueller added a note Apr 9, 2015

Good point, though that is slightly non-trivial because of my awkward mixture of memory views and numpy arrays :-/

@MechCoder
scikit-learn member
MechCoder added a note Nov 6, 2015

@jnothman Are you sure that this is worth doing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
scikit-learn member
amueller commented Apr 9, 2015

I ran the substantially faster MNIST with 100 clusters with higher tol, to see how well we do when trying to reproduce the lloyd algorithm results.
With tol=1e-6 I get
elkan time: 44.1879 inertia: 127430761035.275131
which is not entirely as good, but seems pretty close.

@amueller
scikit-learn member
amueller commented Apr 9, 2015

It might make sense to set the default tol higher for elkan. It will still result in much faster execution.

@amueller
scikit-learn member

Setting the stopping tolerance to 1e-8 for elkan, I get

n_clusters = 3
digits
lloyd time: 0.028687 inertia: 1733032.689910
elkan time: 0.0169 inertia: 1733032.689910
iris
lloyd time: 0.002376 inertia: 78.945066
elkan time: 0.0023 inertia: 78.945066
mnist
lloyd time: 16.327123 inertia: 213604863631.214050
elkan time: 6.4335 inertia: 213604830311.203827
mnist.T
lloyd time: 9.836743 inertia: 223006838798.008301
elkan time: 5.2511 inertia: 223006838798.007385

n_clusters = 10
digits
lloyd time: 0.055704 inertia: 1165142.004619
elkan time: 0.0217 inertia: 1165142.004619
iris
lloyd time: 0.001475 inertia: 26.582819
elkan time: 0.0014 inertia: 26.582819
mnist
lloyd time: 124.218960 inertia: 178462558367.345459
elkan time: 19.7944 inertia: 178462556140.571777
mnist.T
lloyd time: 15.635896 inertia: 179957987722.406860
elkan time: 6.2990 inertia: 176498041091.190704

n_clusters = 100
digits
lloyd time: 0.230095 inertia: 590828.045549
elkan time: 0.0549 inertia: 588733.951315
iris
lloyd time: 0.002111 inertia: 2.720000
elkan time: 0.0028 inertia: 2.766667
mnist
lloyd time: 1396.242876 inertia: 127426420029.702515
elkan time: 50.8705 inertia: 127426420029.702454
mnist.T
lloyd time: 162.557854 inertia: 78912667991.498444
elkan time: 38.3189 inertia: 77931928897.879990
@amueller
scikit-learn member

There was an error in the tolerance computation. Fixed now.

@amueller amueller changed the title from WIP Elkan k-means to [MRG] Elkan k-means Apr 10, 2015
@GaelVaroquaux
scikit-learn member
@thedrow
thedrow commented Apr 12, 2015

The doctests seem to fail here.
Maybe you need to rebase?

@amueller
scikit-learn member

na, I just need to fix the doctest ;)

@amueller
scikit-learn member

Switched the default to "auto" which I introduced because I didn't implement sparse matrix support. This causes a somewhat silly deprecation of auto once we do implement sparse matrix support, but I don't have the time to do that now, I think.

@amueller
scikit-learn member

fixed the doctests.

@amueller
scikit-learn member

The elkan implementation is unhappy if the number of clusters is 1. I fall back to lloyd in this case now.

@zhaipro
zhaipro commented Apr 22, 2015

algorithm : "auto", "lloyd" or "elkan", default="auto"
K-means algorthm to use. The classical EM-style algorithm is "lloyd".
The "elkan" variation is more efficient by using the triangle
inequality, but currently doesn't support sparse data. "auto" chooses
"elkan" for sparse data and "lloyd" for dense data.

@amueller This document is the opposite?

@amueller
scikit-learn member

Good catch, it should be the other way around.

@amueller
scikit-learn member

I should bench with MKL / anaconda

@amueller
scikit-learn member

Using MKL numpy with anaconda:


n_clusters = 3
digits
lloyd time: 0.0205 inertia: 1733032.689910
elkan time: 0.0168 inertia: 1733032.689910
iris
lloyd time: 0.0022 inertia: 78.945066
elkan time: 0.0021 inertia: 78.945066
mnist
lloyd time: 12.6178 inertia: 213604863631.214020
elkan time: 6.8844 inertia: 213604855865.778595
mnist.T
lloyd time: 11.9735 inertia: 223006838798.007874
elkan time: 5.2097 inertia: 223006838798.007385

n_clusters = 10
digits
lloyd time: 0.0200 inertia: 1165142.004619
elkan time: 0.0230 inertia: 1165142.004619
iris
lloyd time: 0.0058 inertia: 26.582819
elkan time: 0.0021 inertia: 26.582819
mnist
lloyd time: 38.1610 inertia: 178462558367.345428
elkan time: 23.6535 inertia: 178462556140.571777
mnist.T
lloyd time: 11.4759 inertia: 176498041091.191345
elkan time: 6.1850 inertia: 176498041091.190704

n_clusters = 100
digits
lloyd time: 0.0606 inertia: 590828.045549
elkan time: 0.0705 inertia: 588733.951315
iris
lloyd time: 0.0036 inertia: 2.728333
elkan time: 0.0041 inertia: 2.766667
mnist
lloyd time: 150.7346 inertia: 127426420029.702469
elkan time: 63.5170 inertia: 127426420029.702454
mnist.T
lloyd time: 27.0587 inertia: 77931928897.880417
elkan time: 17.5252 inertia: 77931928897.879990

Still seems to help, and that is using multi-threading on a dual-core (4 with hyperthreading) machine.
Crazy the 10x speedup on the mnist with n_clusters=100 ^^

@agramfort
scikit-learn member

@amueller should we think of better names than lloyd or elkan? We tend to avoid naming with authors and names than say what the algo does helps users. Elkan -> "lazy" | "inequality" | "triangle_inequality" ?

what do others think?

@amueller
scikit-learn member

"full" and "triangle_inequality"?

@GaelVaroquaux
scikit-learn member
@agramfort
scikit-learn member
@amueller
scikit-learn member

@GaelVaroquaux that is my conclusion from the benchmarks above. I'll double check memory usage.

@amueller
scikit-learn member

I don't see why elkan should have more memory usage. There is an array of point to center distances, but you need that for the lloyd variant, too. according to memory_profiler the elkan variant takes slightly less memory on mnist and mnist.T.
The results are a bit odd, though. I get 3mb for mnist with 100 clusters, but 100 * 60000 * 4 / 1024 // 1024= 24.
Using htop I get more realistic estimates, but there doesn't seem to be much difference between the methods.

@agramfort
scikit-learn member
@amueller
scikit-learn member

Args, forgot to rename. Will do that now. Otherwise should be ready. There are certain minor optimizations possible but I think we are already faster and these would take some effort.

@GaelVaroquaux GaelVaroquaux commented on the diff Jun 1, 2015
sklearn/cluster/_k_means_elkan.pyx
+
+
+cdef double *getfloatpointer(np.ndarray[np.float64_t, ndim=2, mode='c'] data):
+ return <double *>(data.data)
+
+
+cdef double d(double* a, double* b, int n_features) nogil:
+ cdef double result, tmp
+ result = 0
+ cdef int i
+ for i in range(n_features):
+ tmp = (a[0] - b[0])
+ a += 1
+ b += 1
+ result += tmp * tmp
+ return sqrt(result)
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

Missing empty line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jun 1, 2015
sklearn/cluster/_k_means_elkan.pyx
+ return <double *>(data.data)
+
+
+cdef double d(double* a, double* b, int n_features) nogil:
+ cdef double result, tmp
+ result = 0
+ cdef int i
+ for i in range(n_features):
+ tmp = (a[0] - b[0])
+ a += 1
+ b += 1
+ result += tmp * tmp
+ return sqrt(result)
+
+cdef assign_labels(double* X, double* centers, double[:, :]
+ center_distances, int[:] labels, double[:, :] lower_bounds, double[:] distances, int n_samples, int n_features, int n_clusters):
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

Line too long.

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

This function modifies inputs in place doesn't it? I think that the name should make it explicit. I always find it very important to stress these things.

@amueller
scikit-learn member
amueller added a note Jun 1, 2015

well, it updates labels and distances. how would you call it? update_labels_distances?

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

update_labels_distances_inplace :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on the diff Jun 1, 2015
sklearn/cluster/_k_means_elkan.pyx
+ x = X + sample * n_features
+ d_c = d(x, centers, n_features)
+ lower_bounds[sample, 0] = d_c
+ for j in range(n_clusters):
+ if d_c > center_distances[c_x, j]:
+ c = centers + j * n_features
+ dist = d(x, c, n_features)
+ lower_bounds[sample, j] = dist
+ if dist < d_c:
+ d_c = dist
+ c_x = j
+ labels[sample] = c_x
+ distances[sample] = d_c
+
+
+def k_means_elkan(X_, int n_clusters, init, float tol=1e-4, int max_iter=30, verbose=False):
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

It seems that this function is meant to be used from outside. Thus I think that it should have a docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jun 1, 2015
sklearn/cluster/_k_means_elkan.pyx
+ centers_ = init
+ cdef double* centers_p = getfloatpointer(centers_)
+ cdef double* X_p = getfloatpointer(X_)
+ cdef double* x_p
+ cdef int n_samples = X_.shape[0]
+ cdef int n_features = X_.shape[1]
+ cdef int point_index, center_index, label
+ cdef float upper_bound, distance
+ cdef double[:, :] center_distances = euclidean_distances(centers_) / 2.
+ cdef double[:, :] lower_bounds = np.zeros((n_samples, n_clusters))
+ cdef double[:] distance_next_center
+ labels_ = np.empty(n_samples, dtype=np.int32)
+ cdef int[:] labels = labels_
+ upper_bounds_ = np.empty(n_samples, dtype=np.float)
+ cdef double[:] upper_bounds = upper_bounds_
+ assign_labels(X_p, centers_p, center_distances, labels, lower_bounds, upper_bounds, n_samples, n_features, n_clusters)
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

Isn't this line too long?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jun 1, 2015
sklearn/cluster/k_means_.py
@@ -187,6 +188,12 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
If a callable is passed, it should take arguments X, k and
and a random state and return an initialization.
+ algorithm : "auto", "full" or "triangle_inequality", default="auto"
+ K-means algorthm to use. The classical EM-style algorithm is "full".
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

typo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jun 1, 2015
sklearn/cluster/k_means_.py
+ X = check_array(X, order="C")
+ random_state = check_random_state(random_state)
+ if x_squared_norms is None:
+ x_squared_norms = row_norms(X, squared=True)
+ # init
+ centers = _init_centroids(X, n_clusters, init, random_state=random_state,
+ x_squared_norms=x_squared_norms)
+ if verbose:
+ print('Initialization complete')
+ centers, labels, n_iter = k_means_elkan(X, n_clusters, centers, tol=tol,
+ max_iter=max_iter, verbose=verbose)
+ inertia = np.sum((X - centers[labels]) ** 2)
+ return labels, inertia, centers, n_iter
+
+
+def _kmeans_single(X, n_clusters, max_iter=300, init='k-means++',
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

Maybe this should be renamed to _kmeans_single_loyd or _k_means_single_em

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jun 1, 2015
sklearn/cluster/k_means_.py
@@ -656,6 +701,12 @@ class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
If an ndarray is passed, it should be of shape (n_clusters, n_features)
and gives the initial centers.
+ algorithm : "auto", "full" or "triangle_inequality", default="auto"
+ K-means algorthm to use. The classical EM-style algorithm is "full".
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a note Jun 1, 2015

typo: algorithm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux
scikit-learn member

I have finished my review: only minor comments. This is pretty much mergeable!

@GaelVaroquaux GaelVaroquaux changed the title from [MRG] Elkan k-means to [MRG+1] Elkan k-means Jun 1, 2015
@GaelVaroquaux
scikit-learn member

+1 to merge. Hurray, hurray.

Needs another review

@amueller
scikit-learn member
amueller commented Jun 1, 2015

Thanks @GaelVaroquaux. maybe @agramfort or @jnothman are interested.

@jnothman jnothman commented on the diff Jun 2, 2015
sklearn/cluster/k_means_.py
@@ -188,6 +189,12 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
If a callable is passed, it should take arguments X, k and
and a random state and return an initialization.
+ algorithm : "auto", "full" or "triangle_inequality", default="auto"
+ K-means algorithm to use. The classical EM-style algorithm is "full".
+ The "triangle_inequality" variation is more efficient by using the triangle
@jnothman
scikit-learn member
jnothman added a note Jun 2, 2015

Does the Elkan solution trade off any quality?

@amueller
scikit-learn member
amueller added a note Jun 2, 2015

No. It is exact.

@jnothman
scikit-learn member
jnothman added a note Jun 2, 2015

And its results are identical given the same random state, I presume.

@amueller
scikit-learn member
amueller added a note Jun 2, 2015

I just checked, and unfortunately they are not always. I will double check what is happening. if they converge, they converge to the same solution, but if I stop after one iteration, they are different on iris. I'm not sure if this is a different choice of argmax or what happens there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
scikit-learn member
amueller commented Jun 2, 2015

I am slightly confused by the following:

import numpy as np
from sklearn.cluster.k_means_ import row_norms, _labels_inertia, KMeans
from sklearn.datasets import load_iris

X = load_iris().data
init = 'k-means++'
n_clusters = 10 

km1 = KMeans(algorithm='full', n_clusters=n_clusters, random_state=0, init=init, n_init=1, max_iter=3, verbose=10).fit(X)

km2 = KMeans(algorithm='triangle_inequality', n_clusters=n_clusters, random_state=0, init=init, n_init=1, max_iter=3, verbose=10).fit(X)

# same results
assert np.all(km1.labels_ == km2.labels_)
assert np.all(km1.cluster_centers_ == km2.cluster_centers_)

# different reported inertia
print (km1.inertia_, km2.inertia_)
# (27.324478976530933, 27.24269537480064)

# hand-compute inertia, same as elkan
print(np.sum((X - km1.cluster_centers_[km1.labels_]) ** 2))

# hand compute inertia using _labels_inertia, gives a third number?
x_squared_norms = row_norms(X, squared=True)
print(_labels_inertia(X, x_squared_norms, km1.cluster_centers_)[1])
# 27.213376399731573
@amueller
scikit-learn member
amueller commented Jun 2, 2015

more interestingly

np.all(km1.predict(X) == km1.labels_)
> False
@amueller
scikit-learn member
amueller commented Jun 2, 2015

that is with the lloyd algorithm. Will check if that was already the case in master.

@vene vene commented on the diff Jun 2, 2015
sklearn/cluster/tests/test_k_means.py
@@ -542,9 +556,11 @@ def test_predict():
def test_score():
- km1 = KMeans(n_clusters=n_clusters, max_iter=1, random_state=42)
+ # TODO better test? This doesn't pass with algorithm="triangle_inequality" as
+ # it converges after one iteration (to the same score that full gets after 10)
@vene
scikit-learn member
vene added a note Jun 2, 2015

Greater or equal would make sense, but that wouldn't be the strongest test in the world.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene
scikit-learn member
vene commented Jun 2, 2015

How do the inertia/labels disagreements above compare to the "gold" implementation in the tests?

@zhaipro
zhaipro commented Jun 3, 2015
# different reported inertia
print (km1.inertia_, km2.inertia_)
# (27.324478976530933, 27.24269537480064)

# hand-compute inertia, same as elkan
print(np.sum((X - km1.cluster_centers_[km1.labels_]) ** 2))

About the disagreements above, I guess,

In kmeans_single_lloyd, compute inertia is in E-step, and then the M-step will compute the new centers,
so the hand-compute results do not agree with km1.inertia
.

@amueller
scikit-learn member
amueller commented Jun 3, 2015

@zhaipro yeah I agree. It is a bit inconsistent the way inertia is computed at the moment, but I'm not sure we should try to make it more consistent, as it might create some overhead.

@amueller
scikit-learn member
amueller commented Jun 3, 2015

So when running until convergence, the results are the same. When stopping after a certain number of iterations, the results may vary as lloyd does an aditional m-Step.

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member
amueller commented Jun 3, 2015

yeah we could do that. That would slightly defeat the purpose of the labels_ attribute, but might be better from a usability standpoint.

@amueller
scikit-learn member
amueller commented Jun 3, 2015

I have to double check if that is enough to get consistent results before convergence between the two algorithms, though.

@MechCoder
scikit-learn member

closed in favour of the other PR

@MechCoder MechCoder closed this Nov 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.