New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Adding support for sample weights to K-Means #10933

Merged
merged 12 commits into from May 22, 2018

Conversation

Projects
None yet
4 participants
@jnhansen
Contributor

jnhansen commented Apr 7, 2018

Reference Issues/PRs

See also #3998.

What does this implement/fix? Explain your changes.

This branch adds support for sample weights to the K-Means algorithm (as well as Minibatch K-Means). This is done by adding the optional parameter sample_weights to KMeans.fit, KMeans.partial_fit, KMeans.predict, KMeans.fit_predict, KMeans.fit_transform, as well as k_means.

Full backwards compatibility of the public methods of the class KMeans and MiniBatchKMeans is maintained.

Any other comments?

@jnhansen jnhansen changed the title from Adding support for sample weights to K-Means to [MRG] Adding support for sample weights to K-Means Apr 7, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 9, 2018

This is exciting! Before a proper review, can you please benchmark the effect on runtime when passed no weights?

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented Apr 9, 2018

Yes, will do.

@TomDLT

Just a minor comment.
I need to do a second pass, but this is already very nice.

sample_weights = _check_sample_weights(X, sample_weights)
# verify that the number of samples is equal to the number of weights
if _num_samples(X) != len(sample_weights):

This comment has been minimized.

@TomDLT

TomDLT Apr 9, 2018

Member

You can also use sklearn.utils.validation.check_consistent_length.
Also, this check does not seem to be tested.

This comment has been minimized.

@jnhansen

jnhansen Apr 9, 2018

Contributor

I could use sklearn.utils.validation.check_consistent_length, but then the error message will be very generic. Would you still prefer this?

I'll add tests for the check method.

This comment has been minimized.

@TomDLT

TomDLT Apr 9, 2018

Member

Right, I have no strong feeling about it, you can keep it this way.

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented Apr 9, 2018

Here's a first idea of the change in performance.

I have this in benchmark_k_means.py:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
km = KMeans(n_clusters=3, random_state=42)

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.8 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 36.4 msec per loop

As expected, adding this feature will slow down the performance very slightly when not using sample_weights. I may have a look and see if I can still improve the performance in the case without sample weights.

centers = np.zeros((n_clusters, n_features), dtype=dtype)
weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype)
for i in range(n_samples):

This comment has been minimized.

@TomDLT

TomDLT Apr 9, 2018

Member

Don't know if this is slow since Cython is compiled, yet you may try if np.add.at is faster:

weights_sum_in_cluster = np.zeros((n_clusters, ), dtype=dtype)
np.add.at(weights_sum_in_cluster, labels, sample_weights)
empty_clusters = np.where(weights_sum_in_cluster == 0)[0]

This comment has been minimized.

@jnhansen

jnhansen Apr 9, 2018

Contributor

Unfortunately, that is considerably slower:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 57.5 msec per loop

jnhansen added some commits Apr 9, 2018

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented Apr 10, 2018

Given that the performance drop isn't altogether that noticeable, I'd probably leave it as is. Any additional improvement for the case where sample_weights is None would make the code rather ugly because of the need for quite a lot of code duplication. Any thoughts?

@TomDLT

This comment has been minimized.

Member

TomDLT commented Apr 10, 2018

I did a bit of line profiling on _kmeans_single_elkan.
It seems that the extra multiplication in the inertia is the main cause of the 10% increase in computation of your benchmark.
To remove it, maybe you could do something like:

checked_sample_weights = _check_sample_weights(X, sample_weights)
centers, labels, n_iter = k_means_elkan(X, checked_sample_weights,
                                        n_clusters, centers, tol=tol,
                                        max_iter=max_iter, verbose=verbose)
sq_distances = (X - centers[labels]) ** 2
if sample_weights is not None:
    sq_distances *= np.expand_dims(checked_sample_weights, axis=-1)
inertia = np.sum(sq_distances, dtype=np.float64)

I did not look at _kmeans_single_lloyd.

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented Apr 10, 2018

Good catch. After implementing your suggestion I cannot make out any statistically significant difference.

For algorithm="full" a ~10% difference remains. I thought it was due to _labels_inertia, but the bulk of the computational effort (~95%) there traces back to sklearn.metrics.pairwise.pairwise_distances_argmin_min, which doesn't even involve the sample weights.

inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)
sq_distances = (X - centers[labels]) ** 2
if sample_weights is not None:
sq_distances *= checked_sample_weights[:, np.newaxis]

This comment has been minimized.

@TomDLT

TomDLT Apr 11, 2018

Member

Actually, on second thought, the operation is expensive because of a useless broadcasting.
Instead of n_samples * n_features multiplications, we can do only n_samples multiplications with:

if sample_weights is not None:
    sq_distances = np.sum(sq_distances, axis=1, dtype=np.float64)
    sq_distances *= checked_sample_weights

This comment has been minimized.

@jnhansen

jnhansen Apr 11, 2018

Contributor

You are right that this makes the computation faster if sample_weights is not None. It obviously doesn't make a difference when no sample weights are passed.

I should clarify that when I said above that "I cannot make out any statistically significant difference" I was referring to the difference between the weighted_k_means and master branches, not before and after the change in _kmeans_single_elkan.

This comment has been minimized.

@TomDLT

TomDLT Apr 11, 2018

Member

Sure, but removing the unnecessary broadcasting should speed up the sample_weights is not None case.

This comment has been minimized.

@jnhansen

jnhansen Apr 11, 2018

Contributor

Definitely, I'm testing it right now

This comment has been minimized.

@jnhansen

jnhansen Apr 11, 2018

Contributor

Okay, new benchmarks, adding:

w = np.ones(X.shape[0])

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loop

All subject to ~ 1 msec variation.

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented Apr 14, 2018

In case it gets lost in the collapsed comments above, here are again the current benchmarks.

In benchmark_k_means.py:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
w = np.ones(X.shape[0])
km = KMeans(n_clusters=3, random_state=42)

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loop

All subject to ~ 1 msec variation between benchmarks.

@jnothman

Nice work! I hope i'm able to review this soon, but the next couple of weeks are full up!

return centers[sort_index, :], sorted_labels
def test_k_means_weighted_vs_repeated():

This comment has been minimized.

@jnothman

jnothman Apr 14, 2018

Member

can we please parametrize or loop these tests to avoid repeating code for KMeans and MBKMeans?

est_weighted.cluster_centers_, np.repeat(est_weighted.labels_,
sample_weights))
assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0)
if not isinstance(estimator, MiniBatchKMeans):

This comment has been minimized.

@jnhansen

jnhansen Apr 15, 2018

Contributor

In case you are wondering why this test is not done for MiniBatchKMeans, that's because it fails. That is not due to the changes in this PR, however. I did a quick test on the master branch, comparing the cluster_centers_ of KMeans and MiniBatchKMeans, and they are not the same.

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented Apr 16, 2018

I have no idea why the latest commit fails the lgtm check. I only updated the tests and they all check out.

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 16, 2018

@jhelie

This comment has been minimized.

Contributor

jhelie commented Apr 16, 2018

Hi guys, sorry about this - it seems that this particular base commit couldn't be built due to an outdated pip cache. We're working on fixing the "retry analysis" link to provide a way to overcome this in the future.

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented May 14, 2018

Hi guys, has anyone had a chance to review this yet?

@jnothman

Basically cosmetic. Good work!

@@ -34,6 +34,7 @@ np.import_array()
@cython.wraparound(False)
@cython.cdivision(True)
cpdef DOUBLE _assign_labels_array(np.ndarray[floating, ndim=2] X,
np.ndarray[floating, ndim=1] sample_weights,

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

Please use singluar sample_weight for local and global consistency

This comment has been minimized.

@jnhansen

jnhansen May 14, 2018

Contributor

So just to confirm, sample_weight instead of sample_weights throughout? I had been consistently using the plural everywhere.

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

Yes, I think so. That is consistent with the rest of the library

This comment has been minimized.

@jnhansen

jnhansen May 14, 2018

Contributor

Alright, will do. Although personally, the singular makes me think that a scalar is requested.

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

It's quite common to use singular nomenclature for vectors... though we are very inconsistent!

@@ -167,9 +172,10 @@ cpdef DOUBLE _assign_labels_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms,
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def _mini_batch_update_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms,
def _mini_batch_update_csr(X, np.ndarray[floating, ndim=1] sample_weights,

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

(argh! Plurals! Do what you like here, I suppose)

dtype = np.float32 if floating is float else np.float64
centers = np.zeros((n_clusters, n_features), dtype=dtype)
weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype)

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

weight_in_cluster would be a sufficient name

@@ -271,6 +277,7 @@ def _mini_batch_update_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms,
@cython.wraparound(False)
@cython.cdivision(True)
def _centers_dense(np.ndarray[floating, ndim=2] X,
np.ndarray[floating, ndim=1] sample_weights,

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

In general, though, we should prefer singular

@@ -103,7 +103,9 @@ cdef update_labels_distances_inplace(
upper_bounds[sample] = d_c
def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_, int n_clusters,
def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_,
np.ndarray[floating, ndim=1, mode='c'] sample_weights,

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

sample_weight

return np.ones(n_samples, dtype=X.dtype)
else:
# verify that the number of samples is equal to the number of weights
if n_samples != len(sample_weights):

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

We have a helper called check_consistent_length

This comment has been minimized.

@jnhansen

jnhansen May 14, 2018

Contributor

Tom made the same remark earlier, and I replied that the error message will then be very generic. I'd prefer the more specific error message, but happy to change.

precompute_distances=self.precompute_distances,
tol=self.tol, random_state=random_state, copy_x=self.copy_x,
n_jobs=self.n_jobs, algorithm=self.algorithm,
return_n_iter=True)
return self
def fit_predict(self, X, y=None):
def fit_predict(self, X, y=None, sample_weights=None):

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

sample_weight

est_1.cluster_centers_, est_1.labels_)
centers_2, labels_2 = _sort_cluster_centers_and_labels(
est_2.cluster_centers_, est_2.labels_)
assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0)

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

You should be able to get the perfect v_measure even without sorting. This makes for a red herring when reading the code

This comment has been minimized.

@jnhansen

jnhansen May 15, 2018

Contributor

You're right, the sorting of the labels was before I had discovered v_measure_score and is now redundant

sample_weights = None
checked_sample_weights = _check_sample_weights(X, sample_weights)
assert_equal(_num_samples(X), _num_samples(checked_sample_weights))
assert_equal(X.dtype, checked_sample_weights.dtype)

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

If you're doing these checks, why no check that the output sums to 1?

# repetition of the sample
sample_weights = np.random.randint(1, 5, size=n_samples)
X_repeat = np.repeat(X, sample_weights, axis=0)
for estimator in [KMeans(n_clusters=n_clusters, random_state=42),

This comment has been minimized.

@jnothman

jnothman May 14, 2018

Member

Need to test the different init approaches as well

# verify that the number of samples is equal to the number of weights
if n_samples != len(sample_weight):
raise ValueError("n_samples=%d should be == len(sample_weight)=%d"
% (n_samples, len(sample_weight)))

This comment has been minimized.

@jnhansen

jnhansen May 15, 2018

Contributor

I didn't change this to check_consistent_length, because that would result in a very generic error message. Let me know if you'd still like me to change it.

@jnothman

LGTM!

@jnothman jnothman changed the title from [MRG] Adding support for sample weights to K-Means to [MRG+1] Adding support for sample weights to K-Means May 15, 2018

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented May 16, 2018

@TomDLT

This comment has been minimized.

Member

TomDLT commented May 16, 2018

Nice work ! I am a bit busy this week, I'll take a closer look next week.

@TomDLT

TomDLT approved these changes May 22, 2018

LGTM

@TomDLT TomDLT merged commit 4b24fbe into scikit-learn:master May 22, 2018

8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 99% of diff hit (target 95.09%)
Details
codecov/project 95.11% (+0.02%) compared to bfad4da
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@TomDLT

This comment has been minimized.

Member

TomDLT commented May 22, 2018

Oops I forgot the whats_new.
Could you add an entry in doc/whats_new/v0.20.rst in a new PR?

@jnhansen

This comment has been minimized.

Contributor

jnhansen commented May 22, 2018

Sure I'll do that later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment