Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] KMeans optimisation for array C/F contiguity #10471

Merged
merged 20 commits into from Feb 8, 2018

Conversation

Projects
None yet
4 participants
@gxyd
Copy link
Contributor

gxyd commented Jan 14, 2018

Reference Issues/PRs

Fixes #4582

What does this implement/fix? Explain your changes.

Use order='C' internally in check_array for KMeans optimization.

Any other comments?

I don't think this will require any tests. Requires benchmarking? apart from what was described on the issue itself?

@@ -393,7 +396,7 @@ def _kmeans_single_elkan(X, n_clusters, max_iter=300, init='k-means++',
precompute_distances=True):
if sp.issparse(X):
raise ValueError("algorithm='elkan' not supported for sparse input X")
X = check_array(X, order="C")
X = check_array(X, order='C')

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 14, 2018

Contributor

unnecessary change

@@ -170,6 +170,9 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
algorithm="auto", return_n_iter=False):
"""K-means clustering algorithm.
Warning: The data will be converted to C ordering, which might cause a

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 14, 2018

Contributor

I would avoid to issue a warning here. You can add a line of explanation under X, similarly to the some explanation regarding sparse matrix CSR

Also a conversion will for sure make a memory copy so you should remove "might" :)

@@ -716,6 +719,9 @@ class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
Read more in the :ref:`User Guide <k_means>`.
Warning: The data will be converted to C ordering, which might cause a

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 14, 2018

Contributor

I would avoid to issue a warning here. You can add a line of explanation under X, similarly to the some explanation regarding sparse matrix CSR

@@ -1225,6 +1233,9 @@ class MiniBatchKMeans(KMeans):
Read more in the :ref:`User Guide <mini_batch_kmeans>`.
Warning: The data will be converted to C ordering, which might cause a

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 14, 2018

Contributor

I would avoid to issue a warning here. You can add a line of explanation under X, similarly to the some explanation regarding sparse matrix CSR

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jan 14, 2018

I would repeat a small bench to ensure that everything is right (you can reuse the one from the issue).

Please add an entry to the change log at doc/whats_new/v0.20.rst under bug fixes. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 14, 2018

I would put this under enhancements rather than bug fixes.

gxyd added some commits Jan 14, 2018

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jan 14, 2018

LGTM

@glemaitre glemaitre changed the title [MRG] KMeans optimisation for array C/F contiguity [MRG + 1] KMeans optimisation for array C/F contiguity Jan 14, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 14, 2018

Using the benchmarking script mentioned on the issue

import numpy as np
from sklearn.cluster import KMeans
from time import time
from itertools import product

def bench_kmeans(name, data, alg, n_clusters):
    start = time()
    km = KMeans(algorithm=alg, n_clusters=n_clusters, init='random', n_init=1, max_iter=1,
                copy_x=False, random_state=42, precompute_distances=False).fit(data)
    print("%d\t%s\t%s\t%f\t%f" % (n_clusters, alg, name, time() - start, km.inertia_))


def test_kmeans(data):
    c_data = data
    f_data = np.asfortranarray(data)
    print('n_clusters\talg\tname\ttime\tinertia:')
    for n_clusters, alg, (name, data) in product((50, 100, 200),
                                    ('auto', 'full', 'elkan'),
                                     zip(('C order', 'F order'), (c_data, f_data))):
        bench_kmeans(name=name, data=data, alg=alg, n_clusters=n_clusters)

np.random.seed(0)
data = np.random.random(3000*1000).reshape((3000, 1000))
print('shape:', data.shape)
test_kmeans(data)
data = data.reshape((1000, 3000))
print('shape:', data.shape)
test_kmeans(data)

(on master branch)

('shape:', (3000, 1000))
n_clusters	alg	name	time	inertia:
50	auto	C order	0.403544	244211.650794
50	auto	F order	0.424452	244211.650794
50	full	C order	0.334684	244211.650794
50	full	F order	2.815877	244211.650794
50	elkan	C order	0.363026	244211.650794
50	elkan	F order	0.427056	244211.650794
100	auto	C order	0.729616	239215.757518
100	auto	F order	0.777269	239215.757518
100	full	C order	0.643146	239215.757518
100	full	F order	5.498488	239215.757518
100	elkan	C order	0.685999	239215.757518
100	elkan	F order	0.751356	239215.757518
200	auto	C order	1.355203	229656.843120
200	auto	F order	1.538969	229656.843120
200	full	C order	1.284042	229656.843120
200	full	F order	10.949957	229656.843120
200	elkan	C order	1.349571	229656.843120
200	elkan	F order	1.382669	229656.843120
('shape:', (1000, 3000))
n_clusters	alg	name	time	inertia:
50	auto	C order	0.384128	236247.135296
50	auto	F order	0.468056	236247.135296
50	full	C order	0.343625	236247.135296
50	full	F order	2.672272	236247.135296
50	elkan	C order	0.417920	236247.135296
50	elkan	F order	0.437526	236247.135296
100	auto	C order	0.739238	222766.273428
100	auto	F order	0.796313	222766.273428
100	full	C order	0.647263	222766.273428
100	full	F order	5.291439	222766.273428
100	elkan	C order	0.739860	222766.273428
100	elkan	F order	0.785315	222766.273428
200	auto	C order	1.605269	196655.300444
200	auto	F order	1.668683	196655.300444
200	full	C order	1.353986	196655.300444
200	full	F order	10.681836	196655.300444
200	elkan	C order	1.617238	196655.300444
200	elkan	F order	1.641563	196655.300444

on this branch i.e k-means branch

('shape:', (3000, 1000))
n_clusters	alg	name	time	inertia:
50	auto	C order	0.364319	244211.650794
50	auto	F order	0.391694	244211.650794
50	full	C order	0.338562	244211.650794
50	full	F order	0.362544	244211.650794
50	elkan	C order	0.379989	244211.650794
50	elkan	F order	0.372378	244211.650794
100	auto	C order	0.652812	239215.757518
100	auto	F order	0.659883	239215.757518
100	full	C order	0.605441	239215.757518
100	full	F order	0.612673	239215.757518
100	elkan	C order	0.651492	239215.757518
100	elkan	F order	0.655293	239215.757518
200	auto	C order	1.326843	229656.843120
200	auto	F order	1.343027	229656.843120
200	full	C order	1.233928	229656.843120
200	full	F order	1.192939	229656.843120
200	elkan	C order	1.284317	229656.843120
200	elkan	F order	1.297300	229656.843120
('shape:', (1000, 3000))
n_clusters	alg	name	time	inertia:
50	auto	C order	0.356359	236247.135296
50	auto	F order	0.377453	236247.135296
50	full	C order	0.325774	236247.135296
50	full	F order	0.351796	236247.135296
50	elkan	C order	0.379598	236247.135296
50	elkan	F order	0.382943	236247.135296
100	auto	C order	0.702238	222766.273428
100	auto	F order	0.787363	222766.273428
100	full	C order	0.629510	222766.273428
100	full	F order	0.675536	222766.273428
100	elkan	C order	0.703230	222766.273428
100	elkan	F order	0.701331	222766.273428
200	auto	C order	1.582612	196655.300444
200	auto	F order	1.626843	196655.300444
200	full	C order	1.321152	196655.300444
200	full	F order	1.341884	196655.300444
200	elkan	C order	1.559763	196655.300444
200	elkan	F order	1.558133	196655.300444
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 14, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 14, 2018

gxyd added some commits Jan 15, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 15, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 15, 2018

if X.shape[0] < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
X.shape[0], self.n_clusters))
return X

def _check_test_data(self, X):
X = check_array(X, accept_sparse='csr', dtype=FLOAT_DTYPES)
if self.algorithm == "full":

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 15, 2018

Member

This is not relevant at test time!

if self.algorithm == "full":
order = "C"
else:
order = None

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 15, 2018

Member

_kmeans_elkan coerces the data to C order anyway, so this turns out to be redundant and weird.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 15, 2018

You've caught the redundancy as well. So always use order='C'

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 15, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 15, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 15, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 15, 2018

@@ -175,7 +175,10 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
Parameters
----------
X : array-like or sparse matrix, shape (n_samples, n_features)
The observations to cluster.
The observations to cluster. It must be noted that, the data
will be converted to C ordering if the classical EM-style algorithm

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

I would avoid to give details regarding the algorithm:

... C ordering when ``algorithm="full"``

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

Uhm at the end we always use C-ordered so no need for this specificity.

This comment has been minimized.

Copy link
@gxyd

gxyd Jan 15, 2018

Author Contributor

Ah, I forgot to change this back.

Training instances to cluster.
Training instances to cluster. It must be noted that, the data
will be converted to C ordering if the classical EM-style
algorithm i.e `algorithm="full"` is used, which will cause a memory

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

I would avoid to give details regarding the algorithm:

... C ordering when ``algorithm="full"``

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

Uhm at the end we always use C-ordered so no need for this specificity.

@@ -391,8 +394,6 @@ def _kmeans_single_elkan(X, n_clusters, max_iter=300, init='k-means++',
verbose=False, x_squared_norms=None,
random_state=None, tol=1e-4,
precompute_distances=True):
if sp.issparse(X):
raise ValueError("algorithm='elkan' not supported for sparse input X")
X = check_array(X, order="C")

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

Could you mentioned specifically:

X = check_array(X, accept_sparse=False, order="C")

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 16, 2018

Member

You can drop this check_array altogether. It is a waste of time, unless I'm much mistaken.

And I think it would be fine to change the ValueError to a more appropriate TypeError; but the message is more helpful.

@glemaitre glemaitre changed the title [MRG + 1] KMeans optimisation for array C/F contiguity [MRG] KMeans optimisation for array C/F contiguity Jan 15, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 15, 2018

You've caught the redundancy as well.

Well, actually I just happen to point to that line of code :)

@glemaitre glemaitre changed the title [MRG] KMeans optimisation for array C/F contiguity [MRG + 1] KMeans optimisation for array C/F contiguity Jan 15, 2018

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jan 15, 2018

LGTM

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 23, 2018

checked that they didn't think this would break many users by: moving the len(X)<n_clusters validation from the class to the function

I know you meant _num_samples(y) (instead of len(y)). From a machine learning theory point of view, does that make reasonable sense? I mean, dividing into more no. of clusters than the number of sample points?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jan 23, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Jan 31, 2018

@agramfort any opinions on this one?

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Feb 6, 2018

@@ -175,7 +176,9 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
Parameters
----------
X : array-like or sparse matrix, shape (n_samples, n_features)
The observations to cluster.
The observations to cluster. It must be noted that, the data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

that, the -> that the

@@ -885,13 +886,14 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that, the data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

that, the -> that the

order=order, copy=copy_x)
# verify that the number of samples given is larger than k
if _num_samples(X) < n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

I see that this is cover but I don't see from which test since that you did not add a specific test.
Can you point out which test is raising this?

This comment has been minimized.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 7, 2018

Contributor

Could make the change such that we check the message properly.

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 7, 2018

Apart of my last comment, still LGTM

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Feb 8, 2018

moving the len(X)<n_clusters validation from the class to the function

I think this is fine.

@ogrisel

ogrisel approved these changes Feb 8, 2018

Copy link
Member

ogrisel left a comment

Overall I am fine with the change. Please find some small comments below:

@@ -1522,7 +1525,7 @@ def partial_fit(self, X, y=None):
"""

X = check_array(X, accept_sparse="csr")
X = check_array(X, accept_sparse="csr", order="C")

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

I think the docstring of partial_fit should be updated to mention that X will be copied if not C-contiguous.

Training instances to cluster.
Training instances to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory
copy if the given data is in fortran order.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

"copy if the given data is in fortran order" => "copy if the given data is not C-contiguous".

It could be Fortran-contiguous or no contiguous at all.

This comment has been minimized.

Copy link
@gxyd

gxyd Feb 8, 2018

Author Contributor

Ah I missed this. Good point to remember.

@@ -280,7 +283,14 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
raise ValueError('Number of iterations should be a positive number,'
' got %d instead' % max_iter)

X = as_float_array(X, copy=copy_x)
# avoid forcing order when copy_x=False
order = "C" if copy_x else None

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

The docstring of copy_x should be updated both in the estimator class and in the function to explain that setting copy_x to False will not ensure that the data is C-contiguous possibly causing a significant slowdown.

@@ -1519,7 +1523,8 @@ def partial_fit(self, X, y=None):
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Coordinates of the data points to cluster.
Coordinates of the data points to cluster. It must be noted that
'X' will be copied if it is not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

'X' -> don't use apostrophe or double backquote.

modified. If False, the original data is modified, and put back before
the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
the data first. If copy_x is True (defualt), then the original data is

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

default

the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
the data first. If copy_x is True (defualt), then the original data is
not modified, ensuring X is C-contiguous. If False, the original data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

Double space

This comment has been minimized.

Copy link
@gxyd

gxyd Feb 8, 2018

Author Contributor

For consistency I think we can keep the double space here. I heard that @jnothman was using that somewhere and some talk of lawyer came up, he might have been the first to introduce that in the first place. Or Emacs.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

The first consistency rule should be grammar rule :). Emacs are adding those when compacting several paragraphs together. Anyway since that it was there before, I let go.

modified. If False, the original data is modified, and put back before
the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
the data first. If copy_x is True (defualt), then the original data is

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

double white space

modified. If False, the original data is modified, and put back before
the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
the data first. If copy_x is True (defualt), then the original data is

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

default

@@ -178,7 +178,7 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
X : array-like or sparse matrix, shape (n_samples, n_features)
The observations to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory copy
if the given data is in fortran order.
if the given data is in not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

if the given data are not C-contiguous.

@@ -888,7 +892,7 @@ def fit(self, X, y=None):
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory
copy if the given data is in fortran order.
copy if the given data is not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

if the given data are not ...

@@ -1354,7 +1358,7 @@ def fit(self, X, y=None):
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory copy
if the given data is in fortran order.
if the given data is in not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

given data are not ...

The observations to cluster.
The observations to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory copy
if the given data is in not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

data are not ...

Training instances to cluster.
Training instances to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory copy
if the given data is in not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

given data are not C-contiguous

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 8, 2018

Last comments and I think that we will be good to go

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Feb 8, 2018

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 8, 2018

Could you make a last pass on your changes. data will always be plural and I see that there still some place where this is singular.

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Feb 8, 2018

I disagree, in common modern usage, data is treated as a mass noun, similar to a word like information or money. Therefore the use of singular is fine. If you want to use the plural I would argue in favor of one of the following expressions "pieces of data" or "samples", "observations", "records"...

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 8, 2018

I disagree, in common modern usage, data is treated as a mass noun, similar to a word like information. Therefore the use of singular is fine. If you want to use the plural I would argue in favor of one of the following expressions "pieces of data" or "samples", "observations", "records"...

Fair enough. This is possibly not the time to go in this controversy ;)

@gxyd could you revert my previous remark and put back singular for the plural occurrences to be consistent.

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Feb 8, 2018

(py36) ogrisel@ici:~/code/scikit-learn$ git grep "data are " | wc -l
46
(py36) ogrisel@ici:~/code/scikit-learn$ git grep "data is " | wc -l
250

@gxyd gxyd force-pushed the gxyd:k-means branch from a85234c to a9c914c Feb 8, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Feb 8, 2018

I hard resetted the last commit and undo'ed a few changes.

@glemaitre glemaitre merged commit 0390b28 into scikit-learn:master Feb 8, 2018

0 of 5 checks passed

ci/circleci: python2 CircleCI is running your tests
Details
ci/circleci: python3 CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
lgtm analysis: Python Running analyses for revisions
Details
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Feb 8, 2018

@gxyd Thanks!!!

@gxyd gxyd deleted the gxyd:k-means branch Feb 8, 2018

@gxyd

This comment has been minimized.

Copy link
Contributor Author

gxyd commented Feb 8, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Feb 8, 2018

cperales added a commit to cperales/scikit-learn that referenced this pull request Feb 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.