Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] KMeans optimisation for array C/F contiguity #10471

Merged
merged 20 commits into from Feb 8, 2018
Merged
Changes from 17 commits
Commits
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+51 −30
Diff settings

Always

Just for now

Copy path View file
@@ -125,6 +125,13 @@ Classifiers and regressors
only require X to be an object with finite length or shape.
:issue:`9832` by :user:`Vrishank Bhardwaj <vrishank97>`.

Cluster

- :class:`cluster.KMeans`, :class:`cluster.MiniBatchKMeans` and

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 23, 2018

Member

I think you should say that with algorithm='full', row-major ordering will be forced, improving runtime.

:func:`cluster.k_means` passed with ``algorithm='full'`` now enforces
row-major ordering, improving runtime.
:issue:`10471` by :user:`Gaurav Dhingra <gxyd>`.

Preprocessing

- :class:`preprocessing.PolynomialFeatures` now supports sparse input.
Copy path View file
@@ -22,6 +22,7 @@
from ..utils.extmath import row_norms, squared_norm, stable_cumsum
from ..utils.sparsefuncs_fast import assign_rows_csr
from ..utils.sparsefuncs import mean_variance_axis
from ..utils.validation import _num_samples
from ..utils import check_array
from ..utils import check_random_state
from ..utils import as_float_array
@@ -175,7 +176,9 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
Parameters
----------
X : array-like or sparse matrix, shape (n_samples, n_features)
The observations to cluster.
The observations to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory copy
if the given data is in not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

if the given data are not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

data are not ...

n_clusters : int
The number of clusters to form as well as the number of
@@ -230,10 +233,12 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
copy_x : boolean, optional
When pre-computing distances it is more numerically accurate to center
the data first. If copy_x is True, then the original data is not
modified. If False, the original data is modified, and put back before
the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
the data first. If copy_x is True (defualt), then the original data is

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

double white space

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

default

not modified, ensuring X is C-contiguous. If False, the original data
is modified, and put back before the function returns, but small
numerical differences may be introduced by subtracting and then adding
the data mean, in this case it will also not ensure that data is
C-contiguous which may cause significant slowdown.
n_jobs : int
The number of jobs to use for the computation. This works by computing
@@ -280,7 +285,14 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
raise ValueError('Number of iterations should be a positive number,'
' got %d instead' % max_iter)

X = as_float_array(X, copy=copy_x)
# avoid forcing order when copy_x=False
order = "C" if copy_x else None

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

The docstring of copy_x should be updated both in the estimator class and in the function to explain that setting copy_x to False will not ensure that the data is C-contiguous possibly causing a significant slowdown.

X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32],
order=order, copy=copy_x)
# verify that the number of samples given is larger than k
if _num_samples(X) < n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

I see that this is cover but I don't see from which test since that you did not add a specific test.
Can you point out which test is raising this?

This comment has been minimized.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 7, 2018

Contributor

Could make the change such that we check the message properly.

_num_samples(X), n_clusters))
tol = _tolerance(X, tol)

# If the distances are precomputed every job will create a matrix of shape
@@ -392,8 +404,7 @@ def _kmeans_single_elkan(X, n_clusters, max_iter=300, init='k-means++',
random_state=None, tol=1e-4,
precompute_distances=True):
if sp.issparse(X):
raise ValueError("algorithm='elkan' not supported for sparse input X")

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 15, 2018

Member

Sorry for confusing things, but I think this error message is better, being more specific and helpful to the user. So please keep this and, as you suggested, remove the check_array

X = check_array(X, order="C")
raise TypeError("algorithm='elkan' not supported for sparse input X")
random_state = check_random_state(random_state)
if x_squared_norms is None:
x_squared_norms = row_norms(X, squared=True)
@@ -768,12 +779,14 @@ class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.
copy_x : boolean, default True
copy_x : boolean, optional
When pre-computing distances it is more numerically accurate to center
the data first. If copy_x is True, then the original data is not
modified. If False, the original data is modified, and put back before
the function returns, but small numerical differences may be introduced
by subtracting and then adding the data mean.
the data first. If copy_x is True (defualt), then the original data is

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

default

not modified, ensuring X is C-contiguous. If False, the original data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

Double space

This comment has been minimized.

Copy link
@gxyd

gxyd Feb 8, 2018

Author Contributor

For consistency I think we can keep the double space here. I heard that @jnothman was using that somewhere and some talk of lawyer came up, he might have been the first to introduce that in the first place. Or Emacs.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

The first consistency rule should be grammar rule :). Emacs are adding those when compacting several paragraphs together. Anyway since that it was there before, I let go.

is modified, and put back before the function returns, but small
numerical differences may be introduced by subtracting and then adding
the data mean, in this case it will also not ensure that data is
C-contiguous which may cause significant slowdown.
n_jobs : int
The number of jobs to use for the computation. This works by computing
@@ -860,14 +873,6 @@ def __init__(self, n_clusters=8, init='k-means++', n_init=10,
self.n_jobs = n_jobs
self.algorithm = algorithm

def _check_fit_data(self, X):
"""Verify that the number of samples given is larger than k"""
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
if X.shape[0] < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
X.shape[0], self.n_clusters))
return X

def _check_test_data(self, X):
X = check_array(X, accept_sparse='csr', dtype=FLOAT_DTYPES)
n_samples, n_features = X.shape
@@ -885,13 +890,14 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory
copy if the given data is not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

if the given data are not ...

y : Ignored
"""
random_state = check_random_state(self.random_state)
X = self._check_fit_data(X)

self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ = \
k_means(
@@ -944,7 +950,6 @@ def fit_transform(self, X, y=None):
# np.array or CSR format already.
# XXX This skips _check_test_data, which may change the dtype;
# we should refactor the input validation.
X = self._check_fit_data(X)
return self.fit(X)._transform(X)

def transform(self, X):
@@ -1351,7 +1356,9 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that the data
will be converted to C ordering, which will cause a memory copy
if the given data is in not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

given data are not ...

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

given data are not C-contiguous

y : Ignored
@@ -1361,8 +1368,8 @@ def fit(self, X, y=None):
dtype=[np.float64, np.float32])
n_samples, n_features = X.shape
if n_samples < self.n_clusters:
raise ValueError("Number of samples smaller than number "
"of clusters.")
raise ValueError("n_samples=%d should be >= n_clusters=%d"
% (n_samples, self.n_clusters))

n_init = self.n_init
if hasattr(self.init, '__array__'):
@@ -1516,13 +1523,14 @@ def partial_fit(self, X, y=None):
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Coordinates of the data points to cluster.
Coordinates of the data points to cluster. It must be noted that
'X' will be copied if it is not C-contiguous.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 8, 2018

Contributor

'X' -> don't use apostrophe or double backquote.

y : Ignored
"""

X = check_array(X, accept_sparse="csr")
X = check_array(X, accept_sparse="csr", order="C")

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

I think the docstring of partial_fit should be updated to mention that X will be copied if not C-contiguous.

n_samples, n_features = X.shape
if hasattr(self.init, '__array__'):
self.init = np.ascontiguousarray(self.init, dtype=X.dtype)
@@ -169,7 +169,8 @@ def _check_fitted_model(km):
assert_greater(km.inertia_, 0.0)

# check error on dataset being too small
assert_raises(ValueError, km.fit, [[0., 1.]])
assert_raise_message(ValueError, "n_samples=1 should be >= n_clusters=%d"
% km.n_clusters, km.fit, [[0., 1.]])


def test_k_means_plus_plus_init():
@@ -750,6 +751,11 @@ def test_k_means_function():
# to many clusters desired
assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1)

# kmeans for algorithm='elkan' raises TypeError on sparse matrix
assert_raise_message(TypeError, "algorithm='elkan' not supported for "
"sparse input X", k_means, X=X_csr, n_clusters=2,
algorithm="elkan")


def test_x_squared_norms_init_centroids():
"""Test that x_squared_norms can be None in _init_centroids"""
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.