Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] KMeans optimisation for array C/F contiguity #10471

Merged
merged 20 commits into from Feb 8, 2018
Merged
Changes from 14 commits
Commits
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+32 −17
Diff settings

Always

Just for now

Copy path View file
@@ -125,6 +125,13 @@ Classifiers and regressors
only require X to be an object with finite length or shape.
:issue:`9832` by :user:`Vrishank Bhardwaj <vrishank97>`.

Cluster

- :class:`cluster.KMeans`, :class:`cluster.MiniBatchKMeans` and

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 23, 2018

Member

I think you should say that with algorithm='full', row-major ordering will be forced, improving runtime.

:func:`cluster.k_means` passed with ``algorithm='full'`` now enforces
row-major ordering, improving runtime.
:issue:`10471` by :user:`Gaurav Dhingra <gxyd>`.

Preprocessing

- :class:`preprocessing.PolynomialFeatures` now supports sparse input.
Copy path View file
@@ -22,6 +22,7 @@
from ..utils.extmath import row_norms, squared_norm, stable_cumsum
from ..utils.sparsefuncs_fast import assign_rows_csr
from ..utils.sparsefuncs import mean_variance_axis
from ..utils.validation import _num_samples
from ..utils import check_array
from ..utils import check_random_state
from ..utils import as_float_array
@@ -175,7 +176,9 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
Parameters
----------
X : array-like or sparse matrix, shape (n_samples, n_features)
The observations to cluster.
The observations to cluster. It must be noted that, the data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

that, the -> that the

will be converted to C ordering, which will cause a memory copy
if the given data is in fortran order.
n_clusters : int
The number of clusters to form as well as the number of
@@ -280,7 +283,14 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
raise ValueError('Number of iterations should be a positive number,'
' got %d instead' % max_iter)

X = as_float_array(X, copy=copy_x)
# avoid forcing order when copy_x=False
order = "C" if copy_x else None

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

The docstring of copy_x should be updated both in the estimator class and in the function to explain that setting copy_x to False will not ensure that the data is C-contiguous possibly causing a significant slowdown.

X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32],
order=order, copy=copy_x)
# verify that the number of samples given is larger than k
if _num_samples(X) < n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

I see that this is cover but I don't see from which test since that you did not add a specific test.
Can you point out which test is raising this?

This comment has been minimized.

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 7, 2018

Contributor

Could make the change such that we check the message properly.

_num_samples(X), n_clusters))
tol = _tolerance(X, tol)

# If the distances are precomputed every job will create a matrix of shape
@@ -392,8 +402,7 @@ def _kmeans_single_elkan(X, n_clusters, max_iter=300, init='k-means++',
random_state=None, tol=1e-4,
precompute_distances=True):
if sp.issparse(X):
raise ValueError("algorithm='elkan' not supported for sparse input X")

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 15, 2018

Member

Sorry for confusing things, but I think this error message is better, being more specific and helpful to the user. So please keep this and, as you suggested, remove the check_array

X = check_array(X, order="C")
raise TypeError("algorithm='elkan' not supported for sparse input X")
random_state = check_random_state(random_state)
if x_squared_norms is None:
x_squared_norms = row_norms(X, squared=True)
@@ -860,14 +869,6 @@ def __init__(self, n_clusters=8, init='k-means++', n_init=10,
self.n_jobs = n_jobs
self.algorithm = algorithm

def _check_fit_data(self, X):
"""Verify that the number of samples given is larger than k"""
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
if X.shape[0] < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
X.shape[0], self.n_clusters))
return X

def _check_test_data(self, X):
X = check_array(X, accept_sparse='csr', dtype=FLOAT_DTYPES)
n_samples, n_features = X.shape
@@ -885,13 +886,14 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that, the data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

that, the -> that the

will be converted to C ordering, which will cause a memory
copy if the given data is in fortran order.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

"copy if the given data is in fortran order" => "copy if the given data is not C-contiguous".

It could be Fortran-contiguous or no contiguous at all.

This comment has been minimized.

Copy link
@gxyd

gxyd Feb 8, 2018

Author Contributor

Ah I missed this. Good point to remember.

y : Ignored
"""
random_state = check_random_state(self.random_state)
X = self._check_fit_data(X)

self.cluster_centers_, self.labels_, self.inertia_, self.n_iter_ = \
k_means(
@@ -944,7 +946,6 @@ def fit_transform(self, X, y=None):
# np.array or CSR format already.
# XXX This skips _check_test_data, which may change the dtype;
# we should refactor the input validation.
X = self._check_fit_data(X)
return self.fit(X)._transform(X)

def transform(self, X):
@@ -1351,7 +1352,9 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that, the data
will be converted to C ordering, which will cause a memory copy
if the given data is in fortran order.
y : Ignored
@@ -1522,7 +1525,7 @@ def partial_fit(self, X, y=None):
"""

X = check_array(X, accept_sparse="csr")
X = check_array(X, accept_sparse="csr", order="C")

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

I think the docstring of partial_fit should be updated to mention that X will be copied if not C-contiguous.

n_samples, n_features = X.shape
if hasattr(self.init, '__array__'):
self.init = np.ascontiguousarray(self.init, dtype=X.dtype)
@@ -750,6 +750,11 @@ def test_k_means_function():
# to many clusters desired
assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1)

# kmeans for algorithm='elkan' raises TypeError on sparse matrix
assert_raise_message(TypeError, "algorithm='elkan' not supported for "
"sparse input X", k_means, X=X_csr, n_clusters=2,
algorithm="elkan")


def test_x_squared_norms_init_centroids():
"""Test that x_squared_norms can be None in _init_centroids"""
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.