Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] KMeans optimisation for array C/F contiguity #10471

Merged
merged 20 commits into from Feb 8, 2018
Merged
Changes from 7 commits
Commits
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+28 −8
Diff settings

Always

Just for now

@@ -125,6 +125,13 @@ Classifiers and regressors
only require X to be an object with finite length or shape.
:issue:`9832` by :user:`Vrishank Bhardwaj <vrishank97>`.

Cluster

- :class:`cluster.KMeans`, :class:`cluster.MiniBatchKMeans` and

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 23, 2018

Member

I think you should say that with algorithm='full', row-major ordering will be forced, improving runtime.

:func:`cluster.k_means` are now optimized using row-major ordering for data
copying.
:issue:`10471` by :user:`Gaurav Dhingra <gxyd>`.

Preprocessing

- :class:`preprocessing.PolynomialFeatures` now supports sparse input.
@@ -175,7 +175,10 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
Parameters
----------
X : array-like or sparse matrix, shape (n_samples, n_features)
The observations to cluster.
The observations to cluster. It must be noted that, the data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

that, the -> that the

will be converted to C ordering if the classical EM-style algorithm

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

I would avoid to give details regarding the algorithm:

... C ordering when ``algorithm="full"``

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

Uhm at the end we always use C-ordered so no need for this specificity.

This comment has been minimized.

Copy link
@gxyd

gxyd Jan 15, 2018

Author Contributor

Ah, I forgot to change this back.

i.e `algorithm="full"` is used, which will cause a memory copy
if the given data is in fortran order.
n_clusters : int
The number of clusters to form as well as the number of
@@ -299,7 +302,7 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',

# Validate init array
if hasattr(init, '__array__'):
init = check_array(init, dtype=X.dtype.type, copy=True)
init = check_array(init, dtype=X.dtype.type, copy=True, order="C")

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 16, 2018

Member

Do we need to be setting the order of init, or only X?

_validate_center_shape(X, n_clusters, init)

if n_init != 1:
@@ -391,8 +394,6 @@ def _kmeans_single_elkan(X, n_clusters, max_iter=300, init='k-means++',
verbose=False, x_squared_norms=None,
random_state=None, tol=1e-4,
precompute_distances=True):
if sp.issparse(X):
raise ValueError("algorithm='elkan' not supported for sparse input X")

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 15, 2018

Member

Sorry for confusing things, but I think this error message is better, being more specific and helpful to the user. So please keep this and, as you suggested, remove the check_array

X = check_array(X, order="C")

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

Could you mentioned specifically:

X = check_array(X, accept_sparse=False, order="C")

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 16, 2018

Member

You can drop this check_array altogether. It is a waste of time, unless I'm much mistaken.

And I think it would be fine to change the ValueError to a more appropriate TypeError; but the message is more helpful.

random_state = check_random_state(random_state)
if x_squared_norms is None:
@@ -862,7 +863,8 @@ def __init__(self, n_clusters=8, init='k-means++', n_init=10,

def _check_fit_data(self, X):
"""Verify that the number of samples given is larger than k"""
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32],
order="C")
if X.shape[0] < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
X.shape[0], self.n_clusters))
@@ -885,7 +887,10 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that, the data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

that, the -> that the

will be converted to C ordering if the classical EM-style
algorithm i.e `algorithm="full"` is used, which will cause a memory

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

I would avoid to give details regarding the algorithm:

... C ordering when ``algorithm="full"``

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

Uhm at the end we always use C-ordered so no need for this specificity.

copy if the given data is in fortran order.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

"copy if the given data is in fortran order" => "copy if the given data is not C-contiguous".

It could be Fortran-contiguous or no contiguous at all.

This comment has been minimized.

Copy link
@gxyd

gxyd Feb 8, 2018

Author Contributor

Ah I missed this. Good point to remember.

y : Ignored
@@ -1351,7 +1356,9 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that, the data
will be converted to C ordering, which will cause a memory copy
if the given data is in fortran order.
y : Ignored
@@ -1522,7 +1529,7 @@ def partial_fit(self, X, y=None):
"""

X = check_array(X, accept_sparse="csr")
X = check_array(X, accept_sparse="csr", order="C")

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

I think the docstring of partial_fit should be updated to mention that X will be copied if not C-contiguous.

n_samples, n_features = X.shape
if hasattr(self.init, '__array__'):
self.init = np.ascontiguousarray(self.init, dtype=X.dtype)
@@ -750,6 +750,12 @@ def test_k_means_function():
# to many clusters desired
assert_raises(ValueError, k_means, X, n_clusters=X.shape[0] + 1)

# kmeans for algorithm='elkan' raises standard error on sparse matrix
assert_raise_message(TypeError, "A sparse matrix was passed, but dense "
"data is required. Use X.toarray() to convert to a "
"dense numpy array.", k_means, X=X_csr, n_clusters=2,
algorithm="elkan")


def test_x_squared_norms_init_centroids():
"""Test that x_squared_norms can be None in _init_centroids"""
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.