Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] KMeans optimisation for array C/F contiguity #10471

Merged
merged 20 commits into from Feb 8, 2018
Merged
Changes from 6 commits
Commits
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+36 −7
Diff settings

Always

Just for now

Copy path View file
@@ -125,6 +125,13 @@ Classifiers and regressors
only require X to be an object with finite length or shape.
:issue:`9832` by :user:`Vrishank Bhardwaj <vrishank97>`.

Cluster

- :class:`cluster.KMeans`, :class:`cluster.MiniBatchKMeans` and

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 23, 2018

Member

I think you should say that with algorithm='full', row-major ordering will be forced, improving runtime.

:func:`cluster.k_means` are now optimized using row-major ordering for data
copying when classical EM-style algorithm is used.
:issue:`10471` by :user:`Gaurav Dhingra <gxyd>`.

Preprocessing

- :class:`preprocessing.PolynomialFeatures` now supports sparse input.
Copy path View file
@@ -175,7 +175,10 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
Parameters
----------
X : array-like or sparse matrix, shape (n_samples, n_features)
The observations to cluster.
The observations to cluster. It must be noted that, the data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

that, the -> that the

will be converted to C ordering if the classical EM-style algorithm

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

I would avoid to give details regarding the algorithm:

... C ordering when ``algorithm="full"``

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

Uhm at the end we always use C-ordered so no need for this specificity.

This comment has been minimized.

Copy link
@gxyd

gxyd Jan 15, 2018

Author Contributor

Ah, I forgot to change this back.

i.e `algorithm="full"` is used, which will cause a memory copy
if the given data is in fortran order.
n_clusters : int
The number of clusters to form as well as the number of
@@ -299,7 +302,11 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',

# Validate init array
if hasattr(init, '__array__'):
init = check_array(init, dtype=X.dtype.type, copy=True)
if algorithm == "full":
order = "C"
else:
order = None
init = check_array(init, dtype=X.dtype.type, copy=True, order=order)
_validate_center_shape(X, n_clusters, init)

if n_init != 1:
@@ -862,14 +869,24 @@ def __init__(self, n_clusters=8, init='k-means++', n_init=10,

def _check_fit_data(self, X):
"""Verify that the number of samples given is larger than k"""
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
if self.algorithm == "full":
order = "C"
else:
order = None

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 15, 2018

Member

_kmeans_elkan coerces the data to C order anyway, so this turns out to be redundant and weird.

X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32],
order=order)
if X.shape[0] < self.n_clusters:
raise ValueError("n_samples=%d should be >= n_clusters=%d" % (
X.shape[0], self.n_clusters))
return X

def _check_test_data(self, X):
X = check_array(X, accept_sparse='csr', dtype=FLOAT_DTYPES)
if self.algorithm == "full":

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 15, 2018

Member

This is not relevant at test time!

order = "C"
else:
order = None
X = check_array(X, accept_sparse='csr', dtype=FLOAT_DTYPES,
order=order)
n_samples, n_features = X.shape
expected_n_features = self.cluster_centers_.shape[1]
if not n_features == expected_n_features:
@@ -885,7 +902,10 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that, the data

This comment has been minimized.

Copy link
@glemaitre

glemaitre Feb 6, 2018

Contributor

that, the -> that the

will be converted to C ordering if the classical EM-style
algorithm i.e `algorithm="full"` is used, which will cause a memory

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

I would avoid to give details regarding the algorithm:

... C ordering when ``algorithm="full"``

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jan 15, 2018

Contributor

Uhm at the end we always use C-ordered so no need for this specificity.

copy if the given data is in fortran order.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Feb 8, 2018

Member

"copy if the given data is in fortran order" => "copy if the given data is not C-contiguous".

It could be Fortran-contiguous or no contiguous at all.

This comment has been minimized.

Copy link
@gxyd

gxyd Feb 8, 2018

Author Contributor

Ah I missed this. Good point to remember.

y : Ignored
@@ -1351,13 +1371,15 @@ def fit(self, X, y=None):
Parameters
----------
X : array-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster.
Training instances to cluster. It must be noted that, the data
will be converted to C ordering, which will cause a memory copy
if the given data is in fortran order.
y : Ignored
"""
random_state = check_random_state(self.random_state)
X = check_array(X, accept_sparse="csr", order='C',
X = check_array(X, accept_sparse="csr",
dtype=[np.float64, np.float32])
n_samples, n_features = X.shape
if n_samples < self.n_clusters:
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.