[MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types #6846

Merged
merged 17 commits into from Jun 21, 2016

Conversation

Projects
None yet
5 participants
@yenchenlin
Contributor

yenchenlin commented May 31, 2016

This is a follow-up PR for #6430, which use fused types in Cython to work with np.float32 inputs without wasting memory by converting them internally to np.float64.

Since Sparse Func Utils now supports Cython fused types, we can avoid zig-zag memory usage which showed in #6430 .

Updated on 6/19

  • Dense np.float32 Data
    • Before this PR:

master

  • After this PR:

fused_types

  • Sparse np.float32 Data
    • Before this PR:

master_sparse_32

  • After this PR:

fused_types_sparse_32

Here is the test script used to generate above figures (thanks @ssaeger):

import numpy as np
from scipy import sparse as sp
from sklearn.cluster import KMeans

@profile
def fit_est():
    estimator.fit(X)

np.random.seed(5)
X = np.random.rand(200000, 20)
X = np.float32(X)

# X = sp.csr_matrix(X)

estimator = KMeans()
fit_est()
@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin May 31, 2016

Contributor

It seems that we are facing precision issues again ...

Contributor

yenchenlin commented May 31, 2016

It seems that we are facing precision issues again ...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 31, 2016

Member

Some imprecision in inertia may be tolerable. However, having not looked this over in detail, it seems as if you've changed all the DOUBLEs into floatings. If changing some of these back to DOUBLEs will only increase memory usage in O(n) or O(m) but not O(mn) (n=samples, m=features) I would think it acceptable if it did not also substantially increase runtime. Do you think this is worth investigating?

Member

jnothman commented May 31, 2016

Some imprecision in inertia may be tolerable. However, having not looked this over in detail, it seems as if you've changed all the DOUBLEs into floatings. If changing some of these back to DOUBLEs will only increase memory usage in O(n) or O(m) but not O(mn) (n=samples, m=features) I would think it acceptable if it did not also substantially increase runtime. Do you think this is worth investigating?

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 1, 2016

Contributor

Hello @jnothman , if I understand you correctly, you mean by changing some of the variables back to DOUBLE may only increase memory usage in a O(n) or O(m) without sacrifice precision, right?

Contributor

yenchenlin commented Jun 1, 2016

Hello @jnothman , if I understand you correctly, you mean by changing some of the variables back to DOUBLE may only increase memory usage in a O(n) or O(m) without sacrifice precision, right?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 1, 2016

Member

Yes.

Member

jnothman commented Jun 1, 2016

Yes.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 1, 2016

Contributor

I get it, will do.

However, it's a little bit weird that CI only fails for precision reason in Python2.6 & Python2.7.
Are there any substantial differences in precision between Python2 and Python3?

Contributor

yenchenlin commented Jun 1, 2016

I get it, will do.

However, it's a little bit weird that CI only fails for precision reason in Python2.6 & Python2.7.
Are there any substantial differences in precision between Python2 and Python3?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 1, 2016

Member

Are there any substantial differences in precision between Python2 and Python3?

I recall differences in handling of large ints, but don't recall if there was anything widely publicised about float formats. Feel free to investigate a little for your own knowledge, but it's not essential to getting the job done here .

Member

jnothman commented Jun 1, 2016

Are there any substantial differences in precision between Python2 and Python3?

I recall differences in handling of large ints, but don't recall if there was anything widely publicised about float formats. Feel free to investigate a little for your own knowledge, but it's not essential to getting the job done here .

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 2, 2016

Member

Please ping when you've looked at potential changes, whether or not we can make any beneficial changes.

Member

jnothman commented Jun 2, 2016

Please ping when you've looked at potential changes, whether or not we can make any beneficial changes.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 4, 2016

Contributor

Hello @jnothman and @MechCoder ,

I've fixed the test precision issue.

The only difference I made here is to change the dtype when computing inertia, i.e., np.sum.

Since inertia is the sum of the distances from points to center, using np.float64 accumulator can keep the precision for np.float32 data.

Contributor

yenchenlin commented Jun 4, 2016

Hello @jnothman and @MechCoder ,

I've fixed the test precision issue.

The only difference I made here is to change the dtype when computing inertia, i.e., np.sum.

Since inertia is the sum of the distances from points to center, using np.float64 accumulator can keep the precision for np.float32 data.

@yenchenlin yenchenlin changed the title from [WIP] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types to [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types Jun 4, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 4, 2016

Member

It's a bit unfortunate that it involes O(mn) conversions to float64, especially since it's done in every iteration. Can you get an estimate of runtime cost, relative to the lower precision version?

Member

jnothman commented Jun 4, 2016

It's a bit unfortunate that it involes O(mn) conversions to float64, especially since it's done in every iteration. Can you get an estimate of runtime cost, relative to the lower precision version?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 4, 2016

Member

So cumsum(..., dtype=np.float64) is our best option, perhaps?

Member

jnothman commented Jun 4, 2016

So cumsum(..., dtype=np.float64) is our best option, perhaps?

sklearn/cluster/k_means_.py
- centers[center_idx] /= counts[center_idx]
+ # Note: numpy >= 1.10 does not support '/=' for the following
+ # expression for a mixture of int and float (see numpy issue #6464)
+ centers[center_idx] = centers[center_idx]/counts[center_idx]

This comment has been minimized.

@jnothman

jnothman Jun 4, 2016

Member

space around / please

@jnothman

jnothman Jun 4, 2016

Member

space around / please

sklearn/cluster/k_means_.py
@@ -1282,15 +1290,20 @@ def fit(self, X, y=None):
Coordinates of the data points to cluster
"""
random_state = check_random_state(self.random_state)
- X = check_array(X, accept_sparse="csr", order='C', dtype=np.float64)
+ # to handle sparse data which only works as float64 at the moment

This comment has been minimized.

@jnothman

jnothman Jun 4, 2016

Member

Aren't we trying to change this?

@jnothman

jnothman Jun 4, 2016

Member

Aren't we trying to change this?

This comment has been minimized.

@ogrisel

ogrisel Jun 4, 2016

Member

If this is not trivial to do, I am ok with keeping the sparse case for a second PR.

@ogrisel

ogrisel Jun 4, 2016

Member

If this is not trivial to do, I am ok with keeping the sparse case for a second PR.

This comment has been minimized.

@MechCoder

MechCoder Jun 4, 2016

Member

It is doable. We have refactored all sparse utils to be support float32 data as well. I suggest we do it over here.

@MechCoder

MechCoder Jun 4, 2016

Member

It is doable. We have refactored all sparse utils to be support float32 data as well. I suggest we do it over here.

This comment has been minimized.

@ogrisel

ogrisel Jun 4, 2016

Member

Alright then. Does this PR need a rebase?

@ogrisel

ogrisel Jun 4, 2016

Member

Alright then. Does this PR need a rebase?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 4, 2016

Member

Could you please refactor those tests into one? I think the code duplication doesn't give much value.

Member

jnothman commented Jun 4, 2016

Could you please refactor those tests into one? I think the code duplication doesn't give much value.

- centers = init
+ # ensure that the centers have the same dtype as X
+ # this is a requirement of fused types of cython
+ centers = np.array(init, dtype=X.dtype)

This comment has been minimized.

@ogrisel

ogrisel Jun 4, 2016

Member

This will also trigger a copy of init even if init is already an array with the right dtype while this was not the case before.

I think this is good to systematically copy as it is probably a bad idea to mutate the user provided input array silently. I think you should add a test to check this behavior.

@ogrisel

ogrisel Jun 4, 2016

Member

This will also trigger a copy of init even if init is already an array with the right dtype while this was not the case before.

I think this is good to systematically copy as it is probably a bad idea to mutate the user provided input array silently. I think you should add a test to check this behavior.

This comment has been minimized.

@jnothman

jnothman Jun 17, 2016

Member

Has this been addressed?

@jnothman

jnothman Jun 17, 2016

Member

Has this been addressed?

This comment has been minimized.

@MechCoder

MechCoder Jun 17, 2016

Member

There is a block of code under the function k_means that starts with if hasattr(init, '__array__') that also converts explicitly converts init to higher precision.

That should also be fixed and we should test the attributes cluster_centers_ and inertia_ for precision when init is a ndarray ...

@MechCoder

MechCoder Jun 17, 2016

Member

There is a block of code under the function k_means that starts with if hasattr(init, '__array__') that also converts explicitly converts init to higher precision.

That should also be fixed and we should test the attributes cluster_centers_ and inertia_ for precision when init is a ndarray ...

This comment has been minimized.

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

I've added a test to make sure that we do copy init here.

@MechCoder , since function _init_centroids will always be called after the code block you mentioned and thus make sure centers has the same dtype as input, I think maybe we don't need to touch that part of code?

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

I've added a test to make sure that we do copy init here.

@MechCoder , since function _init_centroids will always be called after the code block you mentioned and thus make sure centers has the same dtype as input, I think maybe we don't need to touch that part of code?

This comment has been minimized.

@MechCoder

MechCoder Jun 18, 2016

Member

Ouch, that means a copy is already triggered before _init_centroids is called right?

Correct me if I'm wrong but in other words, if X is of dtype np.float32, there are 2 copies made of init.

  1. One in kmeans which makes init of dtype np.float64
  2. And then one in _init_centroids that converts it back to np.float32
@MechCoder

MechCoder Jun 18, 2016

Member

Ouch, that means a copy is already triggered before _init_centroids is called right?

Correct me if I'm wrong but in other words, if X is of dtype np.float32, there are 2 copies made of init.

  1. One in kmeans which makes init of dtype np.float64
  2. And then one in _init_centroids that converts it back to np.float32

This comment has been minimized.

@yenchenlin

yenchenlin Jun 18, 2016

Contributor

You are right!

But I think it is reasonable to copy the data twice because in step 1.,
init = check_array(init, dtype=np.float64, copy=True) is used to make sure KMeans.init is a copy of the array provided by users.

And in step 2., centers = np.array(init, dtype=X.dtype)
is used to make sure the centers we compute won't alter its argument init.

@yenchenlin

yenchenlin Jun 18, 2016

Contributor

You are right!

But I think it is reasonable to copy the data twice because in step 1.,
init = check_array(init, dtype=np.float64, copy=True) is used to make sure KMeans.init is a copy of the array provided by users.

And in step 2., centers = np.array(init, dtype=X.dtype)
is used to make sure the centers we compute won't alter its argument init.

This comment has been minimized.

@yenchenlin

yenchenlin Jun 18, 2016

Contributor

Yeah but I still change this line to init = check_array(init, dtype=X.dtype.type, copy=True) since it is more consistent to keep it as the same datatype as X, and also it makes no harm to precision! 😃

@yenchenlin

yenchenlin Jun 18, 2016

Contributor

Yeah but I still change this line to init = check_array(init, dtype=X.dtype.type, copy=True) since it is more consistent to keep it as the same datatype as X, and also it makes no harm to precision! 😃

sklearn/cluster/k_means_.py
+ if sp.issparse(X):
+ X = check_array(X, accept_sparse='csr', dtype=np.float64)
+ else:
+ X = check_array(X, dtype=None)

This comment has been minimized.

@ogrisel

ogrisel Jun 4, 2016

Member

I think you should do:

X = check_array(X, dtype=[np.float64, np.float32])

make it possible to pass integer values to fit and have them automatically converted to np.float64. Please add a test for this case as well.

@ogrisel

ogrisel Jun 4, 2016

Member

I think you should do:

X = check_array(X, dtype=[np.float64, np.float32])

make it possible to pass integer values to fit and have them automatically converted to np.float64. Please add a test for this case as well.

sklearn/cluster/k_means_.py
+ X = check_array(X, accept_sparse="csr", order='C',
+ dtype=np.float64)
+ else:
+ X = check_array(X, accept_sparse="csr", order='C')

This comment has been minimized.

@ogrisel

ogrisel Jun 4, 2016

Member

accept_sparse is useless on this side of the if branch. Instead you should put: dtype=[np.float64, np.float32] to make it possible to convert integer to np.float64 by default while letting np.float32 pass through unchanged.

@ogrisel

ogrisel Jun 4, 2016

Member

accept_sparse is useless on this side of the if branch. Instead you should put: dtype=[np.float64, np.float32] to make it possible to convert integer to np.float64 by default while letting np.float32 pass through unchanged.

sklearn/cluster/tests/test_k_means.py
@@ -771,4 +774,122 @@ def test_x_squared_norms_init_centroids():
def test_max_iter_error():
km = KMeans(max_iter=-1)
- assert_raise_message(ValueError, 'Number of iterations should be', km.fit, X)
+ assert_raise_message(ValueError,
+ 'Number of iterations should be', km.fit, X)

This comment has been minimized.

@ogrisel

ogrisel Jun 4, 2016

Member

cosmetics, I would have found the following more natural to read:

    assert_raise_message(ValueError, 'Number of iterations should be',
                         km.fit, X)
@ogrisel

ogrisel Jun 4, 2016

Member

cosmetics, I would have found the following more natural to read:

    assert_raise_message(ValueError, 'Number of iterations should be',
                         km.fit, X)
sklearn/cluster/tests/test_k_means.py
+ # compare arrays with low precision since the difference between
+ # 32 and 64 bit sometimes makes a difference up to the 4th decimal place
+ assert_array_almost_equal(inertia32, inertia64, decimal=4)
+ assert_array_almost_equal(X_new32, X_new64, decimal=4)

This comment has been minimized.

@ogrisel

ogrisel Jun 4, 2016

Member

Can you also please add a check that km.cluster_centers_ in 32 bit is almost equal to its 64 bit counterpart?

@ogrisel

ogrisel Jun 4, 2016

Member

Can you also please add a check that km.cluster_centers_ in 32 bit is almost equal to its 64 bit counterpart?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 4, 2016

Member

Great work @yenchenlin. I added some further comments to address on top of @jnothman's. Looking forward to merge this once they are addressed.

Member

ogrisel commented Jun 4, 2016

Great work @yenchenlin. I added some further comments to address on top of @jnothman's. Looking forward to merge this once they are addressed.

@ogrisel ogrisel added this to the 0.18 milestone Jun 4, 2016

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 4, 2016

Contributor

Thanks a lot for you guys comments!
I've started addressing your comments.

Contributor

yenchenlin commented Jun 4, 2016

Thanks a lot for you guys comments!
I've started addressing your comments.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 14, 2016

Member

Please avoid amending commits and force-pushing your changes. It's much easier to review changes incrementally, especially weeks after the previous change, if the commits show precisely what has changed. There are other reasons too, but commits with clear messages were designed that way for a reason.

Member

jnothman commented Jun 14, 2016

Please avoid amending commits and force-pushing your changes. It's much easier to review changes incrementally, especially weeks after the previous change, if the commits show precisely what has changed. There are other reasons too, but commits with clear messages were designed that way for a reason.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 14, 2016

Contributor

I see, sorry!
Shall I just add more commits while developing and squash them after core developers agree to merge?

Contributor

yenchenlin commented Jun 14, 2016

I see, sorry!
Shall I just add more commits while developing and squash them after core developers agree to merge?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 14, 2016

Member

Yes.

Member

jnothman commented Jun 14, 2016

Yes.

doc/whats_new.rst
@@ -144,6 +144,12 @@ Enhancements
- The :func: `ignore_warnings` now accept a category argument to ignore only
the warnings of a specified type. By `Thierry Guillemot`_.
+ - :class:`cluster.KMeans` and :class:`cluster.MiniBatchKMeans` now works

This comment has been minimized.

@jnothman

jnothman Jun 16, 2016

Member

You need to rebase and manage a conflict here.

@jnothman

jnothman Jun 16, 2016

Member

You need to rebase and manage a conflict here.

sklearn/cluster/k_means_.py
@@ -1282,15 +1289,19 @@ def fit(self, X, y=None):
Coordinates of the data points to cluster
"""
random_state = check_random_state(self.random_state)
- X = check_array(X, accept_sparse="csr", order='C', dtype=np.float64)
+ if sp.issparse(X):

This comment has been minimized.

@jnothman

jnothman Jun 16, 2016

Member

Remind me why we need this conditional, i.e. why check_array(X, accept_sparse="csr", order='C', dtype=[np.float64, np.float32]) is insufficient?

@jnothman

jnothman Jun 16, 2016

Member

Remind me why we need this conditional, i.e. why check_array(X, accept_sparse="csr", order='C', dtype=[np.float64, np.float32]) is insufficient?

This comment has been minimized.

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

It's my fault, thanks for pointing out!

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

It's my fault, thanks for pointing out!

sklearn/cluster/k_means_.py
@@ -1282,15 +1289,19 @@ def fit(self, X, y=None):
Coordinates of the data points to cluster
"""
random_state = check_random_state(self.random_state)
- X = check_array(X, accept_sparse="csr", order='C', dtype=np.float64)
+ if sp.issparse(X):
+ X = check_array(X, accept_sparse="csr", order='C',

This comment has been minimized.

@jnothman

jnothman Jun 16, 2016

Member

order='C' is irrelevant in the sparse-only case

@jnothman

jnothman Jun 16, 2016

Member

order='C' is irrelevant in the sparse-only case

sklearn/cluster/tests/test_k_means.py
+ km.fit, X)
+
+
+def test_kmeans_float32_64():

This comment has been minimized.

@jnothman

jnothman Jun 17, 2016

Member

maybe we should use float_precision rather than float32_64

@jnothman

jnothman Jun 17, 2016

Member

maybe we should use float_precision rather than float32_64

This comment has been minimized.

@MechCoder

MechCoder Jun 17, 2016

Member

I suppose we should test for X having non-float dtypes as well?

@MechCoder

MechCoder Jun 17, 2016

Member

I suppose we should test for X having non-float dtypes as well?

This comment has been minimized.

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

I now named the refactored test function as test_float_precision() since it tests both KMeans and MiniBatchKMeans.

@MechCoder I guess int input data should be tested in other functions since its clustering results will not remain the same as float data here?

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

I now named the refactored test function as test_float_precision() since it tests both KMeans and MiniBatchKMeans.

@MechCoder I guess int input data should be tested in other functions since its clustering results will not remain the same as float data here?

This comment has been minimized.

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

I now tested int input data here.

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

I now tested int input data here.

sklearn/cluster/tests/test_k_means.py
+ inertia = {}
+ X_new = {}
+
+ for dtype in [np.float64, np.float32]:

This comment has been minimized.

@jnothman

jnothman Jun 17, 2016

Member

Can we just use another for loop to handle the dense and sparse cases?

@jnothman

jnothman Jun 17, 2016

Member

Can we just use another for loop to handle the dense and sparse cases?

sklearn/cluster/tests/test_k_means.py
+ assert_array_almost_equal(X_new[np.float32], X_new[np.float64])
+
+
+def test_mb_k_means_float32_64():

This comment has been minimized.

@jnothman

jnothman Jun 17, 2016

Member

It looks like this is identical to the one above, except for the estimator. Again, we should probably use a loop (or a generator)

@jnothman

jnothman Jun 17, 2016

Member

It looks like this is identical to the one above, except for the estimator. Again, we should probably use a loop (or a generator)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 17, 2016

Member

Do we need to update the benchmarks in the PR description, or has nothing memory or time-sensitive changed since 2016-05-31?

Member

jnothman commented Jun 17, 2016

Do we need to update the benchmarks in the PR description, or has nothing memory or time-sensitive changed since 2016-05-31?

sklearn/cluster/tests/test_k_means.py
+ # dtype of cluster centers has to be the dtype of the input data
+ assert_equal(km.cluster_centers_.dtype, dtype)
+ inertia[dtype] = km.inertia_
+ X_new[dtype] = km.transform(km.cluster_centers_)

This comment has been minimized.

@MechCoder

MechCoder Jun 17, 2016

Member

Did you mean X_new[dtype] = km.transform(X)?

km.transform(km.cluster_centers_) just returns an array of zeros...

@MechCoder

MechCoder Jun 17, 2016

Member

Did you mean X_new[dtype] = km.transform(X)?

km.transform(km.cluster_centers_) just returns an array of zeros...

This comment has been minimized.

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

Yeah ... big thanks!

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

Yeah ... big thanks!

sklearn/cluster/k_means_.py
+ if sp.issparse(X):
+ X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
+ else:
+ X = check_array(X, dtype=[np.float64, np.float32])

This comment has been minimized.

@MechCoder

MechCoder Jun 17, 2016

Member

Why this if .. else, just

X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])

should suffice right?

@MechCoder

MechCoder Jun 17, 2016

Member

Why this if .. else, just

X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])

should suffice right?

This comment has been minimized.

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

You are right! Fixed!

@yenchenlin

yenchenlin Jun 17, 2016

Contributor

You are right! Fixed!

doc/whats_new.rst
+ with ``np.float32`` and ``np.float64`` input data without converting it.
+ This allows to reduce the memory consumption by using ``np.float32``.
+ (`#6430 <https://github.com/scikit-learn/scikit-learn/pull/6430>`_)
+ By `Sebastian Säger`_.

This comment has been minimized.

@MechCoder

MechCoder Jun 17, 2016

Member

Add your name as well.

@MechCoder

MechCoder Jun 17, 2016

Member

Add your name as well.

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Jun 17, 2016

Member

That should be it from me for now..

Member

MechCoder commented Jun 17, 2016

That should be it from me for now..

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 17, 2016

Contributor

Hello @jnothman ,
I used KMeans to fit np.float32 data with shape (200000, 20),
and here is the time profiling results (You can find my script here):

This line is the only change I made.

  1. Using inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64):
    63.4521019459
  2. Using inertia = np.sum((X - centers[labels]) ** 2):
    62.4173989296

The solution I used to keep precision seems reasonable.

In my opinion, this line involves O(n) instead of O(mn) in each iteration because it only converts the dtype of the computation results (i.e., distance from a point to its corresponding center), and it's complexity has nothing to do with the number of features.

Contributor

yenchenlin commented Jun 17, 2016

Hello @jnothman ,
I used KMeans to fit np.float32 data with shape (200000, 20),
and here is the time profiling results (You can find my script here):

This line is the only change I made.

  1. Using inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64):
    63.4521019459
  2. Using inertia = np.sum((X - centers[labels]) ** 2):
    62.4173989296

The solution I used to keep precision seems reasonable.

In my opinion, this line involves O(n) instead of O(mn) in each iteration because it only converts the dtype of the computation results (i.e., distance from a point to its corresponding center), and it's complexity has nothing to do with the number of features.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 17, 2016

Contributor

Hello guys, I know I'm noisy today but I still want to inform you that I've addressed the comments you gave before, please have a look! 🙏

Contributor

yenchenlin commented Jun 17, 2016

Hello guys, I know I'm noisy today but I still want to inform you that I've addressed the comments you gave before, please have a look! 🙏

sklearn/cluster/tests/test_k_means.py
+ for dtype in [np.int32, np.int64, np.float32, np.float64]:
+ X_test = dtype(X_small)
+ init_centers_test = dtype(init_centers)
+ assert_equal(X_test.dtype, init_centers_test.dtype)

This comment has been minimized.

@MechCoder

MechCoder Jun 18, 2016

Member

This line does not really test anything, does it?

@MechCoder

MechCoder Jun 18, 2016

Member

This line does not really test anything, does it?

This comment has been minimized.

@yenchenlin

yenchenlin Jun 18, 2016

Contributor

I did it yesterday so as to emphasize that they are of same types, but it looks stupid to me overnight. Thanks!

@yenchenlin

yenchenlin Jun 18, 2016

Contributor

I did it yesterday so as to emphasize that they are of same types, but it looks stupid to me overnight. Thanks!

sklearn/cluster/tests/test_k_means.py
+ assert_equal(X_test.dtype, init_centers_test.dtype)
+ km = KMeans(init=init_centers_test, n_clusters=3)
+ km.fit(X_test)
+ assert_equal(False, km.cluster_centers_ is init_centers)

This comment has been minimized.

@MechCoder

MechCoder Jun 18, 2016

Member

Using np.may_share_memory might be better..

@MechCoder

MechCoder Jun 18, 2016

Member

Using np.may_share_memory might be better..

This comment has been minimized.

@MechCoder

MechCoder Jun 18, 2016

Member

We should also test the dtype of km.cluster_centers_, no?

@MechCoder

MechCoder Jun 18, 2016

Member

We should also test the dtype of km.cluster_centers_, no?

This comment has been minimized.

@yenchenlin

yenchenlin Jun 18, 2016

Contributor

I've tested it for int dtype here and for float dtype here.

@yenchenlin

yenchenlin Jun 18, 2016

Contributor

I've tested it for int dtype here and for float dtype here.

sklearn/cluster/tests/test_k_means.py
+ init_int = X_int[:2]
+
+ fitted_models = [
+ KMeans(n_clusters=2).fit(X_list),

This comment has been minimized.

@MechCoder

MechCoder Jun 18, 2016

Member

You should move the computations related to the X_list outside the for loop.

@MechCoder

MechCoder Jun 18, 2016

Member

You should move the computations related to the X_list outside the for loop.

+ # This test is used to check KMeans won't mutate the user provided input array silently
+ # even if input data and init centers have the same type
+ X_small = np.array([[1.1, 1.1], [-7.5, -7.5], [-1.1, -1.1], [7.5, 7.5]])
+ init_centers = np.array([[0.0, 0.0], [5.0, 5.0], [-5.0, -5.0]])

This comment has been minimized.

@MechCoder

MechCoder Jun 18, 2016

Member

You should also check more explicitly that the value of init_centers are not changed after every dtype in the for loop below.

@MechCoder

MechCoder Jun 18, 2016

Member

You should also check more explicitly that the value of init_centers are not changed after every dtype in the for loop below.

- centers = init
+ # ensure that the centers have the same dtype as X
+ # this is a requirement of fused types of cython
+ centers = np.array(init, dtype=X.dtype)
elif callable(init):
centers = init(X, k, random_state=random_state)

This comment has been minimized.

@MechCoder

MechCoder Jun 18, 2016

Member

We should also force centers to be the dtype of X here...

@MechCoder

MechCoder Jun 18, 2016

Member

We should also force centers to be the dtype of X here...

This comment has been minimized.

@MechCoder

MechCoder Jun 19, 2016

Member

bump?

This comment has been minimized.

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Jun 18, 2016

Member

That should be it from me as well. lgtm pending comments. great work!

Member

MechCoder commented Jun 18, 2016

That should be it from me as well. lgtm pending comments. great work!

@MechCoder MechCoder changed the title from [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types to [MRG+1] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types Jun 18, 2016

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 18, 2016

Contributor

Thanks @MechCoder for the inputs, please have a look when you have time!

Contributor

yenchenlin commented Jun 18, 2016

Thanks @MechCoder for the inputs, please have a look when you have time!

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 18, 2016

Member

In terms of benchmarking, I mostly meant to make sure that we're actually reducing memory consumption in the sparse and dense cases from what it is at master...

Member

jnothman commented Jun 18, 2016

In terms of benchmarking, I mostly meant to make sure that we're actually reducing memory consumption in the sparse and dense cases from what it is at master...

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 19, 2016

Contributor

Hello @jnothman , sorry for the misunderstanding ...
But what I want to emphasize in this comment is that we can keep precision while not increasing overhead.

About the benchmarking, I've updated the results of sparse input in the main description of this PR,
while the results remain the same in the dense case.

Contributor

yenchenlin commented Jun 19, 2016

Hello @jnothman , sorry for the misunderstanding ...
But what I want to emphasize in this comment is that we can keep precision while not increasing overhead.

About the benchmarking, I've updated the results of sparse input in the main description of this PR,
while the results remain the same in the dense case.

@@ -305,7 +305,7 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
X -= X_mean
if hasattr(init, '__array__'):
- init = check_array(init, dtype=np.float64, copy=True)
+ init = check_array(init, dtype=X.dtype.type, copy=True)

This comment has been minimized.

@jnothman

jnothman Jun 20, 2016

Member

(I'm surprised that we need this .type here and perhaps check_array should allow a dtype object to be passed.)

@jnothman

jnothman Jun 20, 2016

Member

(I'm surprised that we need this .type here and perhaps check_array should allow a dtype object to be passed.)

This comment has been minimized.

@yenchenlin

yenchenlin Jun 20, 2016

Contributor

I think so.

@yenchenlin

yenchenlin Jun 20, 2016

Contributor

I think so.

sklearn/cluster/tests/test_k_means.py
+ assert_array_almost_equal(inertia[np.float32], inertia[np.float64],
+ decimal=4)
+ assert_array_almost_equal(X_new[np.float32], X_new[np.float64], decimal=4)
+ assert_array_almost_equal(centers[np.float32], centers[np.float64], decimal=4)

This comment has been minimized.

@jnothman

jnothman Jun 20, 2016

Member

Please adhere to PEP8 line length here and elsewhere. It's worth using a tool like flake8

@jnothman

jnothman Jun 20, 2016

Member

Please adhere to PEP8 line length here and elsewhere. It's worth using a tool like flake8

sklearn/cluster/tests/test_k_means.py
+ centers[dtype] = estimator.cluster_centers_
+ # make sure predictions correspond to the correct label
+ assert_equal(estimator.predict(X_test[0]), estimator.labels_[0])
+ if estimator == mb_km:

This comment has been minimized.

@jnothman

jnothman Jun 20, 2016

Member

Do you mean if hasattr(estimator, 'partial_fit')?

@jnothman

jnothman Jun 20, 2016

Member

Do you mean if hasattr(estimator, 'partial_fit')?

This comment has been minimized.

@yenchenlin

yenchenlin Jun 20, 2016

Contributor

Don't they have the same meanings?

@yenchenlin

yenchenlin Jun 20, 2016

Contributor

Don't they have the same meanings?

This comment has been minimized.

@jnothman

jnothman Jun 20, 2016

Member

They have the same effect, but different meanings! One is more clear about what you mean to test.

@jnothman

jnothman Jun 20, 2016

Member

They have the same effect, but different meanings! One is more clear about what you mean to test.

This comment has been minimized.

@yenchenlin

yenchenlin Jun 21, 2016

Contributor

Ah yeah, thanks!
Will take care on this kind of issues in the future.

Code updated.

@yenchenlin

yenchenlin Jun 21, 2016

Contributor

Ah yeah, thanks!
Will take care on this kind of issues in the future.

Code updated.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 20, 2016

Member

Let us know when you've dealt with those cosmetic things, after which I think this looks good for merge.

Member

jnothman commented Jun 20, 2016

Let us know when you've dealt with those cosmetic things, after which I think this looks good for merge.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 20, 2016

Contributor

@jnothman Thanks for the check, done!

Contributor

yenchenlin commented Jun 20, 2016

@jnothman Thanks for the check, done!

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Jun 21, 2016

Member

Thanks for the updates. My +1 still holds.

Member

MechCoder commented Jun 21, 2016

Thanks for the updates. My +1 still holds.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2016

Member

Good work, @yenchenlin !

Member

jnothman commented Jun 21, 2016

Good work, @yenchenlin !

@jnothman jnothman merged commit 6bea11b into scikit-learn:master Jun 21, 2016

3 of 4 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.004%) to 94.48%
Details
@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Jun 21, 2016

Contributor

Thanks for your review!
Also thanks @ssaeger !

Contributor

yenchenlin commented Jun 21, 2016

Thanks for your review!
Also thanks @ssaeger !

@ssaeger

This comment has been minimized.

Show comment
Hide comment
@ssaeger

ssaeger Jun 21, 2016

Contributor

Thanks @yenchenlin , great work!

Contributor

ssaeger commented Jun 21, 2016

Thanks @yenchenlin , great work!

imaculate added a commit to imaculate/scikit-learn that referenced this pull request Jun 23, 2016

agramfort added a commit that referenced this pull request Jun 23, 2016

[MRG+1] Added support for sample_weight in linearSVR, including tests…
… and documentation. Fixes #6862 (#6907)

* Make KernelCenterer a _pairwise operation

Replicate solution to 9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s.

Happy to add tests, just tell me what should be covered.

* Adding test for PR #6900

* Simplifying imports and test

* updating changelog links on homepage (#6901)

* first commit

* changed binary average back to macro

* changed binomialNB to multinomialNB

* emphasis on "higher return values are better..." (#6909)

* fix typo in comment of hierarchical clustering (#6912)

* [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (#6846)

* Fix sklearn.base.clone for all scipy.sparse formats (#6910)

* DOC If git is not installed, need to catch OSError

Fixes #6860

* DOC add what's new for clone fix

* fix a typo in ridge.py (#6917)

* pep8

* TST: Speed up: cv=2

This is a smoke test. Hence there is no point having cv=4

* Added support for sample_weight in linearSVR, including tests and documentation

* Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance

* Fixed pep8 violations and sampleweight format

* rebased with upstream

olologin added a commit to olologin/scikit-learn that referenced this pull request Aug 24, 2016

olologin added a commit to olologin/scikit-learn that referenced this pull request Aug 24, 2016

[MRG+1] Added support for sample_weight in linearSVR, including tests…
… and documentation. Fixes #6862 (#6907)

* Make KernelCenterer a _pairwise operation

Replicate solution to scikit-learn@9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s.

Happy to add tests, just tell me what should be covered.

* Adding test for PR #6900

* Simplifying imports and test

* updating changelog links on homepage (#6901)

* first commit

* changed binary average back to macro

* changed binomialNB to multinomialNB

* emphasis on "higher return values are better..." (#6909)

* fix typo in comment of hierarchical clustering (#6912)

* [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (#6846)

* Fix sklearn.base.clone for all scipy.sparse formats (#6910)

* DOC If git is not installed, need to catch OSError

Fixes #6860

* DOC add what's new for clone fix

* fix a typo in ridge.py (#6917)

* pep8

* TST: Speed up: cv=2

This is a smoke test. Hence there is no point having cv=4

* Added support for sample_weight in linearSVR, including tests and documentation

* Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance

* Fixed pep8 violations and sampleweight format

* rebased with upstream

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] Added support for sample_weight in linearSVR, including tests…
… and documentation. Fixes #6862 (#6907)

* Make KernelCenterer a _pairwise operation

Replicate solution to scikit-learn@9a52077 except that `_pairwise` should always be `True` for `KernelCenterer` because it's supposed to receive a Gram matrix. This should make `KernelCenterer` usable in `Pipeline`s.

Happy to add tests, just tell me what should be covered.

* Adding test for PR #6900

* Simplifying imports and test

* updating changelog links on homepage (#6901)

* first commit

* changed binary average back to macro

* changed binomialNB to multinomialNB

* emphasis on "higher return values are better..." (#6909)

* fix typo in comment of hierarchical clustering (#6912)

* [MRG] Allows KMeans/MiniBatchKMeans to use float32 internally by using cython fused types (#6846)

* Fix sklearn.base.clone for all scipy.sparse formats (#6910)

* DOC If git is not installed, need to catch OSError

Fixes #6860

* DOC add what's new for clone fix

* fix a typo in ridge.py (#6917)

* pep8

* TST: Speed up: cv=2

This is a smoke test. Hence there is no point having cv=4

* Added support for sample_weight in linearSVR, including tests and documentation

* Changed assert to assert_allclose and assert_almost_equal, reduced the test tolerance

* Fixed pep8 violations and sampleweight format

* rebased with upstream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment