New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Use fused types in sparse mean variance functions #6593

Merged
merged 3 commits into from Apr 16, 2016

Conversation

Projects
None yet
5 participants
@yenchenlin
Contributor

yenchenlin commented Mar 26, 2016

This is a follow up PR from #6588 , which try to make functions in utils/sparsefuncs_fast.pyx support Cython fused types.

In this PR, I focus on functions listed below:

  • csr_mean_variance_axis0
  • csc_mean_variance_axis0
  • incr_mean_variance_axis0

EDIT:
I called mean_variance_axis function on a np.float32 array with shape (5*10^6, 20).
Here is the memory usage over time

  • master:

figure_1

  • this branch:

figure_2

memory usage surrounded by the bracket indeed decrease.

def _csr_mean_variance_axis0(np.ndarray[floating, ndim=1, mode="c"] X_data,
shape,
np.ndarray[int, ndim=1] X_indices):

This comment has been minimized.

@MechCoder

MechCoder Mar 29, 2016

Member

You could write a small comment if we want to figure out why you wrote this wrapper. (That is, you can pass arguments that could have different dtypes only as arguments to functions and in this case floating identifies the dtype of the argument passed. The same cannot be done with variables)

@MechCoder

MechCoder Mar 29, 2016

Member

You could write a small comment if we want to figure out why you wrote this wrapper. (That is, you can pass arguments that could have different dtypes only as arguments to functions and in this case floating identifies the dtype of the argument passed. The same cannot be done with variables)

This comment has been minimized.

@MechCoder

MechCoder Mar 29, 2016

Member

Ofc in a much better way.

@MechCoder

MechCoder Mar 29, 2016

Member

Ofc in a much better way.

@yenchenlin yenchenlin changed the title from [WIP] Use fused types in sparse mean variance functions to [MRG] Use fused types in sparse mean variance functions Apr 2, 2016

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 2, 2016

Contributor

Hello @MechCoder , I've addressed the issues you pointed out.
Also, I've refactored the tests for mean variance functions and make sure it will output np.float32 result when X passed in is np.float32.

Contributor

yenchenlin commented Apr 2, 2016

Hello @MechCoder , I've addressed the issues you pointed out.
Also, I've refactored the tests for mean variance functions and make sure it will output np.float32 result when X passed in is np.float32.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 3, 2016

Contributor

Would @jnothman please have a look when you have time?
Thanks!

Contributor

yenchenlin commented Apr 3, 2016

Would @jnothman please have a look when you have time?
Thanks!

Show outdated Hide outdated sklearn/utils/sparsefuncs_fast.pyx Outdated
Show outdated Hide outdated sklearn/utils/sparsefuncs_fast.pyx Outdated
@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Apr 7, 2016

Member

Thanks for the refactoring!

Can you use memory profiler to do some quick benchmarks and confirm the lessened memory usage when dtype is np.float32 between this branch and master?

Member

MechCoder commented Apr 7, 2016

Thanks for the refactoring!

Can you use memory profiler to do some quick benchmarks and confirm the lessened memory usage when dtype is np.float32 between this branch and master?

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 8, 2016

Contributor

Hello @MechCoder ,
I've fixed the problems you mentioned.

About memory profiling, I called mean_variance_axis function on a np.float32 array with shape (5*10^6, 20).

Here is the memory usage over time

  • master:

figure_1

  • this branch:

figure_2

As you can see, memory usage surrounded by the bracket drastically decrease.

Contributor

yenchenlin commented Apr 8, 2016

Hello @MechCoder ,
I've fixed the problems you mentioned.

About memory profiling, I called mean_variance_axis function on a np.float32 array with shape (5*10^6, 20).

Here is the memory usage over time

  • master:

figure_1

  • this branch:

figure_2

As you can see, memory usage surrounded by the bracket drastically decrease.

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Apr 8, 2016

Member

I would expect the peak memory usage to drastically reduce. Did you forget to rebuild sklearn?

Member

MechCoder commented Apr 8, 2016

I would expect the peak memory usage to drastically reduce. Did you forget to rebuild sklearn?

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 8, 2016

Contributor

Here is my test script:

import numpy as np
import scipy.sparse as sp
from sklearn.utils.sparsefuncs import mean_variance_axis

X = np.random.rand(5000000, 20)
X = X.astype(np.float32)
X_csr = sp.csr_matrix(X)

@profile
def test():
    X_means, X_vars = mean_variance_axis(X_csr, axis=0)
    print X_means.dtype

test()

I think peak memory usage appear when I initialize
X = np.random.rand(5000000, 20), and it should not change between the two branch.

Contributor

yenchenlin commented Apr 8, 2016

Here is my test script:

import numpy as np
import scipy.sparse as sp
from sklearn.utils.sparsefuncs import mean_variance_axis

X = np.random.rand(5000000, 20)
X = X.astype(np.float32)
X_csr = sp.csr_matrix(X)

@profile
def test():
    X_means, X_vars = mean_variance_axis(X_csr, axis=0)
    print X_means.dtype

test()

I think peak memory usage appear when I initialize
X = np.random.rand(5000000, 20), and it should not change between the two branch.

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Apr 8, 2016

Member

In that case, you can use the -m flag to verify.

python -m memory_profiler test.py

which will provide line by line output.

Member

MechCoder commented Apr 8, 2016

In that case, you can use the -m flag to verify.

python -m memory_profiler test.py

which will provide line by line output.

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Apr 8, 2016

Member

Oh the graph, looks good, I did not read it properly :/

Member

MechCoder commented Apr 8, 2016

Oh the graph, looks good, I did not read it properly :/

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 8, 2016

Contributor

Oh okay 😄

Contributor

yenchenlin commented Apr 8, 2016

Oh okay 😄

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Apr 8, 2016

Member

LGTM. cc: @jnothman

Member

MechCoder commented Apr 8, 2016

LGTM. cc: @jnothman

@MechCoder MechCoder changed the title from [MRG] Use fused types in sparse mean variance functions to [MRG+1] Use fused types in sparse mean variance functions Apr 8, 2016

Show outdated Hide outdated sklearn/utils/sparsefuncs_fast.pyx Outdated
@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 12, 2016

Contributor

Hello @jnothman @MechCoder ,

I've updated the code, please have a look. Thanks!

Contributor

yenchenlin commented Apr 12, 2016

Hello @jnothman @MechCoder ,

I've updated the code, please have a look. Thanks!

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Apr 13, 2016

Member

I will let @jnothman do the honours.

Member

MechCoder commented Apr 13, 2016

I will let @jnothman do the honours.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 14, 2016

Member

It's not really an issue with this PR, but I suspect we should have a test there to ensure the result is sensible for integer dtypes. You could just have something like:

for input_dtype, expected_dtype in [(np.float32, np.float32), (np.int32, np.float64), ...]

In fact, the expected_dtype is not really of interest: the enhancement we're making here is not really testable except via memory profiling.

Thinking further about it, this change still copies data for integers, which we could avoid with a more generic fused type, while we still having float output. Can we assume that the mean of explicitly integer features is not actually something we're interested in often, so not worth the additional compilation time?

We also need a what's new entry boasting what we've enhanced.

Member

jnothman commented Apr 14, 2016

It's not really an issue with this PR, but I suspect we should have a test there to ensure the result is sensible for integer dtypes. You could just have something like:

for input_dtype, expected_dtype in [(np.float32, np.float32), (np.int32, np.float64), ...]

In fact, the expected_dtype is not really of interest: the enhancement we're making here is not really testable except via memory profiling.

Thinking further about it, this change still copies data for integers, which we could avoid with a more generic fused type, while we still having float output. Can we assume that the mean of explicitly integer features is not actually something we're interested in often, so not worth the additional compilation time?

We also need a what's new entry boasting what we've enhanced.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 14, 2016

Contributor

Hello @jnothman , thanks for your review.
I'll add the test and what's new.

Can we assume that the mean of explicitly integer features is not actually something we're interested in often, so not worth the additional compilation time

Maybe it's my lack of knowledge, but why mean of integer features is often not we are interesred in?

Contributor

yenchenlin commented Apr 14, 2016

Hello @jnothman , thanks for your review.
I'll add the test and what's new.

Can we assume that the mean of explicitly integer features is not actually something we're interested in often, so not worth the additional compilation time

Maybe it's my lack of knowledge, but why mean of integer features is often not we are interesred in?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 14, 2016

Member

I guess integer features, or at least binary, are common enough to many problem spaces. So the current code is still performing a copy for integer input. Do we mind?

Member

jnothman commented Apr 14, 2016

I guess integer features, or at least binary, are common enough to many problem spaces. So the current code is still performing a copy for integer input. Do we mind?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Apr 14, 2016

Member

I guess integer features, or at least binary, are common enough to many problem spaces. So the current code is still performing a copy for integer input. Do we mind?

+1 for not delaying this PR because of integer handling.

Member

ogrisel commented Apr 14, 2016

I guess integer features, or at least binary, are common enough to many problem spaces. So the current code is still performing a copy for integer input. Do we mind?

+1 for not delaying this PR because of integer handling.

Show outdated Hide outdated sklearn/utils/sparsefuncs_fast.pyx Outdated
if floating is float:
dtype = np.float32
else:
dtype = np.float64

This comment has been minimized.

@ogrisel

ogrisel Apr 14, 2016

Member

style:

dtype = np.float32 if floating is float else np.float64
@ogrisel

ogrisel Apr 14, 2016

Member

style:

dtype = np.float32 if floating is float else np.float64

This comment has been minimized.

@jnothman

jnothman Apr 14, 2016

Member

Personally, I find that a bit too much of a tongue-twister!

@jnothman

jnothman Apr 14, 2016

Member

Personally, I find that a bit too much of a tongue-twister!

This comment has been minimized.

@ogrisel

ogrisel Apr 14, 2016

Member

:)

as you wish.

@ogrisel

ogrisel Apr 14, 2016

Member

:)

as you wish.

This comment has been minimized.

@MechCoder

MechCoder Apr 14, 2016

Member

haha

@MechCoder

MechCoder Apr 14, 2016

Member

haha

Show outdated Hide outdated sklearn/utils/sparsefuncs_fast.pyx Outdated
Show outdated Hide outdated sklearn/utils/sparsefuncs_fast.pyx Outdated
Show outdated Hide outdated sklearn/utils/sparsefuncs_fast.pyx Outdated
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Apr 14, 2016

Member

+1 for not delaying this PR because of integer handling.

But +1 for adding a non-regression test to check that integer dtypes are still supported.

Member

ogrisel commented Apr 14, 2016

+1 for not delaying this PR because of integer handling.

But +1 for adding a non-regression test to check that integer dtypes are still supported.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 16, 2016

Contributor

Hi @jnothman @ogrisel @MechCoder ,
I've fixed the issues you guys mentioned.

Would you please check again?
Thanks a lot!

Contributor

yenchenlin commented Apr 16, 2016

Hi @jnothman @ogrisel @MechCoder ,
I've fixed the issues you guys mentioned.

Would you please check again?
Thanks a lot!

Show outdated Hide outdated doc/whats_new.rst Outdated
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 16, 2016

Member

LGTM!

Member

jnothman commented Apr 16, 2016

LGTM!

@jnothman jnothman changed the title from [MRG+1] Use fused types in sparse mean variance functions to [MRG+2] Use fused types in sparse mean variance functions Apr 16, 2016

@MechCoder MechCoder merged commit 28758cc into scikit-learn:master Apr 16, 2016

4 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.003%) to 94.033%
Details
@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Apr 16, 2016

Member

Thanks, Yen!

Member

MechCoder commented Apr 16, 2016

Thanks, Yen!

mannby pushed a commit to mannby/scikit-learn that referenced this pull request Apr 22, 2016

[MRG+2] Use fused types in sparse mean variance functions (scikit-lea…
…rn#6593)

* Use fused types in mean variance functions

* Add test for mean-variance functions using fused trpes

* Add whats_new

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+2] Use fused types in sparse mean variance functions (scikit-lea…
…rn#6593)

* Use fused types in mean variance functions

* Add test for mean-variance functions using fused trpes

* Add whats_new
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment