MRG sparse matrix support in univariate feature selection #1025

Closed
wants to merge 13 commits into
from

Conversation

Projects
None yet
3 participants
@amueller
Member

amueller commented Aug 14, 2012

This PR adds sparse matrix support to ``f_classif" and "f_regression" univariate feature selection.

I'm not very good with sparse stuff, so any suggestions are more then welcome.

I added a utility function that computes the element wise square of an array-like or sparse matrix and put it in utils.
I hope this is not nonsense but seemed the simplest thing to do. Is there a better way to to element-wise operations in sparse matrices?
Also I'm not sure if there might be a better place for the function in on of the submodules and whether it should be documented in the developers docs.

sklearn/utils/__init__.py
@@ -283,6 +284,26 @@ def shuffle(*arrays, **options):
return resample(*arrays, **options)
+def safe_sqr(X):

This comment has been minimized.

@agramfort

agramfort Aug 15, 2012

Member

would it be useful to add a copy param to allow inplace squaring?

@agramfort

agramfort Aug 15, 2012

Member

would it be useful to add a copy param to allow inplace squaring?

This comment has been minimized.

@amueller

amueller Aug 15, 2012

Member

I guess it would be useful. Will add.

@amueller

amueller Aug 15, 2012

Member

I guess it would be useful. Will add.

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Aug 15, 2012

Member

do you test that sparse and dense give the same output values?

Member

agramfort commented Aug 15, 2012

do you test that sparse and dense give the same output values?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 15, 2012

Member

I copied the test of the dense implementation. I also though about doing the dense == sparse test. Do you think that would be better? Then I'd remove the duplicate tests for sparse and just check if the results are equal.

Member

amueller commented Aug 15, 2012

I copied the test of the dense implementation. I also though about doing the dense == sparse test. Do you think that would be better? Then I'd remove the duplicate tests for sparse and just check if the results are equal.

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Aug 15, 2012

Member

yes I think it's better. One test for both sparse and dense et check that is gives the same result

Member

agramfort commented Aug 15, 2012

yes I think it's better. One test for both sparse and dense et check that is gives the same result

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 15, 2012

Member

Done.

Member

amueller commented Aug 15, 2012

Done.

The set of F values
- pval : array of shape(m),
+ pval : array shape = [n_features,],

This comment has been minimized.

@agramfort

agramfort Aug 15, 2012

Member

shouldn't the comma be before the shape ?

@agramfort

agramfort Aug 15, 2012

Member

shouldn't the comma be before the shape ?

The set of regressors that will tested sequentially
y : array of shape(n_samples)
The data matrix
Returns
-------
- F : array of shape (m),
+ F : array shape = [n_features,],

This comment has been minimized.

@agramfort

agramfort Aug 15, 2012

Member

same comma here

@agramfort

agramfort Aug 15, 2012

Member

same comma here

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Aug 15, 2012

Member

besides the docstring commas +1 for merge

Member

agramfort commented Aug 15, 2012

besides the docstring commas +1 for merge

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 15, 2012

Member

Thanks, done.

Member

amueller commented Aug 15, 2012

Thanks, done.

@amueller amueller referenced this pull request Aug 15, 2012

Closed

MRG Sparse rfe #1029

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Aug 16, 2012

Member

+1 for merge

Member

agramfort commented Aug 16, 2012

+1 for merge

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 16, 2012

Member

Thanks. I'll wait for another +1.

Member

amueller commented Aug 16, 2012

Thanks. I'll wait for another +1.

+ if issparse(X):
+ corr = y * X
+ else:
+ corr = np.dot(y, X)

This comment has been minimized.

@mblondel

mblondel Aug 16, 2012

Member

Can you use safe_sparse_dot here?

@mblondel

mblondel Aug 16, 2012

Member

Can you use safe_sparse_dot here?

This comment has been minimized.

@amueller

amueller Aug 16, 2012

Member

Good point :)

@amueller

amueller Aug 16, 2012

Member

Good point :)

- X = as_float_array(X, copy=False) # copy only if center
+ if issparse(X) and center:
+ raise ValueError("center=True only allowed for dense data")
+ X, y = check_arrays(X, y, dtype=np.float, )

This comment has been minimized.

@mblondel

mblondel Aug 16, 2012

Member

Extra comma + PEP8 ?

@mblondel

mblondel Aug 16, 2012

Member

Extra comma + PEP8 ?

- args = [X[y == k] for k in np.unique(y)]
+ X, y = check_arrays(X, y)
+ if issparse(X):
+ args = [X[np.where(y == k)[0]] for k in np.unique(y)]

This comment has been minimized.

@mblondel

mblondel Aug 16, 2012

Member

Is it possible to use this line in the dense case too? (to get rid of the if / else)

@mblondel

mblondel Aug 16, 2012

Member

Is it possible to use this line in the dense case too? (to get rid of the if / else)

This comment has been minimized.

@amueller

amueller Aug 16, 2012

Member

It is possible but I thought it might incur some overhead. I'm not sure how big that is, though. On numpy arrays, indexing with indices is slower than boolean masks I think.

@amueller

amueller Aug 16, 2012

Member

It is possible but I thought it might incur some overhead. I'm not sure how big that is, though. On numpy arrays, indexing with indices is slower than boolean masks I think.

This comment has been minimized.

@amueller

amueller Aug 16, 2012

Member

Also of course the where would be unnecessary.
I could do a safe_boolean_indexing helper function but that feels a bit overkill...

@amueller

amueller Aug 16, 2012

Member

Also of course the where would be unnecessary.
I could do a safe_boolean_indexing helper function but that feels a bit overkill...

This comment has been minimized.

@mblondel

mblondel Aug 16, 2012

Member

There is a safe_mask function already.
On Aug 17, 2012 12:16 AM, "Andreas Mueller" notifications@github.com
wrote:

In sklearn/feature_selection/univariate_selection.py:

     The set of p-values
 """
  • X = array2d(X)
  • y = np.asarray(y).ravel()
  • args = [X[y == k] for k in np.unique(y)]
  • X, y = check_arrays(X, y)
  • if issparse(X):
  •    args = [X[np.where(y == k)[0]] for k in np.unique(y)]
    

Also of course the where would be unnecessary.
I could do a safe_boolean_indexing helper function but that feels a bit
overkill...


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1025/files#r1393613.

@mblondel

mblondel Aug 16, 2012

Member

There is a safe_mask function already.
On Aug 17, 2012 12:16 AM, "Andreas Mueller" notifications@github.com
wrote:

In sklearn/feature_selection/univariate_selection.py:

     The set of p-values
 """
  • X = array2d(X)
  • y = np.asarray(y).ravel()
  • args = [X[y == k] for k in np.unique(y)]
  • X, y = check_arrays(X, y)
  • if issparse(X):
  •    args = [X[np.where(y == k)[0]] for k in np.unique(y)]
    

Also of course the where would be unnecessary.
I could do a safe_boolean_indexing helper function but that feels a bit
overkill...


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/pull/1025/files#r1393613.

This comment has been minimized.

@amueller

amueller Aug 16, 2012

Member

Oh great, I'll use that then.

@amueller

amueller Aug 16, 2012

Member

Oh great, I'll use that then.

This comment has been minimized.

@mblondel

mblondel Aug 16, 2012

Member

I only remembered of it after you suggested creating such a function :)

@mblondel

mblondel Aug 16, 2012

Member

I only remembered of it after you suggested creating such a function :)

This comment has been minimized.

@amueller

amueller Aug 16, 2012

Member

Hm it creates a mask so it is hard to use in the list comprehension. It is not used much, I could change it to directly return the masked array....

@amueller

amueller Aug 16, 2012

Member

Hm it creates a mask so it is hard to use in the list comprehension. It is not used much, I could change it to directly return the masked array....

This comment has been minimized.

@amueller

amueller Aug 16, 2012

Member

hm ok never mind, I can still use it directly...

@amueller

amueller Aug 16, 2012

Member

hm ok never mind, I can still use it directly...

@@ -90,6 +92,7 @@ def f_oneway(*args):
msb = ssbn / float(dfbn)
msw = sswn / float(dfwn)
f = msb / msw
+ f = np.asarray(f).ravel()

This comment has been minimized.

@mblondel

mblondel Aug 16, 2012

Member

Could you add a comment that explains why this line is needed?

@mblondel

mblondel Aug 16, 2012

Member

Could you add a comment that explains why this line is needed?

This comment has been minimized.

@amueller

amueller Aug 16, 2012

Member

I should.

@amueller

amueller Aug 16, 2012

Member

I should.

@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Aug 16, 2012

Member

Apart from my comments, +1.

Member

mblondel commented Aug 16, 2012

Apart from my comments, +1.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 16, 2012

Member

I was just wondering, should I move the safe_sqr to the extmath submodule? at the moment it's directly in utils.

Member

amueller commented Aug 16, 2012

I was just wondering, should I move the safe_sqr to the extmath submodule? at the moment it's directly in utils.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 16, 2012

Member

I think keeping safe_sqr next to safe_mask is as good as anywhere. Merging by rebase.

Member

amueller commented Aug 16, 2012

I think keeping safe_sqr next to safe_mask is as good as anywhere. Merging by rebase.

@amueller amueller closed this Aug 16, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment