Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Issue#6673:Make a wrapper around functions that score an individual feature #8038

Closed
wants to merge 38 commits into from

Conversation

amanp10
Copy link
Contributor

@amanp10 amanp10 commented Dec 11, 2016

Fixes #6673

It adds a wrapper around scoring functions like 'mutual_info_classif', 'f_classif, etc. Takes as input (X, y) where y can be None.

I have worked on Issue2 as I felt that Issue1 has already been taken care of (Please correct me if I am wrong). Here, I have not added any extra tests to test this wrapper function. Also, its usage is a little different than that mentioned in the Issue2 example.

@jnothman
Copy link
Member

I don't get the point of this.

@amanp10
Copy link
Contributor Author

amanp10 commented Dec 12, 2016

Well, I am looking forward to your instructions on what to do next.

@jnothman
Copy link
Member

But what's it for? It doesn't related to the wrapper described in #6673's second issue as far as I can tell.

@amanp10
Copy link
Contributor Author

amanp10 commented Dec 12, 2016

I have made this function which takes 3 inputs, scoring function, X and y(=None). Now the feature selectors like SelectKBest take as parameter the scoring function and then the base class calls the scoring function through the wrapper.
The usage is different from the Issue2 example that the SelectKBest doesnt take as parameter the wrapper function but the scoring function and the base class uses the wrapper function.
Also, the negative score values returned by scoring functions are changed to 0 in the wrapper function.

I feel I am missing something. Please guide me on what exactly is expected by the Issue2. Thanks a lot for your help.

@jnothman
Copy link
Member

I have made this function which takes 3 inputs, scoring function, X and y(=None). Now the feature selectors like SelectKBest take as parameter the scoring function and then the base class calls the scoring function through the wrapper.

Yes, but can you give me some example of where this is useful??

I think the proposal in issue 2 was that it should allow you to translate a function that operates over a feature vector into one that operates over a matrix. I'm not sure it's entirely necessary, personally.

@amanp10
Copy link
Contributor Author

amanp10 commented Dec 14, 2016

I am not sure about the necessity of this function myself, I have only tried to achieve what was required in the issue, I think it went wrong. If you say then I will start working on the Issue's proposal as you said above.

@jnothman
Copy link
Member

@hlin117 your input would be welcome

@hlin117
Copy link
Contributor

hlin117 commented Dec 17, 2016

I have made this function which takes 3 inputs, scoring function, X and y(=None). Now the feature selectors like SelectKBest take as parameter the scoring function and then the base class calls the scoring function through the wrapper.
Yes, but can you give me some example of where this is useful??

I think the proposal in issue 2 was that it should allow you to translate a function that operates over a feature vector into one that operates over a matrix. I'm not sure it's entirely necessary, personally.

Yes, you're correct about the objective of the issue. As for the necessity, it's not necessary, however it does act as a convenience function.

Otherwise, say my scoring function was scipy.stats.pearsonr. I want to use SelectKBest, with respect to this scoring function. With the current scikit-learn framework, I have no way of doing this.

Copy link
Contributor

@hlin117 hlin117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got any test cases?

@@ -11,7 +11,7 @@
from scipy.sparse import issparse, csc_matrix

from ..base import TransformerMixin
from ..utils import check_array, safe_mask
from ..utils import check_array, safe_mask, check_X_y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Make this alphabetical, check_X_y should come before safe_mask.



def wrapper_scorer(score_func, X, y=None):
""" A wrapper function around score functions. This function takes as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line of a docstring should be a one sentence summary. See PEP 257

The target values (class labels in classification, real numbers in
regression).

Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this docstring, maybe there's a disconnect of how you think this function wrapper would be useful. It would help you - and people using your code - to provide an example of how they should expect to use it.

"""

if not callable(score_func):
raise TypeError("The score function should be a callable, %s (%s) "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scikit-learn typically raises ValueErrors for these kinds of things.

if y is None:
X = check_array(X, ('csr', 'csc'))
else:
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can change ['csr', 'csc'] to ('csr', 'csc'). Tuples have less overhead than lists.

Also, in my opinion, you should add in the sparsity test cases later. Try getting this PR approved for dense matrices before you move on to adapt it to sparse matrices.

if y is None:
score_func_ret = score_func(X)
else:
score_func_ret = score_func(X, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing this code, it looks like you have a lot of special casing for when the return values of the functions is a pair rather than a single value. I'd suggest separating the logic out into two helper functions.

y : array-like or None, shape = [n_samples]
The target values (class labels in classification, real numbers in
regression).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add docs of what this function returns.

if pvalues is None:
return scores
else:
return scores, pvalues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking at these return values, I don't think you're understanding what issue 2 of #6673 is describing. We're just looking for a way to wrap this scoring function; the output of this function should be another callable. And this callable can be used - for example - into select k best.

@@ -17,7 +17,7 @@
safe_mask)
from ..utils.extmath import norm, safe_sparse_dot, row_norms
from ..utils.validation import check_is_fitted
from .base import SelectorMixin
from .base import SelectorMixin, wrapper_scorer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you doing in this file? I think you're making this PR far too big, and it's diverging in scope.

@amanp10
Copy link
Contributor Author

amanp10 commented Jan 9, 2017

The work seems to be done, the example mentioned in the issue is working fine. However, I have not used another callable for the output as suggested by @hlin117 , as it didnt feel necessary. Please correct me if I am wrong.
Also, should I add tests as well @jnothman ?

@jnothman
Copy link
Member

jnothman commented Jan 9, 2017

Maybe the issue was incorrect. I don't think there's any need for a wrapper scorer. Already the code handles the case where the return value is not a tuple; a wrapper is not needed. See documentation for score_func in SelectKBest, SelectPercentile.

@amanp10
Copy link
Contributor Author

amanp10 commented Jan 10, 2017

Yes @jnothman . But the second part of the issue is the actual one. That is what I have tried to solve here, a wrapper for scipy.stats scoring functions since they could not be used directly by SelectKBest and others.

@jnothman
Copy link
Member

Sorry. Got lost over the last few weeks.

@jnothman
Copy link
Member

jnothman commented Jan 10, 2017

I must admit, the name wrapper_scorer diminishes the clarity of what's going on vastly. How about make_per_feature(pearsonr)(X, y)? I'm not coming up with something perfect yet:

  • make_per_feature(pearsonr)(X, y)
  • make_per_column(pearsonr)(X, y)
  • make_columnwise(pearsonr)(X, y)
  • make_featurewise(pearsonr)(X, y)
  • per_feature(pearsonr)(X, y)
  • featurewise(pearsonr)(X, y)
  • stat_per_feature(pearsonr)(X, y)

@amanp10
Copy link
Contributor Author

amanp10 commented Jan 10, 2017

What about feature_wise_scorer(pearsonr)(X, y) or feature_wise_stat_scorer(pearsonr)(X, y)?

@jnothman
Copy link
Member

jnothman commented Jan 10, 2017 via email

@hlin117
Copy link
Contributor

hlin117 commented Jan 10, 2017

I'm eager to see this issue addressed! Thank you for working on it, @amanp10 .

@jnothman
Copy link
Member

jnothman commented Jan 10, 2017 via email

@amanp10
Copy link
Contributor Author

amanp10 commented Jan 11, 2017

Should I finally name it feature_wise_scorer(pearsonr)(X, y) ? Also, should I add a test or make the final changes by changing the whats_new.rst and commit?

@jnothman
Copy link
Member

jnothman commented Jan 11, 2017 via email

@amanp10
Copy link
Contributor Author

amanp10 commented Oct 8, 2017

I need some help here. I am unable to figure out why the ci/circleci tests are failing. It reports a problem somewhere in an example.

@amanp10
Copy link
Contributor Author

amanp10 commented Dec 3, 2017

@jnothman @hlin117 I have removed the option for y=None as I couldnt find scoring functions which take only X as input. I wanted your views on it.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced that users will often be helped by this or know to look for it... Some narrative docs under doc/modules/feature_selection.rst giving an example with spearmanr, for example, would help.

if isinstance(score_func_ret, (list, tuple)):
score, p_val = score_func_ret
else:
score = score_func_ret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never run in tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will just scrape the else part, since I couldn't get a function returning only scores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should I do about this? I am not sure how to test the else part. Also, is it necessary to test it sinice ScoreFunction is just a dummy function created for testing purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just use lambda *args, **kwargs: spearmanr(*args, **kwargs)[0]??

for i in six.moves.range(X.shape[1]):
score_func_ret = score_func(X[:, i], y, **kwargs)

if isinstance(score_func_ret, (list, tuple)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we should only support tuples here. Lists should have different meaning.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add it to doc/modules/classes.rst

@jnothman
Copy link
Member

jnothman commented Jan 9, 2018 via email

:func:`featurewise_scorer` is a wrapper function which wraps around scoring
functions like `spearmanr`, `pearsonr` etc. from the `scipy.stats` module and
makes it usable for feature selection algorithms like :class:`SelectKBest`,
:class:`SelectPercentile` etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that it compares each column of X to y

makes it usable for feature selection algorithms like :class:`SelectKBest`,
:class:`SelectPercentile` etc.

The following example illustrates it's usage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the apostrophe

SelectKBest(k=10, score_func=...)
>>> new_X = skb.transform(X)

This wrapper function returns the absolute value of the scores i.e. a score of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should do this without an option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, in SelectKBest we are supposed to choose the features having maximum correlation with the target vector. In that case the magnitude of the scores (since they may be negative) returned should serve our purpose.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that may not be true of all appropriate score functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, should we add a parameter like absolute_score which takes values True and False, if its True (default) absolute scores would be considered.
What do you say?

@jnothman
Copy link
Member

jnothman commented Jan 10, 2018 via email

@amanp10
Copy link
Contributor Author

amanp10 commented Jan 10, 2018

'abs' is a python function, so I went with absolute_score.

@jnothman
Copy link
Member

jnothman commented Jan 10, 2018 via email

@@ -131,6 +131,8 @@ def featurewise_scorer(score_func, **kwargs):
Function taking arrays X and y, and returning a pair of arrays
(scores, pvalues) or a single array with scores. This function is also
allowed to take other parameters as input.
absolute_score : bool
If True (default), the absolute value of the scores are returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add that "this is useful when using correlation coefficients"

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remain +.5 on this. I understand how it is helpful, but I suspect this kind of helper should not be in the library, but merely present as an example. Note that the case in the example here can be implemented as:

def featurwise_spearmanr(X, y):
    scores, pvals = zip(*(spearmanr(x, y) for x in X.T))
    return np.abs(scores), pvals

-------
scores : array-like, shape (n_features,)
Score values returned by the scoring function.
p_vals : array-like, shape (n_features,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark this dependent on the score function

Parameters
----------
score_func : callable
Function taking arrays X and y, and returning a pair of arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it just return a pair of numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the entire description appropriately.

allowed to take other parameters as input.
absolute_score : bool
If True (default), the absolute value of the scores are returned,
which is useful when using correlation coefficients.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document kwargs also

X, y = make_classification(random_state=0)

# spearmanr from scipy.stats
skb = SelectKBest(featurewise_scorer(spearmanr, axis=0), k=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should really be testing the new function alone. We already have checked that selectkbest works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to test if the wrapper is working as it is supposed to be used. I will try to change the tests, testing the function alone.

@amanp10
Copy link
Contributor Author

amanp10 commented Jan 14, 2018

@jnothman I am not able to comment on the necessity of this feature or its importance in the library. I think the other core devs might be able to help.

@agramfort
Copy link
Member

I would close this one. It adds code to core for something I would expect a user to be able to write.

I agree with @jnothman here. Feel free to reopen if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants