Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Univariate feature selection methods should allow passing kwargs to score_func callable #12468

Open
hermidalc opened this issue Oct 26, 2018 · 31 comments

Comments

@hermidalc
Copy link
Contributor

Univariate feature selection methods, SelectKBest/Percentile/Fdr/Fpr/Fwe and GenericUnivariateSelect does not have functionality to allow passing of score_func kwargs, some of which are important.

One thought too is that an implementation should be GridSearchCV compliant such that score_func kwargs parameters can be searched.

@amueller
Copy link
Member

agree, for mutual_info_classif for example the kwargs seem important.
I think any natural implementation would allow for it being grid-searchable.
Btw, you can work around this using

{score_func:[partial(mutual_info_classif, n_neighbors=k) for k in [3, 5, 7]]}

but maybe doing it more directly would be nice.

@jnothman
Copy link
Member

jnothman commented Oct 29, 2018 via email

@aishgrt1
Copy link
Contributor

aishgrt1 commented Nov 1, 2018

Shall I take this?

@jnothman
Copy link
Member

jnothman commented Nov 4, 2018

I suppose you can take this. I don't consider it high priority as you can certainly use partial to accomplish the same

@hermidalc
Copy link
Contributor Author

@amueller @jnothman where do I put this above workaround in GridSearchCV? Also, how do I get it to work with joblib memory since I usually wrap feature selection scoring functions like mutual_info_classif in a cache, e.g.,

memory = Memory(cachedir=mkdtemp(dir='/tmp'), verbose=0)
mi_classif_cached = memory.cache(mutual_info_classif)

search = GridSearchCV(
    Pipeline([
        ('slr', StandardScaler()),
        ('fs',  SelectKBest(mi_classif_cached)),
        ('clf', LinearSVC())
    ]), memory=memory),
)

@hermidalc
Copy link
Contributor Author

@aishgrt1 @amueller @jnothman I implemented it locally, so modifying https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/univariate_selection.py you need to change _BaseFilter base class and SelectKBest class to show in this example (though all the Select* classes should change the same). I added new score_args and score_kwargs parameters and init code to the _BaseFilter __init__ and SelectKBest __init__ functions, and you will see they are used when calling the score function in _BaseFilter fit(). I don't think anything else is needed.

class _BaseFilter(BaseEstimator, SelectorMixin):
    """Initialize the univariate feature selection.
    
    Parameters
    ----------
    score_func : callable
        Function taking two arrays X and y, and returning a pair of arrays
        (scores, pvalues) or a single array with scores.
    
    score_args: list
        Additional args to pass to score_func

    score_kwargs: dict
        kwargs to pass to score_func
    """

    def __init__(self, score_func, score_args, score_kwargs):
        self.score_func = score_func
        self.score_args = score_args
        self.score_kwargs = score_kwargs

    def fit(self, X, y):
        """Run score function on (X, y) and get the appropriate features.

        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            The training input samples.

        y : array-like, shape = [n_samples]
            The target values (class labels in classification, real numbers in
            regression).

        Returns
        -------
        self : object
        """
        X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)

        if not callable(self.score_func):
            raise TypeError("The score function should be a callable, %s (%s) "
                            "was passed."
                            % (self.score_func, type(self.score_func)))

        self._check_params(X, y)
        score_func_ret = self.score_func(X, y, *self.score_args, **self.score_kwargs)
        if isinstance(score_func_ret, (list, tuple)):
            self.scores_, self.pvalues_ = score_func_ret
            self.pvalues_ = np.asarray(self.pvalues_)
        else:
            self.scores_ = score_func_ret
            self.pvalues_ = None

        self.scores_ = np.asarray(self.scores_)

        return self

    def _check_params(self, X, y):
        pass


class SelectKBest(_BaseFilter):
    """Select features according to the k highest scores.

    Read more in the :ref:`User Guide <univariate_feature_selection>`.

    Parameters
    ----------
    score_func : callable
        Function taking two arrays X and y, and returning a pair of arrays
        (scores, pvalues) or a single array with scores.
        Default is f_classif (see below "See also"). The default function only
        works with classification tasks.
    
    score_args: list, optional, default=()
        Additional args to pass to score_func

    score_kwargs: dict, optional, default={}
        kwargs to pass to score_func
    
    k : int or "all", optional, default=10
        Number of top features to select.
        The "all" option bypasses selection, for use in a parameter search.

    Attributes
    ----------
    scores_ : array-like, shape=(n_features,)
        Scores of features.

    pvalues_ : array-like, shape=(n_features,)
        p-values of feature scores, None if `score_func` returned only scores.

    Examples
    --------
    >>> from sklearn.datasets import load_digits
    >>> from sklearn.feature_selection import SelectKBest, chi2
    >>> X, y = load_digits(return_X_y=True)
    >>> X.shape
    (1797, 64)
    >>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
    >>> X_new.shape
    (1797, 20)
    Notes
    -----
    Ties between features with equal scores will be broken in an unspecified
    way.
    See also
    --------
    f_classif: ANOVA F-value between label/feature for classification tasks.
    mutual_info_classif: Mutual information for a discrete target.
    chi2: Chi-squared stats of non-negative features for classification tasks.
    f_regression: F-value between label/feature for regression tasks.
    mutual_info_regression: Mutual information for a continuous target.
    SelectPercentile: Select features based on percentile of the highest scores.
    SelectFpr: Select features based on a false positive rate test.
    SelectFdr: Select features based on an estimated false discovery rate.
    SelectFwe: Select features based on family-wise error rate.
    GenericUnivariateSelect: Univariate feature selector with configurable mode.
    """

    def __init__(self, score_func=f_classif, score_args=(), score_kwargs={}, k=10):
        super(SelectKBest, self).__init__(score_func, score_args, score_kwargs)
        self.k = k

    def _check_params(self, X, y):
        if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
            raise ValueError("k should be >=0, <= n_features = %d; got %r. "
                             "Use k='all' to return all features."
                             % (X.shape[1], self.k))

    def _get_support_mask(self):
        check_is_fitted(self, 'scores_')

        if self.k == 'all':
            return np.ones(self.scores_.shape, dtype=bool)
        elif self.k == 0:
            return np.zeros(self.scores_.shape, dtype=bool)
        else:
            scores = _clean_nans(self.scores_)
            mask = np.zeros(scores.shape, dtype=bool)

            # Request a stable sort. Mergesort takes more memory (~40MB per
            # megafeature on x86-64).
            mask[np.argsort(scores, kind="mergesort")[-self.k:]] = 1
        return mask

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 11, 2018

I actually ran into a problem while trying to use the modified code within GridSearchCV testing it with mutual_info_classif and trying to search n_neighbors:

search = GridSearchCV(
    Pipeline([
        ('slr', StandardScaler()),
        ('fs',  SelectKBest(mutual_info_classif)),
        ('clf', LinearSVC())
    ]), memory=memory),
    param_grid={
        fs__score_kwargs__n_neighbors = [ 5, 10, 15 ]
    }
)

I get the error:

Traceback (most recent call last):
  File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 350, in __call__
    return self.func(*args, **kwargs)
  File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 444, in _fit_and_score
    estimator.set_params(**parameters)
  File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/pipeline.py", line 142, in set_params
    self._set_params('steps', **kwargs)
  File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/utils/metaestimators.py", line 49, in _set_params
    super(_BaseComposition, self).set_params(**params)
  File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/base.py", line 282, in set_params
    valid_params[key].set_params(**sub_params)
  File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/base.py", line 282, in set_params
    valid_params[key].set_params(**sub_params)
AttributeError: 'dict' object has no attribute 'set_params'

So during the search it cannot set a dict param... any advice would be welcomed!

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 11, 2018

A score_params parameter?

@jnothman sorry I just saw this, I agree this is better parameter name as it's more consistent with other similar parameters in the codebase.

Though as mentioned above I'm not sure how to get a dict param to work.

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 11, 2018

I looked base.py set_params() and it looks like I'm probably missing some code to do additional handling of this so that it would work or maybe it just cannot work given the current sklearn structure.

There are two issues. First, independent of grid search there has to be a way to set score_func params and that like you said should be done with a score_params dict on the _BaseFilter.

Second, score_func params need to be grid searchable, and the only way I see it working given the current way these things work in sklearn is that score_func has to act like a nested estimator so that you do, e.g.:

fs__score_func__n_neighbors: [ 3, 5, 10 ]

So I think to unify both points, just like for nested estimators all univariate scoring functions need to be proper classes. It's probably a good thing anyway it makes things nicer.

I see in other parts of the codebase there are params that are dicts, namely in KNeighborsClassifier there is a metric_params, and in some of the kernel approximation classes there are kernel_params, and I guess these are not grid searchable as well.

def set_params(self, **params):
"""Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
``<component>__<parameter>`` so that it's possible to update each
component of a nested object.
Returns
-------
self
"""
if not params:
# Simple optimization to gain speed (inspect is slow)
return self
valid_params = self.get_params(deep=True)
nested_params = defaultdict(dict) # grouped by prefix
for key, value in params.items():
key, delim, sub_key = key.partition('__')
if key not in valid_params:
raise ValueError('Invalid parameter %s for estimator %s. '
'Check the list of available parameters '
'with `estimator.get_params().keys()`.' %
(key, self))
if delim:
nested_params[key][sub_key] = value
else:
setattr(self, key, value)
valid_params[key] = value
for key, sub_params in nested_params.items():
valid_params[key].set_params(**sub_params)
return self

@hermidalc
Copy link
Contributor Author

@aishgrt1 @amueller @jnothman guys I rewrote univariate_selection.py and mutual_info_.py to turn all scoring functions into proper *FeatureScorer classes inheriting from BaseEstimator. When fit() is called this is where it runs the scoring code and stores results in scores_ and pvalues_ attributes. Also updated all Select* and GenericUnivariateSelect filter classes to support the new *FeatureScorer objects and changed score_func parameter to scorer. Having the feature scorers as classes then we don't need score_params.

All my tests so far show everything works and seamlessly with grid searching just like they do with classes that have nested estimators. This is somewhat of a big change because all the feature scoring functions go away. I would be happy to make a pull request, just need to make some tests as well and to get your feedback on the code and conventions so that things are consistent with the rest of codebase.

Old                         New
f_classif                   ANOVAFClassifierFeatureScorer
f_regression                ANOVAFRegressorFeatureScorer
chi2                        Chi2FeatureScorer
mutual_info_classif         MutualInfoClassifierFeatureScorer
mutual_info_regression      MutualInfoRegressorFeatureScorer

If you feel the class names are too long I guess we could shorten it to *Scorer instead of *FeatureScorer, I just didn't want any confusion between other sklearn scorers that are not for feature selection scoring.

@amueller
Copy link
Member

if you want the double underscore notation you need estimators, but if you just want to pass on kwargs you don't need to.
So SelectKBest could have a score_func_args parameter that's a dictionary that's passed to score_func. You can implement that without a refactoring.

@amueller
Copy link
Member

The way to make it work is

 fs__score_kwargs = [ {'n_neighbors': i} for i in [5, 10, 15 ]}

@hermidalc
Copy link
Contributor Author

@amueller do you think it's a good idea though to make a pull request with what I've done? To me its actually cleaner to have Scorer classes

@hermidalc
Copy link
Contributor Author

I'm going to send you the pull request, you'll see its better than the current code and structure, also I've shortened the class names.

@amueller
Copy link
Member

The grid-search might be a bit more natural to write down for the scorer classes, but I'm not sure that's a good incentive for such a big change. The current implementation supports what you're trying to achieve, and imho it's pretty straight-forward, if not super elegant. Adding func_kwargs would make things a bit easier without a big rewrite.
Keep in mind that we need to keep all the old behaviors.

@hermidalc
Copy link
Contributor Author

It's actually not such a big rewrite you will see.

@amueller
Copy link
Member

Adding 5 new classes is pretty big.

@hermidalc
Copy link
Contributor Author

Well I'll make the pull request, please have a look and you can decide. In the long run you will always make changes that aren't backwards compatible, and this it actually a better way to design it.

@amueller
Copy link
Member

We basically never make changes that are not backwards compatible.

@amueller
Copy link
Member

but sure, go ahead, we can have a look. But the PR should be backwards-compatible.

@hermidalc
Copy link
Contributor Author

Would that be impossible?

@amueller
Copy link
Member

no, it just means there's two interfaces.

@hermidalc
Copy link
Contributor Author

Ok will do, working on it now

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 12, 2018

@amueller sorry need to ask some advice. The new *Scorer classes basically wrap the functions so zero duplication of code, but for the Select* classes in python how do you define it properly so that it can accept either a *Scorer object or legacy function? Can python have multiple constructors/signatures? Or do I need to keep score_func parameter and check if they pass a *Scorer object or legacy function?

@hermidalc
Copy link
Contributor Author

@amueller or sorry would I have two named parameters legacy score_func and scorer, though if they pass it in positionally which is likely the case I will still need to check what it is

@hermidalc
Copy link
Contributor Author

Since as you said for backwards compatibility we need to keep the score_func named parameter and it's 1st position, I updated univariate_selection.py _BaseFilter code to check if we are getting a callable or Scorer object. Also updated all docs

@hermidalc
Copy link
Contributor Author

@amueller I've passed feature selection tests including new ones written for the new classes except for three common tests for two of the Scorer classes (that wrap their respective scoring functions).

I imagine since these new classes are estimators there are additional tests because the errors are tracing to lines of code in the legacy scoring functions and therefore they weren't ever tested to this level before. Would you have time to help me troubleshoot these errors below?

Other than these errors everything works and as you asked all the legacy code and functionality is all still there. These will just be an additional functionality if people prefer to use them. You can see my branch here https://github.com/hermidalc/scikit-learn/tree/univariate_scorer_classes

sklearn/tests/test_common.py::test_non_meta_estimators[ANOVAFScorerClassification-ANOVAFScorerClassification-check_estimator_sparse_data] FAILED
Estimator  = <class 'sklearn.feature_selection.univariate_selection.ANOVAFScorerClassification'>
check      = <function check_estimator_sparse_data at 0x7fdbe83e3a60>
estimator  = ANOVAFScorerClassification()
name       = 'ANOVAFScorerClassification'

sklearn/tests/test_common.py:101: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/estimator_checks.py:508: in check_estimator_sparse_data
    estimator.fit(X, y)
sklearn/feature_selection/univariate_selection.py:365: in fit
    self.scores_, self.pvalues_ = f_classif(X, y)
sklearn/feature_selection/univariate_selection.py:148: in f_classif
    args = [X[safe_mask(X, y == k)] for k in np.unique(y)]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <iterator object at 0x7fdbc7fe3e48>

>   args = [X[safe_mask(X, y == k)] for k in np.unique(y)]
E   TypeError: only integer scalar arrays can be converted to a scalar index

.0         = <iterator object at 0x7fdbc7fe3e48>
X          = <40x10 sparse matrix of type '<class 'numpy.float64'>'
	with 74 stored elements in COOrdinate format>
k          = 0
y          = array([1, 3, 0, 3, 3, 1, 1, 0, 2, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 2, 3, 3,
       1, 0, 2, 1, 3, 0, 3, 1, 0, 2, 0, 2, 2, 1, 2, 1, 2, 1])

sklearn/feature_selection/univariate_selection.py:148: TypeError
--------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------
Estimator ANOVAFScorerClassification doesn't seem to fail gracefully on sparse data: error message state explicitly that sparse input is not supported if this is not the case.
sklearn/tests/test_common.py::test_non_meta_estimators[ANOVAFScorerRegression-ANOVAFScorerRegression-check_dtype_object] FAILED
Estimator  = <class 'sklearn.feature_selection.univariate_selection.ANOVAFScorerRegression'>
check      = <function check_dtype_object at 0x7fdbe83e3ea0>
estimator  = ANOVAFScorerRegression(center=True)
name       = 'ANOVAFScorerRegression'

sklearn/tests/test_common.py:101: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/testing.py:350: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:627: in check_dtype_object
    estimator.fit(X, y.astype(object))
sklearn/feature_selection/univariate_selection.py:427: in fit
    self.scores_, self.pvalues_ = f_regression(X, y, self.center)
sklearn/feature_selection/univariate_selection.py:304: in f_regression
    pv = stats.f.sf(F, 1, degrees_of_freedom)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <scipy.stats._continuous_distns.f_gen object at 0x7fdbf68322e8>
x = array([561.4483878151119, 0.42515515900446055, 2.883877366500464,
       0.17795476559942922, 0.12501597495086797, 0.0...483376599,
       2.406504550121106, 0.5829428400264207, 0.30688161329297486,
       1.3495053261207668], dtype=object)
args = (array(1), array(38)), kwds = {}, loc = array(0), scale = array(1), dtyp = dtype('O'), cond0 = True
cond1 = array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])
cond2 = array([False, False, False, False, False, False, False, False, False,
       False]), cond = array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])
output = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=object)

    def sf(self, x, *args, **kwds):
        """
            Survival function (1 - `cdf`) at x of the given RV.
    
            Parameters
            ----------
            x : array_like
                quantiles
            arg1, arg2, arg3,... : array_like
                The shape parameter(s) for the distribution (see docstring of the
                instance object for more information)
            loc : array_like, optional
                location parameter (default=0)
            scale : array_like, optional
                scale parameter (default=1)
    
            Returns
            -------
            sf : array_like
                Survival function evaluated at x
    
            """
        args, loc, scale = self._parse_args(*args, **kwds)
        x, loc, scale = map(asarray, (x, loc, scale))
        args = tuple(map(asarray, args))
        dtyp = np.find_common_type([x.dtype, np.float64], [])
        x = np.asarray((x - loc)/scale, dtype=dtyp)
        cond0 = self._argcheck(*args) & (scale > 0)
        cond1 = self._open_support_mask(x) & (scale > 0)
        cond2 = cond0 & (x <= self.a)
        cond = cond0 & cond1
        output = zeros(shape(cond), dtyp)
>       place(output, (1-cond0)+np.isnan(x), self.badvalue)
E       TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

args       = (array(1), array(38))
cond       = array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])
cond0      = True
cond1      = array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])
cond2      = array([False, False, False, False, False, False, False, False, False,
       False])
dtyp       = dtype('O')
kwds       = {}
loc        = array(0)
output     = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=object)
scale      = array(1)
self       = <scipy.stats._continuous_distns.f_gen object at 0x7fdbf68322e8>
x          = array([561.4483878151119, 0.42515515900446055, 2.883877366500464,
       0.17795476559942922, 0.12501597495086797, 0.0...483376599,
       2.406504550121106, 0.5829428400264207, 0.30688161329297486,
       1.3495053261207668], dtype=object)

../../../../soft/anaconda3/5.2.0/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1824: TypeError
sklearn/tests/test_common.py::test_non_meta_estimators[MutualInfoScorerClassification-MutualInfoScorerClassification-check_fit2d_1sample] FAILED
Estimator  = <class 'sklearn.feature_selection.univariate_selection.MutualInfoScorerClassification'>
check      = <function check_fit2d_1sample at 0x7fdbe83de620>
estimator  = MutualInfoScorerClassification(copy=True, discrete_features='auto',
                n_neighbors=3, random_state=None)
name       = 'MutualInfoScorerClassification'

sklearn/tests/test_common.py:101: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/testing.py:350: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:847: in check_fit2d_1sample
    raise e
sklearn/utils/estimator_checks.py:844: in check_fit2d_1sample
    estimator.fit(X, y)
sklearn/feature_selection/univariate_selection.py:593: in fit
    self.random_state)
sklearn/feature_selection/mutual_info_.py:450: in mutual_info_classif
    copy, random_state)
sklearn/feature_selection/mutual_info_.py:289: in _estimate_mi
    x, discrete_feature in moves.zip(_iterate_columns(X), discrete_mask)]
sklearn/feature_selection/mutual_info_.py:289: in <listcomp>
    x, discrete_feature in moves.zip(_iterate_columns(X), discrete_mask)]
sklearn/feature_selection/mutual_info_.py:161: in _compute_mi
    return _compute_mi_cd(x, y, n_neighbors)
sklearn/feature_selection/mutual_info_.py:139: in _compute_mi_cd
    nn.fit(c)
sklearn/neighbors/base.py:912: in fit
    return self._fit(X)
sklearn/neighbors/base.py:209: in _fit
    X = check_array(X, accept_sparse='csr')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

array = array([], shape=(0, 1), dtype=float64), accept_sparse = 'csr', accept_large_sparse = True, dtype = None, order = None, copy = False, force_all_finite = True, ensure_2d = True
allow_nd = False, ensure_min_samples = 1, ensure_min_features = 1, warn_on_dtype = False, estimator = None

    def check_array(array, accept_sparse=False, accept_large_sparse=True,
                    dtype="numeric", order=None, copy=False, force_all_finite=True,
                    ensure_2d=True, allow_nd=False, ensure_min_samples=1,
                    ensure_min_features=1, warn_on_dtype=False, estimator=None):
    
        """Input validation on an array, list, sparse matrix or similar.
    
        By default, the input is checked to be a non-empty 2D array containing
        only finite values. If the dtype of the array is object, attempt
        converting to float, raising on failure.
    
        Parameters
        ----------
        array : object
            Input object to check / convert.
    
        accept_sparse : string, boolean or list/tuple of strings (default=False)
            String[s] representing allowed sparse matrix formats, such as 'csc',
            'csr', etc. If the input is sparse but not in the allowed format,
            it will be converted to the first listed format. True allows the input
            to be any format. False means that a sparse matrix input will
            raise an error.
    
            .. deprecated:: 0.19
               Passing 'None' to parameter ``accept_sparse`` in methods is
               deprecated in version 0.19 "and will be removed in 0.21. Use
               ``accept_sparse=False`` instead.
    
        accept_large_sparse : bool (default=True)
            If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by
            accept_sparse, accept_large_sparse=False will cause it to be accepted
            only if its indices are stored with a 32-bit dtype.
    
            .. versionadded:: 0.20
    
        dtype : string, type, list of types or None (default="numeric")
            Data type of result. If None, the dtype of the input is preserved.
            If "numeric", dtype is preserved unless array.dtype is object.
            If dtype is a list of types, conversion on the first type is only
            performed if the dtype of the input is not in the list.
    
        order : 'F', 'C' or None (default=None)
            Whether an array will be forced to be fortran or c-style.
            When order is None (default), then if copy=False, nothing is ensured
            about the memory layout of the output array; otherwise (copy=True)
            the memory layout of the returned array is kept as close as possible
            to the original array.
    
        copy : boolean (default=False)
            Whether a forced copy will be triggered. If copy=False, a copy might
            be triggered by a conversion.
    
        force_all_finite : boolean or 'allow-nan', (default=True)
            Whether to raise an error on np.inf and np.nan in array. The
            possibilities are:
    
            - True: Force all values of array to be finite.
            - False: accept both np.inf and np.nan in array.
            - 'allow-nan': accept only np.nan values in array. Values cannot
              be infinite.
    
            .. versionadded:: 0.20
               ``force_all_finite`` accepts the string ``'allow-nan'``.
    
        ensure_2d : boolean (default=True)
            Whether to raise a value error if array is not 2D.
    
        allow_nd : boolean (default=False)
            Whether to allow array.ndim > 2.
    
        ensure_min_samples : int (default=1)
            Make sure that the array has a minimum number of samples in its first
            axis (rows for a 2D array). Setting to 0 disables this check.
    
        ensure_min_features : int (default=1)
            Make sure that the 2D array has some minimum number of features
            (columns). The default value of 1 rejects empty datasets.
            This check is only enforced when the input data has effectively 2
            dimensions or is originally 1D and ``ensure_2d`` is True. Setting to 0
            disables this check.
    
        warn_on_dtype : boolean (default=False)
            Raise DataConversionWarning if the dtype of the input data structure
            does not match the requested dtype, causing a memory copy.
    
        estimator : str or estimator instance (default=None)
            If passed, include the name of the estimator in warning messages.
    
        Returns
        -------
        array_converted : object
            The converted and validated array.
    
        """
        # accept_sparse 'None' deprecation check
        if accept_sparse is None:
            warnings.warn(
                "Passing 'None' to parameter 'accept_sparse' in methods "
                "check_array and check_X_y is deprecated in version 0.19 "
                "and will be removed in 0.21. Use 'accept_sparse=False' "
                " instead.", DeprecationWarning)
            accept_sparse = False
    
        # store reference to original array to check if copy is needed when
        # function returns
        array_orig = array
    
        # store whether originally we wanted numeric dtype
        dtype_numeric = isinstance(dtype, six.string_types) and dtype == "numeric"
    
        dtype_orig = getattr(array, "dtype", None)
        if not hasattr(dtype_orig, 'kind'):
            # not a data type (e.g. a column named dtype in a pandas DataFrame)
            dtype_orig = None
    
        # check if the object contains several dtypes (typically a pandas
        # DataFrame), and store them. If not, store None.
        dtypes_orig = None
        if hasattr(array, "dtypes") and hasattr(array, "__array__"):
            dtypes_orig = np.array(array.dtypes)
    
        if dtype_numeric:
            if dtype_orig is not None and dtype_orig.kind == "O":
                # if input is object, convert to float.
                dtype = np.float64
            else:
                dtype = None
    
        if isinstance(dtype, (list, tuple)):
            if dtype_orig is not None and dtype_orig in dtype:
                # no dtype conversion required
                dtype = None
            else:
                # dtype conversion required. Let's select the first element of the
                # list of accepted types.
                dtype = dtype[0]
    
        if force_all_finite not in (True, False, 'allow-nan'):
            raise ValueError('force_all_finite should be a bool or "allow-nan"'
                             '. Got {!r} instead'.format(force_all_finite))
    
        if estimator is not None:
            if isinstance(estimator, six.string_types):
                estimator_name = estimator
            else:
                estimator_name = estimator.__class__.__name__
        else:
            estimator_name = "Estimator"
        context = " by %s" % estimator_name if estimator is not None else ""
    
        if sp.issparse(array):
            _ensure_no_complex_data(array)
            array = _ensure_sparse_format(array, accept_sparse=accept_sparse,
                                          dtype=dtype, copy=copy,
                                          force_all_finite=force_all_finite,
                                          accept_large_sparse=accept_large_sparse)
        else:
            # If np.array(..) gives ComplexWarning, then we convert the warning
            # to an error. This is needed because specifying a non complex
            # dtype to the function converts complex to real dtype,
            # thereby passing the test made in the lines following the scope
            # of warnings context manager.
            with warnings.catch_warnings():
                try:
                    warnings.simplefilter('error', ComplexWarning)
                    array = np.asarray(array, dtype=dtype, order=order)
                except ComplexWarning:
                    raise ValueError("Complex data not supported\n"
                                     "{}\n".format(array))
    
            # It is possible that the np.array(..) gave no warning. This happens
            # when no dtype conversion happened, for example dtype = None. The
            # result is that np.array(..) produces an array of complex dtype
            # and we need to catch and raise exception for such cases.
            _ensure_no_complex_data(array)
    
            if ensure_2d:
                # If input is scalar raise error
                if array.ndim == 0:
                    raise ValueError(
                        "Expected 2D array, got scalar array instead:\narray={}.\n"
                        "Reshape your data either using array.reshape(-1, 1) if "
                        "your data has a single feature or array.reshape(1, -1) "
                        "if it contains a single sample.".format(array))
                # If input is 1D raise error
                if array.ndim == 1:
                    raise ValueError(
                        "Expected 2D array, got 1D array instead:\narray={}.\n"
                        "Reshape your data either using array.reshape(-1, 1) if "
                        "your data has a single feature or array.reshape(1, -1) "
                        "if it contains a single sample.".format(array))
    
            # in the future np.flexible dtypes will be handled like object dtypes
            if dtype_numeric and np.issubdtype(array.dtype, np.flexible):
                warnings.warn(
                    "Beginning in version 0.22, arrays of bytes/strings will be "
                    "converted to decimal numbers if dtype='numeric'. "
                    "It is recommended that you convert the array to "
                    "a float dtype before using it in scikit-learn, "
                    "for example by using "
                    "your_array = your_array.astype(np.float64).",
                    FutureWarning)
    
            # make sure we actually converted to numeric:
            if dtype_numeric and array.dtype.kind == "O":
                array = array.astype(np.float64)
            if not allow_nd and array.ndim >= 3:
                raise ValueError("Found array with dim %d. %s expected <= 2."
                                 % (array.ndim, estimator_name))
            if force_all_finite:
                _assert_all_finite(array,
                                   allow_nan=force_all_finite == 'allow-nan')
    
        shape_repr = _shape_repr(array.shape)
        if ensure_min_samples > 0:
            n_samples = _num_samples(array)
            if n_samples < ensure_min_samples:
                raise ValueError("Found array with %d sample(s) (shape=%s) while a"
                                 " minimum of %d is required%s."
                                 % (n_samples, shape_repr, ensure_min_samples,
>                                   context))
E               ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

accept_large_sparse = True
accept_sparse = 'csr'
allow_nd   = False
array      = array([], shape=(0, 1), dtype=float64)
array_orig = array([], shape=(0, 1), dtype=float64)
context    = ''
copy       = False
dtype      = None
dtype_numeric = True
dtype_orig = dtype('float64')
dtypes_orig = None
ensure_2d  = True
ensure_min_features = 1
ensure_min_samples = 1
estimator  = None
estimator_name = 'Estimator'
force_all_finite = True
n_samples  = 0
order      = None
shape_repr = '(0, 1)'
warn_on_dtype = False

sklearn/utils/validation.py:582: ValueError

@jnothman
Copy link
Member

Please, @hermidalc, open a pull request rather than posting code and test logs here. GitHub will facilitate tangible discussion much better on a pull request where you add commits.

But I really doubt that introducing a Scorer class will result in code that gets merged. Please try the score_params parameter to Select* instead

@hermidalc
Copy link
Contributor Author

@jnothman @amueller all feature selection tests pass and three common tests above still fail but. As you can see they are issues I believe with the legacy scoring function code that was never tested since they weren't part of an estimator before.

@hermidalc
Copy link
Contributor Author

hermidalc commented Nov 13, 2018

But I really doubt that introducing a Scorer class will result in code that gets merged. Please try the score_params parameter to Select* instead

While that can fix the immediate limitation it's still a somewhat ugly and inconsistent way you have to encode the param grid to search these such parameters. It's a better and more future proof design to make scorers classes that are estimators while still supporting the legacy functions and backwards compatibility, though all futures scorers should use this new design. I personally think it's a win-win

@jnothman
Copy link
Member

jnothman commented Nov 13, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants