New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Univariate feature selection methods should allow passing kwargs to score_func callable #12468
Comments
agree, for {score_func:[partial(mutual_info_classif, n_neighbors=k) for k in [3, 5, 7]]} but maybe doing it more directly would be nice. |
A score_params parameter?
|
Shall I take this? |
I suppose you can take this. I don't consider it high priority as you can certainly use partial to accomplish the same |
@amueller @jnothman where do I put this above workaround in GridSearchCV? Also, how do I get it to work with joblib memory since I usually wrap feature selection scoring functions like mutual_info_classif in a cache, e.g., memory = Memory(cachedir=mkdtemp(dir='/tmp'), verbose=0)
mi_classif_cached = memory.cache(mutual_info_classif)
search = GridSearchCV(
Pipeline([
('slr', StandardScaler()),
('fs', SelectKBest(mi_classif_cached)),
('clf', LinearSVC())
]), memory=memory),
) |
@aishgrt1 @amueller @jnothman I implemented it locally, so modifying https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/univariate_selection.py you need to change _BaseFilter base class and SelectKBest class to show in this example (though all the Select* classes should change the same). I added new score_args and score_kwargs parameters and init code to the _BaseFilter __init__ and SelectKBest __init__ functions, and you will see they are used when calling the score function in _BaseFilter fit(). I don't think anything else is needed. class _BaseFilter(BaseEstimator, SelectorMixin):
"""Initialize the univariate feature selection.
Parameters
----------
score_func : callable
Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues) or a single array with scores.
score_args: list
Additional args to pass to score_func
score_kwargs: dict
kwargs to pass to score_func
"""
def __init__(self, score_func, score_args, score_kwargs):
self.score_func = score_func
self.score_args = score_args
self.score_kwargs = score_kwargs
def fit(self, X, y):
"""Run score function on (X, y) and get the appropriate features.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (class labels in classification, real numbers in
regression).
Returns
-------
self : object
"""
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
if not callable(self.score_func):
raise TypeError("The score function should be a callable, %s (%s) "
"was passed."
% (self.score_func, type(self.score_func)))
self._check_params(X, y)
score_func_ret = self.score_func(X, y, *self.score_args, **self.score_kwargs)
if isinstance(score_func_ret, (list, tuple)):
self.scores_, self.pvalues_ = score_func_ret
self.pvalues_ = np.asarray(self.pvalues_)
else:
self.scores_ = score_func_ret
self.pvalues_ = None
self.scores_ = np.asarray(self.scores_)
return self
def _check_params(self, X, y):
pass
class SelectKBest(_BaseFilter):
"""Select features according to the k highest scores.
Read more in the :ref:`User Guide <univariate_feature_selection>`.
Parameters
----------
score_func : callable
Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues) or a single array with scores.
Default is f_classif (see below "See also"). The default function only
works with classification tasks.
score_args: list, optional, default=()
Additional args to pass to score_func
score_kwargs: dict, optional, default={}
kwargs to pass to score_func
k : int or "all", optional, default=10
Number of top features to select.
The "all" option bypasses selection, for use in a parameter search.
Attributes
----------
scores_ : array-like, shape=(n_features,)
Scores of features.
pvalues_ : array-like, shape=(n_features,)
p-values of feature scores, None if `score_func` returned only scores.
Examples
--------
>>> from sklearn.datasets import load_digits
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> X, y = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
>>> X_new.shape
(1797, 20)
Notes
-----
Ties between features with equal scores will be broken in an unspecified
way.
See also
--------
f_classif: ANOVA F-value between label/feature for classification tasks.
mutual_info_classif: Mutual information for a discrete target.
chi2: Chi-squared stats of non-negative features for classification tasks.
f_regression: F-value between label/feature for regression tasks.
mutual_info_regression: Mutual information for a continuous target.
SelectPercentile: Select features based on percentile of the highest scores.
SelectFpr: Select features based on a false positive rate test.
SelectFdr: Select features based on an estimated false discovery rate.
SelectFwe: Select features based on family-wise error rate.
GenericUnivariateSelect: Univariate feature selector with configurable mode.
"""
def __init__(self, score_func=f_classif, score_args=(), score_kwargs={}, k=10):
super(SelectKBest, self).__init__(score_func, score_args, score_kwargs)
self.k = k
def _check_params(self, X, y):
if not (self.k == "all" or 0 <= self.k <= X.shape[1]):
raise ValueError("k should be >=0, <= n_features = %d; got %r. "
"Use k='all' to return all features."
% (X.shape[1], self.k))
def _get_support_mask(self):
check_is_fitted(self, 'scores_')
if self.k == 'all':
return np.ones(self.scores_.shape, dtype=bool)
elif self.k == 0:
return np.zeros(self.scores_.shape, dtype=bool)
else:
scores = _clean_nans(self.scores_)
mask = np.zeros(scores.shape, dtype=bool)
# Request a stable sort. Mergesort takes more memory (~40MB per
# megafeature on x86-64).
mask[np.argsort(scores, kind="mergesort")[-self.k:]] = 1
return mask |
I actually ran into a problem while trying to use the modified code within GridSearchCV testing it with mutual_info_classif and trying to search n_neighbors: search = GridSearchCV(
Pipeline([
('slr', StandardScaler()),
('fs', SelectKBest(mutual_info_classif)),
('clf', LinearSVC())
]), memory=memory),
param_grid={
fs__score_kwargs__n_neighbors = [ 5, 10, 15 ]
}
) I get the error: Traceback (most recent call last):
File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 350, in __call__
return self.func(*args, **kwargs)
File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 444, in _fit_and_score
estimator.set_params(**parameters)
File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/pipeline.py", line 142, in set_params
self._set_params('steps', **kwargs)
File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/utils/metaestimators.py", line 49, in _set_params
super(_BaseComposition, self).set_params(**params)
File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/base.py", line 282, in set_params
valid_params[key].set_params(**sub_params)
File "/home/hermidalc/soft/anaconda3/default/lib/python3.6/site-packages/sklearn/base.py", line 282, in set_params
valid_params[key].set_params(**sub_params)
AttributeError: 'dict' object has no attribute 'set_params' So during the search it cannot set a dict param... any advice would be welcomed! |
@jnothman sorry I just saw this, I agree this is better parameter name as it's more consistent with other similar parameters in the codebase. Though as mentioned above I'm not sure how to get a dict param to work. |
I looked base.py set_params() and it looks like I'm probably missing some code to do additional handling of this so that it would work or maybe it just cannot work given the current sklearn structure. There are two issues. First, independent of grid search there has to be a way to set score_func params and that like you said should be done with a score_params dict on the _BaseFilter. Second, score_func params need to be grid searchable, and the only way I see it working given the current way these things work in sklearn is that score_func has to act like a nested estimator so that you do, e.g.: fs__score_func__n_neighbors: [ 3, 5, 10 ] So I think to unify both points, just like for nested estimators all univariate scoring functions need to be proper classes. It's probably a good thing anyway it makes things nicer. I see in other parts of the codebase there are params that are dicts, namely in KNeighborsClassifier there is a metric_params, and in some of the kernel approximation classes there are kernel_params, and I guess these are not grid searchable as well. Lines 189 to 224 in eb36c28
|
@aishgrt1 @amueller @jnothman guys I rewrote univariate_selection.py and mutual_info_.py to turn all scoring functions into proper *FeatureScorer classes inheriting from BaseEstimator. When fit() is called this is where it runs the scoring code and stores results in scores_ and pvalues_ attributes. Also updated all Select* and GenericUnivariateSelect filter classes to support the new *FeatureScorer objects and changed score_func parameter to scorer. Having the feature scorers as classes then we don't need score_params. All my tests so far show everything works and seamlessly with grid searching just like they do with classes that have nested estimators. This is somewhat of a big change because all the feature scoring functions go away. I would be happy to make a pull request, just need to make some tests as well and to get your feedback on the code and conventions so that things are consistent with the rest of codebase.
If you feel the class names are too long I guess we could shorten it to *Scorer instead of *FeatureScorer, I just didn't want any confusion between other sklearn scorers that are not for feature selection scoring. |
if you want the double underscore notation you need estimators, but if you just want to pass on kwargs you don't need to. |
The way to make it work is fs__score_kwargs = [ {'n_neighbors': i} for i in [5, 10, 15 ]} |
@amueller do you think it's a good idea though to make a pull request with what I've done? To me its actually cleaner to have Scorer classes |
I'm going to send you the pull request, you'll see its better than the current code and structure, also I've shortened the class names. |
The grid-search might be a bit more natural to write down for the scorer classes, but I'm not sure that's a good incentive for such a big change. The current implementation supports what you're trying to achieve, and imho it's pretty straight-forward, if not super elegant. Adding |
It's actually not such a big rewrite you will see. |
Adding 5 new classes is pretty big. |
Well I'll make the pull request, please have a look and you can decide. In the long run you will always make changes that aren't backwards compatible, and this it actually a better way to design it. |
We basically never make changes that are not backwards compatible. |
but sure, go ahead, we can have a look. But the PR should be backwards-compatible. |
Would that be impossible? |
no, it just means there's two interfaces. |
Ok will do, working on it now |
@amueller sorry need to ask some advice. The new *Scorer classes basically wrap the functions so zero duplication of code, but for the Select* classes in python how do you define it properly so that it can accept either a *Scorer object or legacy function? Can python have multiple constructors/signatures? Or do I need to keep score_func parameter and check if they pass a *Scorer object or legacy function? |
@amueller or sorry would I have two named parameters legacy score_func and scorer, though if they pass it in positionally which is likely the case I will still need to check what it is |
Since as you said for backwards compatibility we need to keep the score_func named parameter and it's 1st position, I updated univariate_selection.py _BaseFilter code to check if we are getting a callable or Scorer object. Also updated all docs |
@amueller I've passed feature selection tests including new ones written for the new classes except for three common tests for two of the Scorer classes (that wrap their respective scoring functions). I imagine since these new classes are estimators there are additional tests because the errors are tracing to lines of code in the legacy scoring functions and therefore they weren't ever tested to this level before. Would you have time to help me troubleshoot these errors below? Other than these errors everything works and as you asked all the legacy code and functionality is all still there. These will just be an additional functionality if people prefer to use them. You can see my branch here https://github.com/hermidalc/scikit-learn/tree/univariate_scorer_classes sklearn/tests/test_common.py::test_non_meta_estimators[ANOVAFScorerClassification-ANOVAFScorerClassification-check_estimator_sparse_data] FAILED Estimator = <class 'sklearn.feature_selection.univariate_selection.ANOVAFScorerClassification'>
check = <function check_estimator_sparse_data at 0x7fdbe83e3a60>
estimator = ANOVAFScorerClassification()
name = 'ANOVAFScorerClassification'
sklearn/tests/test_common.py:101:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/estimator_checks.py:508: in check_estimator_sparse_data
estimator.fit(X, y)
sklearn/feature_selection/univariate_selection.py:365: in fit
self.scores_, self.pvalues_ = f_classif(X, y)
sklearn/feature_selection/univariate_selection.py:148: in f_classif
args = [X[safe_mask(X, y == k)] for k in np.unique(y)]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.0 = <iterator object at 0x7fdbc7fe3e48>
> args = [X[safe_mask(X, y == k)] for k in np.unique(y)]
E TypeError: only integer scalar arrays can be converted to a scalar index
.0 = <iterator object at 0x7fdbc7fe3e48>
X = <40x10 sparse matrix of type '<class 'numpy.float64'>'
with 74 stored elements in COOrdinate format>
k = 0
y = array([1, 3, 0, 3, 3, 1, 1, 0, 2, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 2, 3, 3,
1, 0, 2, 1, 3, 0, 3, 1, 0, 2, 0, 2, 2, 1, 2, 1, 2, 1])
sklearn/feature_selection/univariate_selection.py:148: TypeError
--------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------
Estimator ANOVAFScorerClassification doesn't seem to fail gracefully on sparse data: error message state explicitly that sparse input is not supported if this is not the case. sklearn/tests/test_common.py::test_non_meta_estimators[ANOVAFScorerRegression-ANOVAFScorerRegression-check_dtype_object] FAILED Estimator = <class 'sklearn.feature_selection.univariate_selection.ANOVAFScorerRegression'>
check = <function check_dtype_object at 0x7fdbe83e3ea0>
estimator = ANOVAFScorerRegression(center=True)
name = 'ANOVAFScorerRegression'
sklearn/tests/test_common.py:101:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/testing.py:350: in wrapper
return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:627: in check_dtype_object
estimator.fit(X, y.astype(object))
sklearn/feature_selection/univariate_selection.py:427: in fit
self.scores_, self.pvalues_ = f_regression(X, y, self.center)
sklearn/feature_selection/univariate_selection.py:304: in f_regression
pv = stats.f.sf(F, 1, degrees_of_freedom)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <scipy.stats._continuous_distns.f_gen object at 0x7fdbf68322e8>
x = array([561.4483878151119, 0.42515515900446055, 2.883877366500464,
0.17795476559942922, 0.12501597495086797, 0.0...483376599,
2.406504550121106, 0.5829428400264207, 0.30688161329297486,
1.3495053261207668], dtype=object)
args = (array(1), array(38)), kwds = {}, loc = array(0), scale = array(1), dtyp = dtype('O'), cond0 = True
cond1 = array([ True, True, True, True, True, True, True, True, True,
True])
cond2 = array([False, False, False, False, False, False, False, False, False,
False]), cond = array([ True, True, True, True, True, True, True, True, True,
True])
output = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=object)
def sf(self, x, *args, **kwds):
"""
Survival function (1 - `cdf`) at x of the given RV.
Parameters
----------
x : array_like
quantiles
arg1, arg2, arg3,... : array_like
The shape parameter(s) for the distribution (see docstring of the
instance object for more information)
loc : array_like, optional
location parameter (default=0)
scale : array_like, optional
scale parameter (default=1)
Returns
-------
sf : array_like
Survival function evaluated at x
"""
args, loc, scale = self._parse_args(*args, **kwds)
x, loc, scale = map(asarray, (x, loc, scale))
args = tuple(map(asarray, args))
dtyp = np.find_common_type([x.dtype, np.float64], [])
x = np.asarray((x - loc)/scale, dtype=dtyp)
cond0 = self._argcheck(*args) & (scale > 0)
cond1 = self._open_support_mask(x) & (scale > 0)
cond2 = cond0 & (x <= self.a)
cond = cond0 & cond1
output = zeros(shape(cond), dtyp)
> place(output, (1-cond0)+np.isnan(x), self.badvalue)
E TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
args = (array(1), array(38))
cond = array([ True, True, True, True, True, True, True, True, True,
True])
cond0 = True
cond1 = array([ True, True, True, True, True, True, True, True, True,
True])
cond2 = array([False, False, False, False, False, False, False, False, False,
False])
dtyp = dtype('O')
kwds = {}
loc = array(0)
output = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=object)
scale = array(1)
self = <scipy.stats._continuous_distns.f_gen object at 0x7fdbf68322e8>
x = array([561.4483878151119, 0.42515515900446055, 2.883877366500464,
0.17795476559942922, 0.12501597495086797, 0.0...483376599,
2.406504550121106, 0.5829428400264207, 0.30688161329297486,
1.3495053261207668], dtype=object)
../../../../soft/anaconda3/5.2.0/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1824: TypeError sklearn/tests/test_common.py::test_non_meta_estimators[MutualInfoScorerClassification-MutualInfoScorerClassification-check_fit2d_1sample] FAILED Estimator = <class 'sklearn.feature_selection.univariate_selection.MutualInfoScorerClassification'>
check = <function check_fit2d_1sample at 0x7fdbe83de620>
estimator = MutualInfoScorerClassification(copy=True, discrete_features='auto',
n_neighbors=3, random_state=None)
name = 'MutualInfoScorerClassification'
sklearn/tests/test_common.py:101:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/testing.py:350: in wrapper
return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:847: in check_fit2d_1sample
raise e
sklearn/utils/estimator_checks.py:844: in check_fit2d_1sample
estimator.fit(X, y)
sklearn/feature_selection/univariate_selection.py:593: in fit
self.random_state)
sklearn/feature_selection/mutual_info_.py:450: in mutual_info_classif
copy, random_state)
sklearn/feature_selection/mutual_info_.py:289: in _estimate_mi
x, discrete_feature in moves.zip(_iterate_columns(X), discrete_mask)]
sklearn/feature_selection/mutual_info_.py:289: in <listcomp>
x, discrete_feature in moves.zip(_iterate_columns(X), discrete_mask)]
sklearn/feature_selection/mutual_info_.py:161: in _compute_mi
return _compute_mi_cd(x, y, n_neighbors)
sklearn/feature_selection/mutual_info_.py:139: in _compute_mi_cd
nn.fit(c)
sklearn/neighbors/base.py:912: in fit
return self._fit(X)
sklearn/neighbors/base.py:209: in _fit
X = check_array(X, accept_sparse='csr')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
array = array([], shape=(0, 1), dtype=float64), accept_sparse = 'csr', accept_large_sparse = True, dtype = None, order = None, copy = False, force_all_finite = True, ensure_2d = True
allow_nd = False, ensure_min_samples = 1, ensure_min_features = 1, warn_on_dtype = False, estimator = None
def check_array(array, accept_sparse=False, accept_large_sparse=True,
dtype="numeric", order=None, copy=False, force_all_finite=True,
ensure_2d=True, allow_nd=False, ensure_min_samples=1,
ensure_min_features=1, warn_on_dtype=False, estimator=None):
"""Input validation on an array, list, sparse matrix or similar.
By default, the input is checked to be a non-empty 2D array containing
only finite values. If the dtype of the array is object, attempt
converting to float, raising on failure.
Parameters
----------
array : object
Input object to check / convert.
accept_sparse : string, boolean or list/tuple of strings (default=False)
String[s] representing allowed sparse matrix formats, such as 'csc',
'csr', etc. If the input is sparse but not in the allowed format,
it will be converted to the first listed format. True allows the input
to be any format. False means that a sparse matrix input will
raise an error.
.. deprecated:: 0.19
Passing 'None' to parameter ``accept_sparse`` in methods is
deprecated in version 0.19 "and will be removed in 0.21. Use
``accept_sparse=False`` instead.
accept_large_sparse : bool (default=True)
If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by
accept_sparse, accept_large_sparse=False will cause it to be accepted
only if its indices are stored with a 32-bit dtype.
.. versionadded:: 0.20
dtype : string, type, list of types or None (default="numeric")
Data type of result. If None, the dtype of the input is preserved.
If "numeric", dtype is preserved unless array.dtype is object.
If dtype is a list of types, conversion on the first type is only
performed if the dtype of the input is not in the list.
order : 'F', 'C' or None (default=None)
Whether an array will be forced to be fortran or c-style.
When order is None (default), then if copy=False, nothing is ensured
about the memory layout of the output array; otherwise (copy=True)
the memory layout of the returned array is kept as close as possible
to the original array.
copy : boolean (default=False)
Whether a forced copy will be triggered. If copy=False, a copy might
be triggered by a conversion.
force_all_finite : boolean or 'allow-nan', (default=True)
Whether to raise an error on np.inf and np.nan in array. The
possibilities are:
- True: Force all values of array to be finite.
- False: accept both np.inf and np.nan in array.
- 'allow-nan': accept only np.nan values in array. Values cannot
be infinite.
.. versionadded:: 0.20
``force_all_finite`` accepts the string ``'allow-nan'``.
ensure_2d : boolean (default=True)
Whether to raise a value error if array is not 2D.
allow_nd : boolean (default=False)
Whether to allow array.ndim > 2.
ensure_min_samples : int (default=1)
Make sure that the array has a minimum number of samples in its first
axis (rows for a 2D array). Setting to 0 disables this check.
ensure_min_features : int (default=1)
Make sure that the 2D array has some minimum number of features
(columns). The default value of 1 rejects empty datasets.
This check is only enforced when the input data has effectively 2
dimensions or is originally 1D and ``ensure_2d`` is True. Setting to 0
disables this check.
warn_on_dtype : boolean (default=False)
Raise DataConversionWarning if the dtype of the input data structure
does not match the requested dtype, causing a memory copy.
estimator : str or estimator instance (default=None)
If passed, include the name of the estimator in warning messages.
Returns
-------
array_converted : object
The converted and validated array.
"""
# accept_sparse 'None' deprecation check
if accept_sparse is None:
warnings.warn(
"Passing 'None' to parameter 'accept_sparse' in methods "
"check_array and check_X_y is deprecated in version 0.19 "
"and will be removed in 0.21. Use 'accept_sparse=False' "
" instead.", DeprecationWarning)
accept_sparse = False
# store reference to original array to check if copy is needed when
# function returns
array_orig = array
# store whether originally we wanted numeric dtype
dtype_numeric = isinstance(dtype, six.string_types) and dtype == "numeric"
dtype_orig = getattr(array, "dtype", None)
if not hasattr(dtype_orig, 'kind'):
# not a data type (e.g. a column named dtype in a pandas DataFrame)
dtype_orig = None
# check if the object contains several dtypes (typically a pandas
# DataFrame), and store them. If not, store None.
dtypes_orig = None
if hasattr(array, "dtypes") and hasattr(array, "__array__"):
dtypes_orig = np.array(array.dtypes)
if dtype_numeric:
if dtype_orig is not None and dtype_orig.kind == "O":
# if input is object, convert to float.
dtype = np.float64
else:
dtype = None
if isinstance(dtype, (list, tuple)):
if dtype_orig is not None and dtype_orig in dtype:
# no dtype conversion required
dtype = None
else:
# dtype conversion required. Let's select the first element of the
# list of accepted types.
dtype = dtype[0]
if force_all_finite not in (True, False, 'allow-nan'):
raise ValueError('force_all_finite should be a bool or "allow-nan"'
'. Got {!r} instead'.format(force_all_finite))
if estimator is not None:
if isinstance(estimator, six.string_types):
estimator_name = estimator
else:
estimator_name = estimator.__class__.__name__
else:
estimator_name = "Estimator"
context = " by %s" % estimator_name if estimator is not None else ""
if sp.issparse(array):
_ensure_no_complex_data(array)
array = _ensure_sparse_format(array, accept_sparse=accept_sparse,
dtype=dtype, copy=copy,
force_all_finite=force_all_finite,
accept_large_sparse=accept_large_sparse)
else:
# If np.array(..) gives ComplexWarning, then we convert the warning
# to an error. This is needed because specifying a non complex
# dtype to the function converts complex to real dtype,
# thereby passing the test made in the lines following the scope
# of warnings context manager.
with warnings.catch_warnings():
try:
warnings.simplefilter('error', ComplexWarning)
array = np.asarray(array, dtype=dtype, order=order)
except ComplexWarning:
raise ValueError("Complex data not supported\n"
"{}\n".format(array))
# It is possible that the np.array(..) gave no warning. This happens
# when no dtype conversion happened, for example dtype = None. The
# result is that np.array(..) produces an array of complex dtype
# and we need to catch and raise exception for such cases.
_ensure_no_complex_data(array)
if ensure_2d:
# If input is scalar raise error
if array.ndim == 0:
raise ValueError(
"Expected 2D array, got scalar array instead:\narray={}.\n"
"Reshape your data either using array.reshape(-1, 1) if "
"your data has a single feature or array.reshape(1, -1) "
"if it contains a single sample.".format(array))
# If input is 1D raise error
if array.ndim == 1:
raise ValueError(
"Expected 2D array, got 1D array instead:\narray={}.\n"
"Reshape your data either using array.reshape(-1, 1) if "
"your data has a single feature or array.reshape(1, -1) "
"if it contains a single sample.".format(array))
# in the future np.flexible dtypes will be handled like object dtypes
if dtype_numeric and np.issubdtype(array.dtype, np.flexible):
warnings.warn(
"Beginning in version 0.22, arrays of bytes/strings will be "
"converted to decimal numbers if dtype='numeric'. "
"It is recommended that you convert the array to "
"a float dtype before using it in scikit-learn, "
"for example by using "
"your_array = your_array.astype(np.float64).",
FutureWarning)
# make sure we actually converted to numeric:
if dtype_numeric and array.dtype.kind == "O":
array = array.astype(np.float64)
if not allow_nd and array.ndim >= 3:
raise ValueError("Found array with dim %d. %s expected <= 2."
% (array.ndim, estimator_name))
if force_all_finite:
_assert_all_finite(array,
allow_nan=force_all_finite == 'allow-nan')
shape_repr = _shape_repr(array.shape)
if ensure_min_samples > 0:
n_samples = _num_samples(array)
if n_samples < ensure_min_samples:
raise ValueError("Found array with %d sample(s) (shape=%s) while a"
" minimum of %d is required%s."
% (n_samples, shape_repr, ensure_min_samples,
> context))
E ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.
accept_large_sparse = True
accept_sparse = 'csr'
allow_nd = False
array = array([], shape=(0, 1), dtype=float64)
array_orig = array([], shape=(0, 1), dtype=float64)
context = ''
copy = False
dtype = None
dtype_numeric = True
dtype_orig = dtype('float64')
dtypes_orig = None
ensure_2d = True
ensure_min_features = 1
ensure_min_samples = 1
estimator = None
estimator_name = 'Estimator'
force_all_finite = True
n_samples = 0
order = None
shape_repr = '(0, 1)'
warn_on_dtype = False
sklearn/utils/validation.py:582: ValueError |
Please, @hermidalc, open a pull request rather than posting code and test logs here. GitHub will facilitate tangible discussion much better on a pull request where you add commits. But I really doubt that introducing a Scorer class will result in code that gets merged. Please try the score_params parameter to Select* instead |
While that can fix the immediate limitation it's still a somewhat ugly and inconsistent way you have to encode the param grid to search these such parameters. It's a better and more future proof design to make scorers classes that are estimators while still supporting the legacy functions and backwards compatibility, though all futures scorers should use this new design. I personally think it's a win-win |
You may personally think it's a win-win, but the cost is not merely in the
design but in the maintenance. But in design too, there are several
possible options for how to get a feature scorer to return its result. I
don't think yours is the only obvious design.
|
Univariate feature selection methods, SelectKBest/Percentile/Fdr/Fpr/Fwe and GenericUnivariateSelect does not have functionality to allow passing of score_func kwargs, some of which are important.
One thought too is that an implementation should be GridSearchCV compliant such that score_func kwargs parameters can be searched.
The text was updated successfully, but these errors were encountered: