MeanThreshold, MedianThreshold, and other threshold support in GenericUnivariateSelect #21699
Labels
help wanted
Moderate
Anything that requires some knowledge of conventions and best practices
module:feature_selection
New Feature
Describe the workflow you want to enable
I would like to select features by thresholding their mean value (i.e., mean-across-samples), similar to how
VarianceThreshold
selects features by thresholding their variance-across-samples.Describe your proposed solution
Two possible options:
sklearn.feature_selection
, similar toVarianceThreshold
. Example: https://github.com/hermidalc/sklearn-extensions/blob/f9296d0f3ed5d71b7f07779b47d8cf71bbcfa51b/feature_selection/_average_threshold.py#L7-L96mode='threshold'
option toGenericUnivariateSelect
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.GenericUnivariateSelect.html)score_func
s as wellDescribe alternatives you've considered, if relevant
Another alternative, although this seems counter to how these functions are designedSelectFromModel(DummyRegressor(strategy='mean'), importance_getter='constant_', threshold=min_mean_value)
Additional context
Setting a MeanThreshold would be useful when working with non-negative features, such as pixel intensity in images. For example, we might want to exclude pixels that are regularly saturated in our dataset, as they may be less informative.
Specifically, in my research field of neuroscience (single-neuron recordings), our "features" are the (non-negative) action-potential-counts for each neuron. We often exclude neurons with very-low-firing-rates to minimize discretization error. Here are a few examples of neuroscience papers that set a MeanThreshold per neuron (i.e., feature):
Units with mean firing rates less than 1.5 Hz were excluded from the analysis.
Neurons with firing rates less than 0.5 Hz were excluded.
Firing data were Gaussian smoothed and binned in 0.1 s periods and bins with firing rates less than 0.1 Hz (no spikes) were excluded.
The text was updated successfully, but these errors were encountered: