-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature selector instances are too heavy-weight #17816
Comments
I'm not sure about the name of the parameter, but I think we could consider a flag for having lighter selectors. We should also document that the corresponding attributes aren't available when the flag is True. If we start storing the selected indices instead of an array of shape n_features, we'll also need to re-consider how |
Thanks for opening the issue. In principle, using a selector here is not really needed, and you'll get less memory footprint if you just don't do selection. Still, it would be nice to support a sparse mask / indices. The linear models have a "sparsify" method: scikit-learn/sklearn/linear_model/_base.py Line 357 in 7cc0177
We could do the same here basically. This method is never called automatically, so it would be up to the user to call it to reduce the memory footprint. We might make the |
@amueller you'd be fine with |
How would you not compute the |
Ah indeed. I still find that deleting an attribute would be a surprising consequence of calling a method called |
yes, I think I wouldn't delete the attributes but replace the values which correspond with dropped features with zeros and make the arrays sparse. |
Are we sure this is not just a case where the user really just wants a safe
way to export a compact predictive model, i.e. onnx might be a better pick?
|
Thank you very much for all the engagement in this ticket! @jnothman Thanks for your response! Agreed that exporting to a compact format like ONNX would probably satisfy the need. I suppose it comes down a design decision on how much of a focus there is on model size and memory use. I do think these pipeline components in particular are surprisingly heavy. |
Definitions
By "selectors", I'm referring to the set of classes that implement
sklearn.feature_selection._base.SelectorMixin
.Examples are:
SelectKBest
VarianceThreshold
.Summary
I'd like to suggest either:
or
These classes save large matrices in their state which, for the most common cases, are not consumed.
Motivation
Consider a fairly typical NLP classifier as so:
The
FeatureHasher
is described as "a low-memory alternative to DictVectorizer and CountVectorizer". The disadvantage of theFeatureHasher
is that it will always produce a matrix withn_features
regardless of the input size; therefore, using classes likeVarianceThreshold
andSelectKBest
are commonly used to reduce the size of the feature matrix.The current implementation of these classes mitigates much of the low-memory benefit of the FeatureHasher due to the large amount of information saved on their state.
For example:
VarianceThreshold
saves the matrixvariances_
of shape(n_features,)
where n_features is the size of the input.SelectKBest
saves two matrices with this shape:scores_
andpvalues_
Describe your proposed solution
I suggest that for this type of transformer, the state saved is a single numpy array of only the column indices to retain. An array in this form can be retrieved by calling
get_support(indices=True)
on a fitted selector instance.The disadvantage of this approach is a decrease of model explain-ability.
To examine the difference of this approach, I designed the following implementation:
Comparing the size of the pickled objects show a dramatic decrease:
Setup code
SelectKBest
VarianceThreshold
The text was updated successfully, but these errors were encountered: