Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature selector instances are too heavy-weight #17816

Open
DomHudson opened this issue Jul 2, 2020 · 8 comments
Open

Feature selector instances are too heavy-weight #17816

DomHudson opened this issue Jul 2, 2020 · 8 comments

Comments

@DomHudson
Copy link

DomHudson commented Jul 2, 2020

Definitions

By "selectors", I'm referring to the set of classes that implement sklearn.feature_selection._base.SelectorMixin.

Examples are:

  • SelectKBest
  • VarianceThreshold.

Summary

I'd like to suggest either:

  1. modifying these "selectors" to save less information in their state
    or
  2. providing a more minimal set of "selectors" with less information saved in their state

These classes save large matrices in their state which, for the most common cases, are not consumed.

Motivation

Consider a fairly typical NLP classifier as so:

Pipeline([
    ('feature-hashing', FeatureHasher(n_features=2**20)),
    ('variance', VarianceThreshold(threshold=0.0)),
    ('clf', LogisticRegression())
])

The FeatureHasher is described as "a low-memory alternative to DictVectorizer and CountVectorizer". The disadvantage of the FeatureHasher is that it will always produce a matrix with n_features regardless of the input size; therefore, using classes like VarianceThreshold and SelectKBest are commonly used to reduce the size of the feature matrix.

The current implementation of these classes mitigates much of the low-memory benefit of the FeatureHasher due to the large amount of information saved on their state.

For example:

  • VarianceThreshold saves the matrix variances_ of shape (n_features,) where n_features is the size of the input.
  • SelectKBest saves two matrices with this shape: scores_ and pvalues_

Describe your proposed solution

I suggest that for this type of transformer, the state saved is a single numpy array of only the column indices to retain. An array in this form can be retrieved by calling get_support(indices=True) on a fitted selector instance.

  • This dramatically reduces the size of the object (both when pickled and in-memory) without a decrease in algorithm performance.
  • This implementation scales a lot better than the current implementation as the size of the state scales with the output shape rather than the input shape.

The disadvantage of this approach is a decrease of model explain-ability.

To examine the difference of this approach, I designed the following implementation:

class NumpyColumnFilter:

    def __init__(self, relevant_columns):
        """ Constructor.

        :param numpy.ndarray relevant_columns: 1D numpy array containing indicies of columns to
            retain.
        :return void:
        """
        self._relevant_columns = relevant_columns

    @classmethod
    def from_sklearn_selector(cls, selector):
        """ Produce a NumpyColumnFilter from an implementation of sklearn's SelectorMixin.

        :param mixed selector:
        :return NumpyColumnFilter:
        """
        return cls(selector.get_support(indices=True))

    def apply(self, X):
        """ Select just the relevant columns.

        :param numpy.ndarray X:
        :return numpy.ndarray:
        """
        return X[:, self._relevant_columns]


class FeatureSelector:

    def __init__(self, selector_class, **selector_kwargs):
        """ Constructor.

        :param mixed selector_class: Filter to remove feature indicies.
        :param dict selector_kwargs: Keyword arguments to pass when instantiating the selector class
        :return void:
        """
        self._selector_class = selector_class
        self._selector_kwargs = selector_kwargs
        self._column_filter = None

    def _is_fitted(self):
        """ Is the filter fitted?

        :return bool:
        """
        return self._column_filter is not None

    def fit_transform(self, *args, **kwargs):
        """ Fit and transform.

        :return np.ndarray:
        """
        return self.fit(*args, **kwargs).transform(*args, **kwargs)

    def fit(self, *args, **kwargs):
        """ Fit the algorithm.

        :param np.ndarray X:
        :raises Exception:
        :return self:
        """
        fitted_selector = self._selector_class(**self._selector_kwargs).fit(*args, **kwargs)

        self._column_filter = NumpyColumnFilter.from_sklearn_selector(fitted_selector)
        return self

    def transform(self, X, *args, **kwargs):
        """ Select just the relevant columns.

        :param np.ndarray X:
        :raises Exception:
        :return np.ndarray:
        """
        if not self._is_fitted():
            raise Exception('Not fitted!')

        return self._column_filter.apply(X)

Comparing the size of the pickled objects show a dramatic decrease:

Setup code

import pickle
import sys

import numpy as np
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import VarianceThreshold


def pickled_size(item):
    return sys.getsizeof(pickle.dumps(item))


X, y = datasets.make_classification(n_samples = 100, n_features = 10)

# Add one-million features without any variance.
X = np.concatenate((X, np.zeros((100, int(1e6)))), axis = 1)

SelectKBest

sklearn_selector_class = SelectKBest().fit(X, y)
print(pickled_size(sklearn_selector_class))
>>> ~ 16 megabytes

feature_selector = FeatureSelector(selector_class = SelectKBest).fit(X, y)
print(pickled_size(feature_selector))
>>> ~ 500 bytes

VarianceThreshold

sklearn_selector_class = VarianceThreshold().fit(X, y)
print(pickled_size(sklearn_selector_class))
>>> ~ 8 megabytes

feature_selector = FeatureSelector(selector_class = VarianceThreshold).fit(X, y)
print(pickled_size(feature_selector))
>>> ~ 400 bytes
@NicolasHug
Copy link
Member

I'm not sure about the name of the parameter, but I think we could consider a flag for having lighter selectors. We should also document that the corresponding attributes aren't available when the flag is True.

If we start storing the selected indices instead of an array of shape n_features, we'll also need to re-consider how get_support() and _get_support_mask interact with each-other

@amueller
Copy link
Member

amueller commented Jul 2, 2020

Thanks for opening the issue.
Why is using indices a decrease in model interpretability? This is just storing a dense vs a sparse vector, right?
[edit]I now realize that you also need to drop the existing attributes for this to actually help, which does remove some information [/edit].

In principle, using a selector here is not really needed, and you'll get less memory footprint if you just don't do selection.

Still, it would be nice to support a sparse mask / indices. The linear models have a "sparsify" method:

def sparsify(self):

We could do the same here basically. This method is never called automatically, so it would be up to the user to call it to reduce the memory footprint. We might make the pvalues_ etc sparse and only keep the ones that are not masked out, that would probably only require minimum changes in the logic.

@NicolasHug
Copy link
Member

This method is never called automatically, so it would be up to the user to call it to reduce the memory footprint.

@amueller you'd be fine with est.sparsify basically delete the variances_ attribute? I think I'd still prefer having an __init__ parameter for this, because it would also allow us to avoid computing e.g. pvalues_ (which isn't needed to know the selected indices). With a call to sparsify(), the attribute will still be computed even though it will never be used

@amueller
Copy link
Member

amueller commented Jul 2, 2020

How would you not compute the pvalues_? They are returned from score_func.

@NicolasHug
Copy link
Member

Ah indeed.

I still find that deleting an attribute would be a surprising consequence of calling a method called sparsify though. sparsify() for the linear models only converts coef_ to a sparse matrix but does not delete anything

@amueller
Copy link
Member

amueller commented Jul 2, 2020

yes, I think I wouldn't delete the attributes but replace the values which correspond with dropped features with zeros and make the arrays sparse.

@jnothman
Copy link
Member

jnothman commented Jul 2, 2020 via email

@DomHudson
Copy link
Author

Thank you very much for all the engagement in this ticket!

@jnothman Thanks for your response! Agreed that exporting to a compact format like ONNX would probably satisfy the need. I suppose it comes down a design decision on how much of a focus there is on model size and memory use. I do think these pipeline components in particular are surprisingly heavy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants