Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP 8: Propagating feature names #18

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
369 changes: 369 additions & 0 deletions slep006.rst
@@ -0,0 +1,369 @@
==================================
SLEP008: Propagating feature names
==================================

:Author: Andreas Mueller, Joris Van den Bossche
:Status: Draft
:Type: Standards Track
:Created: 2019-03-01

Abstract
--------

This SLEP proposes to add a transformative ``get_feature_names()`` method that
gets recursively called on each step of a ``Pipeline`` so that the feature
names get propagated throughout the full ``Pipeline``. This will allow to
inspect the input and output feature names in each step of a ``Pipeline``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names. That's slightly different than recursively calling it to get the names once the pipeline has been fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should mention that but maybe you can provide a suggestion for motivation and implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names.

That's maybe partly related to what I mentioned below in one of the questions about standalone estimators (not in a pipeline). If we want those to behave similarly, the fit method of the estimator needs to do something (at least, with the current proposal, calling the "update feature names" method). But if we actually let fit handle the actual feature name logic (needed for the above suggestion), that directly solves the issue of standalone vs within-pipeline consistency.

Detailed description
--------------------

Motivating example
^^^^^^^^^^^^^^^^^^

We've been making it easier to build complex workflows with the
``ColumnTransformer`` and we expect it will find wide adoption. However, using it
results in very opaque models, even more so than before.

We have a great usage example in the gallery that applies a classifier to the
titanic data set. This is a very simple standard use case, but it is still close to
impossible to inspect the names of the features that went into the final
estimator, for example to match this with the coefficients or feature
importances.

The full pipeline construction can be seen seen at
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html,
but it consists of a final classifier and a preprocessor that scales the
numerical features and one-hot encodes the categorical features. To obtain the
feature names that correspond to what the final classifier receives, we can do
now:

.. code-block:: python

numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex', 'pclass']
# extract OneHotEncoder from pipeline
onehotencoder = clf.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot']
categorical_features2 = onehotencoder.get_feature_names(
input_features=categorical_features)
feature_names = numeric_features + list(categorical_features2)

Note: this only works if the Imputer didn't drop any all-NaN columns.

This SLEP proposes that the input features derived on fit time are propagated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This SLEP proposes that the input features derived on fit time are propagated
This SLEP proposes that the input features derived at fit time are propagated

through and stored in all the steps of the pipeline, so then we can access them
like this:

.. code-block:: python

clf.named_steps['classifier'].input_feature_names_

For this specific example, this would give::

>>> clf.named_steps['classifier'].input_feature_names_
['num__age', 'num__fare', 'cat__embarked_C', 'cat__embarked_Q', 'cat__embarked_S', 'cat__embarked_missing', 'cat__sex_female', 'cat__sex_male', 'cat__pclass_1', 'cat__pclass_2', 'cat__pclass_3']

and which then could be easily matched with eg::

>>> clf.named_steps['classifier'].coef_


Propagating feature names through Pipeline
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The core idea of this proposal is that all transformers get a transformative
*"update feature names"* method that can determine the output feature names,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say that the name is up for discussion?

optionally given input feature names (specified with the ``input_features``
parameter of this method).

The ``Pipeline`` and all meta-estimators implement this method by calling it
recursively on all its child steps / sub-estimators, and in this way the input
features names get propagated through the full pipeline. In addition, it sets
the ``input_feature_names_`` attribute on each step of the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain why that is necessary?


A Pipeline calls the method at the end of ``fit()`` (using the DataFrame column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @adrinjalali says, there is also the possibility to set them during fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Pipeline calls the method at the end of fit()at the end of its fit

I wasn't sure at first whether this was referring to Pipeline.fit, or to step.fit.

names or generated 'x0, 'x1', .. names for numpy arrays or sparse matrices as
initial input feature names), which ensures that a fitted pipeline has the
input features set, without the need for the user to first call the "update
feature names" method before those attributes are available.

The "update feature names" method that is added to all transformers does the
following things:

- ability to specify custom input feature names (the ``input_features`` parameter)
- otherwise, if not specified, generate default keyword names (eg "x0", "x1", ...)
- set the ``input_feature_names_`` attribute (only done by ``Pipeline``
and ``MetaEstimator``)
Comment on lines +95 to +96
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this only be done by pipelines? Why not for a single transformer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we should indicate that pipelines will also set the attribute for the last step (which might not be a transformer)

- transform the input feature names into output feature names
- return the output feature names


Implementation
--------------

There is an implementation of this proposal in PR #13307
(https://github.com/scikit-learn/scikit-learn/pull/13307).

Open design questions
^^^^^^^^^^^^^^^^^^^^^

Name of the "update feature names" method?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Currently, a few transformers already have a ``get_feature_names`` method (more
specifically the vectorizers, PolynomialFeatures, OneHotEncoder and ColumnTransformer). Moreover,
in some cases this method already does exactly what we need (accepting
``input_features`` and returning the transformed ones). This is the case for ``PolynomialFeatures`` and ``OneHotEncoder``, while the other estimators don't take input features.

However, there are some downsides about this name: for ``Pipeline`` and
meta-estimators (and potentially also other transformers, see question below),
this method also *sets* the ``input_feature_names_`` attribute on each of the
estimators. Further, it may not directly be clear from this name that it
returns the *output* feature names and not the input feature names.

In addition to re-using ``get_feature_names``, some ideas for other names:
``set_feature_names``, ``get_output_feature_names``, ``transform_feature_names``,
``propagate_feature_names``, ``update_feature_names``, ...


Name of the attribute: ``input_feature_names_`` vs ``input_features_`` vs ``input_names_``?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like feature_names_in and feature_names_out, similar to _n_features_in (and _n_features_out), see #13603 for the names themselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, me too. Will move this suggestion a bit up.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There was initially some discussion about the attribute to store the names.

The most explicit is ``input_feature_names_``, but this is long.
``input_features_`` can probably to easily be confused with the actual input
*features* (the X data). ``input_names_`` can provide a shorter alternative.

``input_features_`` is already used as parameter name in `get_feature_names`,
which is a plus. Probably, whatever name we choose, it would be good to
eventually make this consistent with the keyword used in the "update feature
names" method.

An example code snippet::

>>> clf.named_steps['classifier'].input_feature_names_

Or more concisely:

>>> clf['classifier'].input_feature_names_
>>> clf[-1].input_feature_names_

Other mentioned alternative: `feature_names_in_` and `feature_names_out_`.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved


Do we also want a ``output_feature_names_`` attribute?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to setting the ``input_feature_names_`` attribute on each estimator,
we could also have an ``output_feature_names_``.

This would return the same as the current ``get_feature_names()`` method,
potentially removing the need to have an explicit output feature names *getter
method*. The "update feature names" method would then mainly be used for
setting the input features and making sure they get propagated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having them [almost] everywhere.


Should all estimators call the "update feature names" method inside ``fit()`` ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the current implementation, the ``Pipeline`` is responsible to catch the input
feature names of the data and calling the "update feature names" method with
those names at the end of ``fit()``.

However, that means that if you use a transformer/estimator in itself and not
in a Pipeline, we won't have the feature names automatically set. For example
(assuming ``X_df`` is a DataFrame with columns A and B)::

>>> ohe = OneHotEncoder()
>>> ohe.fit(X_df)
>>> ohe.input_feature_names_
AttributeError: ...
>>> ohe.get_feature_names()
['x0_cat1', 'x0_cat2', 'x1_cat1', 'x2_cat2']

vs

::

>>> ohe_pipe = Pipeline([('ohe', OneHotEncoder())])
>>> ohe_pipe.fit(X_df)
>>> ohe.input_feature_names_
['A', 'B']
>>> ohe.get_feature_names()
['A_cat1', 'A_cat2', 'B_cat1', 'B_cat2']


Currently, the ``input_feature_names_`` attribute of an estimator is set by the
"update feature names" method of the parent estimator (e.g. the Pipeline from
which the estimator is called). But, this logic could also be moved into the
"update feature names" method of the estimator itself.

In that case, the ``fit`` method of the estimator could also call the "update
feature names" method at the end, ensuring consistency between on-itself
standing estimators and Pipelines. However, the clear downside of this
consistency is that this would add one line to each ``fit`` method throughout
scikit-learn.
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also the option of having a fit which does some common tasks such as setting the feature names, and letting the child classes only implement _fit. It kinda goes along the lines of what's being done in scikit-learn/scikit-learn#13603


What happens if one part of the pipeline does not implement "update feature names"?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Instead of raising an error, the current PR sets the output feature names to
``None``, which if passed to the next step of the pipeline, allows it to still
generate feature names.

What should the "update feature names" method do in the less obvious cases?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The clear cases on how to transform input to output features are:

- Transformers that pass through (One-to-one), e.g. StandardScaler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one-to-one is not entirely obvious if we want to actually have a computation graph. But I think we said this is out of scope for now.

- Transformers that generate new features, e.g. OneHotEncoder, Vectorizers
- Transformers that output a subset of the original features (``SelectKBest``, ``SimpleImputer``)

But, what to do with:

- Transformers that create linear combinations, eg PCA
- Transformers based on arbitrary functions


Should all estimators (so including regressors, classifiers, ...) have a "update feature names" method?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is not properly motivated in the SLEP.
The use case is that X has no column names, right? We fitted a pipeline on a numpy array and we also have feature names and now we want to get the output features.

It might make sense to distinguish the cases where X contains the feature names in the column and where it doesn't because in the first case everything can be automatic.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the current implementation, only transformers (and Pipeline and
meta-estimators, which could act as transformer) have a "update feature names"
method.

For consistency, we could also add them to *all* estimators.

For a regressor or classifier, the method could set the ``input_feature_names_``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? you mean the output feature names, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regressors don't have output features?

But in general, I am not fully sure anymore what I was thinking here for this section. It all depends on where what we decide where the responsibility lies to set the attributes (does parent pipeline set the attributes and then does the "update feature names" method look first at the attribute, or does the parent pipeline pass the names to the "update feature names" method which then sets the attribute, or ...)

attribute and return ``None``.


How should feature names be transformed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For this question, there is a separate SLEP:
https://github.com/scikit-learn/enhancement_proposals/pull/17


Interaction with column validation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Another, potentially related, change that has been discussed is to do input
validation on transform/predict time: ensuring that the column names and order
is identical when transforming/predicting compared to fit (currently,
scikit-learn silently returns "incorrect" results as long as the number of
columns matches).

To do proper validation, the idea would be to store the column names at fit
time, so they can be compared at transform/predict time. Those stored column
names could be very similar to the ``input_feature_names_`` described in this
SLEP.

However, if a user calls ``Pipeline.get_feature_names(input_features=[...])``
with a set of custom input feature names that are not identical to the original
DataFrame column names, the stored column names to do validation and the stored
column names to propagate the feature names would get out of sync. Or should
calling ``get_feature_names`` also affect future validation in a ``predict()``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for this option.

call?

One solution is to disallow setting feature names if the original input are
pandas DataFrames (so ``pipe.get_feature_names(['other', 'names'])`` would
raise an error if ``pipe`` was fitted with a DataFrame). This would prevent
ending up in potentially confusing or ambiguous situations. Calling
``get_feature_names`` with custom input names is of course still possible when
the input was not a pandas DataFrame.


Backward compatibility
----------------------

This SLEP does not affect backward compatibility, as all described attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add common checks, there might be some "backward compatibility issues" much like for #22. Maybe that's just something to note.

and methods would be new ones, not affecting existing ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if we reuse get_feature_names then we add a new parameter in some cases but the old behavior still works.


The only possible compatibility question is, if we decide to use another name
than ``get_feature_names()``, what to do with those existing methods? Those
could in principle be deprecated.


Alternatives
------------

The alternatives described here are alternatives to the combination of the
transformative "update feature names" method and calling it recursively in the
Pipeline setting the ``input_feature_names_`` attribute on each step.

1. Only implement the "update feature names" method and require the user to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the correct distinction but the main point is to always just operate on output feature names, never on input feature names, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this comment. What do you man with "operating on output feature names" ?

(this alternative of course depends on the idea of having such a "update feature names" method that does the work, but if we decide that actually it should happen in fit that would change things)

slice the pipeline to call this method manually on the appropriate subset.
For example::

>>> clf[:-1].get_feature_names(input_features=[....])

would then propagate the original provided names up to the output names of
the final step of this sliced pipeline (which will be the input feature
names for the last step of the pipeline, in this example).

This is what was implemented in `#12627 <https://github.com/scikit-learn/scikit-learn/pull/12627>`_.

The main drawback of this more limited proposal is the user interface: the
user needs to manually slice the pipeline and call ``get_feature_names()``
to get the output feature names of this subset, in order to get the input
feature names of the final classifier/regressor.

The main difference is not automatically calling this method in the Pipeline
``fit()`` method and storing the `input_feature_names_` attributes.

2. Use "pandas in - pandas out" everywhere (also fitted attributes): user does
not need an explicit way to get or set the feature names as they are
included in the output of estimators (e.g. ``coef_`` would be a Series
with the input feature names to the final estimator as the index).

However, this would tie this feature much more to pandas (and eg would not
be available when working with numpy arrays) and would be much more evasive
for the codebase (and raise a lot more general issues about tying to pandas).

3. Implement a more comprehensive feature description language.

4. Leave it to the user.


While we think that alternatives 2) and 3) are valid option for the future,
trying to implement this now will probably result in a gridlock and/or take too
much time. The solution proposed in this SLEP can provide something that solves
the majority of the use cases relatively easy. We can create a more elaborate
solution later, in particular since this SLEP doesn't introduce any concepts
that are not in scikit-learn already.

We don't think that doing nothing (4) is a good option. The titanic example
shown in the introduction is valid use case, and currently, getting the
feature names is very hard (the example was even simplified, as not taking into
account features being dropped if all NaN).


Discussion
----------

Discussions have been held at several places:

- https://github.com/scikit-learn/scikit-learn/issues/6424
- https://github.com/scikit-learn/scikit-learn/issues/6425
- https://github.com/scikit-learn/scikit-learn/pull/12627
- https://github.com/scikit-learn/scikit-learn/pull/13307


References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_