Skip to content

Commit

Permalink
SLEP015: Feature Names Propagation (#48)
Browse files Browse the repository at this point in the history
  • Loading branch information
thomasjpfan committed Nov 30, 2022
1 parent 25edba4 commit 221362b
Show file tree
Hide file tree
Showing 2 changed files with 192 additions and 0 deletions.
1 change: 1 addition & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
:caption: Rejected

slep014/proposal
slep015/proposal

.. toctree::
:maxdepth: 1
Expand Down
191 changes: 191 additions & 0 deletions slep015/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
.. _slep_015:

==================================
SLEP015: Feature Names Propagation
==================================

:Author: Thomas J Fan
:Status: Rejected
:Type: Standards Track
:Created: 2020-10-03

Abstract
########

This SLEP proposes adding the ``get_feature_names_out`` method to all
transformers and the ``feature_names_in_`` attribute for all estimators.
The ``feature_names_in_`` attribute is set during ``fit`` if the input, ``X``,
contains the feature names.

Motivation
##########

``scikit-learn`` is commonly used as a part of a larger data processing
pipeline. When this pipeline is used to transform data, the result is a
NumPy array, discarding column names. The current workflow for
extracting the feature names requires calling ``get_feature_names`` on the
transformer that created the feature. This interface can be cumbersome when used
together with a pipeline with multiple column names::

X = pd.DataFrame({'letter': ['a', 'b', 'c'],
'pet': ['dog', 'snake', 'dog'],
'distance': [1, 2, 3]})
y = [0, 0, 1]
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']

ct = ColumnTransformer(
[('cat', OneHotEncoder(), orig_cat_cols),
('num', StandardScaler(), orig_num_cols)])
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)

cat_names = (pipe['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(orig_cat_cols))

feature_names = np.r_[cat_names, orig_num_cols]

The ``feature_names`` extracted above corresponds to the features directly
passed into ``LogisticRegression``. As demonstrated above, the process of
extracting ``feature_names`` requires knowing the order of the selected
categories in the ``ColumnTransformer``. Furthermore, if there is feature
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
would need to be used to infer the column names that were selected.

Solution
########

This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators
that will extract the feature names of ``X`` during ``fit``. This will also
be used for validation during non-``fit`` methods such as ``transform`` or
``predict``. If the ``X`` is not a recognized container with columns, then
``feature_names_in_`` can be undefined. If ``feature_names_in_`` is undefined,
then it will not be validated.

Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)``
to all transformers. By default, the input features will be determined by the
``feature_names_in_`` attribute. The feature names of a pipeline can then be
easily extracted as follows::

pipe[:-1].get_feature_names_out()
# ['cat__letter_a', 'cat__letter_b', 'cat__letter_c',
'cat__pet_dog', 'cat__pet_snake', 'num__distance']

Note that ``get_feature_names_out`` does not require ``input_names``
because the feature names was stored in the pipeline itself. These
features will be passed to each step's ``get_feature_names_out`` method to
obtain the output feature names of the ``Pipeline`` itself.

Enabling Functionality
######################

The following enhancements are **not** a part of this SLEP. These features are
made possible if this SLEP gets accepted.

1. This SLEP enables us to implement an ``array_out`` keyword argument to
all ``transform`` methods to specify the array container outputted by
``transform``. An implementation of ``array_out`` requires
``feature_names_in_`` to validate that the names in ``fit`` and
``transform`` are consistent. An implementation of ``array_out`` needs
a way to map from the input feature names to output feature names, which is
provided by ``get_feature_names_out``.

2. An alternative to ``array_out``: Transformers in a pipeline may wish to have
feature names passed in as ``X``. This can be enabled by adding a
``array_input`` parameter to ``Pipeline``::

pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(),
array_input='pandas')

In this case, the pipeline will construct a pandas DataFrame to be inputted
into ``MyTransformer`` and ``LogisticRegression``. The feature names
will be constructed by calling ``get_feature_names_out`` as data is passed
through the ``Pipeline``. This feature implies that ``Pipeline`` is
doing the construction of the DataFrame.

Considerations and Limitations
##############################

1. The ``get_feature_names_out`` will be constructed using the name generation
specification from :ref:`slep_007`.

2. For a ``Pipeline`` with only one estimator, slicing will not work and one
would need to access the feature names directly::

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
pipe[:-1].feature_names_in_ # Works

pipe2 = make_pipeline(LogisticRegression())
pipe[:-1].feature_names_in_ # Does not work

This is because `pipe2[:-1]` raises an error because it will result in
a pipeline with no steps. We can work around this by allowing pipelines
with no steps.

3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
an ndarray.

4. Meta-estimators will delegate the setting and validation of
``feature_names_in_`` to its inner estimators. The meta-estimator will
define ``feature_names_in_`` by referencing its inner estimators. For
example, the ``Pipeline`` can use ``steps[0].feature_names_in_`` as
the input feature names. If the inner estimators do not define
``feature_names_in_`` then the meta-estimator will not defined
``feature_names_in_`` as well.

Backward compatibility
######################

1. This SLEP is fully backward compatible with previous versions. With the
introduction of ``get_feature_names_out``, ``get_feature_names`` will
be deprecated. Note that ``get_feature_names_out``'s signature will
always contain ``input_features`` which can be used or ignored. This
helps standardize the interface for the get feature names method.

2. The inclusion of a ``get_feature_names_out`` method will not introduce any
overhead to estimators.

3. The inclusion of a ``feature_names_in_`` attribute will increase the size of
estimators because they would store the feature names. Users can remove
the attribute by calling ``del est.feature_names_in_`` if they want to
remove the feature and disable validation.

Alternatives
############

There have been many attempts to address this issue:

1. ``array_out`` in keyword parameter in ``transform`` : This approach requires
third party estimators to unwrap and wrap array containers in transform,
which introduces more burden for third party estimator maintainers.
Furthermore, ``array_out`` with sparse data will introduce an overhead when
being passed along in a ``Pipeline``. This overhead comes from the
construction of the sparse data container that has the feature names.

2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute
while this SLEP proposes a ``get_feature_names_out`` method to accomplish
the same task. The benefit of the ``get_feature_names_out`` method is that
it can be used even if the feature names were not passed in ``fit`` with a
dataframe. For example, in a ``Pipeline`` the feature names are not passed
through to each step and a ``get_feature_names_out`` method can be used to
get the names of each step with slicing.

3. :ref:`slep_012` : The ``InputArray`` was developed to work around the
overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The
introduction of another data structure into the Python Data Ecosystem, would
lead to more burden for third party estimator maintainers.


References and Footnotes
########################

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.
.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
#########

This document has been placed in the public domain. [1]_

0 comments on commit 221362b

Please sign in to comment.