Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add experimental.ColumnTransformer #9012

Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
1937d56
add heterogeneous ColumnTransformer
amueller Jun 5, 2015
95bf6cb
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 6, 2017
914ba53
Get tests/examples working with current sklearn
jorisvandenbossche Jun 6, 2017
2333e61
Add support for numpy arrays and positional columns in dataframes as …
jorisvandenbossche Jun 6, 2017
464f7e6
add support for selecting multiple columns
jorisvandenbossche Jun 6, 2017
7777e2a
doc corrections
jorisvandenbossche Jun 7, 2017
42ce18c
Change to tuples instead of dict
jorisvandenbossche Jun 7, 2017
4a55b9b
Reimplement as subclass of FeatureUnion
jorisvandenbossche Jun 7, 2017
55a5372
Fix-ups and move tests
jorisvandenbossche Jun 7, 2017
74d0639
update docs
jorisvandenbossche Jun 7, 2017
b6883b9
Support selecting multiple columns from dict + ensure passed subset i…
jorisvandenbossche Jun 7, 2017
1c4f09b
Also support slices for positional subsets
jorisvandenbossche Jun 7, 2017
7cef7df
Fix 2d dict items case
jorisvandenbossche Jun 7, 2017
6ceed19
Refactor column selection based on discussion
jorisvandenbossche Jun 8, 2017
e19e3c1
clean-up + add more tests
jorisvandenbossche Jun 8, 2017
0116ac9
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 9, 2017
c7ea079
Nuke swiss army knife (no dict/recarray support)
jorisvandenbossche Jun 9, 2017
acff9dd
Add catch/reraise error with custom message
jorisvandenbossche Jun 9, 2017
4db243c
update docs
jorisvandenbossche Jun 9, 2017
6ab49a8
undo changes to utils
jorisvandenbossche Jun 9, 2017
2dda954
Move to experimental module
jorisvandenbossche Jun 10, 2017
0d0107f
fixup move to experimental
jorisvandenbossche Jun 10, 2017
267ca85
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 10, 2017
0c7b0d7
Move docs
jorisvandenbossche Jun 10, 2017
c711b55
add support for boolean masks
jorisvandenbossche Jun 10, 2017
0cb9770
Add make_column_transformer factory function
jorisvandenbossche Jun 10, 2017
9d24bb1
doc fixups
jorisvandenbossche Jun 10, 2017
11a5c0c
feedback
jorisvandenbossche Jun 10, 2017
a8efeeb
skip feature_extraction docs if pandas not installed
jorisvandenbossche Jun 10, 2017
20976b1
fix doctests + pep8
jorisvandenbossche Jun 10, 2017
e71a390
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 10, 2017
406b2a9
add to sklearn/setup.py
jorisvandenbossche Jun 14, 2017
ae12bbc
feedback
jorisvandenbossche Jun 14, 2017
70ed541
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 14, 2017
16bfae5
possible fix for get_params / set_params
jorisvandenbossche Jun 15, 2017
7ff02a4
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 16, 2017
a753833
updates for feedback
jorisvandenbossche Jun 16, 2017
bb4d721
Don't subclass FeatureUnion + clone passed transformers
jorisvandenbossche Jun 16, 2017
493116f
add named_transformers_ attribute
jorisvandenbossche Jun 19, 2017
a33ad8c
add test that confirms that transformers now actually get cloned
jorisvandenbossche Jun 19, 2017
18b814d
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jun 26, 2017
6cedbd7
added some more tests
jorisvandenbossche Jun 26, 2017
0229e5b
doc feedback guillaume
jorisvandenbossche Jun 28, 2017
f9d95eb
Merge remote-tracking branch 'origin/master' into amueller/heterogene…
glemaitre Jul 13, 2017
ca1647e
Solve the issue introduce by git during merging
glemaitre Jul 13, 2017
0707319
Addess Joel comments
glemaitre Jul 17, 2017
88ac893
remove validation from init
glemaitre Jul 17, 2017
91a5312
correct comment in example
glemaitre Jul 17, 2017
deb3b78
Do not modify transformer in init
glemaitre Jul 17, 2017
a6d7b77
Factorize _fit_* functions
glemaitre Jul 18, 2017
d287420
minor updates based on feedback
jorisvandenbossche Aug 21, 2017
2920912
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Aug 21, 2017
7b1ce95
refactor try except block to single helper function
jorisvandenbossche Aug 22, 2017
db9b2de
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Oct 27, 2017
e6d81af
move whatsnew + fix bad merge
jorisvandenbossche Oct 27, 2017
733b111
add passthrough kwarg
jorisvandenbossche Oct 27, 2017
6d639f0
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Nov 21, 2017
8d142fd
fixup basic passthrough implementation and tests
jorisvandenbossche Nov 21, 2017
af257e0
fix doctest
jorisvandenbossche Nov 21, 2017
6705233
use pytest setup to skip docs if no pandas
jorisvandenbossche Nov 22, 2017
2b591e4
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Nov 24, 2017
00aef88
move doc fixture to common conftest.py for docs
jorisvandenbossche Nov 24, 2017
9c2df9c
poc of passthrough=True
jorisvandenbossche Dec 5, 2017
4463fa7
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Dec 14, 2017
8d6e034
Update make_column_transformer to accept tuples instead of dict
jorisvandenbossche Dec 14, 2017
82a5697
some clean-up
jorisvandenbossche Dec 15, 2017
04cf4ff
more thoroughly test + fix passthrough
jorisvandenbossche Dec 16, 2017
db2eabd
add test to cover check of transformers
jorisvandenbossche Dec 16, 2017
c402fb2
feedback Joel
jorisvandenbossche Dec 22, 2017
9ae7753
add note on None transformer and 'remainder'
jorisvandenbossche Dec 22, 2017
c222101
small update to the tests
jorisvandenbossche Jan 12, 2018
26bf288
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Jan 12, 2018
8386fae
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Feb 7, 2018
28840ad
flake8
jorisvandenbossche Feb 7, 2018
14c7b1e
Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
jorisvandenbossche Mar 29, 2018
608ba9a
Move ColumnTransformer from experimental to compose
jorisvandenbossche Mar 29, 2018
22c499c
fix sklearn/__init__.py
jorisvandenbossche Mar 29, 2018
333f878
fixup remaining usage of experimental
jorisvandenbossche Mar 29, 2018
c3f8733
fix doctest example
jorisvandenbossche Mar 29, 2018
4804cd8
switch transformers/columns order in make_column_transformer
jorisvandenbossche Apr 10, 2018
3d3e772
Add special-cased 'drop' and 'passthrough'
jorisvandenbossche Apr 18, 2018
3346268
Implement 'drop'/'passthrough' for remainder instead of passthrough k…
jorisvandenbossche May 1, 2018
7ded77a
remainder -> unspecified
jorisvandenbossche May 1, 2018
4835c29
fix doctests + remaining feedback Joel
jorisvandenbossche May 1, 2018
04bcb1e
pep8
jorisvandenbossche May 1, 2018
3d2a9bc
unspecified -> remainder
jorisvandenbossche May 25, 2018
afb7384
update for feedback
jorisvandenbossche May 25, 2018
d298fc3
switch default from 'drop' to 'passthrough' + add transformer ouput v…
jorisvandenbossche May 25, 2018
4098928
Add NotImplementedError for get_feature_names if columns are passed t…
jorisvandenbossche May 29, 2018
9ab27fb
move docs from feature_extraction.rst -> compose.rst
jorisvandenbossche May 29, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1381,6 +1381,30 @@ Low-level methods
utils.validation.column_or_1d
utils.validation.has_fit_parameter


.. _experimental_ref:

:mod:`sklearn.experimental`: Experimental functionality
=======================================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that any functionality here can change any time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the note from the module docstring gets automatically inserted here


.. automodule:: sklearn.experimental
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not changed because we haven't agreed on where to put it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not changed because we haven't agreed on where to put it?

yes

:no-members:
:no-inherited-members:

.. currentmodule:: sklearn

.. autosummary::
:toctree: generated/
:template: class.rst

experimental.ColumnTransformer

.. autosummary::
:toctree: generated/
:template: function.rst

experimental.make_column_transformer

Recently deprecated
===================

Expand Down
87 changes: 86 additions & 1 deletion doc/modules/feature_extraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,91 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by
default instead of a ``numpy.ndarray``.


.. _column_transformer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in compose.rst, but perhaps noted at the top of this file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, but also (related to what I mentioned here: #9012 (comment)):

  • when moving to compose.rst, I think we should use a different example (eg using transformers from preprocessing module, as I think that is a more typical use case)
  • we should reference this in preprocessing.rst
  • we should add a better 'typical data science usecase" example for the example gallery
  • I would maybe keep the explanation currently in feature_extraction.rst (the example), but shorten it by referring to compose.rst for the general explanation.

I can work on the above this week. But in light of getting this merged sooner rather than later, I would prefer doing it as a follow-up PR, if that is fine? (I can also do a minimal here and simply move the current docs addition to compose.rst without any of the other mentioned improvements).


ColumnTransformer for heterogeneous data
========================================

.. warning::

The :class:`experimental.ColumnTransformer <sklearn.experimental.ColumnTransformer>`
class is experimental and the API is subject to change.

Many datasets contain features of different types, say text, floats, and dates,
where each type of feature requires separate preprocessing or feature
extraction steps. Often it is easiest to preprocess data before applying
scikit-learn methods, for example using `pandas <http://pandas.pydata.org/>`__.
Processing your data before passing it to scikit-learn might be problematic for
one of the following reasons:

1. Incorporating statistics from test data into the preprocessors makes
cross-validation scores unreliable (known as *data leakage*).
2. You may want to include the parameters of the preprocessors in a
:ref:`parameter search <grid_search>`.

:class:`~sklearn.experimental.ColumnTransformer` helps performing different
transformations for different columns of the data, within a
:class:`~sklearn.pipeline.Pipeline` that is safe from data leakage and that can
be parametrized. :class:`~sklearn.experimental.ColumnTransformer` works on
arrays, sparse matrices, and
`pandas DataFrames <http://pandas.pydata.org/pandas-docs/stable/>`__.

To each column, a different transformation can be applied, such as
preprocessing or a specific feature extraction method::

>>> import pandas as pd
>>> X = pd.DataFrame(
... {'city': ['London', 'London', 'Paris', 'Sallisaw'],
... 'title': ["His Last Bow", "How Watson Learned the Trick",
... "A Moveable Feast", "The Grapes of Wrath"]})

For this data, we might want to encode the ``'city'`` column as a categorical
variable, but apply a :class:`feature_extraction.text.CountVectorizer
<sklearn.feature_extraction.text.CountVectorizer>` to the ``'title'`` column.
As we might use multiple feature extraction methods on the same column, we give
each transformer a unique name, say ``'city_category'`` and ``'title_bow'``::

>>> from sklearn.experimental import ColumnTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> column_trans = ColumnTransformer(
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most users would find this incomprehensible. Can't we come up with an example that avoids this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this looks weird. We just need #9151 ;)

Most people would use LabelEncoder but I don't think we should encourage that.

How about not having a categorical variable, or encoding it as integer and using OneHotEncoder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't spend too much time on coming up with an example here, as the number one use-case will be CategoricalEncoder which we don't have yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do character n-grams vs word n-grams?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just leave it here as is, and make a clear TODO to update this once CategoricalEncoder is merged? As with that, I think it is a good example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just one-hot encode some variables with this and scale/quantile-transform others...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can now use CategoricalEncoder ;) and yes, different scaling is also fine. I would probably one-hot encode some and standard scale the rest, or something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change that please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have been at the point of changing this, but then realised that CategoricalEncoder does not yet support get_feature_names (which is used some lines below) ... But maybe I can rather remove that part of the example for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a CategoricalEncoder now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only reason I didn't change it yet, is because the CategoricalEncoder does not yet support get_feature_names, which is used below.
However, get_feature_names is in general not yet much supported (only in vectorizers) that in practice you will not be able to use it often at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also prefer CategoricalEncoder (but also happy to merge as is and fix later).

... ('title_bow', CountVectorizer(), 'title')])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should show an example of retaining an existing column ('price', None, ['price']). Oh no. That notation is unintelligible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or am I getting confused and None is not the right way to identity transform...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is nothing to make the identity transform: #9012 (comment)

In the current state, the only way would be to use a FunctionTransformer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for FeatureUnion None means "no features" and for Pipeline it means "Identity". So you could make_pipeline(None) ;) [that doesn't work for the wrong reasons iirc.]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think a useful and hopefully simple example would be: given a homogenous N x 4 array X of floats, how do I apply a StandardScaler to columns [0, 1], leaving [2, 3] untouched.


>>> column_trans.fit(X) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
ColumnTransformer(n_jobs=1, transformer_weights=None,
transformers=...)

>>> column_trans.get_feature_names()
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
['city_category__London', 'city_category__Paris', 'city_category__Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']

>>> column_trans.transform(X).toarray()
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)

In the above example, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes this not a good first example, but I think it is good to mention this somewhere in the docs, maybe as a second example of why there's support for single item columns. This is probably a very rare usecase.

:class:`~sklearn.feature_extraction.text.CountVectorizer` expects a 1D array as
input and therefore the columns were specified as a string (``'city'``).
However, other transformers generally expect 2D data, and in that case you need
to specify the column as a list of strings (``['city']``).

Apart from a scalar or a single item list, the column selection can be specified
as a list of multiple items, an integer array, a slice, or a boolean mask.
Strings can reference columns if the input is a DataFrame, integers are always
interpreted as the positional columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention make_column_transformer here

.. topic:: Examples:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth emphasising that one can use a list of fields

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some more explanation

* :ref:`sphx_glr_auto_examples_column_transformer.py`


.. _feature_hashing:

Feature hashing
Expand Down Expand Up @@ -916,7 +1001,7 @@ Some tips and tricks:
(Note that this will not filter out punctuation.)


The following example will, for instance, transform some British spelling
The following example will, for instance, transform some British spelling
to American spelling::

>>> import re
Expand Down
12 changes: 12 additions & 0 deletions doc/modules/feature_extraction_fixture.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""Fixture module to skip the feature_extraction docs when pandas is not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that work for both, nosetests and pytest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is still one of the issues with the complete move to pytest, and there are other fixtures to fix as well (so would leave that for another issue, unless it would already be fixed in master)

#9445 is the issue

installed

"""
from sklearn.utils.testing import SkipTest


def setup(module):
try:
import pandas # noqa
except ImportError:
raise SkipTest("pandas not installed")
13 changes: 8 additions & 5 deletions doc/modules/pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,9 +220,13 @@ FeatureUnion: composite feature spaces
:class:`FeatureUnion` combines several transformer objects into a new
transformer that combines their output. A :class:`FeatureUnion` takes
a list of transformer objects. During fitting, each of these
is fit to the data independently. For transforming data, the
transformers are applied in parallel, and the sample vectors they output
are concatenated end-to-end into larger vectors.
is fit to the data independently. The transformers are applied in parallel,
and the feature matrices they output are concatenated side-by-side into a
larger matrix.

When you want to apply different transformations to each field of the data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another blank line to make this a separate paragraph?

see the related class :class:`sklearn.experimental.ColumnTransformer`
(see :ref:`user guide <column_transformer>`).

:class:`FeatureUnion` serves the same purposes as :class:`Pipeline` -
convenience and joint parameter estimation and validation.
Expand Down Expand Up @@ -272,5 +276,4 @@ and ignored by setting to ``None``::

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_plot_feature_stacker.py`
* :ref:`sphx_glr_auto_examples_hetero_feature_union.py`
* :ref:`sphx_glr_auto_examples_feature_stacker.py`
34 changes: 34 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,36 @@ Other estimators

Model selection and evaluation

- :class:`model_selection.GridSearchCV` and
:class:`model_selection.RandomizedSearchCV` now support simultaneous
evaluation of multiple metrics. Refer to the
:ref:`multimetric_grid_search` section of the user guide for more
information. :issue:`7388` by `Raghav RV`_

- Added the :func:`model_selection.cross_validate` which allows evaluation
of multiple metrics. This function returns a dict with more useful
information from cross-validation such as the train scores, fit times and
score times.
Refer to :ref:`multimetric_cross_validation` section of the userguide
for more information. :issue:`7388` by `Raghav RV`_

- Added :func:`metrics.mean_squared_log_error`, which computes
the mean square error of the logarithmic transformation of targets,
particularly useful for targets with an exponential trend.
:issue:`7655` by :user:`Karan Desai <karandesai-96>`.

- Added :class:`experimental.ColumnTransformer`, which allows to apply
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correctly merged

different transformers to different columns of arrays or or pandas
dataframes. By `Andreas Müller`_ and `Joris Van den Bossche`_.

- Added :func:`metrics.dcg_score` and :func:`metrics.ndcg_score`, which
compute Discounted cumulative gain (DCG) and Normalized discounted
cumulative gain (NDCG).
:issue:`7739` by :user:`David Gasquez <davidgasquez>`.

- Added the :class:`model_selection.RepeatedKFold` and
:class:`model_selection.RepeatedStratifiedKFold`.
:issue:`8120` by `Neeraj Gangwar`_.
- :class:`model_selection.GridSearchCV` and
:class:`model_selection.RandomizedSearchCV` now support simultaneous
evaluation of multiple metrics. Refer to the
Expand Down Expand Up @@ -5741,7 +5771,11 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Vincent Pham: https://github.com/vincentpham1991

.. _Denis Engemann: http://denis-engemann.de

.. _Anish Shah: https://github.com/AnishShah

.. _Neeraj Gangwar: http://neerajgangwar.in

.. _Arthur Mensch: https://amensch.fr

.. _Joris Van den Bossche: https://github.com/jorisvandenbossche
79 changes: 17 additions & 62 deletions examples/hetero_feature_union.py → examples/column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@
require different processing pipelines.

This example demonstrates how to use
:class:`sklearn.feature_extraction.FeatureUnion` on a dataset containing
:class:`sklearn.experimental.ColumnTransformer` on a dataset containing
different types of features. We use the 20-newsgroups dataset and compute
standard bag-of-words features for the subject line and body in separate
pipelines as well as ad hoc features on the body. We combine them (with
weights) using a FeatureUnion and finally train a classifier on the combined
set of features.
weights) using a ColumnTransformer and finally train a classifier on the
combined set of features.

The choice of features is not particularly helpful, but serves to illustrate
the technique.
Expand All @@ -38,50 +38,11 @@
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.experimental import ColumnTransformer
from sklearn.svm import SVC


class ItemSelector(BaseEstimator, TransformerMixin):
"""For data grouped by feature, select subset of data at a provided key.

The data is expected to be stored in a 2D data structure, where the first
index is over features and the second is over samples. i.e.

>> len(data[key]) == n_samples

Please note that this is the opposite convention to scikit-learn feature
matrixes (where the first index corresponds to sample).

ItemSelector only requires that the collection implement getitem
(data[key]). Examples include: a dict of lists, 2D numpy array, Pandas
DataFrame, numpy record array, etc.

>> data = {'a': [1, 5, 2, 5, 2, 8],
'b': [9, 4, 1, 4, 1, 3]}
>> ds = ItemSelector(key='a')
>> data['a'] == ds.transform(data)

ItemSelector is not designed to handle data grouped by sample. (e.g. a
list of dicts). If your data is structured this way, consider a
transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

Parameters
----------
key : hashable, required
The key corresponding to the desired value in a mappable.
"""
def __init__(self, key):
self.key = key

def fit(self, x, y=None):
return self

def transform(self, data_dict):
return data_dict[self.key]


class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer"""

Expand All @@ -104,21 +65,22 @@ def fit(self, x, y=None):
return self

def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
# construct object dtype array with two columns
# first column = 'subject' and second column = 'body'
features = np.empty(shape=(len(posts), 2), dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are going to be in the "column" namespace, where we support pandas dataframes, should we use a pandas dataframe in this example, rather than a object array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example I didn't use pandas, as it seems a bit overhead (it would just be for temporarily putting the two columns in a frame to pass it to a next frame). But we certainly need another example with a pandas dataframe (eg with adults).
But can change it here as well if needed.

for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod
features[i, 1] = bod

prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub
features[i, 0] = sub

return features

Expand All @@ -128,37 +90,30 @@ def transform(self, posts):
('subjectbody', SubjectBodyExtractor()),

# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
('union', ColumnTransformer(
[
# Pulling features from the post's subject line (first column)
('subject', TfidfVectorizer(min_df=50), 0),

# Pipeline for pulling features from the post's subject line
('subject', Pipeline([
('selector', ItemSelector(key='subject')),
('tfidf', TfidfVectorizer(min_df=50)),
])),

# Pipeline for standard bag-of-words model for body
# Pipeline for standard bag-of-words model for body (second column)
('body_bow', Pipeline([
('selector', ItemSelector(key='body')),
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])),
]), 1),

# Pipeline for pulling ad hoc features from post's body
('body_stats', Pipeline([
('selector', ItemSelector(key='body')),
('stats', TextStats()), # returns a list of dicts
('vect', DictVectorizer()), # list of dicts -> feature matrix
])),

]), 1),
],

# weight components in FeatureUnion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate, or should it be ColumnTransformer?

transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
'body_stats': 1.0,
},
}
)),

# Use a SVC classifier on the combined features
Expand Down
7 changes: 4 additions & 3 deletions sklearn/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,15 +135,16 @@ def config_context(**new_config):
__check_build # avoid flakes unused variable error

__all__ = ['calibration', 'cluster', 'covariance', 'cross_decomposition',
'cross_validation', 'datasets', 'decomposition', 'dummy',
'ensemble', 'exceptions', 'externals', 'feature_extraction',
'cross_validation', 'datasets', 'decomposition',
'discriminant_analysis', 'dummy', 'ensemble', 'exceptions',
'experimental', 'externals', 'feature_extraction',
'feature_selection', 'gaussian_process', 'grid_search',
'isotonic', 'kernel_approximation', 'kernel_ridge',
'learning_curve', 'linear_model', 'manifold', 'metrics',
'mixture', 'model_selection', 'multiclass', 'multioutput',
'naive_bayes', 'neighbors', 'neural_network', 'pipeline',
'preprocessing', 'random_projection', 'semi_supervised',
'svm', 'tree', 'discriminant_analysis',
'svm', 'tree',
# Non-modules:
'clone']

Expand Down
9 changes: 9 additions & 0 deletions sklearn/experimental/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
"""
The :mod:`sklearn.experimental` module hosts experimental functionality for
which the API is not yet guaranteed to be stable.
"""

from ._column_transformer import ColumnTransformer, make_column_transformer


__all__ = ['ColumnTransformer', 'make_column_transformer']
Loading