New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add experimental.ColumnTransformer #9012

Merged
merged 90 commits into from May 29, 2018

Conversation

@jorisvandenbossche
Contributor

jorisvandenbossche commented Jun 6, 2017

Continuation @amueller's PR #3886 (for now just rebased and updated for changes in sklearn)

Fixes #2034.

Closes #2034, closes #3886, closes #8540, closes #8539

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 6, 2017

Member

feel free to squash my commits

Member

amueller commented Jun 6, 2017

feel free to squash my commits

Show outdated Hide outdated examples/column_transformer.py
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
])),
]), 'body'),

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

Are we sold on the tuple-based API here? I'd like it if this were a bit more explicit... (I'd like it to say column_name='body' somehow)

@vene

vene Jun 6, 2017

Member

Are we sold on the tuple-based API here? I'd like it if this were a bit more explicit... (I'd like it to say column_name='body' somehow)

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

While we're at it, why is the outer structure a dict and not an ordered list of tuples like FeatureUnion?

@vene

vene Jun 6, 2017

Member

While we're at it, why is the outer structure a dict and not an ordered list of tuples like FeatureUnion?

Show outdated Hide outdated doc/modules/feature_extraction.rst
Often it is easiest to preprocess data before applying scikit-learn methods, for example using
pandas.
If the preprocessing has parameters that you want to adjust within a
grid-search, however, they need to be inside a transformer. This can be

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

they=?

I'd remove the whole sentence and just say ":class:ColumnTransformer is a convenient way to perform heterogeneous preprocessing on data columns within a pipeline.

@vene

vene Jun 6, 2017

Member

they=?

I'd remove the whole sentence and just say ":class:ColumnTransformer is a convenient way to perform heterogeneous preprocessing on data columns within a pipeline.

Show outdated Hide outdated doc/modules/feature_extraction.rst
.. note::
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

Do you mean "will give all feature values for the selected sample, e.g. (X_array[1].shape == (n_features,)) ?

@vene

vene Jun 6, 2017

Member

Do you mean "will give all feature values for the selected sample, e.g. (X_array[1].shape == (n_features,)) ?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 6, 2017

Contributor

Yeah, I didn't update the rst docs yet, and I also saw there are still some errors like these

@jorisvandenbossche

jorisvandenbossche Jun 6, 2017

Contributor

Yeah, I didn't update the rst docs yet, and I also saw there are still some errors like these

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

sure, don't take my comment personally, just making notes

@vene

vene Jun 6, 2017

Member

sure, don't take my comment personally, just making notes

Show outdated Hide outdated doc/modules/feature_extraction.rst
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called
``1`` for each sample (``X_columns[1].shape == (n_samples,)``).

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

Are we supporting integer column labels?

@vene

vene Jun 6, 2017

Member

Are we supporting integer column labels?

Show outdated Hide outdated doc/modules/feature_extraction.rst
.. note::
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

-> a pandas DataFrame

Also this chapter should probably have actual links to the pandas website or something, for readers who might have no idea what we're talking about.

@vene

vene Jun 6, 2017

Member

-> a pandas DataFrame

Also this chapter should probably have actual links to the pandas website or something, for readers who might have no idea what we're talking about.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 6, 2017

Contributor

So the current way to specify a transformer is like this:

ColumnTransformer({"name": (Transformer(), column), ..})

(where 'name' is the transformer name, and column is the column on which to apply the transformer).

There was some discussion and back-and-forth about this in the original PR, and other options mentioned are (as far I as read it correctly):

ColumnTransformer([("name", Transformer(), column), ..]) # more similar to Pipeline interface

or

ColumnTransformer([('column', Transformer()), ..])   # in this case transformer name / column name has to be identical

BTW, when using dicts, I would actually find this interface more logical:

ColumnTransformer({"column": ("name", Transformer()), ..})

which switches the place of column and transformer name, which gives you a tuple of (name, trans) similar to the Pipeline interface, and uses the dict key to select (which mimics getitem how also the values are selected from the input data).
But this has the disadvantage that we cannot extend this to multiple columns with lists (since lists cannot be dict keys).

Contributor

jorisvandenbossche commented Jun 6, 2017

So the current way to specify a transformer is like this:

ColumnTransformer({"name": (Transformer(), column), ..})

(where 'name' is the transformer name, and column is the column on which to apply the transformer).

There was some discussion and back-and-forth about this in the original PR, and other options mentioned are (as far I as read it correctly):

ColumnTransformer([("name", Transformer(), column), ..]) # more similar to Pipeline interface

or

ColumnTransformer([('column', Transformer()), ..])   # in this case transformer name / column name has to be identical

BTW, when using dicts, I would actually find this interface more logical:

ColumnTransformer({"column": ("name", Transformer()), ..})

which switches the place of column and transformer name, which gives you a tuple of (name, trans) similar to the Pipeline interface, and uses the dict key to select (which mimics getitem how also the values are selected from the input data).
But this has the disadvantage that we cannot extend this to multiple columns with lists (since lists cannot be dict keys).

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 6, 2017

Member

ColumnTransformer({"column": ("name", Transformer()), ..})

The column is a numpy array, right, so it's not hashable.

I think we could use either the list or dict thing here, and have a helper make_column_transformer or somthing that does
make_column_transformer({Transformer(): column})
Transformer is hashable, so that works, and we can generate the name from the class name as in make_pipeline.

Member

amueller commented Jun 6, 2017

ColumnTransformer({"column": ("name", Transformer()), ..})

The column is a numpy array, right, so it's not hashable.

I think we could use either the list or dict thing here, and have a helper make_column_transformer or somthing that does
make_column_transformer({Transformer(): column})
Transformer is hashable, so that works, and we can generate the name from the class name as in make_pipeline.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 6, 2017

Member

Oh I didn't fully read your comment. I think multiple columns are essential, so we can't do that...

Member

amueller commented Jun 6, 2017

Oh I didn't fully read your comment. I think multiple columns are essential, so we can't do that...

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 7, 2017

Member

Maybe i'm a little bit confused but then, does this overlap in scope with FeatureUnion?

If there is more than one transformer, we need to know what order to use for column_stacking their output, right? So if we use a dict with transformers as keys can we guarantee a consistent order?

Member

vene commented Jun 7, 2017

Maybe i'm a little bit confused but then, does this overlap in scope with FeatureUnion?

If there is more than one transformer, we need to know what order to use for column_stacking their output, right? So if we use a dict with transformers as keys can we guarantee a consistent order?

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 7, 2017

Member

Let's try to discuss this with @amueller before proceeding. I think I'll help on this today.

Member

vene commented Jun 7, 2017

Let's try to discuss this with @amueller before proceeding. I think I'll help on this today.

jorisvandenbossche added some commits Jun 7, 2017

Show outdated Hide outdated doc/modules/pipeline.rst
transformations to each field of the data, producing a homogeneous feature
matrix from a heterogeneous data source.
The transformers are applied in parallel, and the feature matrices they output
are concatenated side-by-side into a larger matrix.

This comment has been minimized.

@vene

vene Jun 7, 2017

Member

Since this PR adds ColumnTransformer, we can say here something like: for data organized in fields with heterogeneous types, see the related class class:ColumnTransformer.

@vene

vene Jun 7, 2017

Member

Since this PR adds ColumnTransformer, we can say here something like: for data organized in fields with heterogeneous types, see the related class class:ColumnTransformer.

@jnothman jnothman referenced this pull request Jun 7, 2017

Closed

[MRG+1] Stacking classifier with pipelines API #8960

7 of 7 tasks complete
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 7, 2017

Contributor

Additional problem that we encountered:

In the meantime (since the original PR was made), Transformers need 2D X values. Therefore, I made sure in the ColumnTransformer that I always pass through the subset of columns to the transformer as a 2D object, also when you apply the transformer only on a single column.

But, by ensuring this, the example using a TfidfVectorizer fails, because that one expects a 1D object of text samples.

So possible options:

  1. Add a way to specify in the ColumnTransformer that for certain transformers the passed X values should be kept as 1D.
    E.g. we could have

    ColumnTransformer([('tfidf', TfidfVectorizer, 'text_col'), ...],
                      flatten_column={'tfidf': True})`
    

    where flatten_column (or other name like keep_1d) is False by default (satisfying the normal Transformers), but you can specify per transformer if you want to override this default of False.
    The use of a dict here is similar to the transformer_weights keyword.

  2. Adapt the example (which means: letting the user write more boilerplate) to fix it, eg by adding a step in the pipeline to select the single column from the 2D object.
    We could add one line to the current pipeline that holds the TFIDF (this tuple is one of the transformers in the ColumnTransformer)

            # Pipeline for standard bag-of-words model for body
            ('body_bow', Pipeline([
    added -->   ('flatten', FunctionTransformer(lambda x: x[:, 0], validate=False)),
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 'body'),
    
  3. Adapt TfidfVectorizer to eg have a keyword that allows to specify that 2D data is expected (which would be False by default for backwards compatibility).
    If we would like to do this one, this might ideally be a separate PR and so the second option can be used as a temporary hack to the example to have it working and which can be removed in the other PR.

Contributor

jorisvandenbossche commented Jun 7, 2017

Additional problem that we encountered:

In the meantime (since the original PR was made), Transformers need 2D X values. Therefore, I made sure in the ColumnTransformer that I always pass through the subset of columns to the transformer as a 2D object, also when you apply the transformer only on a single column.

But, by ensuring this, the example using a TfidfVectorizer fails, because that one expects a 1D object of text samples.

So possible options:

  1. Add a way to specify in the ColumnTransformer that for certain transformers the passed X values should be kept as 1D.
    E.g. we could have

    ColumnTransformer([('tfidf', TfidfVectorizer, 'text_col'), ...],
                      flatten_column={'tfidf': True})`
    

    where flatten_column (or other name like keep_1d) is False by default (satisfying the normal Transformers), but you can specify per transformer if you want to override this default of False.
    The use of a dict here is similar to the transformer_weights keyword.

  2. Adapt the example (which means: letting the user write more boilerplate) to fix it, eg by adding a step in the pipeline to select the single column from the 2D object.
    We could add one line to the current pipeline that holds the TFIDF (this tuple is one of the transformers in the ColumnTransformer)

            # Pipeline for standard bag-of-words model for body
            ('body_bow', Pipeline([
    added -->   ('flatten', FunctionTransformer(lambda x: x[:, 0], validate=False)),
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 'body'),
    
  3. Adapt TfidfVectorizer to eg have a keyword that allows to specify that 2D data is expected (which would be False by default for backwards compatibility).
    If we would like to do this one, this might ideally be a separate PR and so the second option can be used as a temporary hack to the example to have it working and which can be removed in the other PR.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 7, 2017

Contributor

Another option would be:

  1. Make a distinction between specifying the columns as a scalar or as a list: when using a scalar, the data is passed as 1D to the Transformers, as a list as 2D data.
    The disadvantage of this is that for all Transformers except of the text vectorizers, you will have to specify the single column as a list:

    ColumnTransformer([('tfidf', TfidfVectorizer, 'text_col')
                       ('scaler', StandardScaler(), ['other_col'])])
    
Contributor

jorisvandenbossche commented Jun 7, 2017

Another option would be:

  1. Make a distinction between specifying the columns as a scalar or as a list: when using a scalar, the data is passed as 1D to the Transformers, as a list as 2D data.
    The disadvantage of this is that for all Transformers except of the text vectorizers, you will have to specify the single column as a list:

    ColumnTransformer([('tfidf', TfidfVectorizer, 'text_col')
                       ('scaler', StandardScaler(), ['other_col'])])
    
@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 7, 2017

Member

Option 4 seems too magic and not explicit enough..

Member

vene commented Jun 7, 2017

Option 4 seems too magic and not explicit enough..

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 7, 2017

Member

I don't think it's too magic, it mirrors indexing semantics in numpy:

In [2]: import numpy as np

In [7]: x = np.arange(9).reshape(3, 3)

In [8]: x[1]
Out[8]: array([3, 4, 5])

In [9]: x[[1]]
Out[9]: array([[3, 4, 5]])
Member

amueller commented Jun 7, 2017

I don't think it's too magic, it mirrors indexing semantics in numpy:

In [2]: import numpy as np

In [7]: x = np.arange(9).reshape(3, 3)

In [8]: x[1]
Out[8]: array([3, 4, 5])

In [9]: x[[1]]
Out[9]: array([[3, 4, 5]])
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 7, 2017

Member

I'm not sure if it's good, though, as it adds (minimal) work in the common case. Though the most common case is probably multiple columns. So maybe it's better than any of the other options?

Member

amueller commented Jun 7, 2017

I'm not sure if it's good, though, as it adds (minimal) work in the common case. Though the most common case is probably multiple columns. So maybe it's better than any of the other options?

Show outdated Hide outdated sklearn/pipeline.py
class ColumnTransformer(FeatureUnion):
"""Applies transformers to columns of a array / dataframe / dict.

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017

Member

a -> an

@jnothman

jnothman Jun 8, 2017

Member

a -> an

Show outdated Hide outdated sklearn/pipeline.py
List of (name, transformer, column) tuples specifying the transformer
objects to be applied to subsets of the data. The columns can be
specified as a scalar or list (for multiple columns) of integer or
string values. Integers are interpreted as the positional columns,

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017

Member

are slice objects allowed?

@jnothman

jnothman Jun 8, 2017

Member

are slice objects allowed?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Yes, see two lines below (this was a request)

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Yes, see two lines below (this was a request)

Show outdated Hide outdated sklearn/pipeline.py
objects to be applied to subsets of the data. The columns can be
specified as a scalar or list (for multiple columns) of integer or
string values. Integers are interpreted as the positional columns,
strings as the keys of `X`. In case of positional integers, also

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017

Member

what does keys mean? should be clear what datatypes we support for X, e.g. mappings, dataframes, struct arrays? Particularly need to be clear if we do not override fit, fit_transform, transform and their docstrings.

@jnothman

jnothman Jun 8, 2017

Member

what does keys mean? should be clear what datatypes we support for X, e.g. mappings, dataframes, struct arrays? Particularly need to be clear if we do not override fit, fit_transform, transform and their docstrings.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Where is the best place to put this more detailed information? (as the input X to fit/transform is not one of the Parameters). In a Notes section, or in the extended summary?

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Where is the best place to put this more detailed information? (as the input X to fit/transform is not one of the Parameters). In a Notes section, or in the extended summary?

This comment has been minimized.

@vene

vene Jun 8, 2017

Member

mention it in extended summary, full explanation in notes?

@vene

vene Jun 8, 2017

Member

mention it in extended summary, full explanation in notes?

Show outdated Hide outdated sklearn/pipeline.py
@@ -673,13 +683,13 @@ def _validate_transformers(self):
"transform. '%s' (type %s) doesn't" %
(t, type(t)))
def _iter(self):
def _iter(self, X=None, skip_none=True):
"""Generate (name, est, weight) tuples excluding None transformers

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017

Member

outdated

@jnothman

jnothman Jun 8, 2017

Member

outdated

This comment has been minimized.

@vene

vene Jun 8, 2017

Member

skip_none in particular needs to be documented because the need for it was quite subtle and will be forgotten

@vene

vene Jun 8, 2017

Member

skip_none in particular needs to be documented because the need for it was quite subtle and will be forgotten

Show outdated Hide outdated sklearn/pipeline.py
and all(isinstance(col, int) for col in column))):
column_names = False
elif (isinstance(column, six.string_types)
or (isinstance(column, list)

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017

Member

DataFrame.loc also supports slice over strings

@jnothman

jnothman Jun 8, 2017

Member

DataFrame.loc also supports slice over strings

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Yes, indeed. For now I didn't support for simplicity (as I then have to check the type of the slice start and stop), but probably it would be cleaner (and useful) from user perspective to also support this if positional slice is supported.

But, thinking about it now, the problem becomes: what to do with dicts and recarrays. Raise an error? (for recarrays you could in principle determine what the user wants with some convoluted code, but for a dict not as keys are not sorted)

We could of course also say in general that we don't support getting multiple keys in case of a dict

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Yes, indeed. For now I didn't support for simplicity (as I then have to check the type of the slice start and stop), but probably it would be cleaner (and useful) from user perspective to also support this if positional slice is supported.

But, thinking about it now, the problem becomes: what to do with dicts and recarrays. Raise an error? (for recarrays you could in principle determine what the user wants with some convoluted code, but for a dict not as keys are not sorted)

We could of course also say in general that we don't support getting multiple keys in case of a dict

This comment has been minimized.

@amueller

amueller Jun 8, 2017

Member

I don't mind supporting the subset of the common cases at first. This is complex enough as-is.

@amueller

amueller Jun 8, 2017

Member

I don't mind supporting the subset of the common cases at first. This is complex enough as-is.

Show outdated Hide outdated sklearn/pipeline.py
if not isinstance(column, list):
return _ensure_2d(X[column])
else:
return np.hstack([_ensure_2d(X[col]) for col in column])

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017

Member

What if one of the X[col] is a sparse matrix?

@jnothman

jnothman Jun 8, 2017

Member

What if one of the X[col] is a sparse matrix?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Good catch. Yes, this will currently not work at all.

We can check for this and use sparse.hstack for sparse arrays (and also update the ensure_2d to work with it). But then you would also need to check that each column of the dict is a sparse one, and not a mixture, because then you would need to convert the sparse ones to dense, ...

Or we can say we don't support gettting multiple columns (for one transformer) from dictionaries to avoid this. Ideas?

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Good catch. Yes, this will currently not work at all.

We can check for this and use sparse.hstack for sparse arrays (and also update the ensure_2d to work with it). But then you would also need to check that each column of the dict is a sparse one, and not a mixture, because then you would need to convert the sparse ones to dense, ...

Or we can say we don't support gettting multiple columns (for one transformer) from dictionaries to avoid this. Ideas?

This comment has been minimized.

@amueller

amueller Jun 8, 2017

Member

How about doing the same as in FeatureUnion (which I think is using sparse stacking if any of them is sparse - it makes slightly more sense to make a dense array sparse then a sparse one dense)

@amueller

amueller Jun 8, 2017

Member

How about doing the same as in FeatureUnion (which I think is using sparse stacking if any of them is sparse - it makes slightly more sense to make a dense array sparse then a sparse one dense)

This comment has been minimized.

@amueller

amueller Jun 8, 2017

Member

Sorry I made no sense.

@amueller

amueller Jun 8, 2017

Member

Sorry I made no sense.

Show outdated Hide outdated sklearn/pipeline.py
else:
# dicts or numpy recarrays
if not isinstance(column, list):
return _ensure_2d(X[column])

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017

Member

This will be a problem for vectorising text (or dicts) coming out of a DataFrame column.

I suspect we'll need different semantics. If one wants to extract a 2d array, one will need to use a slice/list. This is the same semantics as provided by DataFrame and recarray (except that the recarray slice output is not considered a single-typed array).

@jnothman

jnothman Jun 8, 2017

Member

This will be a problem for vectorising text (or dicts) coming out of a DataFrame column.

I suspect we'll need different semantics. If one wants to extract a 2d array, one will need to use a slice/list. This is the same semantics as provided by DataFrame and recarray (except that the recarray slice output is not considered a single-typed array).

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Yes (this was one of the todo's). See the comments in the main thread (#9012 (comment) and below). What you suggest is basically the option 4 I think.

Apart from the shape (1D vs 2D) problem, there is also the problem of the conversion to an array in case of text data (eg when you have a dict -> list of text data). Just converting to an array is not what we want, because this will convert it to some numpy string dtype. But to do this well, you need to start inferring the type of the data in the list, leading to a whole new can of worms .. (a problem that is also encountered in #8793 to support strings in the OneHotEncoder)

If we don't reshape X to a 2D array in case of a scalar column selection, we could in principle leave such lists intact, avoiding this problem.
But you would still have this conversion problem for text columns when selecting multiple columns.

@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Yes (this was one of the todo's). See the comments in the main thread (#9012 (comment) and below). What you suggest is basically the option 4 I think.

Apart from the shape (1D vs 2D) problem, there is also the problem of the conversion to an array in case of text data (eg when you have a dict -> list of text data). Just converting to an array is not what we want, because this will convert it to some numpy string dtype. But to do this well, you need to start inferring the type of the data in the list, leading to a whole new can of worms .. (a problem that is also encountered in #8793 to support strings in the OneHotEncoder)

If we don't reshape X to a 2D array in case of a scalar column selection, we could in principle leave such lists intact, avoiding this problem.
But you would still have this conversion problem for text columns when selecting multiple columns.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

In some way I like option 4 the most, as it is indeed consistent with how indexing works ('col' -> 1D, ['col'] -> 2D). But, a problem I have with it is that users who want to apply a transformer on a single columns will typically first try the scalar case col, and then you get the following error from the transformer:

ValueError: Got X with X.ndim=1. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

which is totally not clear that you have to do ['col'] to solve this.

Contributor

jorisvandenbossche commented Jun 8, 2017

In some way I like option 4 the most, as it is indeed consistent with how indexing works ('col' -> 1D, ['col'] -> 2D). But, a problem I have with it is that users who want to apply a transformer on a single columns will typically first try the scalar case col, and then you get the following error from the transformer:

ValueError: Got X with X.ndim=1. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

which is totally not clear that you have to do ['col'] to solve this.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 8, 2017

Member

@jorisvandenbossche as I said IRL, I'm not opposed to detecting this warning and giving a better. That's a slippery slope but could really improve usability here.
We still need to do the sniffing for string types. @jnothman any ideas for that. Maybe I'll ask @GaelVaroquaux ^^

Member

amueller commented Jun 8, 2017

@jorisvandenbossche as I said IRL, I'm not opposed to detecting this warning and giving a better. That's a slippery slope but could really improve usability here.
We still need to do the sniffing for string types. @jnothman any ideas for that. Maybe I'll ask @GaelVaroquaux ^^

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member
Member

jnothman commented Jun 8, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member
Member

jnothman commented Jun 8, 2017

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

What @amueller and I were just discussing is that we could just not convert lists for now (and thus leave the problem to the user). So if you select 1 columns, the list is just passed through to the transformer, if you select multiple columns that consists of lists we can raise an error saying to the user he/she has to convert it to an array to be able to do this (and most sensible is then to use object array when having strings).

This would prevent all the conversion of list of strings to array problem for now (we can always reconsider if this problem is tackled in the OneHotEncoder)

Contributor

jorisvandenbossche commented Jun 8, 2017

What @amueller and I were just discussing is that we could just not convert lists for now (and thus leave the problem to the user). So if you select 1 columns, the list is just passed through to the transformer, if you select multiple columns that consists of lists we can raise an error saying to the user he/she has to convert it to an array to be able to do this (and most sensible is then to use object array when having strings).

This would prevent all the conversion of list of strings to array problem for now (we can always reconsider if this problem is tackled in the OneHotEncoder)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member
Member

jnothman commented Jun 8, 2017

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 8, 2017

Contributor

Yes, that is true. So indeed maybe better to start with not supporting that at all. The user can always put a 2D array inside a dict to achieve the same result.

Contributor

jorisvandenbossche commented Jun 8, 2017

Yes, that is true. So indeed maybe better to start with not supporting that at all. The user can always put a 2D array inside a dict to achieve the same result.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 8, 2017

Member

But even a dict of arrays will cause problems when the arrays are
heterogeneous.

Why? Won't numpy go to the most generic type? Why is that a problem?

Member

amueller commented Jun 8, 2017

But even a dict of arrays will cause problems when the arrays are
heterogeneous.

Why? Won't numpy go to the most generic type? Why is that a problem?

@pierreglaser

This comment has been minimized.

Show comment
Hide comment
@pierreglaser

pierreglaser May 17, 2018

I just tried to use the ColumnTransformer using LabelBinarizer as a transformer. The code returns
an error since the fit_transform syntax of LabelBinarizer is
fit_transform(self,y), instead of the expected fit_transform(self,X,y=None)

There are obviously workarounds to avoid using LabelBinarizer (-->using CategoricalEncoder),
but should we keep this as it is, or change the syntax of LabelBinarizer?

pierreglaser commented May 17, 2018

I just tried to use the ColumnTransformer using LabelBinarizer as a transformer. The code returns
an error since the fit_transform syntax of LabelBinarizer is
fit_transform(self,y), instead of the expected fit_transform(self,X,y=None)

There are obviously workarounds to avoid using LabelBinarizer (-->using CategoricalEncoder),
but should we keep this as it is, or change the syntax of LabelBinarizer?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 17, 2018

Contributor

It's true that currently many people use LabelEncoder / LabelBinarizer to (one-hot)-encode their features, but currently, you can also not use them in a pipeline (as they are meant for the labels, not features).
Given that current limitation, and given that there is now no reason any more to use them on X (as CategoricalEncoder should cover all those use cases), I don't think we should fix that.

Note that if you write a custom transformer nowadays where your fit method only accepts X and not X, y, you will get the same error message if you put that in a normal Pipeline.

Contributor

jorisvandenbossche commented May 17, 2018

It's true that currently many people use LabelEncoder / LabelBinarizer to (one-hot)-encode their features, but currently, you can also not use them in a pipeline (as they are meant for the labels, not features).
Given that current limitation, and given that there is now no reason any more to use them on X (as CategoricalEncoder should cover all those use cases), I don't think we should fix that.

Note that if you write a custom transformer nowadays where your fit method only accepts X and not X, y, you will get the same error message if you put that in a normal Pipeline.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 19, 2018

Member

Just catching up with the discussion and looks all great! Merge? Or should I do another pass?

Member

amueller commented May 19, 2018

Just catching up with the discussion and looks all great! Merge? Or should I do another pass?

@amueller

I'm so excited about this!

Show outdated Hide outdated doc/modules/feature_extraction.rst
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> column_trans = ColumnTransformer(
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

I'd also prefer CategoricalEncoder (but also happy to merge as is and fix later).

@amueller

amueller May 19, 2018

Member

I'd also prefer CategoricalEncoder (but also happy to merge as is and fix later).

Show outdated Hide outdated sklearn/compose/_column_transformer.py
Parameters
----------
transformers : list of tuples
List of (name, transformer, column) tuples specifying the transformer

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

shouldn't it it be column(s) ?

@amueller

amueller May 19, 2018

Member

shouldn't it it be column(s) ?

Show outdated Hide outdated sklearn/compose/_column_transformer.py
strings 'drop' and 'passthrough' are accepted as well, to
indicate to drop the columns or to pass them through untransformed,
respectively.
column : string or int, array-like of string or int, slice or boolean \

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

columns?

@amueller

amueller May 19, 2018

Member

columns?

Show outdated Hide outdated sklearn/compose/_column_transformer.py
Like in FeatureUnion and Pipeline, this allows the transformer and
its parameters to be set using ``set_params`` and searched in grid
search.
transformer : estimator or {'passthrough', 'drop'}

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

Should we deprecate None in Pipeline and replace it by 'passthrough'? That might also allow us to encode an blank pipeline which is a bit awkward now (i.e. one where you want to search over the steps).

@amueller

amueller May 19, 2018

Member

Should we deprecate None in Pipeline and replace it by 'passthrough'? That might also allow us to encode an blank pipeline which is a bit awkward now (i.e. one where you want to search over the steps).

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

That's seems like a good thing to discuss, let's open a separate issue for that.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

That's seems like a good thing to discuss, let's open a separate issue for that.

This comment has been minimized.

Show outdated Hide outdated sklearn/compose/_column_transformer.py
otherwise a 2d array will be passed to the transformer.
unspecified : {'passthrough', 'drop'}, default 'drop'
By default, only the specified columns in `transformers` are

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

single backticks? Is that referring to the glossary now? (Sorry I'm not caught up). Or do you mean double backticks?

@amueller

amueller May 19, 2018

Member

single backticks? Is that referring to the glossary now? (Sorry I'm not caught up). Or do you mean double backticks?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 23, 2018

Contributor

the numpydoc format uses single backticks to refer to parameter names, I am not sure what the convention is to do in sklearn

@jorisvandenbossche

jorisvandenbossche May 23, 2018

Contributor

the numpydoc format uses single backticks to refer to parameter names, I am not sure what the convention is to do in sklearn

This comment has been minimized.

@amueller

amueller May 23, 2018

Member

really? Since when? I thought that this would refer to the "default role" that we haven't defined yet. Maybe @jnothman can enlighten me?

@amueller

amueller May 23, 2018

Member

really? Since when? I thought that this would refer to the "default role" that we haven't defined yet. Maybe @jnothman can enlighten me?

This comment has been minimized.

@jnothman

jnothman May 23, 2018

Member

Yes, unless we enable a default_role, single backticks do nothing.

@jnothman

jnothman May 23, 2018

Member

Yes, unless we enable a default_role, single backticks do nothing.

This comment has been minimized.

@amueller

amueller May 24, 2018

Member

ok. @jorisvandenbossche where did you get the idea about numpydoc? (@jnothman is probably the most active numpydoc dev so I kinda trust him on this ;)

@amueller

amueller May 24, 2018

Member

ok. @jorisvandenbossche where did you get the idea about numpydoc? (@jnothman is probably the most active numpydoc dev so I kinda trust him on this ;)

This comment has been minimized.

@amueller

amueller May 24, 2018

Member

The last three review comments I made were "single backticks do nothing". Maybe we should enable that default role lol. But we could think about using glossary as default role instead of code as an alternative.

@amueller

amueller May 24, 2018

Member

The last three review comments I made were "single backticks do nothing". Maybe we should enable that default role lol. But we could think about using glossary as default role instead of code as an alternative.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

single backticks make it look italic if you do not enable a default role (I think).

In all example here https://numpydoc.readthedocs.io/en/latest/format.html#common-rest-concepts single backticks are used for parameter names. The section I linked to says this should also be done for variable, class, module names, but I suppose this is with the idea that the default role is to link to the API docs (which will never work for parameter names)

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

single backticks make it look italic if you do not enable a default role (I think).

In all example here https://numpydoc.readthedocs.io/en/latest/format.html#common-rest-concepts single backticks are used for parameter names. The section I linked to says this should also be done for variable, class, module names, but I suppose this is with the idea that the default role is to link to the API docs (which will never work for parameter names)

This comment has been minimized.

@jnothman

jnothman May 26, 2018

Member

We have for a long time explicitly disabled the default_role and do not get italics. I agree it would be better to use it

@jnothman

jnothman May 26, 2018

Member

We have for a long time explicitly disabled the default_role and do not get italics. I agree it would be better to use it

Input data, of which specified subsets are used to fit the
transformers.
y : array-like, shape (n_samples, ...), optional

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

Is this the standard for shape of y now? I haven't seen that before.

@amueller

amueller May 19, 2018

Member

Is this the standard for shape of y now? I haven't seen that before.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

This is how it is documented in Pipeline. However in the transformers this is often just left out or indicated as 'ignored'. Not sure what would be best here.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

This is how it is documented in Pipeline. However in the transformers this is often just left out or indicated as 'ignored'. Not sure what would be best here.

Returns
-------
X_t : array-like or sparse matrix, shape (n_samples, sum_n_components)
hstack of results of transformers. sum_n_components is the

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

Is hstack clear enough? Maybe say "Horizontal concatenation (hstack)"?

@amueller

amueller May 19, 2018

Member

Is hstack clear enough? Maybe say "Horizontal concatenation (hstack)"?

Parameters
----------
X : array-like or DataFrame of shape [n_samples, n_features]

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

round parenthesis?

@amueller

amueller May 19, 2018

Member

round parenthesis?

Parameters
----------
X : array-like or DataFrame of shape [n_samples, n_features]

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

round parenthesis?

@amueller

amueller May 19, 2018

Member

round parenthesis?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

It is done like this in many other examples in preprocessing/data.py, so maybe better to clean-up all of this in another PR?

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

It is done like this in many other examples in preprocessing/data.py, so maybe better to clean-up all of this in another PR?

Show outdated Hide outdated doc/modules/feature_extraction.rst
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)
In the above example, the

This comment has been minimized.

@amueller

amueller May 19, 2018

Member

I think this makes this not a good first example, but I think it is good to mention this somewhere in the docs, maybe as a second example of why there's support for single item columns. This is probably a very rare usecase.

@amueller

amueller May 19, 2018

Member

I think this makes this not a good first example, but I think it is good to mention this somewhere in the docs, maybe as a second example of why there's support for single item columns. This is probably a very rare usecase.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 23, 2018

Member

(If anyone wants to push the green button I'm totally not opposed btw, or @jorisvandenbossche can think about my nitpicks)

Member

amueller commented May 23, 2018

(If anyone wants to push the green button I'm totally not opposed btw, or @jorisvandenbossche can think about my nitpicks)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 23, 2018

Contributor

(If anyone wants to push the green button I'm totally not opposed btw, or @jorisvandenbossche can think about my nitpicks)

Will update the PR tomorrow, still need to do the rename of unspecified -> remainder and switch the default.

Contributor

jorisvandenbossche commented May 23, 2018

(If anyone wants to push the green button I'm totally not opposed btw, or @jorisvandenbossche can think about my nitpicks)

Will update the PR tomorrow, still need to do the rename of unspecified -> remainder and switch the default.

Show outdated Hide outdated sklearn/compose/_column_transformer.py
``transformer`` expects X to be a 1d array-like (vector),
otherwise a 2d array will be passed to the transformer.
unspecified : {'passthrough', 'drop'}, default 'drop'

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

Changed!

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

Changed!

Show outdated Hide outdated sklearn/compose/_column_transformer.py
Like in FeatureUnion and Pipeline, this allows the transformer and
its parameters to be set using ``set_params`` and searched in grid
search.
transformer : estimator or {'passthrough', 'drop'}

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

That's seems like a good thing to discuss, let's open a separate issue for that.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

That's seems like a good thing to discuss, let's open a separate issue for that.

Show outdated Hide outdated sklearn/compose/_column_transformer.py
otherwise a 2d array will be passed to the transformer.
unspecified : {'passthrough', 'drop'}, default 'drop'
By default, only the specified columns in `transformers` are

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

single backticks make it look italic if you do not enable a default role (I think).

In all example here https://numpydoc.readthedocs.io/en/latest/format.html#common-rest-concepts single backticks are used for parameter names. The section I linked to says this should also be done for variable, class, module names, but I suppose this is with the idea that the default role is to link to the API docs (which will never work for parameter names)

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

single backticks make it look italic if you do not enable a default role (I think).

In all example here https://numpydoc.readthedocs.io/en/latest/format.html#common-rest-concepts single backticks are used for parameter names. The section I linked to says this should also be done for variable, class, module names, but I suppose this is with the idea that the default role is to link to the API docs (which will never work for parameter names)

Show outdated Hide outdated sklearn/compose/_column_transformer.py
... ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
... [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. Therefore, a separate

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

removed "therefore"

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

removed "therefore"

def named_transformers_(self):
"""Access the fitted transformer by name.
Read-only attribute to access any transformer by given name.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

This is a similar approach as eg with Pipeline, where it is only documented to be read-only.

I also don't think there is an easy way to actually make it read-only, since even if Bunch itself is readonly, the attributes refer to the transformers, which are still mutable itself.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

This is a similar approach as eg with Pipeline, where it is only documented to be read-only.

I also don't think there is an easy way to actually make it read-only, since even if Bunch itself is readonly, the attributes refer to the transformers, which are still mutable itself.

Show outdated Hide outdated sklearn/compose/_column_transformer.py
for name, trans, _, _ in self._iter(fitted=True):
if trans == 'drop':
continue
if not hasattr(trans, 'get_feature_names'):

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

I added a not implemented error

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

I added a not implemented error

Parameters
----------
X : array-like or DataFrame of shape [n_samples, n_features]

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

It is done like this in many other examples in preprocessing/data.py, so maybe better to clean-up all of this in another PR?

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

It is done like this in many other examples in preprocessing/data.py, so maybe better to clean-up all of this in another PR?

Input data, of which specified subsets are used to fit the
transformers.
y : array-like, shape (n_samples, ...), optional

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

This is how it is documented in Pipeline. However in the transformers this is often just left out or indicated as 'ignored'. Not sure what would be best here.

@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

This is how it is documented in Pipeline. However in the transformers this is often just left out or indicated as 'ignored'. Not sure what would be best here.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 25, 2018

Contributor

OK, I updated this today, but ended up needing to make a bit more changes than I anticipated. So I think it would be good that at least somebody takes a closer look to the changes. Diff view of only those changes: https://github.com/scikit-learn/scikit-learn/pull/9012/files/04bcb1ec7292a566e1b6b5f2fa0b7d38d60d9102..d298fc310069273f05d8806428fcfd0d2530b77d

What I changed:

  • Some small changes according to the inline feedback of Andy and Joel (for get_feature_names I added a a NotImplementedError in case you call this with a 'passthrough' transformer)
  • Renamed unspecified keyword to remainder, and switched default from 'drop' to 'passthrough', as discussed above.
  • However, this change of default uncovered some issues with the current implementation. I hstack the output of the different transformers. However, if the transformers do not necessarily all return 2D output, the hstack will give a (not very informative) error or even produce wrong results.
    I discussed with Olivier, and we didn't really think of a good use case for a Transformer to return 1D output (some of the vectorizers need 1D input but will all output 2D data). So it probably mainly occurred due to how I wrote the tests. But therefore we decided to be strict for now and require 2D output of each transformer, and I added a check for this in fit/fit_transform in order to provide a nicer error message.
Contributor

jorisvandenbossche commented May 25, 2018

OK, I updated this today, but ended up needing to make a bit more changes than I anticipated. So I think it would be good that at least somebody takes a closer look to the changes. Diff view of only those changes: https://github.com/scikit-learn/scikit-learn/pull/9012/files/04bcb1ec7292a566e1b6b5f2fa0b7d38d60d9102..d298fc310069273f05d8806428fcfd0d2530b77d

What I changed:

  • Some small changes according to the inline feedback of Andy and Joel (for get_feature_names I added a a NotImplementedError in case you call this with a 'passthrough' transformer)
  • Renamed unspecified keyword to remainder, and switched default from 'drop' to 'passthrough', as discussed above.
  • However, this change of default uncovered some issues with the current implementation. I hstack the output of the different transformers. However, if the transformers do not necessarily all return 2D output, the hstack will give a (not very informative) error or even produce wrong results.
    I discussed with Olivier, and we didn't really think of a good use case for a Transformer to return 1D output (some of the vectorizers need 1D input but will all output 2D data). So it probably mainly occurred due to how I wrote the tests. But therefore we decided to be strict for now and require 2D output of each transformer, and I added a check for this in fit/fit_transform in order to provide a nicer error message.
@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 25, 2018

Member
Member

GaelVaroquaux commented May 25, 2018

@jnothman

Those changes look good

@jnothman

Some things I noticed

Show outdated Hide outdated doc/modules/feature_extraction.rst
@@ -101,6 +101,105 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by
default instead of a ``numpy.ndarray``.
.. _column_transformer:

This comment has been minimized.

@jnothman

jnothman May 27, 2018

Member

This should be in compose.rst, but perhaps noted at the top of this file

@jnothman

jnothman May 27, 2018

Member

This should be in compose.rst, but perhaps noted at the top of this file

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 28, 2018

Contributor

Yes, I know, but also (related to what I mentioned here: #9012 (comment)):

  • when moving to compose.rst, I think we should use a different example (eg using transformers from preprocessing module, as I think that is a more typical use case)
  • we should reference this in preprocessing.rst
  • we should add a better 'typical data science usecase" example for the example gallery
  • I would maybe keep the explanation currently in feature_extraction.rst (the example), but shorten it by referring to compose.rst for the general explanation.

I can work on the above this week. But in light of getting this merged sooner rather than later, I would prefer doing it as a follow-up PR, if that is fine? (I can also do a minimal here and simply move the current docs addition to compose.rst without any of the other mentioned improvements).

@jorisvandenbossche

jorisvandenbossche May 28, 2018

Contributor

Yes, I know, but also (related to what I mentioned here: #9012 (comment)):

  • when moving to compose.rst, I think we should use a different example (eg using transformers from preprocessing module, as I think that is a more typical use case)
  • we should reference this in preprocessing.rst
  • we should add a better 'typical data science usecase" example for the example gallery
  • I would maybe keep the explanation currently in feature_extraction.rst (the example), but shorten it by referring to compose.rst for the general explanation.

I can work on the above this week. But in light of getting this merged sooner rather than later, I would prefer doing it as a follow-up PR, if that is fine? (I can also do a minimal here and simply move the current docs addition to compose.rst without any of the other mentioned improvements).

feature_names : list of strings
Names of the features produced by transform.
"""
check_is_fitted(self, 'transformers_')

This comment has been minimized.

@jnothman

jnothman May 27, 2018

Member

Shouldn't remainder be handled here too?

@jnothman

jnothman May 27, 2018

Member

Shouldn't remainder be handled here too?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 28, 2018

Contributor

Ideally, yes. I am only not fully sure what to do here currently, given that get_feature_names is in general not really well supported.
I think ideally I would add the names of the passed through columns to feature_names, but then the actual string column names in case of pandas dataframes. And in case of numpy arrays return the indices into that array as strings? (['0', '1', ..])

I can also raise an error for now if there are columns passed through, just to make sure that if we improve the get_feature_names in the future, it does not lead to a change in behaviour (but a removal of the error).

@jorisvandenbossche

jorisvandenbossche May 28, 2018

Contributor

Ideally, yes. I am only not fully sure what to do here currently, given that get_feature_names is in general not really well supported.
I think ideally I would add the names of the passed through columns to feature_names, but then the actual string column names in case of pandas dataframes. And in case of numpy arrays return the indices into that array as strings? (['0', '1', ..])

I can also raise an error for now if there are columns passed through, just to make sure that if we improve the get_feature_names in the future, it does not lead to a change in behaviour (but a removal of the error).

This comment has been minimized.

@jnothman

jnothman May 28, 2018

Member

Can just raise NotImplementedError in case of remainder != 'drop' for now.... Or you can tack the remainder transformer onto the end of _iter.

I agree get_feature_names is not quite the right design.

@jnothman

jnothman May 28, 2018

Member

Can just raise NotImplementedError in case of remainder != 'drop' for now.... Or you can tack the remainder transformer onto the end of _iter.

I agree get_feature_names is not quite the right design.

This comment has been minimized.

@amueller

amueller Jun 1, 2018

Member

Can we have an issue for this?

@amueller

amueller Jun 1, 2018

Member

Can we have an issue for this?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 28, 2018

Member
Member

jnothman commented May 28, 2018

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 29, 2018

Contributor

Added last update: added additional error for get_feature_names, and moved the docs to compose.rst.

As far as I am concerned, somebody can push the green button ;-)

Contributor

jorisvandenbossche commented May 29, 2018

Added last update: added additional error for get_feature_names, and moved the docs to compose.rst.

As far as I am concerned, somebody can push the green button ;-)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 29, 2018

Member

Indeed: let's see how this flies!

Member

jnothman commented May 29, 2018

Indeed: let's see how this flies!

@jnothman jnothman merged commit 0b6308c into scikit-learn:master May 29, 2018

8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 99.2% of diff hit (target 95.1%)
Details
codecov/project 95.2% (+0.09%) compared to 0c424ce
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 29, 2018

Member

Thanks Joris for some great work on finally making this happen!

Member

jnothman commented May 29, 2018

Thanks Joris for some great work on finally making this happen!

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 29, 2018

Contributor

Woohoo, thanks for merging!

Contributor

jorisvandenbossche commented May 29, 2018

Woohoo, thanks for merging!

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT May 29, 2018

Member

Great work congrats !

Member

TomDLT commented May 29, 2018

Great work congrats !

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux May 29, 2018

Member
Member

GaelVaroquaux commented May 29, 2018

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre May 29, 2018

Contributor

Really nice!!! Let's stack those columns then ;)

Contributor

glemaitre commented May 29, 2018

Really nice!!! Let's stack those columns then ;)

@eyadsibai

This comment has been minimized.

Show comment
Hide comment
@eyadsibai

eyadsibai May 29, 2018

Looking forward for the next release

Looking forward for the next release

@jorisvandenbossche jorisvandenbossche deleted the jorisvandenbossche:amueller/heterogeneous_feature_union branch May 30, 2018

@armgilles

This comment has been minimized.

Show comment
Hide comment
@armgilles

armgilles May 31, 2018

Next release will be amazing !

Next release will be amazing !

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 31, 2018

Member

OMG this is great! Thank you so much for your work (and patience) on this one @jorisvandenbossche

Member

amueller commented May 31, 2018

OMG this is great! Thank you so much for your work (and patience) on this one @jorisvandenbossche

@partmor

This comment has been minimized.

Show comment
Hide comment
@partmor

partmor Jun 1, 2018

Contributor

Thank you @jorisvandenbossche!! Great stuff.
I have a question regarding this feature (I'm very new in GitHub, not sure if this is the right place... apologies in advance):

Is there going to be an effort (I would like to contribute) to implement get_feature_names in the majority of transformers?

I find that one of the big advantages that DataFrameMapper from sklearn-pandas brought to us also is the ability to trace names of derived features (using aliases and df_out=True), with what this means for interpretability (e.g. get some feature importances for a tree based model after a fairly complex non-sequential preprocessing pipeline). Having get_feature_names working consistently in ColumnTransformer would be the bomb.

What do you guys think?

Thank you in advance.

Contributor

partmor commented Jun 1, 2018

Thank you @jorisvandenbossche!! Great stuff.
I have a question regarding this feature (I'm very new in GitHub, not sure if this is the right place... apologies in advance):

Is there going to be an effort (I would like to contribute) to implement get_feature_names in the majority of transformers?

I find that one of the big advantages that DataFrameMapper from sklearn-pandas brought to us also is the ability to trace names of derived features (using aliases and df_out=True), with what this means for interpretability (e.g. get some feature importances for a tree based model after a fairly complex non-sequential preprocessing pipeline). Having get_feature_names working consistently in ColumnTransformer would be the bomb.

What do you guys think?

Thank you in advance.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 1, 2018

Member

@partmor See #9606 and #6425. What exactly was working with DataFrameMapper that's not currently working? I feel like the main usecase will be with OneHotEncoder/CategoricalEncoder who will provide a get_feature_names. For most multivariate transformations it's hard to get feature names, so I don't know how DataFrameMapper did that.

Member

amueller commented Jun 1, 2018

@partmor See #9606 and #6425. What exactly was working with DataFrameMapper that's not currently working? I feel like the main usecase will be with OneHotEncoder/CategoricalEncoder who will provide a get_feature_names. For most multivariate transformations it's hard to get feature names, so I don't know how DataFrameMapper did that.

@partmor

This comment has been minimized.

Show comment
Hide comment
@partmor

partmor Jun 1, 2018

Contributor

@amueller thank you for the links. For instance, if we want to use StandardScaler on a set of numeric variables, ColumnTransformer raises an exception because SC does not have get_feature_names implemented. In "1 to 1" column transformations like standard scaling in DataFrameMapper you could just passthrough the feature names: ['x0', 'x1',...] to ['x0', 'x1',...]. ColumnTransformer.get_feature_names() just raises exception by using SC in it.

Contributor

partmor commented Jun 1, 2018

@amueller thank you for the links. For instance, if we want to use StandardScaler on a set of numeric variables, ColumnTransformer raises an exception because SC does not have get_feature_names implemented. In "1 to 1" column transformations like standard scaling in DataFrameMapper you could just passthrough the feature names: ['x0', 'x1',...] to ['x0', 'x1',...]. ColumnTransformer.get_feature_names() just raises exception by using SC in it.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 2, 2018

Member

@partmor yeah for univariate things like that it would be easy to implement. We should revisit #6425 keeping in mind that ColumnTransformer relies heavily on it.

Member

amueller commented Jun 2, 2018

@partmor yeah for univariate things like that it would be easy to implement. We should revisit #6425 keeping in mind that ColumnTransformer relies heavily on it.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 3, 2018

Member
Member

jnothman commented Jun 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment