New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Adds transformer support in ColumnTransformer.remainder #11315

Merged
merged 31 commits into from Jun 26, 2018

Conversation

Projects
None yet
4 participants
@thomasjpfan
Contributor

thomasjpfan commented Jun 18, 2018

Reference Issues/PRs

Fixes #11210.

What does this implement/fix? Explain your changes.

Adds transformer support in the ColumnTransformer.remainder parameter.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 18, 2018

Member

awesome! flake8 is unhappy though.

Member

amueller commented Jun 18, 2018

awesome! flake8 is unhappy though.

@jnothman

I wonder if the implementation here could be greatly simplified by appending remainder to transformers_ and to _iter.

Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
@thomasjpfan

This comment has been minimized.

Show comment
Hide comment
@thomasjpfan

thomasjpfan Jun 20, 2018

Contributor

As long as we can make 'remainder' an invalid transformer name, I think putting remainder into transformers_ would work.

Contributor

thomasjpfan commented Jun 20, 2018

As long as we can make 'remainder' an invalid transformer name, I think putting remainder into transformers_ would work.

@thomasjpfan

This comment has been minimized.

Show comment
Hide comment
@thomasjpfan

thomasjpfan Jun 20, 2018

Contributor

I updated this PR with an implementation where _iter is used to train remainder. Here are some details about the implementation:

  1. It cleans up fit_transform and transform. self._passthrough does not need to be checked anymore. If remainder is 'passthrough' or an estimator, _iter will iterate through them. If remainder is 'drop', it will be ignored by _iter.

  2. remainder_transformer is used as the name of the remainder transformer in _iter.

  3. remainder_ is updated by looking for the name remainder_transformer (in _update_fitted_transformers).

  4. remainder_ is only added during _iter. It does not change transformers_. This keeps the external interface where: len(transformers) == len(transformers_).

Contributor

thomasjpfan commented Jun 20, 2018

I updated this PR with an implementation where _iter is used to train remainder. Here are some details about the implementation:

  1. It cleans up fit_transform and transform. self._passthrough does not need to be checked anymore. If remainder is 'passthrough' or an estimator, _iter will iterate through them. If remainder is 'drop', it will be ignored by _iter.

  2. remainder_transformer is used as the name of the remainder transformer in _iter.

  3. remainder_ is updated by looking for the name remainder_transformer (in _update_fitted_transformers).

  4. remainder_ is only added during _iter. It does not change transformers_. This keeps the external interface where: len(transformers) == len(transformers_).

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 20, 2018

Member

In master, 'remainder' is already not allowed as a transformer name:

>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import OneHotEncoder
>>> ColumnTransformer([('remainder', OneHotEncoder(), [0])])
ColumnTransformer(n_jobs=1, remainder='passthrough', transformer_weights=None,
         transformers=[('remainder', OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True), [0])])
>>> ColumnTransformer([('remainder', OneHotEncoder(), [0])]).fit([[1]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py", line 361, in fit
    self._validate_transformers()
  File "/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py", line 215, in _validate_transformers
    self._validate_names(names)
  File "/Users/joel/repos/scikit-learn/sklearn/utils/metaestimators.py", line 68, in _validate_names
    'arguments: {0!r}'.format(sorted(invalid_names)))
ValueError: Estimator names conflict with constructor arguments: ['remainder']
Member

jnothman commented Jun 20, 2018

In master, 'remainder' is already not allowed as a transformer name:

>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import OneHotEncoder
>>> ColumnTransformer([('remainder', OneHotEncoder(), [0])])
ColumnTransformer(n_jobs=1, remainder='passthrough', transformer_weights=None,
         transformers=[('remainder', OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True), [0])])
>>> ColumnTransformer([('remainder', OneHotEncoder(), [0])]).fit([[1]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py", line 361, in fit
    self._validate_transformers()
  File "/Users/joel/repos/scikit-learn/sklearn/compose/_column_transformer.py", line 215, in _validate_transformers
    self._validate_names(names)
  File "/Users/joel/repos/scikit-learn/sklearn/utils/metaestimators.py", line 68, in _validate_names
    'arguments: {0!r}'.format(sorted(invalid_names)))
ValueError: Estimator names conflict with constructor arguments: ['remainder']
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2018

Member

This keeps the external interface where: len(transformers) == len(transformers_).

I don't think there's any problem changing this as long as ColumnTransformer has not yet been released. If we can get this patch into the upcoming release, I'd much prefer the simplicity of appending the remainder transformer/passthrough to transformers_.

Member

jnothman commented Jun 21, 2018

This keeps the external interface where: len(transformers) == len(transformers_).

I don't think there's any problem changing this as long as ColumnTransformer has not yet been released. If we can get this patch into the upcoming release, I'd much prefer the simplicity of appending the remainder transformer/passthrough to transformers_.

Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
@thomasjpfan

This comment has been minimized.

Show comment
Hide comment
@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

If there are any remaining columns, the remainder will always be added totransformers_. This also includes remainder='drop' or remainder='passthrough'. Tests for the len(transformers_) were added to assert this behavior.

Contributor

thomasjpfan commented Jun 21, 2018

If there are any remaining columns, the remainder will always be added totransformers_. This also includes remainder='drop' or remainder='passthrough'. Tests for the len(transformers_) were added to assert this behavior.

thomasjpfan added some commits Jun 21, 2018

@thomasjpfan

This comment has been minimized.

Show comment
Hide comment
@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

Mind as well fix #11332 as well. 😄

Contributor

thomasjpfan commented Jun 21, 2018

Mind as well fix #11332 as well. 😄

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2018

Member

Mind as well fix #11332 as well

I wouldn't make a point of doing so here. It's definitely orthogonal to this change.

Member

jnothman commented Jun 21, 2018

Mind as well fix #11332 as well

I wouldn't make a point of doing so here. It's definitely orthogonal to this change.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2018

Member

RFC: Uses deep in get_params

This requires a test. Although testing ColumnTransformer([...], remainder=SomeEst()).get_params() would, IMO, be sufficient.

Member

jnothman commented Jun 21, 2018

RFC: Uses deep in get_params

This requires a test. Although testing ColumnTransformer([...], remainder=SomeEst()).get_params() would, IMO, be sufficient.

thomasjpfan added some commits Jun 21, 2018

@thomasjpfan

This comment has been minimized.

Show comment
Hide comment
@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

I updated _iter to accept include_remainder, which is set to false when _validate_transformers calls _iter. This excludes the remainder from the set of names to be valided. This change is compatible with set_params and get_params.

Contributor

thomasjpfan commented Jun 21, 2018

I updated _iter to accept include_remainder, which is set to false when _validate_transformers calls _iter. This excludes the remainder from the set of names to be valided. This change is compatible with set_params and get_params.

@jnothman

Rather than add a param to _iter, I'd personally prefer _validate_names(zip(*self.transformers)[0]) and not using _iter to validate names. But I can see why the current version might also be construed as clearer.

Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
@jnothman

Please also update the documentation of the transformers_ attribute.

@thomasjpfan

This comment has been minimized.

Show comment
Hide comment
@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

PR updated with:

  1. _validate_transformers checks self.transformers and does not call _iter anymore.
  2. self._remainder is checked first before adding it to the chain.
  3. transformer_ doc.
Contributor

thomasjpfan commented Jun 21, 2018

PR updated with:

  1. _validate_transformers checks self.transformers and does not call _iter anymore.
  2. self._remainder is checked first before adding it to the chain.
  3. transformer_ doc.
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
@@ -90,7 +93,10 @@ class ColumnTransformer(_BaseComposition, TransformerMixin):
----------
transformers_ : list

This comment has been minimized.

@jnothman

jnothman Jun 21, 2018

Member

perhaps specify the length here as len(transformers) or len(transformers) + 1

@jnothman

jnothman Jun 21, 2018

Member

perhaps specify the length here as len(transformers) or len(transformers) + 1

@jnothman

Do you have a test case where you check the value of transformers_[-1]? Do you have a test case where there are no remaining columns?

Do you think that if there are no remaining columns but a remainder transformer is given, we should raise a warning?

Show outdated Hide outdated sklearn/compose/tests/test_column_transformer.py Outdated
@thomasjpfan

This comment has been minimized.

Show comment
Hide comment
@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

For the case where there is no remaining columns and a remainder estimator, I do not think there needs to be a warning. A warning would have to direct a user to a preferred usage option. In our case it would either be to change remainder to drop or passthrough. I do not think those options are better than the initial estimator.

Contributor

thomasjpfan commented Jun 21, 2018

For the case where there is no remaining columns and a remainder estimator, I do not think there needs to be a warning. A warning would have to direct a user to a preferred usage option. In our case it would either be to change remainder to drop or passthrough. I do not think those options are better than the initial estimator.

@jnothman

Please also see if this can be incorporated into one of the examples in examples/compose or if it should be mentioned in the user guide.

thomasjpfan added some commits Jun 21, 2018

RFC: Does not need to check remainder
The remainder is already a part of `_iter`, thus it will be checked in the for loop.
@jorisvandenbossche

Thanks a lot for this PR!

Really nice PR, only added a bunch of minor comments.

Show outdated Hide outdated doc/modules/compose.rst Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/_column_transformer.py Outdated
@@ -23,7 +23,7 @@ def __init__(self):
pass
def _get_params(self, attr, deep=True):
out = super(_BaseComposition, self).get_params(deep=False)
out = super(_BaseComposition, self).get_params(deep=deep)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 21, 2018

Contributor

This was needed to get the tests passing? (just to be sure this change is covered by the tests)

@jorisvandenbossche

jorisvandenbossche Jun 21, 2018

Contributor

This was needed to get the tests passing? (just to be sure this change is covered by the tests)

This comment has been minimized.

@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

This was updated to support ColumnTransformer.get_params('remainder__{}') where the remainder is a estimator. This is tested with test_column_transformer_get_set_params_with_remainder.

@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

This was updated to support ColumnTransformer.get_params('remainder__{}') where the remainder is a estimator. This is tested with test_column_transformer_get_set_params_with_remainder.

This comment has been minimized.

@jorisvandenbossche
@jorisvandenbossche

jorisvandenbossche Jun 22, 2018

Contributor

OK!

Show outdated Hide outdated sklearn/compose/tests/test_column_transformer.py Outdated
assert_array_equal(ct.fit_transform(X_array), X_res_both)
assert_array_equal(ct.fit(X_array).transform(X_array), X_res_both)
assert len(ct.transformers_) == 2

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 21, 2018

Contributor

should we test here more explicitly the full content of ct.trnasformers_[-1] ? (you already test it partly in many other places, but here it seems appropriate to test the full content)

@jorisvandenbossche

jorisvandenbossche Jun 21, 2018

Contributor

should we test here more explicitly the full content of ct.trnasformers_[-1] ? (you already test it partly in many other places, but here it seems appropriate to test the full content)

This comment has been minimized.

@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

I went through all the tests to fully test all the elements of ct.transformers_[-1].

@thomasjpfan

thomasjpfan Jun 21, 2018

Contributor

I went through all the tests to fully test all the elements of ct.transformers_[-1].

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jun 22, 2018

Contributor

Personally, I think that doing that everywhere is a bit overkill compared to a specific test to do it, but OK :-)

@jorisvandenbossche

jorisvandenbossche Jun 22, 2018

Contributor

Personally, I think that doing that everywhere is a bit overkill compared to a specific test to do it, but OK :-)

Show outdated Hide outdated sklearn/compose/tests/test_column_transformer.py Outdated
Show outdated Hide outdated sklearn/compose/tests/test_column_transformer.py Outdated

thomasjpfan added some commits Jun 21, 2018

@jorisvandenbossche

Thanks for the updates!

@jnothman

Thanks, @thomasjpfan. Great work!

@jnothman jnothman merged commit 895dfd3 into scikit-learn:master Jun 26, 2018

8 checks passed

LGTM analysis: Python No alert changes
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 99.41% of diff hit (target 95.26%)
Details
codecov/project 95.27% (+0.01%) compared to bb5110b
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment