New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add feature_extraction.ColumnTransformer #3886

Closed
wants to merge 13 commits into
base: master
from

Conversation

8 participants
@amueller
Member

amueller commented Nov 25, 2014

Fixes #2034.

Todo:

  • Docstrings
  • Simple example
  • Test
  • documentation
  • test feature names
  • rename to ColumnSelector
  • allow selecting multiple columns
  • don't slice first direction, use iloc for pandas

Also see here for how this would help people.

)),
# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
('svc', LinearSVC(dual=False)),

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

much more appropriate.

@@ -406,6 +438,17 @@ def _update_transformer_list(self, transformers):
for ((name, old), new) in zip(self.transformer_list, transformers)
]
def _check_fields(self):
if self.fields is not None:
fields = self.fields

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

this is probably overly paranoid, but do we care if self.fields is a generator?

This comment has been minimized.

@amueller

amueller Nov 25, 2014

Member

Then it'll break as it can only be evaluated once, right? we could convert it to a list but the docsting says it should be a list.....

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

Right. I don't think it is a problem. Just being paranoid.

@mrterry

This comment has been minimized.

Contributor

mrterry commented Nov 25, 2014

FeatureUnions from previous versions of skleran will not be unpickleable after this merges. Is that OK?

return self
def transform(self, data_dict):
return data_dict[self.key]

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

good riddance.

This comment has been minimized.

@amueller

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

Happy to see this go. Your change makes all this much more elegant.

@mrterry

This comment has been minimized.

Contributor

mrterry commented Nov 25, 2014

Looks good to me. I wrote something similar to this, but didn't get around to writing the tests. How do you test the parallel dispatch stays parallel?

@amueller

This comment has been minimized.

Member

amueller commented Nov 25, 2014

We don't provide pickle compatibility between versions. That is unfortunate but we don't have the resources / infrastructure for that at the moment, so we just don't worry about it.

I am not sure I understand you question about parallelism. You mean how do we test that joblib actually dispatches? I guess we don't.

@mrterry

This comment has been minimized.

Contributor

mrterry commented Nov 25, 2014

You understood my poorly worded question.

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 26, 2014

I wrote something similar to this, but didn't get around to writing the tests.

That's what the WIPs are for in PR titles!

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 26, 2014

@amueller, is it okay to allow fields to be a list of string or functions, where functions are just applied to X?

Why won't previous FeatureUnions be unplickleable?

@amueller

This comment has been minimized.

Member

amueller commented Nov 26, 2014

I think there is a pickling problem because old ones don't have a fields attribute, right?

Can you give an application for passing a function?
Also, what would you call the parameter then?

I would rather have data transforming functions in transform objects, I think.

@mrterry

This comment has been minimized.

Contributor

mrterry commented Nov 26, 2014

@jnothman Strictly speaking, they will unpickle just fine. A v.15 pickle hydrated in v.16 will not have self.fields and will bonk when calling _check_fields()

@amueller amueller force-pushed the amueller:heterogeneous_feature_union branch from b701a58 to edcc143 Nov 26, 2014

@amueller amueller changed the title from WIP Add transform fields option to FeatureUnion to MRG Add transform fields option to FeatureUnion Nov 26, 2014

@amueller amueller force-pushed the amueller:heterogeneous_feature_union branch from edcc143 to 5f6f7af Nov 26, 2014

@amueller

This comment has been minimized.

Member

amueller commented Dec 16, 2014

Can I has reviews?

--------
>>> from sklearn.preprocessing import Normalizer
>>> union = FeatureUnion([("norm1", Normalizer(norm='l1')), \
("norm2", Normalizer(norm='l1'))], \

This comment has been minimized.

@jnothman

jnothman Dec 17, 2014

Member

I'm not really sure it's a useful example if both undergo the same column-wise transformation.

This comment has been minimized.

@amueller

amueller Dec 17, 2014

Member

It is. If both are histograms this is different than doing it per column ;)

This comment has been minimized.

@jnothman

jnothman Dec 18, 2014

Member

Ah. I've never used Normalizer before. I confused it for a feature scaler. It's norming each sample. Thanks...

@jnothman

This comment has been minimized.

Member

jnothman commented Dec 17, 2014

LGTM!

Changed my mind: can I suggest that this new functionality be noted in the description of FeatureUnion in the narrative docs, and perhaps in the docstring?

@jnothman

This comment has been minimized.

Member

jnothman commented Dec 25, 2014

now LGTM! ;)

@jnothman jnothman changed the title from MRG Add transform fields option to FeatureUnion to [MRG+1] Add transform fields option to FeatureUnion Dec 25, 2014

@amueller

This comment has been minimized.

Member

amueller commented Dec 29, 2014

Thanks for your help @jnothman :) Any other reviews?

@amueller

This comment has been minimized.

Member

amueller commented Jan 7, 2015

@ogrisel a review would be much appreciated ;)

@amueller amueller force-pushed the amueller:heterogeneous_feature_union branch from 0a3b75a to ddefe62 Jan 15, 2015

@amueller amueller force-pushed the amueller:heterogeneous_feature_union branch from ddefe62 to f31d5db Jan 22, 2015

@amueller amueller force-pushed the amueller:heterogeneous_feature_union branch from f31d5db to fec5175 Mar 3, 2015

@amueller

This comment has been minimized.

Member

amueller commented Mar 5, 2015

Ping @ogrisel what do you think of this?

@bmabey

This comment has been minimized.

bmabey commented Mar 13, 2015

I'm new to the code base but FWIW LGTM.

I'm also biased since I'm excited to see this merged. :)

@GaelVaroquaux GaelVaroquaux changed the title from [MRG+1] Add transform fields option to FeatureUnion to [MRG+1-1] Add transform fields option to FeatureUnion Mar 13, 2015

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Mar 13, 2015

While I understand the problem that this is trying to solve, and I think that it is very important, I am a bit worried by the indexing in the sample direction. The changes are toying with our clear definition that the first direction of indexing should be a sample direction. The implications of such blurring of conventions are probably very deep. In particular, I expect code validation and error reporting to become harder and harder.

I know where this comes from: pandas has very strange indexing logics, and as a result an API hard to learn and error message that are very open ended. On the opposite, scikit-learn has so far had a very strict set of conventions, that made it easier to learn and give good error messages.

As this change basically introduces a new kind of input data, to capture heterogenous data, I suggest that it should be confine to a new sub-module, in which objects only deal with heterogenous data, and refuse to deal with the standard data matrices. We could document this module as the part of scikit-learn dedicated to heterogeneous data and define the input data type as anything that when indexed with a string return a array of length n_samples. This would enable us to support pandas DataFrame, dictionaries of 1D arrays, and structured dtypes. It would probably make the documentation, discovery and future evolution of such support easier.

As a side note, the name 'field' is very unclear to me. I understood where it came from after reading half of the pull request, because the pull request has an obvious consistency and story behind it, but looking locally at a bit of the code, I had a hard time understanding why a 'field' was applied in the sample direction.

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 10, 2015

And by "you" you mean me, or @amueller? If me, I had been proposing to not allow multiple transformers for the same column without nesting a FeatureUnion. But now I'm just being tired of looking at lists of tuples and dicts of string keys and tuple values and parallel arrays and other such monsters.

@amueller

This comment has been minimized.

Member

amueller commented Jun 10, 2015

I feel if we make users nest FeatureUnion and ColumnTransformer then we failed.
I feel the interface as implemented now is not to bad, I think, it doesn't have lists of tuples or parallel arrays.
I was just about to revert this to the parallel array solution.
If we do any other API, there will be very little code reuse with FeatureUnion.

What we could do is use the "name is the column name" thing by default and keep the feature union. And if anyone wants to use multiple transformers on the same column, they have to provide a separate array of column names.

class ColumnTransformer(BaseEstimator, TransformerMixin):
"""Applies transformers to columns of a dataframe / dict.
This estimator applies a transformer objects to each column or field of the

This comment has been minimized.

@raghavrv

raghavrv Jun 10, 2015

Member

applies transformer objects :)

@amueller amueller force-pushed the amueller:heterogeneous_feature_union branch from 5e42880 to 6addec5 Jun 10, 2015

@amueller amueller force-pushed the amueller:heterogeneous_feature_union branch from 6addec5 to c3b6568 Jun 10, 2015

@amueller amueller changed the title from [WIP] Add feature_extraction.ColumnTransformer to [MRG] Add feature_extraction.ColumnTransformer Jun 10, 2015

class ColumnTransformer(BaseEstimator, TransformerMixin):
"""Applies transformers to columns of a dataframe / dict.

This comment has been minimized.

@glouppe

glouppe Jul 19, 2015

Member

While I see of point of this transformer on dataframe and dicts, I find it too bad we cannot apply it on Numpy arrays. I would love to have see a built-in to apply transformers on selected columns only.

This comment has been minimized.

@glouppe

glouppe Jul 19, 2015

Member

(Coming late to the party, this might have been discussed before...)

This comment has been minimized.

@amueller

amueller Jul 30, 2015

Member

That would be pretty easy with the FunctionTransformer #4798

This comment has been minimized.

@glouppe

glouppe Aug 30, 2015

Member

Indeed, +1

Parameters
----------
transformers : dict from string to (string, transformer) tuples

This comment has been minimized.

@glouppe

glouppe Aug 30, 2015

Member

The implementation is expected the dict values to be (transformer, string) tuples, and not (string, transformer) as documented here.

This comment has been minimized.

@glouppe

glouppe Aug 30, 2015

Member

Also, does the key used to access the column always need to be a string? Eg. what if I use a int to a access the n-th column, or even a list to access several columns at once?

Input data, used to fit transformers.
"""
transformers = Parallel(n_jobs=self.n_jobs)(
delayed(_fit_one_transformer)(trans, X[column], y)

This comment has been minimized.

@amueller

amueller Mar 31, 2017

Member

Should use .iloc if it exists otherwise slice in second direction, and allow multiple columns.

@amueller amueller changed the title from [MRG] Add feature_extraction.ColumnTransformer to [WIP] Add feature_extraction.ColumnTransformer Mar 31, 2017

@amueller amueller added this to needs pr in Andy's pets Mar 31, 2017

@amueller amueller moved this from needs pr to PR phase in Andy's pets Mar 31, 2017

@amueller amueller moved this from PR phase to AJ in Andy's pets May 12, 2017

@jnothman jnothman referenced this pull request May 26, 2017

Closed

[MRG] Support for strings in OneHotEncoder #8793

4 of 4 tasks complete

@amueller amueller removed this from AJ in Andy's pets Jul 21, 2017

@jnothman jnothman added this to In progress in API and interoperability Aug 14, 2017

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Nov 28, 2017

Should we close this, in favor of #9012, to clean the tracker?

@amueller amueller closed this Nov 28, 2017

@jnothman jnothman moved this from In progress to Done in API and interoperability Jul 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment