Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add feature_extraction.ColumnTransformer #3886

Closed

Conversation

amueller
Copy link
Member

@amueller amueller commented Nov 25, 2014

Fixes #2034.

Todo:

  • Docstrings
  • Simple example
  • Test
  • documentation
  • test feature names
  • rename to ColumnSelector
  • allow selecting multiple columns
  • don't slice first direction, use iloc for pandas

Also see here for how this would help people.

)),

# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
('svc', LinearSVC(dual=False)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much more appropriate.

@mrterry
Copy link
Contributor

mrterry commented Nov 25, 2014

FeatureUnions from previous versions of skleran will not be unpickleable after this merges. Is that OK?

return self

def transform(self, data_dict):
return data_dict[self.key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good riddance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see this go. Your change makes all this much more elegant.

@mrterry
Copy link
Contributor

mrterry commented Nov 25, 2014

Looks good to me. I wrote something similar to this, but didn't get around to writing the tests. How do you test the parallel dispatch stays parallel?

@amueller
Copy link
Member Author

We don't provide pickle compatibility between versions. That is unfortunate but we don't have the resources / infrastructure for that at the moment, so we just don't worry about it.

I am not sure I understand you question about parallelism. You mean how do we test that joblib actually dispatches? I guess we don't.

@mrterry
Copy link
Contributor

mrterry commented Nov 25, 2014

You understood my poorly worded question.

@jnothman
Copy link
Member

I wrote something similar to this, but didn't get around to writing the tests.

That's what the WIPs are for in PR titles!

@jnothman
Copy link
Member

@amueller, is it okay to allow fields to be a list of string or functions, where functions are just applied to X?

Why won't previous FeatureUnions be unplickleable?

@amueller
Copy link
Member Author

I think there is a pickling problem because old ones don't have a fields attribute, right?

Can you give an application for passing a function?
Also, what would you call the parameter then?

I would rather have data transforming functions in transform objects, I think.

@mrterry
Copy link
Contributor

mrterry commented Nov 26, 2014

@jnothman Strictly speaking, they will unpickle just fine. A v.15 pickle hydrated in v.16 will not have self.fields and will bonk when calling _check_fields()

@amueller amueller changed the title WIP Add transform fields option to FeatureUnion MRG Add transform fields option to FeatureUnion Nov 26, 2014
@amueller
Copy link
Member Author

Can I has reviews?

--------
>>> from sklearn.preprocessing import Normalizer
>>> union = FeatureUnion([("norm1", Normalizer(norm='l1')), \
("norm2", Normalizer(norm='l1'))], \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure it's a useful example if both undergo the same column-wise transformation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. If both are histograms this is different than doing it per column ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I've never used Normalizer before. I confused it for a feature scaler. It's norming each sample. Thanks...

@jnothman
Copy link
Member

LGTM!

Changed my mind: can I suggest that this new functionality be noted in the description of FeatureUnion in the narrative docs, and perhaps in the docstring?

@jnothman
Copy link
Member

now LGTM! ;)

@jnothman jnothman changed the title MRG Add transform fields option to FeatureUnion [MRG+1] Add transform fields option to FeatureUnion Dec 25, 2014
@amueller
Copy link
Member Author

Thanks for your help @jnothman :) Any other reviews?

@amueller
Copy link
Member Author

amueller commented Jan 7, 2015

@ogrisel a review would be much appreciated ;)

@amueller
Copy link
Member Author

amueller commented Mar 5, 2015

Ping @ogrisel what do you think of this?

@bmabey
Copy link

bmabey commented Mar 13, 2015

I'm new to the code base but FWIW LGTM.

I'm also biased since I'm excited to see this merged. :)

@GaelVaroquaux GaelVaroquaux changed the title [MRG+1] Add transform fields option to FeatureUnion [MRG+1-1] Add transform fields option to FeatureUnion Mar 13, 2015
@GaelVaroquaux
Copy link
Member

While I understand the problem that this is trying to solve, and I think that it is very important, I am a bit worried by the indexing in the sample direction. The changes are toying with our clear definition that the first direction of indexing should be a sample direction. The implications of such blurring of conventions are probably very deep. In particular, I expect code validation and error reporting to become harder and harder.

I know where this comes from: pandas has very strange indexing logics, and as a result an API hard to learn and error message that are very open ended. On the opposite, scikit-learn has so far had a very strict set of conventions, that made it easier to learn and give good error messages.

As this change basically introduces a new kind of input data, to capture heterogenous data, I suggest that it should be confine to a new sub-module, in which objects only deal with heterogenous data, and refuse to deal with the standard data matrices. We could document this module as the part of scikit-learn dedicated to heterogeneous data and define the input data type as anything that when indexed with a string return a array of length n_samples. This would enable us to support pandas DataFrame, dictionaries of 1D arrays, and structured dtypes. It would probably make the documentation, discovery and future evolution of such support easier.

As a side note, the name 'field' is very unclear to me. I understood where it came from after reading half of the pull request, because the pull request has an obvious consistency and story behind it, but looking locally at a bit of the code, I had a hard time understanding why a 'field' was applied in the sample direction.

@jnothman
Copy link
Member

And by "you" you mean me, or @amueller? If me, I had been proposing to not allow multiple transformers for the same column without nesting a FeatureUnion. But now I'm just being tired of looking at lists of tuples and dicts of string keys and tuple values and parallel arrays and other such monsters.

@amueller
Copy link
Member Author

I feel if we make users nest FeatureUnion and ColumnTransformer then we failed.
I feel the interface as implemented now is not to bad, I think, it doesn't have lists of tuples or parallel arrays.
I was just about to revert this to the parallel array solution.
If we do any other API, there will be very little code reuse with FeatureUnion.

What we could do is use the "name is the column name" thing by default and keep the feature union. And if anyone wants to use multiple transformers on the same column, they have to provide a separate array of column names.

class ColumnTransformer(BaseEstimator, TransformerMixin):
"""Applies transformers to columns of a dataframe / dict.

This estimator applies a transformer objects to each column or field of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applies transformer objects :)

@amueller amueller force-pushed the heterogeneous_feature_union branch from 5e42880 to 6addec5 Compare June 10, 2015 17:32
@amueller amueller force-pushed the heterogeneous_feature_union branch from 6addec5 to c3b6568 Compare June 10, 2015 17:40
@amueller amueller changed the title [WIP] Add feature_extraction.ColumnTransformer [MRG] Add feature_extraction.ColumnTransformer Jun 10, 2015


class ColumnTransformer(BaseEstimator, TransformerMixin):
"""Applies transformers to columns of a dataframe / dict.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I see of point of this transformer on dataframe and dicts, I find it too bad we cannot apply it on Numpy arrays. I would love to have see a built-in to apply transformers on selected columns only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Coming late to the party, this might have been discussed before...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be pretty easy with the FunctionTransformer #4798

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, +1

Input data, used to fit transformers.
"""
transformers = Parallel(n_jobs=self.n_jobs)(
delayed(_fit_one_transformer)(trans, X[column], y)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use .iloc if it exists otherwise slice in second direction, and allow multiple columns.

@amueller amueller changed the title [MRG] Add feature_extraction.ColumnTransformer [WIP] Add feature_extraction.ColumnTransformer Mar 31, 2017
@GaelVaroquaux
Copy link
Member

Should we close this, in favor of #9012, to clean the tracker?

@amueller amueller closed this Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend FeatureUnion to better handle heterogeneous data
8 participants