[WIP] Add feature_extraction.ColumnTransformer #3886

Closed
wants to merge 13 commits into
from

Conversation

8 participants
@amueller
Member

amueller commented Nov 25, 2014

Fixes #2034.

Todo:

  • Docstrings
  • Simple example
  • Test
  • documentation
  • test feature names
  • rename to ColumnSelector
  • allow selecting multiple columns
  • don't slice first direction, use iloc for pandas

Also see here for how this would help people.

)),
# Use a SVC classifier on the combined features
- ('svc', SVC(kernel='linear')),
+ ('svc', LinearSVC(dual=False)),

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

much more appropriate.

@mrterry

mrterry Nov 25, 2014

Contributor

much more appropriate.

sklearn/pipeline.py
@@ -406,6 +438,17 @@ def _update_transformer_list(self, transformers):
for ((name, old), new) in zip(self.transformer_list, transformers)
]
+ def _check_fields(self):
+ if self.fields is not None:
+ fields = self.fields

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

this is probably overly paranoid, but do we care if self.fields is a generator?

@mrterry

mrterry Nov 25, 2014

Contributor

this is probably overly paranoid, but do we care if self.fields is a generator?

This comment has been minimized.

@amueller

amueller Nov 25, 2014

Member

Then it'll break as it can only be evaluated once, right? we could convert it to a list but the docsting says it should be a list.....

@amueller

amueller Nov 25, 2014

Member

Then it'll break as it can only be evaluated once, right? we could convert it to a list but the docsting says it should be a list.....

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

Right. I don't think it is a problem. Just being paranoid.

@mrterry

mrterry Nov 25, 2014

Contributor

Right. I don't think it is a problem. Just being paranoid.

@mrterry

This comment has been minimized.

Show comment
Hide comment
@mrterry

mrterry Nov 25, 2014

Contributor

FeatureUnions from previous versions of skleran will not be unpickleable after this merges. Is that OK?

Contributor

mrterry commented Nov 25, 2014

FeatureUnions from previous versions of skleran will not be unpickleable after this merges. Is that OK?

- return self
-
- def transform(self, data_dict):
- return data_dict[self.key]

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

good riddance.

@mrterry

mrterry Nov 25, 2014

Contributor

good riddance.

This comment has been minimized.

This comment has been minimized.

@mrterry

mrterry Nov 25, 2014

Contributor

Happy to see this go. Your change makes all this much more elegant.

@mrterry

mrterry Nov 25, 2014

Contributor

Happy to see this go. Your change makes all this much more elegant.

@mrterry

This comment has been minimized.

Show comment
Hide comment
@mrterry

mrterry Nov 25, 2014

Contributor

Looks good to me. I wrote something similar to this, but didn't get around to writing the tests. How do you test the parallel dispatch stays parallel?

Contributor

mrterry commented Nov 25, 2014

Looks good to me. I wrote something similar to this, but didn't get around to writing the tests. How do you test the parallel dispatch stays parallel?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 25, 2014

Member

We don't provide pickle compatibility between versions. That is unfortunate but we don't have the resources / infrastructure for that at the moment, so we just don't worry about it.

I am not sure I understand you question about parallelism. You mean how do we test that joblib actually dispatches? I guess we don't.

Member

amueller commented Nov 25, 2014

We don't provide pickle compatibility between versions. That is unfortunate but we don't have the resources / infrastructure for that at the moment, so we just don't worry about it.

I am not sure I understand you question about parallelism. You mean how do we test that joblib actually dispatches? I guess we don't.

@mrterry

This comment has been minimized.

Show comment
Hide comment
@mrterry

mrterry Nov 25, 2014

Contributor

You understood my poorly worded question.

Contributor

mrterry commented Nov 25, 2014

You understood my poorly worded question.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 26, 2014

Member

I wrote something similar to this, but didn't get around to writing the tests.

That's what the WIPs are for in PR titles!

Member

jnothman commented Nov 26, 2014

I wrote something similar to this, but didn't get around to writing the tests.

That's what the WIPs are for in PR titles!

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 26, 2014

Member

@amueller, is it okay to allow fields to be a list of string or functions, where functions are just applied to X?

Why won't previous FeatureUnions be unplickleable?

Member

jnothman commented Nov 26, 2014

@amueller, is it okay to allow fields to be a list of string or functions, where functions are just applied to X?

Why won't previous FeatureUnions be unplickleable?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 26, 2014

Member

I think there is a pickling problem because old ones don't have a fields attribute, right?

Can you give an application for passing a function?
Also, what would you call the parameter then?

I would rather have data transforming functions in transform objects, I think.

Member

amueller commented Nov 26, 2014

I think there is a pickling problem because old ones don't have a fields attribute, right?

Can you give an application for passing a function?
Also, what would you call the parameter then?

I would rather have data transforming functions in transform objects, I think.

@mrterry

This comment has been minimized.

Show comment
Hide comment
@mrterry

mrterry Nov 26, 2014

Contributor

@jnothman Strictly speaking, they will unpickle just fine. A v.15 pickle hydrated in v.16 will not have self.fields and will bonk when calling _check_fields()

Contributor

mrterry commented Nov 26, 2014

@jnothman Strictly speaking, they will unpickle just fine. A v.15 pickle hydrated in v.16 will not have self.fields and will bonk when calling _check_fields()

@amueller amueller changed the title from WIP Add transform fields option to FeatureUnion to MRG Add transform fields option to FeatureUnion Nov 26, 2014

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 16, 2014

Member

Can I has reviews?

Member

amueller commented Dec 16, 2014

Can I has reviews?

sklearn/pipeline.py
+ --------
+ >>> from sklearn.preprocessing import Normalizer
+ >>> union = FeatureUnion([("norm1", Normalizer(norm='l1')), \
+ ("norm2", Normalizer(norm='l1'))], \

This comment has been minimized.

@jnothman

jnothman Dec 17, 2014

Member

I'm not really sure it's a useful example if both undergo the same column-wise transformation.

@jnothman

jnothman Dec 17, 2014

Member

I'm not really sure it's a useful example if both undergo the same column-wise transformation.

This comment has been minimized.

@amueller

amueller Dec 17, 2014

Member

It is. If both are histograms this is different than doing it per column ;)

@amueller

amueller Dec 17, 2014

Member

It is. If both are histograms this is different than doing it per column ;)

This comment has been minimized.

@jnothman

jnothman Dec 18, 2014

Member

Ah. I've never used Normalizer before. I confused it for a feature scaler. It's norming each sample. Thanks...

@jnothman

jnothman Dec 18, 2014

Member

Ah. I've never used Normalizer before. I confused it for a feature scaler. It's norming each sample. Thanks...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 17, 2014

Member

LGTM!

Changed my mind: can I suggest that this new functionality be noted in the description of FeatureUnion in the narrative docs, and perhaps in the docstring?

Member

jnothman commented Dec 17, 2014

LGTM!

Changed my mind: can I suggest that this new functionality be noted in the description of FeatureUnion in the narrative docs, and perhaps in the docstring?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 25, 2014

Member

now LGTM! ;)

Member

jnothman commented Dec 25, 2014

now LGTM! ;)

@jnothman jnothman changed the title from MRG Add transform fields option to FeatureUnion to [MRG+1] Add transform fields option to FeatureUnion Dec 25, 2014

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 29, 2014

Member

Thanks for your help @jnothman :) Any other reviews?

Member

amueller commented Dec 29, 2014

Thanks for your help @jnothman :) Any other reviews?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jan 7, 2015

Member

@ogrisel a review would be much appreciated ;)

Member

amueller commented Jan 7, 2015

@ogrisel a review would be much appreciated ;)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 5, 2015

Member

Ping @ogrisel what do you think of this?

Member

amueller commented Mar 5, 2015

Ping @ogrisel what do you think of this?

@bmabey

This comment has been minimized.

Show comment
Hide comment
@bmabey

bmabey Mar 13, 2015

I'm new to the code base but FWIW LGTM.

I'm also biased since I'm excited to see this merged. :)

bmabey commented Mar 13, 2015

I'm new to the code base but FWIW LGTM.

I'm also biased since I'm excited to see this merged. :)

@GaelVaroquaux GaelVaroquaux changed the title from [MRG+1] Add transform fields option to FeatureUnion to [MRG+1-1] Add transform fields option to FeatureUnion Mar 13, 2015

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 13, 2015

Member

While I understand the problem that this is trying to solve, and I think that it is very important, I am a bit worried by the indexing in the sample direction. The changes are toying with our clear definition that the first direction of indexing should be a sample direction. The implications of such blurring of conventions are probably very deep. In particular, I expect code validation and error reporting to become harder and harder.

I know where this comes from: pandas has very strange indexing logics, and as a result an API hard to learn and error message that are very open ended. On the opposite, scikit-learn has so far had a very strict set of conventions, that made it easier to learn and give good error messages.

As this change basically introduces a new kind of input data, to capture heterogenous data, I suggest that it should be confine to a new sub-module, in which objects only deal with heterogenous data, and refuse to deal with the standard data matrices. We could document this module as the part of scikit-learn dedicated to heterogeneous data and define the input data type as anything that when indexed with a string return a array of length n_samples. This would enable us to support pandas DataFrame, dictionaries of 1D arrays, and structured dtypes. It would probably make the documentation, discovery and future evolution of such support easier.

As a side note, the name 'field' is very unclear to me. I understood where it came from after reading half of the pull request, because the pull request has an obvious consistency and story behind it, but looking locally at a bit of the code, I had a hard time understanding why a 'field' was applied in the sample direction.

Member

GaelVaroquaux commented Mar 13, 2015

While I understand the problem that this is trying to solve, and I think that it is very important, I am a bit worried by the indexing in the sample direction. The changes are toying with our clear definition that the first direction of indexing should be a sample direction. The implications of such blurring of conventions are probably very deep. In particular, I expect code validation and error reporting to become harder and harder.

I know where this comes from: pandas has very strange indexing logics, and as a result an API hard to learn and error message that are very open ended. On the opposite, scikit-learn has so far had a very strict set of conventions, that made it easier to learn and give good error messages.

As this change basically introduces a new kind of input data, to capture heterogenous data, I suggest that it should be confine to a new sub-module, in which objects only deal with heterogenous data, and refuse to deal with the standard data matrices. We could document this module as the part of scikit-learn dedicated to heterogeneous data and define the input data type as anything that when indexed with a string return a array of length n_samples. This would enable us to support pandas DataFrame, dictionaries of 1D arrays, and structured dtypes. It would probably make the documentation, discovery and future evolution of such support easier.

As a side note, the name 'field' is very unclear to me. I understood where it came from after reading half of the pull request, because the pull request has an obvious consistency and story behind it, but looking locally at a bit of the code, I had a hard time understanding why a 'field' was applied in the sample direction.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Mar 13, 2015

Member

Actually, when I think about this more, it seems to me that such class of features are of the same type as text support in scikit-learn, or almost image support. They are 'modality-specific', or 'data-type-specific' support. Thus it makes sense that they are in a separate module, marketed and documented as such.

Maybe sklearn.feature_extraction.heterogeneous_data is too deep :). But maybe that is also telling us that sklearn.feature_extraction.text is too deep too. sklearn.heterogeneous_data, sklearn.text and sklearn.image would feel to me natural and very readable.

What do people think? I think that this would be a pretty important addition to scikit-learn that might help a lot guiding our users that work with heterogeneous data.

Member

GaelVaroquaux commented Mar 13, 2015

Actually, when I think about this more, it seems to me that such class of features are of the same type as text support in scikit-learn, or almost image support. They are 'modality-specific', or 'data-type-specific' support. Thus it makes sense that they are in a separate module, marketed and documented as such.

Maybe sklearn.feature_extraction.heterogeneous_data is too deep :). But maybe that is also telling us that sklearn.feature_extraction.text is too deep too. sklearn.heterogeneous_data, sklearn.text and sklearn.image would feel to me natural and very readable.

What do people think? I think that this would be a pretty important addition to scikit-learn that might help a lot guiding our users that work with heterogeneous data.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 16, 2015

Member

I agree with you that working with heterogeneous data is a tricky thing and there are reasons to avoid it in scikit-learn.
However, this PR by no means introduces new kinds of input data. People have been using scikit-learn with heterogeneous data for a while. Look at the example in master. You can use pipelines and feature unions and grid-search happily to do things on dictionaries or pandas dataframes, and many people do that.

What the PR does is make it easier to build pipelines on these kinds of objects.

Member

amueller commented Mar 16, 2015

I agree with you that working with heterogeneous data is a tricky thing and there are reasons to avoid it in scikit-learn.
However, this PR by no means introduces new kinds of input data. People have been using scikit-learn with heterogeneous data for a while. Look at the example in master. You can use pipelines and feature unions and grid-search happily to do things on dictionaries or pandas dataframes, and many people do that.

What the PR does is make it easier to build pipelines on these kinds of objects.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 16, 2015

Member

Actually, I'm not sure about the dicts as inputs. Do we have tests for them? Can we use DictVectorizer in a GridSearchCV?

Member

amueller commented Mar 16, 2015

Actually, I'm not sure about the dicts as inputs. Do we have tests for them? Can we use DictVectorizer in a GridSearchCV?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 16, 2015

Member

Hum, DictVectorizer works on lists of dicts, while here I used a dict of arrays, which indeed is inspired by the pandas structure... hum... then it would introduce a new datatype, which is not great.
I have to check if the current code actually allows for cross_val_score and GridSearchCV and somehow I imagine it doesn't ...

Member

amueller commented Mar 16, 2015

Hum, DictVectorizer works on lists of dicts, while here I used a dict of arrays, which indeed is inspired by the pandas structure... hum... then it would introduce a new datatype, which is not great.
I have to check if the current code actually allows for cross_val_score and GridSearchCV and somehow I imagine it doesn't ...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 13, 2015

Member

I know where this comes from: pandas has very strange indexing logics

Some of these strange logics is also incorporated in the numpy struct array, where one can use arr[field_name]. It's also a syntax that can be applied to a scikit-learn Bunch!

sklearn.heterogeneous_data

Uh, no. We'll need a better name than that; one that is easier to spell. I would hazard a guess that most data analysts work with heterogeneous data of some kind, and image processing may be one of the few fields where this is not common.

I agree that this is ultimately feature extraction, and that FeatureUnion is misplaced in Pipeline. However I am certain that this is an essential preprocessing need, rather than an application-specific tool.

I am far from wedded to the name field, and I would like to see this extended to support functions as field extractors.

The other thing I dislike about this PR's approach is the use of parallel arrays for the set of transformers, their weights and their fields. It's hard to read. It would indeed be nicer to just wrap the transformer in FieldTransformer('body', my_transformer) (relative to the current example this avoids a Pipeline), but without more magic that I'm sure you would find objectionable, @GaelVaroquaux, this introduces too many underscores into parameter names.

Pandas is extremely popular for data analysis, as I was reminded when I had two former non-Python data analysts working on opposite sides of the world rave about it at my dinner table last week. A key reason for this is its ability to work with features that need different processing, which is almost as common as features requiring different scaling parameters, and for which plain old numpy falls short. It is imperative and urgent that we make scikit-learn/pandas integration as seamless as possible if this project is to remain a top choice for modern Python data analysts.

Member

jnothman commented Apr 13, 2015

I know where this comes from: pandas has very strange indexing logics

Some of these strange logics is also incorporated in the numpy struct array, where one can use arr[field_name]. It's also a syntax that can be applied to a scikit-learn Bunch!

sklearn.heterogeneous_data

Uh, no. We'll need a better name than that; one that is easier to spell. I would hazard a guess that most data analysts work with heterogeneous data of some kind, and image processing may be one of the few fields where this is not common.

I agree that this is ultimately feature extraction, and that FeatureUnion is misplaced in Pipeline. However I am certain that this is an essential preprocessing need, rather than an application-specific tool.

I am far from wedded to the name field, and I would like to see this extended to support functions as field extractors.

The other thing I dislike about this PR's approach is the use of parallel arrays for the set of transformers, their weights and their fields. It's hard to read. It would indeed be nicer to just wrap the transformer in FieldTransformer('body', my_transformer) (relative to the current example this avoids a Pipeline), but without more magic that I'm sure you would find objectionable, @GaelVaroquaux, this introduces too many underscores into parameter names.

Pandas is extremely popular for data analysis, as I was reminded when I had two former non-Python data analysts working on opposite sides of the world rave about it at my dinner table last week. A key reason for this is its ability to work with features that need different processing, which is almost as common as features requiring different scaling parameters, and for which plain old numpy falls short. It is imperative and urgent that we make scikit-learn/pandas integration as seamless as possible if this project is to remain a top choice for modern Python data analysts.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Apr 13, 2015

Member

I agree with @jnothman, but I also think we should be careful on the input data types. The dics I support in this PR is not a valid input type for GridSearchCV.
Re underscores: maybe we should get rid of these anyhow [damn I had a notebook on how to do that, now I can't find it any more]

Member

amueller commented Apr 13, 2015

I agree with @jnothman, but I also think we should be careful on the input data types. The dics I support in this PR is not a valid input type for GridSearchCV.
Re underscores: maybe we should get rid of these anyhow [damn I had a notebook on how to do that, now I can't find it any more]

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 13, 2015

Member

The dics I support in this PR is not a valid input type for GridSearchCV.

True, that's a problem. But I think by now Pandas and struct array style
indexing, in which getitem arg can index the first/series axis, or can
specify a field/column is popular enough to be uncontroversial. It's
overloaded semantics, but it works well in practice.

Re underscores: I don't know how you propose to get rid of them. It is
possible to specify any parameter by the estimator object and its (shallow)
parameter name, so set_params(list of such pairs) could work, but obviously
not the kwargs stuff currently used. It also more strictly requires that
the original parameters be saved as attributes without cloning (in init
or later).

I have also previously built metaestimators that pass parameters through
without prefixing; and I've built a wrapper that allows you to alias
parameters so that you can use the same param_grid even if the nesting of
pipelines/featureunions changes.

On 14 April 2015 at 06:36, Andreas Mueller notifications@github.com wrote:

I agree with @jnothman https://github.com/jnothman, but I also think
we should be careful on the input data types. The dics I support in this PR
is not a valid input type for GridSearchCV.
Re underscores: maybe we should get rid of these anyhow [damn I had a
notebook on how to do that, now I can't find it any more]


Reply to this email directly or view it on GitHub
#3886 (comment)
.

Member

jnothman commented Apr 13, 2015

The dics I support in this PR is not a valid input type for GridSearchCV.

True, that's a problem. But I think by now Pandas and struct array style
indexing, in which getitem arg can index the first/series axis, or can
specify a field/column is popular enough to be uncontroversial. It's
overloaded semantics, but it works well in practice.

Re underscores: I don't know how you propose to get rid of them. It is
possible to specify any parameter by the estimator object and its (shallow)
parameter name, so set_params(list of such pairs) could work, but obviously
not the kwargs stuff currently used. It also more strictly requires that
the original parameters be saved as attributes without cloning (in init
or later).

I have also previously built metaestimators that pass parameters through
without prefixing; and I've built a wrapper that allows you to alias
parameters so that you can use the same param_grid even if the nesting of
pipelines/featureunions changes.

On 14 April 2015 at 06:36, Andreas Mueller notifications@github.com wrote:

I agree with @jnothman https://github.com/jnothman, but I also think
we should be careful on the input data types. The dics I support in this PR
is not a valid input type for GridSearchCV.
Re underscores: maybe we should get rid of these anyhow [damn I had a
notebook on how to do that, now I can't find it any more]


Reply to this email directly or view it on GitHub
#3886 (comment)
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Apr 13, 2015

Member

I wrote a wrapper that recreates the pipeline during fit. You have to give a function that creates the aliases of the parameters in the param_grid to the actual parameters. That is clearly not the solution for sklearn.

What I saw some people propose is to attach the parameters to try more directly to the estimator, instead of specifying them in GridSearchCV.
So you'd do something like:

svm = SVC()
svm.grid_params = {'C' : np.logspace(10, 10)}
grid = GridSearchCV(make_pipeline(StandardScaler(), svm))
Member

amueller commented Apr 13, 2015

I wrote a wrapper that recreates the pipeline during fit. You have to give a function that creates the aliases of the parameters in the param_grid to the actual parameters. That is clearly not the solution for sklearn.

What I saw some people propose is to attach the parameters to try more directly to the estimator, instead of specifying them in GridSearchCV.
So you'd do something like:

svm = SVC()
svm.grid_params = {'C' : np.logspace(10, 10)}
grid = GridSearchCV(make_pipeline(StandardScaler(), svm))
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 13, 2015

Member

That would be nice, except for the need to support other search mechanisms,
and that even ParameterGrid can be more expressive than that (i.e. lists of
dicts).

On 14 April 2015 at 09:16, Andreas Mueller notifications@github.com wrote:

I wrote a wrapper that recreates the pipeline during fit. You have to
give a function that creates the aliases of the parameters in the
param_grid to the actual parameters. That is clearly not the solution for
sklearn.

What I saw some people propose is to attach the parameters to try more
directly to the estimator, instead of specifying them in GridSearchCV.
So you'd do something like:

svm = SVC()
svm.grid_params = {'C' : np.logspace(10, 10)}
grid = GridSearchCV(make_pipeline(StandardScaler(), svm))


Reply to this email directly or view it on GitHub
#3886 (comment)
.

Member

jnothman commented Apr 13, 2015

That would be nice, except for the need to support other search mechanisms,
and that even ParameterGrid can be more expressive than that (i.e. lists of
dicts).

On 14 April 2015 at 09:16, Andreas Mueller notifications@github.com wrote:

I wrote a wrapper that recreates the pipeline during fit. You have to
give a function that creates the aliases of the parameters in the
param_grid to the actual parameters. That is clearly not the solution for
sklearn.

What I saw some people propose is to attach the parameters to try more
directly to the estimator, instead of specifying them in GridSearchCV.
So you'd do something like:

svm = SVC()
svm.grid_params = {'C' : np.logspace(10, 10)}
grid = GridSearchCV(make_pipeline(StandardScaler(), svm))


Reply to this email directly or view it on GitHub
#3886 (comment)
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Apr 13, 2015

Member
Member

amueller commented Apr 13, 2015

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Apr 14, 2015

Member

I should also note that sklearn_pandas attempts to support this heterogeneous processing. It uses the step name-transformer pair input, but instead of step name a field or list of fields is supplied. Indeed it seems not to support s/get_params for setting transformer params.

Member

jnothman commented Apr 14, 2015

I should also note that sklearn_pandas attempts to support this heterogeneous processing. It uses the step name-transformer pair input, but instead of step name a field or list of fields is supplied. Indeed it seems not to support s/get_params for setting transformer params.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 19, 2015

Member

After trying to implement sample properties in #4696, I think the dict type I used here is actually good. In #4696, this would be supported in all slicing operations (also in X).

I feel that it is silly to support these dicts for sample_props but not for X, because I expect that will lead to people doing grid_search.fit(X=np.zeros(n_samples), sample_props=actual_X)

So I came around and think this PR is actually good, with minor additions to safe_indexing from #4696.

Member

amueller commented May 19, 2015

After trying to implement sample properties in #4696, I think the dict type I used here is actually good. In #4696, this would be supported in all slicing operations (also in X).

I feel that it is silly to support these dicts for sample_props but not for X, because I expect that will lead to people doing grid_search.fit(X=np.zeros(n_samples), sample_props=actual_X)

So I came around and think this PR is actually good, with minor additions to safe_indexing from #4696.

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 3, 2015

Member

Another solution to what this PR is doing is to have a CallableTransformer (#4798) and then users can provide the ItemSelector functionality with a lambda x: x[field].
The solution here would make code using pipelines look nicer. I have used this pattern before, but I kept my data grouped by samples, such that my ItemSelector had to do a loop.

Member

vene commented Jun 3, 2015

Another solution to what this PR is doing is to have a CallableTransformer (#4798) and then users can provide the ItemSelector functionality with a lambda x: x[field].
The solution here would make code using pipelines look nicer. I have used this pattern before, but I kept my data grouped by samples, such that my ItemSelector had to do a loop.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 6, 2015

Member

Making transformers_ with an OrderedDict seems sensible.

On 7 June 2015 at 05:24, Vlad Niculae notifications@github.com wrote:

Yeah, unfortunately the order won't be alphabetical if the user passes a
normal dict then. But to have both, the implementation would get messier.
I'd choose the OrderedDict version.


Reply to this email directly or view it on GitHub
#3886 (comment)
.

Member

jnothman commented Jun 6, 2015

Making transformers_ with an OrderedDict seems sensible.

On 7 June 2015 at 05:24, Vlad Niculae notifications@github.com wrote:

Yeah, unfortunately the order won't be alphabetical if the user passes a
normal dict then. But to have both, the implementation would get messier.
I'd choose the OrderedDict version.


Reply to this email directly or view it on GitHub
#3886 (comment)
.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 9, 2015

Member

was there a good way in doctests to get rid of the u for unicode strings in python2?

Member

amueller commented Jun 9, 2015

was there a good way in doctests to get rid of the u for unicode strings in python2?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 9, 2015

Member

the text docs do something like

>>> something == ['the', 'result', 'without', 'u']
True

Is that the only way / our standard pattern?
@ogrisel ?

Member

amueller commented Jun 9, 2015

the text docs do something like

>>> something == ['the', 'result', 'without', 'u']
True

Is that the only way / our standard pattern?
@ogrisel ?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 9, 2015

Member

I agree, code duplication makes refactoring harder. What do you suggest instead @jnothman?

I didn't have a problem with incorporating this functionality into FeatureUnion. Even there we could just add a parameter extract_fields that means the names of transformers are also extraction keys...

Member

jnothman commented Jun 9, 2015

I agree, code duplication makes refactoring harder. What do you suggest instead @jnothman?

I didn't have a problem with incorporating this functionality into FeatureUnion. Even there we could just add a parameter extract_fields that means the names of transformers are also extraction keys...

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 9, 2015

Member

Well @GaelVaroquaux feels strongly about having anything slicing in the first direction in a general module. If we keep the FeatureUnion interface for this transformer there would be much less duplication.

Member

amueller commented Jun 9, 2015

Well @GaelVaroquaux feels strongly about having anything slicing in the first direction in a general module. If we keep the FeatureUnion interface for this transformer there would be much less duplication.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 9, 2015

Member

Okay. Together or apart aside:

I am tempted to suggest we just have a list of pairs for the transformers list. The first element in each pair would be both the field name and the transformer name, which means that field names must be valid identifiers not containing __ (and should also not be transformers, etc.). Maybe that's a silly set of constraints, but having a different input data structure here and in Pipeline/FeatureUnion seems set to confuse.

Member

jnothman commented Jun 9, 2015

Okay. Together or apart aside:

I am tempted to suggest we just have a list of pairs for the transformers list. The first element in each pair would be both the field name and the transformer name, which means that field names must be valid identifiers not containing __ (and should also not be transformers, etc.). Maybe that's a silly set of constraints, but having a different input data structure here and in Pipeline/FeatureUnion seems set to confuse.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 10, 2015

Member

ok, I then I'll go back to the old interface, which would get rid of the sorting issue, gives the same interface as in pipeline / feature union and removes the duplication. so that seems like a good idea.

Member

amueller commented Jun 10, 2015

ok, I then I'll go back to the old interface, which would get rid of the sorting issue, gives the same interface as in pipeline / feature union and removes the duplication. so that seems like a good idea.

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 10, 2015

Member

Does this API allow multiple transformers for the same column? Maybe they get automatically aliased to name_1, name_2, ...?

Member

vene commented Jun 10, 2015

Does this API allow multiple transformers for the same column? Maybe they get automatically aliased to name_1, name_2, ...?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 10, 2015

Member

The current implementation does allow multiple transformers for the same column. I'm suggesting an interface that does not. For such things was FeatureUnion created.

But part of the problem here is that we come to the limits of an interface where all configuration are provided in an object's construction using Python/NumPy primitives. Pipeline-like construction would be much more readable, future-proof and self-documenting if we used a factory idiom, doing away with complicated nested and parallel dict/list/tuple monsters:

myunion = FeatureUnion().append(MyExtractor(), weight=4, getter=operator.itemgetter('body'), use_sample_weight=True)

Similar could be achieved (uglier, IMO) by providing a namedtuple-like class in the API:

myunion = FeatureUnion([UnionEntry(MyExtractor(), weight=4, getter=operator.itemgetter('body'), use_sample_weight=True)])

Indeed, this would likely happen behind the scenes in the former example.

Yes, both of these begin to look like Java, but perhaps when used with discretion they are the right way to make usable APIs and legible code. I actually think the former passes the zen test better than the incumbent on Zen "Flat is better than nested... Readability counts... Practicality beats purity" measures.

Member

jnothman commented Jun 10, 2015

The current implementation does allow multiple transformers for the same column. I'm suggesting an interface that does not. For such things was FeatureUnion created.

But part of the problem here is that we come to the limits of an interface where all configuration are provided in an object's construction using Python/NumPy primitives. Pipeline-like construction would be much more readable, future-proof and self-documenting if we used a factory idiom, doing away with complicated nested and parallel dict/list/tuple monsters:

myunion = FeatureUnion().append(MyExtractor(), weight=4, getter=operator.itemgetter('body'), use_sample_weight=True)

Similar could be achieved (uglier, IMO) by providing a namedtuple-like class in the API:

myunion = FeatureUnion([UnionEntry(MyExtractor(), weight=4, getter=operator.itemgetter('body'), use_sample_weight=True)])

Indeed, this would likely happen behind the scenes in the former example.

Yes, both of these begin to look like Java, but perhaps when used with discretion they are the right way to make usable APIs and legible code. I actually think the former passes the zen test better than the incumbent on Zen "Flat is better than nested... Readability counts... Practicality beats purity" measures.

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 10, 2015

Member

By "this" I indeed meant the API you proposed. It's probably an important feature.

Member

vene commented Jun 10, 2015

By "this" I indeed meant the API you proposed. It's probably an important feature.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 10, 2015

Member

And by "you" you mean me, or @amueller? If me, I had been proposing to not allow multiple transformers for the same column without nesting a FeatureUnion. But now I'm just being tired of looking at lists of tuples and dicts of string keys and tuple values and parallel arrays and other such monsters.

Member

jnothman commented Jun 10, 2015

And by "you" you mean me, or @amueller? If me, I had been proposing to not allow multiple transformers for the same column without nesting a FeatureUnion. But now I'm just being tired of looking at lists of tuples and dicts of string keys and tuple values and parallel arrays and other such monsters.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 10, 2015

Member

I feel if we make users nest FeatureUnion and ColumnTransformer then we failed.
I feel the interface as implemented now is not to bad, I think, it doesn't have lists of tuples or parallel arrays.
I was just about to revert this to the parallel array solution.
If we do any other API, there will be very little code reuse with FeatureUnion.

What we could do is use the "name is the column name" thing by default and keep the feature union. And if anyone wants to use multiple transformers on the same column, they have to provide a separate array of column names.

Member

amueller commented Jun 10, 2015

I feel if we make users nest FeatureUnion and ColumnTransformer then we failed.
I feel the interface as implemented now is not to bad, I think, it doesn't have lists of tuples or parallel arrays.
I was just about to revert this to the parallel array solution.
If we do any other API, there will be very little code reuse with FeatureUnion.

What we could do is use the "name is the column name" thing by default and keep the feature union. And if anyone wants to use multiple transformers on the same column, they have to provide a separate array of column names.

+class ColumnTransformer(BaseEstimator, TransformerMixin):
+ """Applies transformers to columns of a dataframe / dict.
+
+ This estimator applies a transformer objects to each column or field of the

This comment has been minimized.

@raghavrv

raghavrv Jun 10, 2015

Member

applies transformer objects :)

@raghavrv

raghavrv Jun 10, 2015

Member

applies transformer objects :)

@amueller amueller changed the title from [WIP] Add feature_extraction.ColumnTransformer to [MRG] Add feature_extraction.ColumnTransformer Jun 10, 2015

+
+
+class ColumnTransformer(BaseEstimator, TransformerMixin):
+ """Applies transformers to columns of a dataframe / dict.

This comment has been minimized.

@glouppe

glouppe Jul 19, 2015

Member

While I see of point of this transformer on dataframe and dicts, I find it too bad we cannot apply it on Numpy arrays. I would love to have see a built-in to apply transformers on selected columns only.

@glouppe

glouppe Jul 19, 2015

Member

While I see of point of this transformer on dataframe and dicts, I find it too bad we cannot apply it on Numpy arrays. I would love to have see a built-in to apply transformers on selected columns only.

This comment has been minimized.

@glouppe

glouppe Jul 19, 2015

Member

(Coming late to the party, this might have been discussed before...)

@glouppe

glouppe Jul 19, 2015

Member

(Coming late to the party, this might have been discussed before...)

This comment has been minimized.

@amueller

amueller Jul 30, 2015

Member

That would be pretty easy with the FunctionTransformer #4798

@amueller

amueller Jul 30, 2015

Member

That would be pretty easy with the FunctionTransformer #4798

This comment has been minimized.

@glouppe

glouppe Aug 30, 2015

Member

Indeed, +1

@glouppe

glouppe Aug 30, 2015

Member

Indeed, +1

+
+ Parameters
+ ----------
+ transformers : dict from string to (string, transformer) tuples

This comment has been minimized.

@glouppe

glouppe Aug 30, 2015

Member

The implementation is expected the dict values to be (transformer, string) tuples, and not (string, transformer) as documented here.

@glouppe

glouppe Aug 30, 2015

Member

The implementation is expected the dict values to be (transformer, string) tuples, and not (string, transformer) as documented here.

This comment has been minimized.

@glouppe

glouppe Aug 30, 2015

Member

Also, does the key used to access the column always need to be a string? Eg. what if I use a int to a access the n-th column, or even a list to access several columns at once?

@glouppe

glouppe Aug 30, 2015

Member

Also, does the key used to access the column always need to be a string? Eg. what if I use a int to a access the n-th column, or even a list to access several columns at once?

+ Input data, used to fit transformers.
+ """
+ transformers = Parallel(n_jobs=self.n_jobs)(
+ delayed(_fit_one_transformer)(trans, X[column], y)

This comment has been minimized.

@amueller

amueller Mar 31, 2017

Member

Should use .iloc if it exists otherwise slice in second direction, and allow multiple columns.

@amueller

amueller Mar 31, 2017

Member

Should use .iloc if it exists otherwise slice in second direction, and allow multiple columns.

@amueller amueller changed the title from [MRG] Add feature_extraction.ColumnTransformer to [WIP] Add feature_extraction.ColumnTransformer Mar 31, 2017

@amueller amueller added this to needs pr in Andy's pets Mar 31, 2017

@amueller amueller moved this from needs pr to PR phase in Andy's pets Mar 31, 2017

@amueller amueller moved this from PR phase to AJ in Andy's pets May 12, 2017

@jnothman jnothman referenced this pull request May 26, 2017

Closed

[MRG] Support for strings in OneHotEncoder #8793

4 of 4 tasks complete

@amueller amueller removed this from AJ in Andy's pets Jul 21, 2017

@jnothman jnothman added this to In progress in API and interoperability Aug 14, 2017

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Nov 28, 2017

Member

Should we close this, in favor of #9012, to clean the tracker?

Member

GaelVaroquaux commented Nov 28, 2017

Should we close this, in favor of #9012, to clean the tracker?

@amueller amueller closed this Nov 28, 2017

@jnothman jnothman moved this from In progress to Done in API and interoperability Jul 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment