Add DataFrameMapper feature metadata and y-value support. #54

asford · 2016-03-02T00:18:07Z

This pull request:

Enhances DataFrameMapper object with (optional) support for y-value extraction from input frames in fit* methods. Ref discussion in Pandas In, Pandas Out? .inverse_transform() method #41. @naught101 @dukebody
Adds features->X metadata properties, to support association of estimator metadata to source columns. Ref Track which DataFrame Column corresponds to which Array Column(s) after Transform #13. @sveitser @dukebody
Adds a Pipeline subclass, DataFramePipeline, to support transparent extraction of y values during fitting via an input DataFrameMapper.

I'm absolutely open to code-review and discussion of the proposed interfaces before merging.

TODO

Add test coverage for added features.
Clarify y-value extraction semantics with respect to cases where the y_feature column may be present in both input X and y frame inputs.
Clarify acceptable y inputs. Series & DataFrame support?
Move _dataframe_mapper and _final_estimator properties in DataFramePipeline to public properties?
Add helper function to support passing DataFramePipelines into sklearn cross-validation functions with extracted X and y values?

Adds optional 'y_feature' to DataFrameMapper and 'extract_y' method, equivalent to 'transform', to extract y value arrays from input dataframe. Updates 'fit' method to accept optional 'y' values.

Adding extract_* interfaces to DataFrameMapper to support extraction of X and y ndarrays for downstream sklearn components.

Add 'feature_indices_' to DataFrameMapper, tracking indicies of features in output feature array. Update class docstring to sklearn standard format.

Adding DataFramePipeline to support extraction of fitting targets via input DataFrameMapper. 'fit', 'fit_predict', 'fit_transform', and 'score' updated to extract fitting targets from input DataFrame via DataFrameMapper 'extract_y' interface. Pipeline requires first step in pipeline to be a DataFrameMapper instance.

Add 'X_*' properties to DataFrameMapper, allowing association of source column metadata with output feature indicies. Fix 'feature_indicies_' property definition. Add initial example of DataFrameMapper and DateFramePipeline.

dukebody · 2016-03-06T10:52:36Z

Thanks a lot @asford I will review the PR shortly.

dukebody · 2016-03-06T12:30:07Z

Hi @asford. I checked your code and I believe storing the feature_indices_ can be a good idea for later inspection, even for implementing a kind of inverse_transform.

However I feel the code complexity added to be able to use the mapper to extract values for y is too high. One can just do LabelEncoder().fit_transform(df['y_col']).values to get it, instead of pipeline._dataframe_mapper.extract_y(df). It's just one column and therefore in my humble opinion the extra complexity doesn't pay off.

Could you submit a PR with just the feature_indices_ feature? This would solve the use case described in #13.

Thanks!

asford · 2016-03-07T18:50:53Z

On partitioning the pull request, no problem. I opened this after realizing my customizations addressed a few open issues, it has a few features mixed together. I'll break out the feature_indicies_ and metadata properties into a separate pull request.

asford · 2016-03-07T18:52:19Z

@dukebody My goal for the y-value extraction feature was to support syntactic sugar closer to R's high level interfaces. For example (from http://koaning.github.io/html/patsy.html):

library(randomForest) 

formulas = c(
  as.formula("Species ~ Sepal.Length + Sepal.Width + Petal.Length"),
  as.formula("Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width")
)

for(formula in formulas){
  print(randomForest(formula, data=iris))
}

This style of interface clearly captures the feature and target columns in a single object, instead of spreading the model definition across multiple objects. This is a standard way of handling symbolic model definition (in R formula, patsy, etc...) because it supports a clear separation of concerns.

I agree that the current interface (pipeline._dataframe_mapper.extract_y(df)) isn't there yet, but think there's merit in holding an (optional) target y feature in the DataFrameMapper object. In the basic case, it supports the interface style above. In more complex cases, it allows the efficient use of preprocessing transformers on y-values (eg. thresholding via Binarizer).

Would you be open to a revised version of the y-feature components of this pull request with a refined interface?

For example, cross-validation with y features:

formulas = [
  DataFrameMapper(["sepal_length", "sepal_width", "petal_length"], "species"),
  DataFrameMapper(["sepal_length", "sepal_width", "petal_length", "petal_width"], "species"),
]

for formula in formulas:
  pipeline = make_dataframe_pipeline( [formula, RandomForestClassifier()] )
  print confusion_matrix(pipeline.extract_y(iris), pipeline.fit(iris).predict(iris))
  print cross_val_score(*pipeline.estimator_Xy(iris), scoring="accuracy", cv=5)

Cross-validation without y features:

formulas = [
  DataFrameMapper(["sepal_length", "sepal_width", "petal_length"]),
  DataFrameMapper(["sepal_length", "sepal_width", "petal_length", "petal_width"]),
]

for formula in formulas:
  pipeline = make_dataframe_pipeline( [formula, RandomForestClassifier()] )
  print confusion_matrix(
      LabelEncoder().fit_transform(iris['species']).values,
      pipeline.fit(
          iris, LabelEncoder().fit_transform(iris['species']).values
      ).predict( iris )
  )
  print cross_val_score(
      pipeline, iris, LabelEncoder().fit_transform(iris['species']).values,
      scoring="accuracy", cv=5
  )

dukebody · 2016-03-14T22:13:13Z

@asford I agree on that the "formula-like" use case is interesting, but I'd rather prefer to try to implement this without making the DataFrameMapper code significantly more complex.

I was thinking it might be possible to do something similar creating a custom pipeline class that accepts two DataFrameMappers: one for the X variables and another one for the y variables. This way the DataFrameMapper code doesn't have to be modified, since it will still just do select-and-transform. Does this sound reasonable?

Pseudo-code:

mapper_x = DataFrameMapper(["sepal_length", "sepal_width", "petal_length"])
mapper_y = DataFrameMapper([("species", LabelEncoder())])
pipeline = make_dataframe_pipeline((mapper_x, mapper_y), RandomForestClassifier())
pipeline.fit(dataframe)  # this would generate X and y inputs and call random_forest.fit(X, y))

On a separate note, your commit also introduces a new interface for selecting features without applying any transformation:

DataFrameMapper(['feat']) == DataFrameMapper([('feat', None)])

While it is not consistent with the (feature, Transformers) signature I find it is not hard to understand and can save a lot of writing. What I don't understand, even after reading the code, is why in your example using "species" without any transformer would "label-encodify" the value. Or is it a mistake in the example?

dukebody · 2016-03-20T12:31:23Z

@asford I've created an alternative implementation of the feature_indices_ features here: #56 Thoughts? :)

asford · 2016-04-04T21:21:41Z

@dukebody

I've had a chance to read through your alternate implementation and think about the y-feature component of the DataFrameMapper, and I agree that y feature extraction is best handled in the pipeline level.

In this pull's current implementation the pipeline contains:

class DataFramePipeline:
    #Steps is a list of sklearn pipeline components.
    #steps[0] must be a DataFrameMapper
    #steps[-1] may be an estimator.
    steps = list(sklearn_steps)

In the updated interface, the pipeline would contain:

class DataFramePipeline:
    x_mapper = DataFrameMapper
    y_mapper = DataFrameMapper
    #Steps is a list of sklearn pipeline components.
    #steps[-1] may be an estimator.
    steps = list(sklearn_steps)

I will rework this pull request to match this interface over the next few days to work out the API.

Would you mind holding off on #56 until we can iterate on this change? I'm not entirely convinced that storing feature_indices instead of feature_widths in the DataFrameMapper is the best representation to use, particularly for the implementation of inverse transforms. It seems that the feature output width is the simplest method of representing this data, and the corresponding indices can be trivially regenerated from the widths.

dukebody · 2016-04-18T08:12:30Z

Hi @asford . Yes I can hold off on #56, no prob.

However I don't really understand why do you think storing feature_widths instead of feature_indices is preferable? I believe they are mostly equivalent. I just used feature_indices to mimick sklearn's OneHotEncoder.

asford added 5 commits February 29, 2016 19:14

Add initial support for y_feature extraction to DataFrameMapper.

eee470d

Adds optional 'y_feature' to DataFrameMapper and 'extract_y' method, equivalent to 'transform', to extract y value arrays from input dataframe. Updates 'fit' method to accept optional 'y' values.

Add extract_X and extract_Xy methods to DataFrameMapper.

c268bc4

Adding extract_* interfaces to DataFrameMapper to support extraction of X and y ndarrays for downstream sklearn components.

Add DataFrameMapper.feature_indices_ and update class docstring.

c888f73

Add 'feature_indices_' to DataFrameMapper, tracking indicies of features in output feature array. Update class docstring to sklearn standard format.

Add DataFrameMapper column metadata properies.

6f47800

Add 'X_*' properties to DataFrameMapper, allowing association of source column metadata with output feature indicies. Fix 'feature_indicies_' property definition. Add initial example of DataFrameMapper and DateFramePipeline.

dukebody mentioned this pull request Mar 20, 2016

Add feature_indices_ attribute to mapper #56

Open

dukebody mentioned this pull request Jul 9, 2016

DataFrameMapper fit doesn't take y argument #58

Closed

dukebody mentioned this pull request Jan 14, 2017

Name columns #70

Closed

ragrawal closed this Aug 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataFrameMapper feature metadata and y-value support. #54

Add DataFrameMapper feature metadata and y-value support. #54

asford commented Mar 2, 2016

dukebody commented Mar 6, 2016

dukebody commented Mar 6, 2016

asford commented Mar 7, 2016

asford commented Mar 7, 2016

dukebody commented Mar 14, 2016

dukebody commented Mar 20, 2016

asford commented Apr 4, 2016

dukebody commented Apr 18, 2016

Add DataFrameMapper feature metadata and y-value support. #54

Add DataFrameMapper feature metadata and y-value support. #54

Conversation

asford commented Mar 2, 2016

TODO

dukebody commented Mar 6, 2016

dukebody commented Mar 6, 2016

asford commented Mar 7, 2016

asford commented Mar 7, 2016

dukebody commented Mar 14, 2016

dukebody commented Mar 20, 2016

asford commented Apr 4, 2016

dukebody commented Apr 18, 2016