Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrameMapper feature metadata and y-value support. #54

Closed
wants to merge 5 commits into from

Conversation

asford
Copy link

@asford asford commented Mar 2, 2016

This pull request:

Example Notebook

I'm absolutely open to code-review and discussion of the proposed interfaces before merging.

TODO

  • Add test coverage for added features.
  • Clarify y-value extraction semantics with respect to cases where the y_feature column may be present in both input X and y frame inputs.
  • Clarify acceptable y inputs. Series & DataFrame support?
  • Move _dataframe_mapper and _final_estimator properties in DataFramePipeline to public properties?
  • Add helper function to support passing DataFramePipelines into sklearn cross-validation functions with extracted X and y values?

Adds optional 'y_feature' to DataFrameMapper and 'extract_y' method,
equivalent to 'transform', to extract y value arrays from input
dataframe. Updates 'fit' method to accept optional 'y' values.
Adding extract_* interfaces to DataFrameMapper to support extraction of
X and y ndarrays for downstream sklearn components.
Add 'feature_indices_' to DataFrameMapper, tracking indicies of features
in output feature array. Update class docstring to sklearn standard
format.
Adding DataFramePipeline to support extraction of fitting targets via
input DataFrameMapper. 'fit', 'fit_predict', 'fit_transform', and
'score' updated to extract fitting targets from input DataFrame via
DataFrameMapper 'extract_y' interface. Pipeline requires first step in
pipeline to be a DataFrameMapper instance.
Add 'X_*' properties to DataFrameMapper, allowing association of source
column metadata with output feature indicies. Fix 'feature_indicies_'
property definition.

Add initial example of DataFrameMapper and DateFramePipeline.
@dukebody
Copy link
Collaborator

dukebody commented Mar 6, 2016

Thanks a lot @asford I will review the PR shortly.

@dukebody
Copy link
Collaborator

dukebody commented Mar 6, 2016

Hi @asford. I checked your code and I believe storing the feature_indices_ can be a good idea for later inspection, even for implementing a kind of inverse_transform.

However I feel the code complexity added to be able to use the mapper to extract values for y is too high. One can just do LabelEncoder().fit_transform(df['y_col']).values to get it, instead of pipeline._dataframe_mapper.extract_y(df). It's just one column and therefore in my humble opinion the extra complexity doesn't pay off.

Could you submit a PR with just the feature_indices_ feature? This would solve the use case described in #13.

Thanks!

@asford
Copy link
Author

asford commented Mar 7, 2016

On partitioning the pull request, no problem. I opened this after realizing my customizations addressed a few open issues, it has a few features mixed together. I'll break out the feature_indicies_ and metadata properties into a separate pull request.

@asford
Copy link
Author

asford commented Mar 7, 2016

@dukebody My goal for the y-value extraction feature was to support syntactic sugar closer to R's high level interfaces. For example (from http://koaning.github.io/html/patsy.html):

library(randomForest) 

formulas = c(
  as.formula("Species ~ Sepal.Length + Sepal.Width + Petal.Length"),
  as.formula("Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width")
)

for(formula in formulas){
  print(randomForest(formula, data=iris))
}

This style of interface clearly captures the feature and target columns in a single object, instead of spreading the model definition across multiple objects. This is a standard way of handling symbolic model definition (in R formula, patsy, etc...) because it supports a clear separation of concerns.

I agree that the current interface (pipeline._dataframe_mapper.extract_y(df)) isn't there yet, but think there's merit in holding an (optional) target y feature in the DataFrameMapper object. In the basic case, it supports the interface style above. In more complex cases, it allows the efficient use of preprocessing transformers on y-values (eg. thresholding via Binarizer).

Would you be open to a revised version of the y-feature components of this pull request with a refined interface?

For example, cross-validation with y features:

formulas = [
  DataFrameMapper(["sepal_length", "sepal_width", "petal_length"], "species"),
  DataFrameMapper(["sepal_length", "sepal_width", "petal_length", "petal_width"], "species"),
]

for formula in formulas:
  pipeline = make_dataframe_pipeline( [formula, RandomForestClassifier()] )
  print confusion_matrix(pipeline.extract_y(iris), pipeline.fit(iris).predict(iris))
  print cross_val_score(*pipeline.estimator_Xy(iris), scoring="accuracy", cv=5)

Cross-validation without y features:

formulas = [
  DataFrameMapper(["sepal_length", "sepal_width", "petal_length"]),
  DataFrameMapper(["sepal_length", "sepal_width", "petal_length", "petal_width"]),
]

for formula in formulas:
  pipeline = make_dataframe_pipeline( [formula, RandomForestClassifier()] )
  print confusion_matrix(
      LabelEncoder().fit_transform(iris['species']).values,
      pipeline.fit(
          iris, LabelEncoder().fit_transform(iris['species']).values
      ).predict( iris )
  )
  print cross_val_score(
      pipeline, iris, LabelEncoder().fit_transform(iris['species']).values,
      scoring="accuracy", cv=5
  )

@dukebody
Copy link
Collaborator

@asford I agree on that the "formula-like" use case is interesting, but I'd rather prefer to try to implement this without making the DataFrameMapper code significantly more complex.

I was thinking it might be possible to do something similar creating a custom pipeline class that accepts two DataFrameMappers: one for the X variables and another one for the y variables. This way the DataFrameMapper code doesn't have to be modified, since it will still just do select-and-transform. Does this sound reasonable?

Pseudo-code:

mapper_x = DataFrameMapper(["sepal_length", "sepal_width", "petal_length"])
mapper_y = DataFrameMapper([("species", LabelEncoder())])
pipeline = make_dataframe_pipeline((mapper_x, mapper_y), RandomForestClassifier())
pipeline.fit(dataframe)  # this would generate X and y inputs and call random_forest.fit(X, y))

On a separate note, your commit also introduces a new interface for selecting features without applying any transformation:

DataFrameMapper(['feat']) == DataFrameMapper([('feat', None)])

While it is not consistent with the (feature, Transformers) signature I find it is not hard to understand and can save a lot of writing. What I don't understand, even after reading the code, is why in your example using "species" without any transformer would "label-encodify" the value. Or is it a mistake in the example?

@dukebody
Copy link
Collaborator

@asford I've created an alternative implementation of the feature_indices_ features here: #56 Thoughts? :)

@asford
Copy link
Author

asford commented Apr 4, 2016

@dukebody

I've had a chance to read through your alternate implementation and think about the y-feature component of the DataFrameMapper, and I agree that y feature extraction is best handled in the pipeline level.

In this pull's current implementation the pipeline contains:

class DataFramePipeline:
    #Steps is a list of sklearn pipeline components.
    #steps[0] must be a DataFrameMapper
    #steps[-1] may be an estimator.
    steps = list(sklearn_steps)

In the updated interface, the pipeline would contain:

class DataFramePipeline:
    x_mapper = DataFrameMapper
    y_mapper = DataFrameMapper
    #Steps is a list of sklearn pipeline components.
    #steps[-1] may be an estimator.
    steps = list(sklearn_steps)

I will rework this pull request to match this interface over the next few days to work out the API.

Would you mind holding off on #56 until we can iterate on this change? I'm not entirely convinced that storing feature_indices instead of feature_widths in the DataFrameMapper is the best representation to use, particularly for the implementation of inverse transforms. It seems that the feature output width is the simplest method of representing this data, and the corresponding indices can be trivially regenerated from the widths.

@dukebody
Copy link
Collaborator

Hi @asford . Yes I can hold off on #56, no prob.

However I don't really understand why do you think storing feature_widths instead of feature_indices is preferable? I believe they are mostly equivalent. I just used feature_indices to mimick sklearn's OneHotEncoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants