Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] DOC add mixed categorical / continuous example with ColumnTransformer #11197

Merged
merged 14 commits into from Jun 13, 2018

Conversation

partmor
Copy link
Contributor

@partmor partmor commented Jun 4, 2018

Reference Issues/PRs

Fixes #11185

What does this implement/fix?

I include in the example gallery an example for ColumnTransformer that uses a dataset with
categorical / continous features.

Any other comments?

I have built the example using a dataset with mixed types stored in a numpy array. I think this is not the most correct way to treat this type of dataset (the array's dtype must be object); but I found it necessary to avoid using pandas.

@amueller
Copy link
Member

amueller commented Jun 4, 2018

Thanks. It would be nice not to require a download, or at least not custom parsing code. But I'm not sure that's realistic. I'm not sure if we want to allow pandas in the examples but we could also just do a pd.read_csv to an URL, of, say, titanic. That would also give an opportunity to show the typical extraction of the labels from a dataframe.

@partmor
Copy link
Contributor Author

partmor commented Jun 4, 2018

@amueller I searched in the sklearn.datasets and the only dataset that I saw with mixed types was Boston housing, but the categorical feature was already encoded. So load from URL was the best thing I could think about.

Regarding the use of pandas, I found numpy not very elegant to hold string and float data, furthermore, not using pandas blocks us from demonstrating the true power of using sklearn.compose.make_column_transformer with column names.

I will procede then to refactor the example using pd.read_csv to a conventional URL, e.g. titanic.

@amueller
Copy link
Member

amueller commented Jun 4, 2018

I think that's a good idea, and yes, there's no dataset loader for a dataset with mixed features right now. fetch_openml will provide that "soon".

@partmor
Copy link
Contributor Author

partmor commented Jun 4, 2018

I've changed the example.
Here ColumnTransformer takes 2 preprocessing pipelines (for both categorical and numerical data).
I also implemented a custom imputer for categorical data that is used in the mentioned pipeline.

To go straight to the point with the usage of ColumnTransformer I performed a train-test split and evaluated the model directly with default params. I avoided hyperparameter tuning (grid search, ...) to not clog up excessively the example.

@TomDLT
Copy link
Member

TomDLT commented Jun 5, 2018

I am surprised CustomCategoryImputer is not something we already have.
I might be out of the loop, but if we don't have it, I think we should, maybe in SimpleImputer with a new strategy (strategy='fixed'), or inside CategoricalEncoder (or its re-factored version in #10523).

@partmor
Copy link
Contributor Author

partmor commented Jun 5, 2018

@TomDLT I have checked the imputers in sklearn.impute (including SimpleImputer) and only numerical types are supported (strategy={'mean', 'median', 'most_frequent}).

Setting handle_unknown='ignore' in CategoricalEncoder allows to return all zeros when unseen categories are encountered when transforming, but yet it can't handle np.nan (even during fit).

@jorisvandenbossche
Copy link
Member

Yes, the inability to handle missing values with categorical data (both in SimpleImputer and CategoricalEncoder )is a know problem, which we hope to fix at least with a basic solution before the release. We were discussing this yesterday, and were also thinking of adding a 'constant'/'fixed' strategy for SimpleImputer.
But until then, the custom transformer here is fine I think (although it distracts from the actual example), then we can later simply replace it with the built-in solution.

@partmor
Copy link
Contributor Author

partmor commented Jun 6, 2018

@jorisvandenbossche I agree it could distract a bit from the core matter of the example, but I thought it was interesting to include it in order to use a Pipeline for the categorical data inside ColumnTransformer.

The alternative solution would have been to use pandas' dropna on the categorical subset at the beginning of the script, and just apply CategoricalEncoder.

Also, I am tempted to extend the example a bit more and include a hyperparameter optimization with grid search for instance, in order to demonstrate how to access the parameters within a ColumnTransformer. I'll try to do this later on today.

Any further thoughts?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how 'far' we want to go for this example, but it could also be nice to show that you can do gridsearch on the parameters of the preprocessing steps.


# We will train our classifier with the following features
num_feats = ['age', 'fare']
cat_feats = ['embarked', 'sex']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can keep 'pclass' as well. It is already integer, but it then shows that that works as well for one-hot encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem

CategoricalEncoder('onehot-dense')
)

preprocessing_pl = ColumnTransformer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use make_column_transformer for the example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to instantiate the object directly because I already saw an example that uses make_column_transformer and the latter doesn't have the remainder param, which I thought was worth showing.

I wouldn't have any problem using make_column_transformer if that's the use we want to encourage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR to add the remainder keyword to the make_column_transformer (that was an oversight it is not there), so that should at least be fixed soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! great stuff. Will switch to make_column_transformer then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is merged: #11183

@jorisvandenbossche
Copy link
Member

The alternative solution would have been to use pandas' dropna on the categorical subset at the beginning of the script, and just apply CategoricalEncoder.

Alternative pandas solution would also be to fillna('missing') on the categorical columns (which would be equivalent to your custom transformer).
Anyhow, I don't think this is too important, as we plan to replace this part with a SimpleImputer in the short term.

Also, I am tempted to extend the example a bit more and include a hyperparameter optimization with grid search

I commented just the same, so yes, I think that's a good idea :-)
I did a small GridSearchCV in my example here https://jorisvandenbossche.github.io/blog/2018/05/28/scikit-learn-columntransformer/#Using-our-pipeline-in-a-grid-search

@jorisvandenbossche
Copy link
Member

One other suggestion, certainly if you would make the example longer, is to use the typical sphinx gallery syntax to alternate code and text blocks, see eg in https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_compare_reduction.py

@partmor
Copy link
Contributor Author

partmor commented Jun 6, 2018

Alternative pandas solution would also be to fillna('missing') on the categorical columns (which would be equivalent to your custom transformer).

True. Should we go for this then?

I commented just the same, so yes, I think that's a good idea :-)
I did a small GridSearchCV in my example here https://jorisvandenbossche.github.io/blog/2018/05/28/scikit-learn-columntransformer/#Using-our-pipeline-in-a-grid-search

Will go for it! Thanks for the link!!

One other suggestion, certainly if you would make the example longer, is to use the typical sphinx gallery syntax to alternate code and text blocks, see eg in https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_compare_reduction.py

Great! I'll use this. I was starting to worry about all the # comment lines I would have to use :)

Many thanks for the feedback @jorisvandenbossche ! I'll incorporate these changes.

@jnothman
Copy link
Member

jnothman commented Jun 6, 2018

Either at the top or in a comment, you should describe the nature of the features: are they numeric representations of category, or strings with how many values?

@jnothman
Copy link
Member

jnothman commented Jun 6, 2018

Why can't we do mode imputation in a pipeline on the categorical features after an ordinal encoding? It's a bit ugly, but it's not so bad...

# defined in our ``ColumnTransformer``, together with the classifier's
# hyperparameters as part of a ``Pipeline``.
# ``ColumnTransformer`` integrates well with the rest of scikit-learn,
# in particular with ``GridSearchCV`.`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bakticks not in correct places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Also, should I use sphinx syntax to reference other sklearn classes rather than simply double backticking??

@jnothman
Copy link
Member

jnothman commented Jun 7, 2018 via email

@partmor
Copy link
Contributor Author

partmor commented Jun 7, 2018

Either at the top or in a comment, you should describe the nature of the features: are they numeric representations of category, or strings with how many values?

Will do.

Why can't we do mode imputation in a pipeline on the categorical features after an ordinal encoding? It's a bit ugly, but it's not so bad...

You mean?: use LabelEncoder as 1st step before one-hot encoding. I am aware we might be abusing the original intention of the label encoder, but it would already handle implicitly the imputing (It would be equivalent to imputing a fixed value) . E.g.:

import numpy as np
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['a','b',np.nan,'c'])
le.transform([np.nan, 'b', 'c', 'a', np.nan])
>>>  array([3, 1, 2, 0, 3])

The output from this tranformer can directly jump into the CategoricalEncoder.
@jnothman @jorisvandenbossche what do you think of this?

This example should be referenced in place of hetero_feature_union at the
end of doc/faq.rst​

Woud this mean just substituting :ref:sphx_glr_auto_examples_hetero_feature_union.py for :ref:sphx_glr_auto_column_transformer_mixed_types.py?

@jorisvandenbossche
Copy link
Member

You mean?: use LabelEncoder as 1st step before one-hot encoding

No, @jnothman meant to use CategoricalEncoder(encoding='ordinal') (the new non-abusive version of LabelEncoder for features :-))

Personally I would just leave it like it is for now. The intent is that SimpleImputer(strategy='constant') will be available before the release, and then we can use that in this example (which results in a separate category for 'missing').

This example should be referenced in place of hetero_feature_union at the
end of doc/faq.rst​

You can also add this example to the list of examples in the section on ColumnTransformer in compose.rst

@glemaitre glemaitre added this to the 0.20 milestone Jun 8, 2018
# We can finally fit the model using the best hyperparameters found
# and assess model performance on holdout set
print(("best logistic regression from grid search: %f"
% grid_search.best_estimator_.score(X_test, y_test)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider this an antipattern. Use grid_search.score(X_test, y_test) unless you want to explicitly overwrite the grid-search scoring (in this case accuracy) with the estimators scoring (also accuracy). Also the comment is misaligned, the fitting happens above it. Maybe just remove the comment as the the code is pretty self-explanatory imho.

@amueller
Copy link
Member

looks good apart from nitpicks (and sorry about individual comments). I think the barrage of underscores in the grid-search is telling. But I'm happy we can finally actually do this!

@partmor
Copy link
Contributor Author

partmor commented Jun 10, 2018

@amueller thank you for the feedback!! I've changed the example based on your comments. Additionally, I've tried to make a better intro, I wasn't fully happy with it either.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as well. Here are a few more comments to improve the readability even further.

# - sex: categories encoded as strings {'female', 'male'}.
# - plcass: categories encoded as ints {1, 2, 3}.
num_feats = ['age', 'fare']
cat_feats = ['embarked', 'sex', 'pclass']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better be explicit and call those variables: numerical_features and categorical_features. Readability is extra important for examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.


# We create the preprocessing pipelines for both numeric and categorical data.
num_pl = make_pipeline(SimpleImputer(), StandardScaler())
cat_pl = CategoricalEncoder('onehot-dense', handle_unknown='ignore')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar remark here; may I suggest: numerical_transformer and categorical_transformer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.


# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf_pl = make_pipeline(preprocessing_pl, LogisticRegression())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either clf or pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.


# Provisionally, use pd.fillna() to impute missing values for categorical
# features; SimpleImputer will eventually support strategy="constant".
data.loc[:, cat_feats] = data.loc[:, cat_feats].fillna(value='missing')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the .loc[:, columns] pattern here? @jorisvandenbossche what is your opinion on using:

data[categorical_features] = data[categorical_features].fillna(value='missing)'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would work the same
(I personally like that more (and would do it like that in my code), but it's a bit subjective)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it to make it more readable, it does work the same. I've had traumas with setting copy warnings from pandas, that's why I sometimes use .loc unconsciously :-)

@partmor
Copy link
Contributor Author

partmor commented Jun 12, 2018

There were conflicts with a recent merge to master: I have moved the example (and its references) under the compose folder.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, pending improvements to Imputer (perhaps #11211) and to ColumnTransformer (#11190)

# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - plcass: categories encoded as ints {1, 2, 3}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*pclass

you might want to actually describe this as "ordinal", rather than "categories", or "ordered categories"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
'scipy-2017-sklearn/master/notebooks/datasets/titanic3.csv')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better practice to replace master with a fixed commit hash, i.e. e7d90ee

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also minor remarks, for the rest looks good!


In this example, the numeric data is standard-scaled after
mean-imputation, while the categorical data is one-hot
endcoded after imputing missing values with a new category
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: endcoded -> encoded

In this example, the numeric data is standard-scaled after
mean-imputation, while the categorical data is one-hot
endcoded after imputing missing values with a new category
(:code:`'missing'`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the :code: doing? (I haven't seen this sphinx role before)
I think it can simply be double backticks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as double backticks, yep. Will use the latter.

@partmor
Copy link
Contributor Author

partmor commented Jun 12, 2018

Updated with your comments :)

@jorisvandenbossche
Copy link
Member

Looks good, pending improvements to Imputer (perhaps #11211) and to ColumnTransformer (#11190)

@jnothman do you want to merge those PRs first so this example can be updated?

I think this PR is ready to be merged, so I would do that and then those other PRs can update the example.

@jnothman
Copy link
Member

Thanks @partmor

@jnothman jnothman merged commit bd19a84 into scikit-learn:master Jun 13, 2018
@jnothman
Copy link
Member

Thanks @partmor

@partmor
Copy link
Contributor Author

partmor commented Jun 13, 2018

Thank you all for your guidance and patience @jnothman @jorisvandenbossche @amueller @ogrisel !!!
This is my first contribution ever to a big open source project!! this only encourages me to keep going!!!
Thanks again!

@partmor
Copy link
Contributor Author

partmor commented Jun 13, 2018

Looks good, pending improvements to Imputer (perhaps #11211) and to ColumnTransformer (#11190)

I'll stay tuned to when strategy='constant' is implemented in SimpleImputer, and make a PR to include it in the example.

@jnothman
Copy link
Member

jnothman commented Jun 13, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jun 13, 2018 via email

@jorisvandenbossche
Copy link
Member

sorry Joris, i didn't see your comment. I don't see anything wrong with updating those PRs (though one isn't a PR yet) instead of waiting on them.

No problem, you already did exactly what I was suggesting in the meantime :-)

@partmor partmor deleted the column-transformer-example branch June 13, 2018 12:23
@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2018

Thanks @partmor!

@qinhanmin2014
Copy link
Member

Question here: will it be better to put the data under scikit-learn/examples-data?
(1) It seems bit strange to fetch data from other's repo (though I respect amueller, one of our leaders)
(2) It's good for scikit-learn to have a dataset with missing data and string-encoded categorical variables (see #8888)
(3) We can also provide users with the source and some basic information of the data in the same repo.

@jorisvandenbossche
Copy link
Member

I think the idea is that the (future) openml loader will provide such datasets (instead of keep adding new single datasets to sklearn for each use case, although a heterogeneous dataset with missing values is of course an important use case ..)

@qinhanmin2014
Copy link
Member

I think the idea is that the (future) openml loader will provide such datasets

That's fine. Thanks a lot for the instant feedback :)

@amueller
Copy link
Member

Yeah it's weird to use my repo, I totally agree. Hopefully we can get rid of that soon. We could just ship the csv without a loader to use in examples (because CSV is such a common format) but that seems slightly strange?
Having an example that contains the boilerplate for separating a target from the data seems very helpful to me. Most data that I worked with required that, but none of our examples require that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add mixed categorical / continuous example with ColumnTransformer
8 participants