Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] improved documentation of MissingIndicator #12424

Merged
merged 19 commits into from
Nov 14, 2018

Conversation

janvanrijn
Copy link
Contributor

Reference Issues/PRs

Fixes #12417

What does this implement/fix? Explain your changes.

Personally, I found the documentation about the MissingIndicator a bit misleading and I tried to improve this a little bit. Also, I tried to contribute an example to the userguide, how to use this in classification pipelines

Any other comments?

Nope

@janvanrijn janvanrijn changed the title improved documentation of MissingIndicator [MRG] improved documentation of MissingIndicator Oct 19, 2018
>>> transformer = sklearn.pipeline.FeatureUnion(
>>> transformer_list=[
>>> ('vanilla_features',
>>> sklearn.impute.SimpleImputer(strategy='constant',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you're not using mean?

@@ -410,13 +410,18 @@ def transform(self, X):


class MissingIndicator(BaseEstimator, TransformerMixin):
"""Binary indicators for missing values.
"""Binary indicators for missing values. Note that this component typically
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.python.org/dev/peps/pep-0257/

single line description, than blank line, than longer description

>>> import sklearn.pipeline
>>> import sklearn.tree
>>> X, y = sklearn.datasets.fetch_openml('anneal', 1, return_X_y=True)
>>> X_train, X_test, y_train, _ = sklearn.model_selection.train_test_split(X, y, test_size=100, random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this line so that lines are no longer than 80 chars

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing from imports will make it shorter

... sklearn.tree.DecisionTreeClassifier())


Note that the anneal dataset has 38 columns. By applying the `features='all'`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this section provides very much value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation was that it adds an explanation why we can expect the number of columns that we see (38*2=76). I can remove this of course, if you think this is common knowledge. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I changed the dataset to audiology. anneal has columns that are completely empty, which are silently dropped by Simple Imputer. This would make the description a bit more complicated.

@@ -412,11 +412,18 @@ def transform(self, X):
class MissingIndicator(BaseEstimator, TransformerMixin):
"""Binary indicators for missing values.

Note that this component typically should not not be used in a vanilla
pipeline consisting of transformers and a classifier, but rather could be
added using a FeatureUnion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use a :class: reference

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import FeatureUnion, make_pipeline
>>> from sklearn.tree import DecisionTreeClassifier
>>> X, y = fetch_openml('audiology', 1, return_X_y=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer an example that doesn't require an internet connection + download

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't have any build-in datasets with missing values? Should we add one? Should we have build-in titanic or something? Having one built-in dataset with mixed types and missing values might be nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be more comfortable just inserting missing values to illustrate the point. We usually don't worry about real world in user guide

@janvanrijn
Copy link
Contributor Author

I would prefer an example that doesn't require an internet connection + download

You are the boss, but just pointing out that the sklearn example gallery contains many fetch_openml calls already

@amueller
Copy link
Member

This is not the gallery ;)

@janvanrijn
Copy link
Contributor Author

Fair enough, I will add titanic?

@amueller
Copy link
Member

you mean add titianic to sklearn? That's a bit of a bigger discussion. I guess I'm ok to merge this now and we could fix it later...

>>> results.shape
(100, 8)

Of course, we can not use the transformer to make any predictions. We should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can not -> cannot

@@ -120,3 +120,48 @@ whether or not they contain missing values::
[False, True, False, False]])
>>> indicator.features_
array([0, 1, 2, 3])

When using it in a pipeline, be sure to use the :class:`FeatureUnion` to add
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The referent of it here is unclear. Replace it with MissingIndicator

>>> transformer = FeatureUnion(
... transformer_list=[
... ('features', SimpleImputer(strategy='mean')),
... ('indicaters', MissingIndicator(features='all'))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indicaters -> indicators

>>> clf = make_pipeline(transformer, DecisionTreeClassifier())

Note that the `iris` dataset has 4 features. By applying the
`features='all'` function, we ensure that all columns obtain a indicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure it's necessary to illustrate this. In a purely predictive context, the extra 0-variance features are useless and may lead to warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already gone, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, sorry.

(100, 8)

Of course, we can not use the transformer to make any predictions. We should
wrap this in a :class:`Pipeline` with a classifier (e.g., a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems out of order; shouldn't make_pipeline be down here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -412,11 +412,18 @@ def transform(self, X):
class MissingIndicator(BaseEstimator, TransformerMixin):
"""Binary indicators for missing values.

Note that this component typically should not not be used in a vanilla
:class:`Pipeline` consisting of transformers and a classifier, but rather
could be added using a :class:`FeatureUnion`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or ColumnTransformer?

@janvanrijn janvanrijn changed the title [MRG] improved documentation of MissingIndicator [MRG+2] improved documentation of MissingIndicator Nov 14, 2018
@@ -120,3 +120,44 @@ whether or not they contain missing values::
[False, True, False, False]])
>>> indicator.features_
array([0, 1, 2, 3])

When using the :class:`MissingIndicator` in a :class:`Pipeline`, be sure to use
the :class:`FeatureUnion` to add the indicator features to the regular
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureUnion or ColumnTransformer?

>>> clf = clf.fit(X_train, y_train)
>>> results = clf.predict(X_test)
>>> results.shape
(100,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit awkward, though I'm not opposed to it.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @janvanrijn

@qinhanmin2014 qinhanmin2014 merged commit c47c8a9 into scikit-learn:master Nov 14, 2018
@janvanrijn janvanrijn deleted the fix_#12417 branch November 14, 2018 16:34
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 20, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MissingIndicator Documentation & Interface
4 participants