-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] improved documentation of MissingIndicator #12424
Conversation
doc/modules/impute.rst
Outdated
>>> transformer = sklearn.pipeline.FeatureUnion( | ||
>>> transformer_list=[ | ||
>>> ('vanilla_features', | ||
>>> sklearn.impute.SimpleImputer(strategy='constant', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you're not using mean?
sklearn/impute.py
Outdated
@@ -410,13 +410,18 @@ def transform(self, X): | |||
|
|||
|
|||
class MissingIndicator(BaseEstimator, TransformerMixin): | |||
"""Binary indicators for missing values. | |||
"""Binary indicators for missing values. Note that this component typically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.python.org/dev/peps/pep-0257/
single line description, than blank line, than longer description
doc/modules/impute.rst
Outdated
>>> import sklearn.pipeline | ||
>>> import sklearn.tree | ||
>>> X, y = sklearn.datasets.fetch_openml('anneal', 1, return_X_y=True) | ||
>>> X_train, X_test, y_train, _ = sklearn.model_selection.train_test_split(X, y, test_size=100, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this line so that lines are no longer than 80 chars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing from imports will make it shorter
doc/modules/impute.rst
Outdated
... sklearn.tree.DecisionTreeClassifier()) | ||
|
||
|
||
Note that the anneal dataset has 38 columns. By applying the `features='all'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this section provides very much value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My motivation was that it adds an explanation why we can expect the number of columns that we see (38*2=76). I can remove this of course, if you think this is common knowledge. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I changed the dataset to audiology. anneal has columns that are completely empty, which are silently dropped by Simple Imputer. This would make the description a bit more complicated.
sklearn/impute.py
Outdated
@@ -412,11 +412,18 @@ def transform(self, X): | |||
class MissingIndicator(BaseEstimator, TransformerMixin): | |||
"""Binary indicators for missing values. | |||
|
|||
Note that this component typically should not not be used in a vanilla | |||
pipeline consisting of transformers and a classifier, but rather could be | |||
added using a FeatureUnion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use a :class: reference
doc/modules/impute.rst
Outdated
>>> from sklearn.model_selection import train_test_split | ||
>>> from sklearn.pipeline import FeatureUnion, make_pipeline | ||
>>> from sklearn.tree import DecisionTreeClassifier | ||
>>> X, y = fetch_openml('audiology', 1, return_X_y=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer an example that doesn't require an internet connection + download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't have any build-in datasets with missing values? Should we add one? Should we have build-in titanic or something? Having one built-in dataset with mixed types and missing values might be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be more comfortable just inserting missing values to illustrate the point. We usually don't worry about real world in user guide
You are the boss, but just pointing out that the sklearn example gallery contains many fetch_openml calls already |
This is not the gallery ;) |
Fair enough, I will add |
you mean add titianic to sklearn? That's a bit of a bigger discussion. I guess I'm ok to merge this now and we could fix it later... |
doc/modules/impute.rst
Outdated
>>> results.shape | ||
(100, 8) | ||
|
||
Of course, we can not use the transformer to make any predictions. We should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can not -> cannot
doc/modules/impute.rst
Outdated
@@ -120,3 +120,48 @@ whether or not they contain missing values:: | |||
[False, True, False, False]]) | |||
>>> indicator.features_ | |||
array([0, 1, 2, 3]) | |||
|
|||
When using it in a pipeline, be sure to use the :class:`FeatureUnion` to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The referent of it here is unclear. Replace it with MissingIndicator
doc/modules/impute.rst
Outdated
>>> transformer = FeatureUnion( | ||
... transformer_list=[ | ||
... ('features', SimpleImputer(strategy='mean')), | ||
... ('indicaters', MissingIndicator(features='all'))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indicaters -> indicators
doc/modules/impute.rst
Outdated
>>> clf = make_pipeline(transformer, DecisionTreeClassifier()) | ||
|
||
Note that the `iris` dataset has 4 features. By applying the | ||
`features='all'` function, we ensure that all columns obtain a indicator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure it's necessary to illustrate this. In a purely predictive context, the extra 0-variance features are useless and may lead to warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already gone, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, sorry.
(100, 8) | ||
|
||
Of course, we can not use the transformer to make any predictions. We should | ||
wrap this in a :class:`Pipeline` with a classifier (e.g., a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems out of order; shouldn't make_pipeline be down here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/impute.py
Outdated
@@ -412,11 +412,18 @@ def transform(self, X): | |||
class MissingIndicator(BaseEstimator, TransformerMixin): | |||
"""Binary indicators for missing values. | |||
|
|||
Note that this component typically should not not be used in a vanilla | |||
:class:`Pipeline` consisting of transformers and a classifier, but rather | |||
could be added using a :class:`FeatureUnion`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or ColumnTransformer?
doc/modules/impute.rst
Outdated
@@ -120,3 +120,44 @@ whether or not they contain missing values:: | |||
[False, True, False, False]]) | |||
>>> indicator.features_ | |||
array([0, 1, 2, 3]) | |||
|
|||
When using the :class:`MissingIndicator` in a :class:`Pipeline`, be sure to use | |||
the :class:`FeatureUnion` to add the indicator features to the regular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureUnion or ColumnTransformer?
>>> clf = clf.fit(X_train, y_train) | ||
>>> results = clf.predict(X_test) | ||
>>> results.shape | ||
(100,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit awkward, though I'm not opposed to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @janvanrijn
Reference Issues/PRs
Fixes #12417
What does this implement/fix? Explain your changes.
Personally, I found the documentation about the MissingIndicator a bit misleading and I tried to improve this a little bit. Also, I tried to contribute an example to the userguide, how to use this in classification pipelines
Any other comments?
Nope