[MRG+2] improved documentation of MissingIndicator #12424

janvanrijn · 2018-10-19T16:15:55Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Personally, I found the documentation about the MissingIndicator a bit misleading and I tried to improve this a little bit. Also, I tried to contribute an example to the userguide, how to use this in classification pipelines

Any other comments?

Nope

amueller · 2018-10-19T16:56:55Z

doc/modules/impute.rst

+  >>> transformer = sklearn.pipeline.FeatureUnion(
+  >>>     transformer_list=[
+  >>>         ('vanilla_features',
+  >>>          sklearn.impute.SimpleImputer(strategy='constant',


Is there a reason you're not using mean?

amueller · 2018-10-19T16:58:05Z

sklearn/impute.py

@@ -410,13 +410,18 @@ def transform(self, X):


 class MissingIndicator(BaseEstimator, TransformerMixin):
-    """Binary indicators for missing values.
+    """Binary indicators for missing values. Note that this component typically


https://www.python.org/dev/peps/pep-0257/

single line description, than blank line, than longer description

jnothman · 2018-10-21T10:10:24Z

doc/modules/impute.rst

+  >>> import sklearn.pipeline
+  >>> import sklearn.tree
+  >>> X, y = sklearn.datasets.fetch_openml('anneal', 1, return_X_y=True)
+  >>> X_train, X_test, y_train, _ = sklearn.model_selection.train_test_split(X, y, test_size=100, random_state=0)


Please split this line so that lines are no longer than 80 chars

Doing from imports will make it shorter

jnothman · 2018-10-21T10:12:09Z

doc/modules/impute.rst

+  ...                                      sklearn.tree.DecisionTreeClassifier())
+
+
+Note that the anneal dataset has 38 columns. By applying the `features='all'`


I don't think this section provides very much value.

My motivation was that it adds an explanation why we can expect the number of columns that we see (38*2=76). I can remove this of course, if you think this is common knowledge. WDYT?

Note that I changed the dataset to audiology. anneal has columns that are completely empty, which are silently dropped by Simple Imputer. This would make the description a bit more complicated.

jnothman · 2018-10-21T10:13:17Z

sklearn/impute.py

@@ -412,11 +412,18 @@ def transform(self, X):
 class MissingIndicator(BaseEstimator, TransformerMixin):
    """Binary indicators for missing values.

+    Note that this component typically should not not be used in a vanilla
+    pipeline consisting of transformers and a classifier, but rather could be
+    added using a FeatureUnion.


Should use a :class: reference

amueller · 2018-10-22T17:02:10Z

doc/modules/impute.rst

+  >>> from sklearn.model_selection import train_test_split
+  >>> from sklearn.pipeline import FeatureUnion, make_pipeline
+  >>> from sklearn.tree import DecisionTreeClassifier
+  >>> X, y = fetch_openml('audiology', 1, return_X_y=True)


I would prefer an example that doesn't require an internet connection + download

I guess we don't have any build-in datasets with missing values? Should we add one? Should we have build-in titanic or something? Having one built-in dataset with mixed types and missing values might be nice.

I'd be more comfortable just inserting missing values to illustrate the point. We usually don't worry about real world in user guide

janvanrijn · 2018-10-22T17:09:40Z

I would prefer an example that doesn't require an internet connection + download

You are the boss, but just pointing out that the sklearn example gallery contains many fetch_openml calls already

amueller · 2018-10-22T17:50:42Z

This is not the gallery ;)

janvanrijn · 2018-10-22T17:55:04Z

Fair enough, I will add titanic?

amueller · 2018-10-22T17:59:24Z

you mean add titianic to sklearn? That's a bit of a bigger discussion. I guess I'm ok to merge this now and we could fix it later...

jnothman · 2018-10-31T11:33:02Z

doc/modules/impute.rst

+  >>> results.shape
+  (100, 8)
+
+Of course, we can not use the transformer to make any predictions. We should


can not -> cannot

jnothman · 2018-10-31T19:42:34Z

doc/modules/impute.rst

@@ -120,3 +120,48 @@ whether or not they contain missing values::
         [False,  True, False, False]])
  >>> indicator.features_
  array([0, 1, 2, 3])
+
+When using it in a pipeline, be sure to use the :class:`FeatureUnion` to add


The referent of it here is unclear. Replace it with MissingIndicator

jnothman · 2018-10-31T19:46:00Z

doc/modules/impute.rst

+  >>> transformer = FeatureUnion(
+  ...     transformer_list=[
+  ...         ('features', SimpleImputer(strategy='mean')),
+  ...         ('indicaters', MissingIndicator(features='all'))])


indicaters -> indicators

jnothman · 2018-10-31T19:49:53Z

doc/modules/impute.rst

+  >>> clf = make_pipeline(transformer, DecisionTreeClassifier())
+
+Note that the `iris` dataset has 4 features. By applying the
+`features='all'` function, we ensure that all columns obtain a indicator


I'm not really sure it's necessary to illustrate this. In a purely predictive context, the extra 0-variance features are useless and may lead to warnings.

It is already gone, right?

right, sorry.

jnothman · 2018-10-31T19:50:56Z

doc/modules/impute.rst

+  (100, 8)
+
+Of course, we can not use the transformer to make any predictions. We should
+wrap this in a :class:`Pipeline` with a classifier (e.g., a


This seems out of order; shouldn't make_pipeline be down here?

amueller

lgtm

jnothman

Otherwise LGTM

jnothman · 2018-11-14T07:38:18Z

sklearn/impute.py

@@ -412,11 +412,18 @@ def transform(self, X):
 class MissingIndicator(BaseEstimator, TransformerMixin):
    """Binary indicators for missing values.

+    Note that this component typically should not not be used in a vanilla
+    :class:`Pipeline` consisting of transformers and a classifier, but rather
+    could be added using a :class:`FeatureUnion`.


Or ColumnTransformer?

qinhanmin2014 · 2018-11-14T08:33:04Z

doc/modules/impute.rst

@@ -120,3 +120,44 @@ whether or not they contain missing values::
         [False,  True, False, False]])
  >>> indicator.features_
  array([0, 1, 2, 3])
+
+When using the :class:`MissingIndicator` in a :class:`Pipeline`, be sure to use
+the :class:`FeatureUnion` to add the indicator features to the regular


FeatureUnion or ColumnTransformer?

qinhanmin2014 · 2018-11-14T08:33:43Z

doc/modules/impute.rst

+  >>> clf = clf.fit(X_train, y_train)
+  >>> results = clf.predict(X_test)
+  >>> results.shape
+  (100,)


Seems a bit awkward, though I'm not opposed to it.

qinhanmin2014

thanks @janvanrijn

…12424)" This reverts commit 4c8c811.

janvanrijn added 2 commits October 19, 2018 12:13

improved documentation of FeatureUnion

bb71c22

further improved the documentation

20f6b24

janvanrijn changed the title ~~improved documentation of MissingIndicator~~ [MRG] improved documentation of MissingIndicator Oct 19, 2018

removes empty line

3cc9f44

amueller reviewed Oct 19, 2018

View reviewed changes

janvanrijn added 4 commits October 19, 2018 15:45

incorporated changes

ac0b71a

doc test fix

e7199c0

updated doctest

0a8e091

doctest

a86d029

jnothman reviewed Oct 21, 2018

View reviewed changes

janvanrijn added 6 commits October 21, 2018 14:04

comments by Joel

a7fd053

changed dataset

6d1fc1a

text fix

a5ab976

changed dataset

f4968e4

changed back to audiology

63db4e1

fix doctest

032380a

amueller reviewed Oct 22, 2018

View reviewed changes

amueller mentioned this pull request Oct 22, 2018

Add built-in dataset with missing values and categorical data? #12433

Open

janvanrijn added 3 commits October 30, 2018 16:03

changed to iris

3374e83

doctest

5303f9c

estatic changes

575eb37

jnothman reviewed Oct 31, 2018

View reviewed changes

improvements requested by Joel

8ad2310

amueller approved these changes Nov 13, 2018

View reviewed changes

jnothman approved these changes Nov 14, 2018

View reviewed changes

sharp.

f214fc7

janvanrijn changed the title ~~[MRG] improved documentation of MissingIndicator~~ [MRG+2] improved documentation of MissingIndicator Nov 14, 2018

qinhanmin2014 approved these changes Nov 14, 2018

View reviewed changes

impute

eefa876

qinhanmin2014 approved these changes Nov 14, 2018

View reviewed changes

qinhanmin2014 merged commit c47c8a9 into scikit-learn:master Nov 14, 2018

janvanrijn deleted the fix_#12417 branch November 14, 2018 16:34

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 20, 2018

DOC improved documentation of MissingIndicator (scikit-learn#12424)

7922ec4

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

DOC improved documentation of MissingIndicator (scikit-learn#12424)

4c8c811

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC improved documentation of MissingIndicator (scikit-learn#…

bb7c1dc

…12424)" This reverts commit 4c8c811.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC improved documentation of MissingIndicator (scikit-learn#…

dc1f94c

…12424)" This reverts commit 4c8c811.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

DOC improved documentation of MissingIndicator (scikit-learn#12424)

aba6243

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+2] improved documentation of MissingIndicator #12424

[MRG+2] improved documentation of MissingIndicator #12424

janvanrijn commented Oct 19, 2018

amueller Oct 19, 2018

amueller Oct 19, 2018

jnothman Oct 21, 2018

jnothman Oct 21, 2018

jnothman Oct 21, 2018

janvanrijn Oct 21, 2018

janvanrijn Oct 21, 2018

jnothman Oct 21, 2018

amueller Oct 22, 2018

amueller Oct 22, 2018

jnothman Oct 25, 2018

janvanrijn commented Oct 22, 2018

amueller commented Oct 22, 2018

janvanrijn commented Oct 22, 2018

amueller commented Oct 22, 2018

jnothman Oct 31, 2018

jnothman Oct 31, 2018

jnothman Oct 31, 2018

jnothman Oct 31, 2018

amueller Nov 13, 2018

janvanrijn Nov 13, 2018

amueller Nov 13, 2018

jnothman Oct 31, 2018

janvanrijn Nov 7, 2018

amueller left a comment

jnothman left a comment

jnothman Nov 14, 2018

qinhanmin2014 Nov 14, 2018

qinhanmin2014 Nov 14, 2018

qinhanmin2014 left a comment

		... sklearn.tree.DecisionTreeClassifier())


		Note that the anneal dataset has 38 columns. By applying the `features='all'`

[MRG+2] improved documentation of MissingIndicator #12424

[MRG+2] improved documentation of MissingIndicator #12424

Conversation

janvanrijn commented Oct 19, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janvanrijn commented Oct 22, 2018

amueller commented Oct 22, 2018

janvanrijn commented Oct 22, 2018

amueller commented Oct 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qinhanmin2014 left a comment

Choose a reason for hiding this comment