Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] MissingIndicator transformer #8075

Merged
merged 57 commits into from Jul 16, 2018

Conversation

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Dec 18, 2016

MissingIndicator transformer for the missing values indicator mask.
see #6556

What does this implement/fix? Explain your changes.

The current implementation returns a indicator mask for the missing values.

Any other comments?

It is a very initial attempt and currently no tests are present. Please do have a look and give suggestions on the design. Thanks !

  • Implementation
  • Documentation
  • Tests
@maniteja123 maniteja123 force-pushed the maniteja123:imputer_missing_values branch from 7fcdbf4 to 678f2c3 Dec 18, 2016
@tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Dec 19, 2016

@jnothman @raghavrv Sorry for my stupid question but what is the difference with #7084 .

@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Dec 19, 2016

Hi, I am not entirely aware of the ValueDropper but that seems to be used to drop specific percentage of values per feature of the corresponding classes. OTOH, this MissingIndicator (name not yet finalized) will be just an indicator of the presence of missing values in the input and transformed output is just a binary matrix. Hope I got it right !

@tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Dec 19, 2016

Ok thx @maniteja123. Sorry for the stupid question.

@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Dec 19, 2016

No problem @tguillemot . No question is stupid. Thanks for asking. :) Probably I should add a small example to make the use case clear to understand.

@tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Dec 19, 2016

Indeed it's always usefull.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 20, 2016

#7084 makes missing values. This helps solve missing value problems.

Copy link
Member

@jnothman jnothman left a comment

Add some tests please. I'll look at the implementation soon.

`missing_values` will be imputed. For missing values encoded as np.nan,
use the string value "NaN".
axis : integer, optional (default=0)

This comment has been minimized.

@jnothman

jnothman Dec 20, 2016
Member

I don't think we need this.

This comment has been minimized.

@maniteja123

maniteja123 Dec 20, 2016
Author Contributor

I understand the missing indicator is independent of axis. So the meaning of missing_indicators =train would be only indicate the features with the missing values at fit time ?

This comment has been minimized.

- If `axis=0`, then impute along columns.
- If `axis=1`, then impute along rows.
missing_indicators : [None, "all", "train", indices/mask]

This comment has been minimized.

@jnothman

jnothman Dec 20, 2016
Member

I think features would be an adequate name, or which_features or something.

If "train"
If array
copy : boolean, optional (default=True)

This comment has been minimized.

@jnothman

jnothman Dec 20, 2016
Member

not applicable?

Attributes
----------
feat_with_missing_ : array of shape(n_missing_features, )
The features with missing values.

This comment has been minimized.

@jnothman

jnothman Dec 20, 2016
Member

"in fit"

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

Reduce indentation.

Note that this is only stored if features == 'train'

Copy link
Member

@jnothman jnothman left a comment

Thanks. This is heading the right direction.

Once it's looking good, we'll talk about integrating it into Imputer, and adding a summary feature which indicates the presence of any missing values in a row.


feat_with_missing = mask_matrix.sum(axis=0).nonzero()[1]
# ravel since nonzero returns 2d matrices for sparse in scipy 0.11
feat_with_missing = np.asarray(feat_with_missing).ravel()

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

I'm pretty sure np.ravel(feat_with_missing) will suffice.


if self.features == "train":
features = np.setdiff1d(self.feat_with_missing_,
feat_with_missing)

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

indent to match previous line's self

features = np.setdiff1d(self.feat_with_missing_,
feat_with_missing)
if features:
warnings.warn("The features %s have missing "

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

I don't think this is a case we should warn about. The opposite (missing values in transform but none in fit), perhaps.

elif self.features == "all":
imputer_mask = imputer_mask

elif isinstance(self.features, (np.ndarray, list, tuple)):

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

tuples and lists are treated differently by numpy and I'm not sure we should support tuples.

X2_tr = MI.transform(X2)
mask = X2[:, features] == -1
assert_array_equal(X2_tr, mask)

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

remove blanks

self : object
Returns self.
"""
if (isinstance(self.features, six.string_types) and

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

Should probably also validate that:

  • features, if an array-like, is an integer array
  • sparse has a valid value
`missing_values` will be imputed. For missing values encoded as np.nan,
use the string value "NaN".
features : [None, "all", "train", array(indices/mask)]

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

{'train' (default), 'all', array-like of int}

MI = clone(MI).set_params(features = "train")
MI.fit(X1)
X2_tr = MI.transform(X2)
features = MI.feat_with_missing_

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

Need to check that this is correctly computed.

[0, -1, 5, -1],
[11, -1, 1, 1]
])

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

mask = X2 == -1  # can be done before loop
for X1, X2, missing_values in [(X1_orig, X2_orig, -1), (X1_orig + 1, X2_orig + 1, 0)]:
    for retype in [np.array, lambda x: x.tolist(), sparse.csr_matrix, sparse.csc_matrix, sparse.lil_matrix]:

This comment has been minimized.

@maniteja123

maniteja123 Dec 22, 2016
Author Contributor

Could you explain what retype means. I guess it it return type but does it mean it should call MI.fit(retype(X1)) and MI.transform(retype(X2)) ?

MI.fit(X1)
X2_tr = MI.transform(X2)
mask = X2[:, features] == -1
assert_array_equal(X2_tr, mask)

This comment has been minimized.

@jnothman

jnothman Dec 21, 2016
Member

Also need to assert the warning case, and assert that validation error messages are produced correctly.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 21, 2016

Also, it might be a good idea to add something to the narrative documentation at this point, explaining the motivation for such features, and briefly describing the operation.

It would be good to add an example here in the docstring too.

Perhaps you should add a task list to the PR description

@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Dec 22, 2016

Hi @jnothman , thanks a lot for the detailed review. I have tried to address most of them. It would be great if you could explain the meaning of retype in your comment here.

Also if sparse = True should the transform always return sparse irrespective of input and if so, which sparse type ?

Thanks.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 22, 2016

retype was a bad name for "the type you want to change it to"

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 22, 2016

Whichever sparse type is most appropriate here. COO? CSC?

X2_tr = MI.transform(X2)
features = MI.feat_with_missing_
assert_array_equal(expect_feat_missing, features)
assert_array_equal(np.asarray(X2_tr), mask[:, features])

This comment has been minimized.

@maniteja123

maniteja123 Dec 22, 2016
Author Contributor

Sorry Joel for many questions but if I am confused here. Suppose I take the case where X1 and X2 are sparse or lists, how do I check for the equality between X2_tr and mask[:, features].

This comment has been minimized.

@jnothman

jnothman Dec 23, 2016
Member

X2_tr should be an array or sparse matrix regardless of X2, no?

This comment has been minimized.

@maniteja123

maniteja123 Dec 23, 2016
Author Contributor

yeah yeah sorry it is array or sparse matrix. I was asking about the equality between the sparse matrices because X2_tr will be sparse while mask is array. Is it okay to densify and use assert_array_equals ?

This comment has been minimized.

@jnothman

jnothman Dec 24, 2016
Member

In a test, yes.

@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Dec 22, 2016

I don't know much about sparse arrays but I was thinking COO might be a good choice since there might be less missing values in general. Please do correct me if I am thinking it the wrong way.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 23, 2016

I don't get your purported motivation for COO. If you construct the matrix in a column-wise fashion, CSC might be the best choice.

@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Dec 23, 2016

Oh I see, but the matrix which is being constructed by this using get_mask. It is not column-wise right ?

Also if we return as CSC, then the indices and indptr needs to calculated accordingly ?

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 24, 2016

For auto, return it in the format it came in. Otherwise, return CSC. You don't need to calculate indices and indptr manually. Calling csc_matrix (or .tocsc() if already sparse) on it will suffice.

@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Dec 28, 2016

Thanks @jnothman , just one more clarification. In case of sparse matrix and missing values = 0 currently a dense matrix is returned. Should it be the same even when sparse='auto' ?

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 28, 2016

@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Dec 28, 2016

Thanks but lil format wouldn't contain indices right ? So this line fails for it.

also should the return type be numpy array when a list is passed for transform ? Sorry for so many questions but the tests are failing for these scenarios.

Copy link
Member

@jnothman jnothman left a comment

Yes, return type should be a numpy array, except if it should be a sparse matrix, in which case the returned format should be specified in the docstring, but need not be the same as the input type. check_array will convert to acceptable types.

if sparse.issparse(X) and self.missing_values != 0:
# sparse matrix and missing values is not zero
imputer_mask = _get_mask(X.data, self.missing_values)
imputer_mask = X.__class__((imputer_mask, X.indices.copy(),
X.indptr.copy()), shape=X.shape,
dtype=X.dtype)

print 'here' + str(type(X)) + str(type(imputer_mask))

This comment has been minimized.

@jnothman

jnothman Jan 9, 2017
Member

debug print to be removed...?

Copy link
Member

@jnothman jnothman left a comment

I'll have a look over the transformation and tests once you're happy you know how to determine handling of transform output types etc.

Transformer indicating missing values
=====================================

MissingIndicator transformer is useful to transform a dataset into corresponding

This comment has been minimized.

@jnothman

jnothman Jan 9, 2017
Member

use

:class:`MissingIndicator`
MissingIndicator(features='train', missing_values=-1, sparse='auto')
>>> X2_tr = MI.transform(X2)
>>> X2_tr
array([[False, False, True],

This comment has been minimized.

@jnothman

jnothman Jan 9, 2017
Member

I think we should probably be returning ints, not bools.

X : {array-like, sparse matrix}, shape (n_samples, n_features)
Input data, where ``n_samples`` is the number of samples and
``n_features`` is the number of features.
Returns

This comment has been minimized.

@jnothman

jnothman Jan 9, 2017
Member

blank line before this, please


def transform(self, X):
"""Impute all missing values in X.
Parameters

This comment has been minimized.

@jnothman

jnothman Jan 9, 2017
Member

blank line

----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
The input data to complete.
Returns

This comment has been minimized.

@jnothman

jnothman Jan 9, 2017
Member

blank line

The input data to complete.
Returns
-------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]

This comment has been minimized.

@jnothman

jnothman Jan 9, 2017
Member

traditionally Xt

Returns
-------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
The transformerwith missing indicator.

This comment has been minimized.

@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Jan 19, 2017

Non zero missing values

sparse ndarray csr_matrix csc_matrix lil_matrix
True csc_matrix csc_matrix csc_matrix csc_matrix
False ndarray ndarray ndarray ndarray
auto ndarray csc_matrix csc_matrix csc_matrix

Zero missing values*

sparse ndarray csr_matrix csc_matrix lil_matrix
True csc_matrix csc_matrix csc_matrix csc_matrix
False ndarray ndarray ndarray ndarray
auto ndarray ndarray ndarray ndarray

Hi @jnothman , sorry for the delay. Could you look at the above return types and let me know if it works ? Thanks.

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 23, 2017

I think that's correct. But to be sure, we could write a test that checks that auto corresponds to the same type/format as output by imputer in that case.

@maniteja123 maniteja123 force-pushed the maniteja123:imputer_missing_values branch from 5357a1b to 41c2596 Feb 18, 2017
@raghavrv
Copy link
Member

@raghavrv raghavrv commented Mar 4, 2017

@maniteja123 any updates? :)

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Mar 4, 2017

Does it need a review? Can you rename to MRG if so... I can look into it next week...

@raghavrv raghavrv added this to Would be very nice to have (MID) in Raghav's *personal* review-priority listing Mar 4, 2017
@maniteja123
Copy link
Contributor Author

@maniteja123 maniteja123 commented Mar 4, 2017

Hi @raghavrv thanks for reminding. I believe it can be reviewed once though tests can be comprehensive regarding the return type for the sparse and dense matrices. Will look at fixing the failing tests and then ping you.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 11, 2018

@jnothman @amueller any review on this feature.

@amueller
Copy link
Member

@amueller amueller commented Jul 11, 2018

can you maybe add this to plot_missing_values.py or something? It would be good to have an example that uses this. Ideally we'd package that into SimpleImputer but for now we can make it explicit and it should still be good for ChainedImputer.

Parameters
----------
missing_values : number, string, np.nan (default) or None

This comment has been minimized.

@amueller

amueller Jul 11, 2018
Member

what does number mean and why is np.nan not a number? Maybe just move the np.nan to the end?

This comment has been minimized.

@jeremiedbb

jeremiedbb Jul 12, 2018
Member

number means real number. It's just to fit this in one line.
I think by definition nan is not a number :)

This comment has been minimized.

@amueller

amueller Jul 12, 2018
Member

but the dtype is also important, isn't it? I find "float or int" more natural than "number or np.nan" but I don't have a strong opinion.

This comment has been minimized.

@jeremiedbb

jeremiedbb Jul 12, 2018
Member

I agree that "float or int" is better than number, but I think it's important to keep np.nan visible since it should be a common value for missing_values. Maybe something like
int, float, string or None (default=np.nan) ?

This comment has been minimized.

@glemaitre

glemaitre Jul 12, 2018
Contributor

right now this is consistent with SimpleImputer and ChainedImputer in fact.

Parameters
----------
missing_values : number, string, np.nan (default) or None

This comment has been minimized.

@glemaitre

glemaitre Jul 12, 2018
Contributor

right now this is consistent with SimpleImputer and ChainedImputer in fact.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 12, 2018

@amueller I updated the example by doing a feature union of the output of the imputer and the missing indicator. This is probably the use case that each imputer should handle in the future...

Copy link
Member

@jnothman jnothman left a comment

I think this LGTM. Thanks for completing it @glemaitre. Given recent issues, though, I wonder if we should make sure missing_values=pickle.loads(pickle.dumps(np.nan)) works also.

error_on_new : boolean, optional
If True (default), transform will raise an error when there are
features with missing values in transform but have no missing values in

This comment has been minimized.

@jnothman

jnothman Jul 15, 2018
Member

but -> that

assert isinstance(X_fit_mask, np.ndarray)
assert isinstance(X_trans_mask, np.ndarray)
else:
if sparse.issparse(X_fit):

This comment has been minimized.

@jnothman

jnothman Jul 15, 2018
Member

Why is this not another elif?

This comment has been minimized.

@glemaitre

glemaitre Jul 15, 2018
Contributor

because it is true for all other param_sparse and missing_values combination

glemaitre added 2 commits Jul 15, 2018
…t-learn into maniteja123-imputer_missing_values
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 15, 2018

I think this LGTM. Thanks for completing it @glemaitre. Given recent issues, though, I wonder if we should make sure missing_values=pickle.loads(pickle.dumps(np.nan)) works also.

A similar test is done in the common test now.

@amueller
Copy link
Member

@amueller amueller commented Jul 15, 2018

merge on green?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 16, 2018

We need to merge #11391 first

@agramfort
Copy link
Member

@agramfort agramfort commented Jul 16, 2018

@maniteja123 can you rebase now that #11391 is merged ?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 16, 2018

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 16, 2018

And will we be opening issues to add a add_indicator param to other imputers?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 16, 2018

And will we be opening issues to add a add_indicator param to other imputers?

yes. Do we want to have it in 0.20 thought?

@amueller amueller added the Blocker label Jul 16, 2018
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 16, 2018

tests passed in travis

@ogrisel ogrisel added this to Issues tagged in scikit-learn 0.20 Jul 16, 2018
@ogrisel ogrisel moved this from Issues tagged to PRs tagged in scikit-learn 0.20 Jul 16, 2018
@amueller amueller moved this from PRs tagged to Blockers in scikit-learn 0.20 Jul 16, 2018
@amueller amueller merged commit 726fa36 into scikit-learn:master Jul 16, 2018
4 of 5 checks passed
4 of 5 checks passed
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@lesteve lesteve moved this from Blockers to Done in scikit-learn 0.20 Jul 16, 2018
@amueller amueller removed this from PR phase in Andy's pets Aug 21, 2018
@amueller
Copy link
Member

@amueller amueller commented Aug 21, 2018

Did someone open these issues? I don't see them linked.

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 21, 2018

I don't think it's been opened. Atm we are only offering one other imputer. (Apparently knn imputer is still blocking on your review @amueller)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.