Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Change fit_transform of MissingIndicator class to get mask only once #14356

Merged
merged 22 commits into from Jul 27, 2019

Conversation

harsh020
Copy link
Contributor

@harsh020 harsh020 commented Jul 14, 2019

Reference Issues/PRs

Fixes #14278

What does this implement/fix? Explain your changes.

The implementation fixes the function _get_missing_features_info from being called twice in fit_transform and also adds a test to check the result are correct.

--Declare a private instance variable (double leading underscore variable/mangling) in fit function, to store the return values of _get_missing_features_info.
-- Called fit from fit_transform.
--Used the private instance variable, when implementing transform in fit_transform to retrieve data.

Any other comments?

@harsh020 harsh020 closed this Jul 14, 2019
@harsh020 harsh020 deleted the missing_indicator branch Jul 14, 2019
@harsh020 harsh020 restored the missing_indicator branch Jul 14, 2019
@harsh020 harsh020 reopened this Jul 14, 2019
self._n_features = X.shape[1]

if self.features not in ('missing-only', 'all'):
raise ValueError("'features' has to be either 'missing-only' or "
Copy link
Member

@amueller amueller Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't duplicate the code ideally. It's not covered by the tests which is why you're seeing the code coverage failure.

Copy link
Contributor Author

@harsh020 harsh020 Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller Then what should I do, because I need to do the exact same things, I can't change the variable name (can I??). I can't call fit function because if I do I need to create a data member to store what the function returns which raises more failures during check because the tests do not know have any information about it.
Any help??

Thanks.

Copy link
Contributor Author

@harsh020 harsh020 Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amuller Can I use mangling (double leading underscore)??

@@ -1,7 +1,3 @@
# Authors: Nicolas Tresegnie <nicolas.tresegnie@gmail.com>
Copy link
Member

@amueller amueller Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the author information?

Copy link
Contributor Author

@harsh020 harsh020 Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do that intentionally, its a mistake, sorry I'll restore it immediately.

@@ -612,7 +612,8 @@ def fit(self, X, y=None):
raise ValueError("'sparse' has to be a boolean or 'auto'. "
"Got {!r} instead.".format(self.sparse))

self.features_ = self._get_missing_features_info(X)[1]
self.__missing_features_info = self._get_missing_features_info(X)
Copy link
Member

@amueller amueller Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually only use a single underscore, but otherwise this solution seems good to me.
Do we have a test that checks that fit(X).transform(X) == fit_transform(X)?

Copy link
Contributor Author

@harsh020 harsh020 Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single underscore will declare a protected instance variable, which can be accessed outside the class making data transparent, hence I've used double underscore which can't be accessed outside the class which would maintain opacity as the user need not know that data, he is already getting it when using fit_transform.
And moreover I think that can raise issue in PR checks, as the fit function returns self, which would include that protected member, but the tests do not know if fit return it, so we need to add that to tests. (I think this may happen, please correct me if wrong)
So, is it okay using private instance variable or should I change it to protected?

Copy link
Contributor Author

@harsh020 harsh020 Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that checks that fit(X).transform(X) == fit_transform(X)?

I don't think so. I'm sorry but I'm very new to test stuff, but I had a look in the test file and don't think that we have something that checks it.
I'll try to learn more about testing ASAP and will try again.

Copy link
Member

@amueller amueller Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add one? it should be quite simple.

Copy link
Contributor Author

@harsh020 harsh020 Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I add it to the pre existing test function test_missing_indicator_new

Copy link
Contributor Author

@harsh020 harsh020 Jul 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I did add a test, I'm pushing it. Please do let me know if any changes are to be made

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 15, 2019

@amueller I did add a test, please have a look and let me know if I need to make some changes.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 15, 2019

Storing missing_features_info, which includes the entire mask seems a little heavy for the MissingIndicator. An alternative is to move all of fit into a private method _fit and have _fit return missing_features_info.

  1. fit will call _fit and return self.

  2. fit_transform will call _fit and use the returned missing_features_info to perform the transform.

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 15, 2019

@thomasjpfan I'm a bit confused please help out here -
So, what you are saying is I should make a protected function _fit and move entire fit to it and store and return missing_features_info. Then I should call _fit from fit and return self and also call _fit from fit_transform, to get that stored data.
How is it different from my approach?? I'll still be storing return values in missing_features_info.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 15, 2019

_fit will not store missing_features_info it will only return it.

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 15, 2019

Okay I see, where you are going. Amazing, thats a nice idea! Thanks!

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 15, 2019

@thomasjpfan Just out of curiocity, what if I would have stored the return values in a local variable in the _fit function, would it (the local variable having the return values) have had the lifespan only when we called the _fit function or until the object of MissingIndicator was destroyed?
And second thing, why do we use protected function (with a single _) rather than a private function (with a double _)?
And yes, I did push the changes.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 15, 2019

would it (the local variable having the return values) have had the lifespan only when we called the _fit function

It will only exist in _fit

And second thing, why do we use protected function (with a single _) rather than a private function (with a double _)?

In scikit-learn we use a single _ to mean "non-public". Currently, we do not use the double underscore.

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 16, 2019

@thomasjpfan Thanks for clearing that up, I made the necessary changes and pushed them. Can you please have a look and let me know. I was also asked to add a test for the same, can you please have a look at it too (bcs I'm very new to tests).
If every thing seems fine whom should I request to merge the branch?

Thanks.

@@ -612,7 +612,23 @@ def fit(self, X, y=None):
raise ValueError("'sparse' has to be a boolean or 'auto'. "
"Got {!r} instead.".format(self.sparse))

self.features_ = self._get_missing_features_info(X)[1]
return self._get_missing_features_info(X)
Copy link
Member

@thomasjpfan thomasjpfan Jul 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that _fit actually fits and learns something:

missing_features_info = self._get_missing_features_info(X)
self.features_ = missing_features_info[1]
return missing_features_info

self : object
Returns self.
"""
self.features_ = self._fit(X, y)[1]
Copy link
Member

@thomasjpfan thomasjpfan Jul 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the change suggested in _fit this can be:

Suggested change
self.features_ = self._fit(X, y)[1]
self._fit(X, y)

@@ -667,7 +683,14 @@ def fit_transform(self, X, y=None):
will be boolean.
"""
return self.fit(X, y).transform(X)
imputer_mask, features = self._fit(X, y)
self.features_ = features
Copy link
Member

@thomasjpfan thomasjpfan Jul 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.features_ = features can be removed with the suggested change in _fit.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 16, 2019

If every thing seems fine whom should I request to merge the branch?

After I approve this PR, we need to wait for another core dev to review the PR before it is merged.

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 16, 2019

@thomasjpfan Yup, I also thought to do so, but since you asked me not to store in missing_features_info, I thought you don't want me to store it in a local variable too. So I didn't do it. No problem, I'll patch it up.
Thanks.

@jnothman jnothman changed the title Change fit_transform of MissingInformation class to prevent it from c… ENH Change fit_transform of MissingIndicator class to get mask only once Jul 25, 2019
@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 26, 2019

@jnothman Sorry, I am creating a mess here. Please help me out here. As I was getting merge conflict, by mistake I merged my missing_indicator branch to upstream/master branch, to correct it I merged the upstream/master changes to local master and reset my missing_indicator head and force pushed it. But still I'm getting merge conflicts.
Should I delete the branch, do the necessary changes and then push it.
I know I'm creating a huge mess (my first PR), Sorry. Please help me out.

Thanks.

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 26, 2019

@jnothman I think I reversed back - I restored my local repo back where I didn't merge (commit ffe3a5a), and also did the necessary changes (removed the unnecessary condition, removed tests and added efficiency entry), I pushed it and think every thing just got back where the merge conflict arose. Should I save the conflict by keeping both the logs or removing the log which was merged in master.

Thanks.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 27, 2019

Should I save the conflict by keeping both the logs or removing the log which was merged in master.

We prefer if you do not do force pushes to remove commits.

return self.fit(X, y).transform(X)
imputer_mask = self._fit(X, y)

if (self.features_.size < self._n_features):
Copy link
Member

@thomasjpfan thomasjpfan Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (self.features_.size < self._n_features):
if self.features_.size < self._n_features:

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 27, 2019

Well done for fixing it up though

@@ -598,7 +598,7 @@ def fit(self, X, y=None):
Returns
-------
self : object
Copy link
Member

@jnothman jnothman Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer accurate

Copy link
Contributor Author

@harsh020 harsh020 Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the return comment of _fit or fit?

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 27, 2019

We prefer if you do not do force pushes to remove commits.

@thomasjpfan Sorry, just got scared for messing things up. Won't happen again.

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 27, 2019

Well done for fixing it up though

@jnothman Thanks! Please have a look and let me know if I did it correctly -
(Visited my course, and revised how to resolve merge conflicts)

  1. Pulled upstream/master to my local master.
  2. Switched to missing_indicator branch and merged master to it.
  3. Resolved merge conflicts.
  4. Pushed missing_indicator branch to origin.

This is no longer accurate

Are you talking about the return comment of _fit? If yes, I fixed it. If no, sorry, I don't follow, please tell me what is not accurate, the comments of fit?

Thanks.

Returns self.
Xt : {ndarray or sparse matrix}, shape (n_samples, n_features)
The missing indicator for input data. The data type of ``Xt``
will be boolean.
Copy link
Member

@thomasjpfan thomasjpfan Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take use the docstring from _get_missing_features_info, since _fit returns one of the outputs of _get_missing_features_info:

        imputer_mask : {ndarray or sparse matrix}, shape (n_samples, n_features_with_missing)
            The imputer mask of the original data.

Copy link
Contributor Author

@harsh020 harsh020 Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Just one thing since _fit returns the full imputer_mask rather than only features_with_missing (as it is being checked in fit_transform), can I change the shape to just (n_samples, n_features)?

Thanks.

Copy link
Member

@thomasjpfan thomasjpfan Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I agree with just (n_samples, n_features).

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 27, 2019

Sorry, just got scared for messing things up. Won't happen again.

@harsh020 It is okay! Thank you for working on this PR! 😄

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 27, 2019

It is okay! Thank you for working on this PR! 😄

The pleasure was all mine. Thank you for helping me all along, as without you guys it would have been impossible for me to work on this issue.

Thanks.

@@ -597,8 +597,10 @@ def fit(self, X, y=None):
Returns
-------
self : object
Returns self.
imputer_mask : {ndarray or sparse matrix},
Copy link
Member

@thomasjpfan thomasjpfan Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        imputer_mask : {ndarray or sparse matrix}, shape (n_samples, \
        n_features)
            The imputer mask of the original data.

Copy link
Contributor Author

@harsh020 harsh020 Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is something wrong?

Copy link
Member

@thomasjpfan thomasjpfan Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spacing. This is only for consistency. This docstring is not going to be rendered since _fit is private.

Notice the use of \ and how n_features) is aligned and how many white spaces there are before The imputer....

Copy link
Contributor Author

@harsh020 harsh020 Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Thanks.

@jnothman jnothman merged commit 7b8b986 into scikit-learn:master Jul 27, 2019
17 checks passed
@jnothman
Copy link
Member

@jnothman jnothman commented Jul 27, 2019

Thank you @harsh020!

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 28, 2019

Thank you @harsh020!

@jnothman The pleasure was all mine. Thanks to you guys, would not have been possible without you guys.

@harsh020
Copy link
Contributor Author

@harsh020 harsh020 commented Jul 28, 2019

@jnothman Can I delete my branch?

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 28, 2019

@harsh020 harsh020 deleted the missing_indicator branch Jul 29, 2019
Copy link
Member

@jnothman jnothman left a comment

Please open a new pr fixing this up, @harsh020. Sorry to have missed it

@@ -107,6 +107,11 @@ Changelog
preserve the class balance of the original training set. :pr:`14194`
by :user:`Johann Faouzi <johannfaouzi>`.

- |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the
Copy link
Member

@jnothman jnothman Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just realised this is in completely the wrong section.

Copy link
Contributor Author

@harsh020 harsh020 Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, my bad. Really sorry for the blunder. I'm immediately opening a new pr and fixing it up.

@@ -107,6 +107,11 @@ Changelog
preserve the class balance of the original training set. :pr:`14194`
by :user:`Johann Faouzi <johannfaouzi>`.

- |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the
_get_missing_features_info function is now called once when calling
Copy link
Member

@jnothman jnothman Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't usually mention private methods in a change log. Rather, this should just say "avoid repeated computation"

- |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the
_get_missing_features_info function is now called once when calling
fit_transform for MissingIndicator class. :pr:`14356` by :user:
`Harsh Soni <harsh020>`.
Copy link
Member

@jnothman jnothman Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't render correctly with space before it. It must be right after the :

Copy link
Contributor Author

@harsh020 harsh020 Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman I did open a PR, kindly review it, #15027

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants