Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Change fit_transform of MissingIndicator class to get mask only once #14356

Merged
merged 22 commits into from Jul 27, 2019

Conversation

@harsh020
Copy link
Contributor

harsh020 commented Jul 14, 2019

Reference Issues/PRs

Fixes #14278

What does this implement/fix? Explain your changes.

The implementation fixes the function _get_missing_features_info from being called twice in fit_transform and also adds a test to check the result are correct.

--Declare a private instance variable (double leading underscore variable/mangling) in fit function, to store the return values of _get_missing_features_info.
-- Called fit from fit_transform.
--Used the private instance variable, when implementing transform in fit_transform to retrieve data.

Any other comments?

…alling _get_missing_features_info twice (#14278)
@harsh020 harsh020 closed this Jul 14, 2019
@harsh020 harsh020 deleted the harsh020:missing_indicator branch Jul 14, 2019
@harsh020 harsh020 restored the harsh020:missing_indicator branch Jul 14, 2019
Harsh Soni
@harsh020 harsh020 reopened this Jul 14, 2019
self._n_features = X.shape[1]

if self.features not in ('missing-only', 'all'):
raise ValueError("'features' has to be either 'missing-only' or "

This comment has been minimized.

Copy link
@amueller

amueller Jul 14, 2019

Member

You shouldn't duplicate the code ideally. It's not covered by the tests which is why you're seeing the code coverage failure.

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 14, 2019

Author Contributor

@amueller Then what should I do, because I need to do the exact same things, I can't change the variable name (can I??). I can't call fit function because if I do I need to create a data member to store what the function returns which raises more failures during check because the tests do not know have any information about it.
Any help??

Thanks.

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 14, 2019

Author Contributor

@amuller Can I use mangling (double leading underscore)??

@@ -1,7 +1,3 @@
# Authors: Nicolas Tresegnie <nicolas.tresegnie@gmail.com>

This comment has been minimized.

Copy link
@amueller

amueller Jul 14, 2019

Member

Why did you remove the author information?

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 14, 2019

Author Contributor

I didn't do that intentionally, its a mistake, sorry I'll restore it immediately.

Harsh Soni added 2 commits Jul 14, 2019
Harsh Soni
@@ -612,7 +612,8 @@ def fit(self, X, y=None):
raise ValueError("'sparse' has to be a boolean or 'auto'. "
"Got {!r} instead.".format(self.sparse))

self.features_ = self._get_missing_features_info(X)[1]
self.__missing_features_info = self._get_missing_features_info(X)

This comment has been minimized.

Copy link
@amueller

amueller Jul 14, 2019

Member

I think we usually only use a single underscore, but otherwise this solution seems good to me.
Do we have a test that checks that fit(X).transform(X) == fit_transform(X)?

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 14, 2019

Author Contributor

A single underscore will declare a protected instance variable, which can be accessed outside the class making data transparent, hence I've used double underscore which can't be accessed outside the class which would maintain opacity as the user need not know that data, he is already getting it when using fit_transform.
And moreover I think that can raise issue in PR checks, as the fit function returns self, which would include that protected member, but the tests do not know if fit return it, so we need to add that to tests. (I think this may happen, please correct me if wrong)
So, is it okay using private instance variable or should I change it to protected?

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 14, 2019

Author Contributor

Do we have a test that checks that fit(X).transform(X) == fit_transform(X)?

I don't think so. I'm sorry but I'm very new to test stuff, but I had a look in the test file and don't think that we have something that checks it.
I'll try to learn more about testing ASAP and will try again.

This comment has been minimized.

Copy link
@amueller

amueller Jul 14, 2019

Member

could you please add one? it should be quite simple.

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 14, 2019

Author Contributor

should I add it to the pre existing test function test_missing_indicator_new

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 14, 2019

Author Contributor

okay, I did add a test, I'm pushing it. Please do let me know if any changes are to be made

Harsh Soni added 4 commits Jul 14, 2019
@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 15, 2019

@amueller I did add a test, please have a look and let me know if I need to make some changes.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 15, 2019

Storing missing_features_info, which includes the entire mask seems a little heavy for the MissingIndicator. An alternative is to move all of fit into a private method _fit and have _fit return missing_features_info.

  1. fit will call _fit and return self.

  2. fit_transform will call _fit and use the returned missing_features_info to perform the transform.

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 15, 2019

@thomasjpfan I'm a bit confused please help out here -
So, what you are saying is I should make a protected function _fit and move entire fit to it and store and return missing_features_info. Then I should call _fit from fit and return self and also call _fit from fit_transform, to get that stored data.
How is it different from my approach?? I'll still be storing return values in missing_features_info.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 15, 2019

_fit will not store missing_features_info it will only return it.

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 15, 2019

Okay I see, where you are going. Amazing, thats a nice idea! Thanks!

Harsh Soni added 2 commits Jul 15, 2019
@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 15, 2019

@thomasjpfan Just out of curiocity, what if I would have stored the return values in a local variable in the _fit function, would it (the local variable having the return values) have had the lifespan only when we called the _fit function or until the object of MissingIndicator was destroyed?
And second thing, why do we use protected function (with a single _) rather than a private function (with a double _)?
And yes, I did push the changes.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 15, 2019

would it (the local variable having the return values) have had the lifespan only when we called the _fit function

It will only exist in _fit

And second thing, why do we use protected function (with a single _) rather than a private function (with a double _)?

In scikit-learn we use a single _ to mean "non-public". Currently, we do not use the double underscore.

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 16, 2019

@thomasjpfan Thanks for clearing that up, I made the necessary changes and pushed them. Can you please have a look and let me know. I was also asked to add a test for the same, can you please have a look at it too (bcs I'm very new to tests).
If every thing seems fine whom should I request to merge the branch?

Thanks.

@@ -612,7 +612,23 @@ def fit(self, X, y=None):
raise ValueError("'sparse' has to be a boolean or 'auto'. "
"Got {!r} instead.".format(self.sparse))

self.features_ = self._get_missing_features_info(X)[1]
return self._get_missing_features_info(X)

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 16, 2019

Member

So that _fit actually fits and learns something:

missing_features_info = self._get_missing_features_info(X)
self.features_ = missing_features_info[1]
return missing_features_info
self : object
Returns self.
"""
self.features_ = self._fit(X, y)[1]

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 16, 2019

Member

With the change suggested in _fit this can be:

Suggested change
self.features_ = self._fit(X, y)[1]
self._fit(X, y)
@@ -667,7 +683,14 @@ def fit_transform(self, X, y=None):
will be boolean.
"""
return self.fit(X, y).transform(X)
imputer_mask, features = self._fit(X, y)
self.features_ = features

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 16, 2019

Member

self.features_ = features can be removed with the suggested change in _fit.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 16, 2019

If every thing seems fine whom should I request to merge the branch?

After I approve this PR, we need to wait for another core dev to review the PR before it is merged.

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 16, 2019

@thomasjpfan Yup, I also thought to do so, but since you asked me not to store in missing_features_info, I thought you don't want me to store it in a local variable too. So I didn't do it. No problem, I'll patch it up.
Thanks.

@harsh020 harsh020 force-pushed the harsh020:missing_indicator branch from be49ad0 to 41605db Jul 26, 2019
@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 26, 2019

@jnothman Sorry, I am creating a mess here. Please help me out here. As I was getting merge conflict, by mistake I merged my missing_indicator branch to upstream/master branch, to correct it I merged the upstream/master changes to local master and reset my missing_indicator head and force pushed it. But still I'm getting merge conflicts.
Should I delete the branch, do the necessary changes and then push it.
I know I'm creating a huge mess (my first PR), Sorry. Please help me out.

Thanks.

@harsh020 harsh020 force-pushed the harsh020:missing_indicator branch from e633ce8 to 57c4dc4 Jul 26, 2019
@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 26, 2019

@jnothman I think I reversed back - I restored my local repo back where I didn't merge (commit ffe3a5a), and also did the necessary changes (removed the unnecessary condition, removed tests and added efficiency entry), I pushed it and think every thing just got back where the merge conflict arose. Should I save the conflict by keeping both the logs or removing the log which was merged in master.

Thanks.

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 27, 2019

Should I save the conflict by keeping both the logs or removing the log which was merged in master.

We prefer if you do not do force pushes to remove commits.

return self.fit(X, y).transform(X)
imputer_mask = self._fit(X, y)

if (self.features_.size < self._n_features):

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 27, 2019

Member
Suggested change
if (self.features_.size < self._n_features):
if self.features_.size < self._n_features:
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 27, 2019

Well done for fixing it up though

@@ -598,7 +598,7 @@ def fit(self, X, y=None):
Returns
-------
self : object

This comment has been minimized.

Copy link
@jnothman

jnothman Jul 27, 2019

Member

This is no longer accurate

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 27, 2019

Author Contributor

Do you mean the return comment of _fit or fit?

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 27, 2019

We prefer if you do not do force pushes to remove commits.

@thomasjpfan Sorry, just got scared for messing things up. Won't happen again.

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 27, 2019

Well done for fixing it up though

@jnothman Thanks! Please have a look and let me know if I did it correctly -
(Visited my course, and revised how to resolve merge conflicts)

  1. Pulled upstream/master to my local master.
  2. Switched to missing_indicator branch and merged master to it.
  3. Resolved merge conflicts.
  4. Pushed missing_indicator branch to origin.

This is no longer accurate

Are you talking about the return comment of _fit? If yes, I fixed it. If no, sorry, I don't follow, please tell me what is not accurate, the comments of fit?

Thanks.

Returns self.
Xt : {ndarray or sparse matrix}, shape (n_samples, n_features)
The missing indicator for input data. The data type of ``Xt``
will be boolean.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 27, 2019

Member

You can take use the docstring from _get_missing_features_info, since _fit returns one of the outputs of _get_missing_features_info:

        imputer_mask : {ndarray or sparse matrix}, shape (n_samples, n_features_with_missing)
            The imputer mask of the original data.

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 27, 2019

Author Contributor

Sure. Just one thing since _fit returns the full imputer_mask rather than only features_with_missing (as it is being checked in fit_transform), can I change the shape to just (n_samples, n_features)?

Thanks.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 27, 2019

Member

Yup I agree with just (n_samples, n_features).

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented Jul 27, 2019

Sorry, just got scared for messing things up. Won't happen again.

@harsh020 It is okay! Thank you for working on this PR! 😄

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 27, 2019

It is okay! Thank you for working on this PR! 😄

The pleasure was all mine. Thank you for helping me all along, as without you guys it would have been impossible for me to work on this issue.

Thanks.

@@ -597,8 +597,10 @@ def fit(self, X, y=None):
Returns
-------
self : object
Returns self.
imputer_mask : {ndarray or sparse matrix},

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 27, 2019

Member
        imputer_mask : {ndarray or sparse matrix}, shape (n_samples, \
        n_features)
            The imputer mask of the original data.

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 27, 2019

Author Contributor

Is something wrong?

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Jul 27, 2019

Member

The spacing. This is only for consistency. This docstring is not going to be rendered since _fit is private.

Notice the use of \ and how n_features) is aligned and how many white spaces there are before The imputer....

This comment has been minimized.

Copy link
@harsh020

harsh020 Jul 27, 2019

Author Contributor

Done.

Thanks.

@jnothman jnothman merged commit 7b8b986 into scikit-learn:master Jul 27, 2019
17 checks passed
17 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.86%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +3.13% compared to df1e3fb
Details
scikit-learn.scikit-learn Build #20190727.24 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 27, 2019

Thank you @harsh020!

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 28, 2019

Thank you @harsh020!

@jnothman The pleasure was all mine. Thanks to you guys, would not have been possible without you guys.

@harsh020

This comment has been minimized.

Copy link
Contributor Author

harsh020 commented Jul 28, 2019

@jnothman Can I delete my branch?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 28, 2019

@harsh020 harsh020 deleted the harsh020:missing_indicator branch Jul 29, 2019
Copy link
Member

jnothman left a comment

Please open a new pr fixing this up, @harsh020. Sorry to have missed it

@@ -107,6 +107,11 @@ Changelog
preserve the class balance of the original training set. :pr:`14194`
by :user:`Johann Faouzi <johannfaouzi>`.

- |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 18, 2019

Member

I've just realised this is in completely the wrong section.

This comment has been minimized.

Copy link
@harsh020

harsh020 Sep 19, 2019

Author Contributor

oh, my bad. Really sorry for the blunder. I'm immediately opening a new pr and fixing it up.

@@ -107,6 +107,11 @@ Changelog
preserve the class balance of the original training set. :pr:`14194`
by :user:`Johann Faouzi <johannfaouzi>`.

- |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the
_get_missing_features_info function is now called once when calling

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 18, 2019

Member

We wouldn't usually mention private methods in a change log. Rather, this should just say "avoid repeated computation"

- |Efficiency| :func:`ensemble.MissingIndicator.fit_transform` the
_get_missing_features_info function is now called once when calling
fit_transform for MissingIndicator class. :pr:`14356` by :user:
`Harsh Soni <harsh020>`.

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 18, 2019

Member

This doesn't render correctly with space before it. It must be right after the :

This comment has been minimized.

Copy link
@harsh020

harsh020 Sep 19, 2019

Author Contributor

@jnothman I did open a PR, kindly review it, #15027

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.