Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX reproducibility and parallelization of InstanceHardnessThreshold #599

Merged

Conversation

@Shihab-Shahriar
Copy link
Contributor

Shihab-Shahriar commented Sep 7, 2019

This PR aims to solve couple problems with existing InstanceHardnessThreshold sampler.

1.When estimator is not None, result won't not be reproducible if estimator doesn't have its random_state already set.
2. When estimator is not None, it may have different n_jobs value than one given to InstanceHardnessThreshold constructor. So when given estimator's n_jobs equals 1, setting n_jobs>1 in InstanceHardnessThreshold won't affect anything, and fit_resample will run in single thread.
3. When n_jobs in both cases match, by moving parallelism away from estimator to cross_val_predict, this enables coarse-grained parallelism, possibly speeding up computation. In several simple experiments run time improved by up to 50%, and never got worse .

(This fixes few n_jobs parameter related test failures of PR #598)

@pep8speaks

This comment has been minimized.

Copy link

pep8speaks commented Sep 7, 2019

Hello @Shihab-Shahriar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-17 11:20:05 UTC
@chkoar

This comment has been minimized.

Copy link
Member

chkoar commented Sep 7, 2019

Thanks. Could you please add tests for the different scenarios you mentioned?

@codecov

This comment has been minimized.

Copy link

codecov bot commented Sep 8, 2019

Codecov Report

Merging #599 into master will decrease coverage by 2.16%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #599      +/-   ##
==========================================
- Coverage   97.96%   95.79%   -2.17%     
==========================================
  Files          83       83              
  Lines        4867     4878      +11     
==========================================
- Hits         4768     4673      -95     
- Misses         99      205     +106
Impacted Files Coverage Δ
...rototype_selection/_instance_hardness_threshold.py 100% <100%> (ø) ⬆️
...election/tests/test_instance_hardness_threshold.py 100% <100%> (ø) ⬆️
imblearn/keras/tests/test_generator.py 8.62% <0%> (-91.38%) ⬇️
imblearn/tensorflow/_generator.py 30% <0%> (-70%) ⬇️
imblearn/keras/_generator.py 50.74% <0%> (-47.77%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 153f6e0...f6ed9aa. Read the comment docs.

@Shihab-Shahriar

This comment has been minimized.

Copy link
Contributor Author

Shihab-Shahriar commented Sep 8, 2019

@chkoar, thanks for your reply. I tried adding tests, but there are few things I'm struggling with.

  1. I edited the test_iht_init test function to test initialization with several estimators, because i made a similar error in my previous PR. But use of sklearn's all_estimators raised a ModuleNotFoundError: No module named 'Cython' error. Should I just skip this edit?
  2. Test to check reproducibility which should fail for current master branch, passes when dataset is small or estimator is simple. I don't know if there is any standard, simpler way to test reproducibility.
  3. If we don't do the coarse-parallelism part of point 3, which is more of an enhancement than a fix, we could easily check that estimator has n_jobs equal to the one provided to constructor of InstanceHardnessThreshold. But doing this seems to make good performance improvement. Here is a gist that compares speed. I'm not sure how to write tests for performance improvements.

I should probably mention this is my first ever attempt at writing tests in a principled way using a library. I tried looking at tests of this repo and sklearn's, but honestly feel bit overwhelmed. Any pointer would be really appreciated. Thanks.

@@ -126,6 +126,9 @@ def _validate_estimator(self):
isinstance(self.estimator, ClassifierMixin) and
hasattr(self.estimator, 'predict_proba')):
self.estimator_ = clone(self.estimator)
self.estimator_.set_params(random_state=self.random_state)

This comment has been minimized.

Copy link
@chkoar

chkoar Sep 9, 2019

Member

We should use this function to set the random_state of the estimator. (I know that this function is intended for internal use. We could vendor it). Apart from that, I believe you should get the random_state object using the check_random_state utility function in _fit_resample and pass it here and later in cross_val_predict.

@@ -126,6 +126,9 @@ def _validate_estimator(self):
isinstance(self.estimator, ClassifierMixin) and
hasattr(self.estimator, 'predict_proba')):
self.estimator_ = clone(self.estimator)
self.estimator_.set_params(random_state=self.random_state)
if 'n_jobs' in self.estimator_.get_params().keys():

This comment has been minimized.

Copy link
@chkoar

chkoar Sep 9, 2019

Member

Since you are trying to set one parameter I think that it is more pythonic and easier to ask for forgiveness than permission.

…now set using _set_random_state
@Shihab-Shahriar

This comment has been minimized.

Copy link
Contributor Author

Shihab-Shahriar commented Sep 11, 2019

@chkoar, thanks for your review. Apart from suggested changes in your review, I reverted test_iht_init to its original version. Bit confused about why Travis CI is failing.

@glemaitre glemaitre force-pushed the scikit-learn-contrib:master branch from 65132db to 68123d0 Nov 8, 2019
@glemaitre glemaitre self-assigned this Nov 17, 2019
@glemaitre

This comment has been minimized.

Copy link
Member

glemaitre commented Nov 17, 2019

So I merge the change with master. I remove the part which was setting n_jobs for the underlying estimator. This is now better managed in joblib since they worked on over-subscription and nested parallelism. For the rest I like the refactoring. So let see if the tests are passing.

@glemaitre

This comment has been minimized.

Copy link
Member

glemaitre commented Nov 17, 2019

The previous errors were linked with some conda issue at that time.

@glemaitre glemaitre merged commit 9b31677 into scikit-learn-contrib:master Nov 17, 2019
3 of 4 checks passed
3 of 4 checks passed
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@glemaitre

This comment has been minimized.

Copy link
Member

glemaitre commented Nov 17, 2019

This is working, let's merge. Thanks @Shihab-Shahriar !!!

@Shihab-Shahriar Shihab-Shahriar deleted the Shihab-Shahriar:fix_ins_hardness branch Nov 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.