Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX reproducibility and parallelization of InstanceHardnessThreshold #599

Merged

Conversation

Shihab-Shahriar
Copy link
Contributor

This PR aims to solve couple problems with existing InstanceHardnessThreshold sampler.

1.When estimator is not None, result won't not be reproducible if estimator doesn't have its random_state already set.
2. When estimator is not None, it may have different n_jobs value than one given to InstanceHardnessThreshold constructor. So when given estimator's n_jobs equals 1, setting n_jobs>1 in InstanceHardnessThreshold won't affect anything, and fit_resample will run in single thread.
3. When n_jobs in both cases match, by moving parallelism away from estimator to cross_val_predict, this enables coarse-grained parallelism, possibly speeding up computation. In several simple experiments run time improved by up to 50%, and never got worse .

(This fixes few n_jobs parameter related test failures of PR #598)

@pep8speaks
Copy link

pep8speaks commented Sep 7, 2019

Hello @Shihab-Shahriar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-17 11:20:05 UTC

@chkoar
Copy link
Member

chkoar commented Sep 7, 2019

Thanks. Could you please add tests for the different scenarios you mentioned?

@codecov
Copy link

codecov bot commented Sep 8, 2019

Codecov Report

Merging #599 into master will decrease coverage by 2.16%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #599      +/-   ##
==========================================
- Coverage   97.96%   95.79%   -2.17%     
==========================================
  Files          83       83              
  Lines        4867     4878      +11     
==========================================
- Hits         4768     4673      -95     
- Misses         99      205     +106
Impacted Files Coverage Δ
...rototype_selection/_instance_hardness_threshold.py 100% <100%> (ø) ⬆️
...election/tests/test_instance_hardness_threshold.py 100% <100%> (ø) ⬆️
imblearn/keras/tests/test_generator.py 8.62% <0%> (-91.38%) ⬇️
imblearn/tensorflow/_generator.py 30% <0%> (-70%) ⬇️
imblearn/keras/_generator.py 50.74% <0%> (-47.77%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 153f6e0...f6ed9aa. Read the comment docs.

@Shihab-Shahriar
Copy link
Contributor Author

Shihab-Shahriar commented Sep 8, 2019

@chkoar, thanks for your reply. I tried adding tests, but there are few things I'm struggling with.

  1. I edited the test_iht_init test function to test initialization with several estimators, because i made a similar error in my previous PR. But use of sklearn's all_estimators raised a ModuleNotFoundError: No module named 'Cython' error. Should I just skip this edit?
  2. Test to check reproducibility which should fail for current master branch, passes when dataset is small or estimator is simple. I don't know if there is any standard, simpler way to test reproducibility.
  3. If we don't do the coarse-parallelism part of point 3, which is more of an enhancement than a fix, we could easily check that estimator has n_jobs equal to the one provided to constructor of InstanceHardnessThreshold. But doing this seems to make good performance improvement. Here is a gist that compares speed. I'm not sure how to write tests for performance improvements.

I should probably mention this is my first ever attempt at writing tests in a principled way using a library. I tried looking at tests of this repo and sklearn's, but honestly feel bit overwhelmed. Any pointer would be really appreciated. Thanks.

@@ -126,6 +126,9 @@ def _validate_estimator(self):
isinstance(self.estimator, ClassifierMixin) and
hasattr(self.estimator, 'predict_proba')):
self.estimator_ = clone(self.estimator)
self.estimator_.set_params(random_state=self.random_state)
Copy link
Member

@chkoar chkoar Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use this function to set the random_state of the estimator. (I know that this function is intended for internal use. We could vendor it). Apart from that, I believe you should get the random_state object using the check_random_state utility function in _fit_resample and pass it here and later in cross_val_predict.

@@ -126,6 +126,9 @@ def _validate_estimator(self):
isinstance(self.estimator, ClassifierMixin) and
hasattr(self.estimator, 'predict_proba')):
self.estimator_ = clone(self.estimator)
self.estimator_.set_params(random_state=self.random_state)
if 'n_jobs' in self.estimator_.get_params().keys():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are trying to set one parameter I think that it is more pythonic and easier to ask for forgiveness than permission.

@Shihab-Shahriar
Copy link
Contributor Author

@chkoar, thanks for your review. Apart from suggested changes in your review, I reverted test_iht_init to its original version. Bit confused about why Travis CI is failing.

@glemaitre
Copy link
Member

So I merge the change with master. I remove the part which was setting n_jobs for the underlying estimator. This is now better managed in joblib since they worked on over-subscription and nested parallelism. For the rest I like the refactoring. So let see if the tests are passing.

@glemaitre
Copy link
Member

The previous errors were linked with some conda issue at that time.

@glemaitre glemaitre merged commit 9b31677 into scikit-learn-contrib:master Nov 17, 2019
@glemaitre
Copy link
Member

This is working, let's merge. Thanks @Shihab-Shahriar !!!

@Shihab-Shahriar Shihab-Shahriar deleted the fix_ins_hardness branch November 22, 2019 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants