FIX reproducibility and parallelization of InstanceHardnessThreshold#599
Conversation
|
Hello @Shihab-Shahriar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-11-17 11:20:05 UTC |
|
Thanks. Could you please add tests for the different scenarios you mentioned? |
Codecov Report
@@ Coverage Diff @@
## master #599 +/- ##
==========================================
- Coverage 97.96% 95.79% -2.17%
==========================================
Files 83 83
Lines 4867 4878 +11
==========================================
- Hits 4768 4673 -95
- Misses 99 205 +106
Continue to review full report at Codecov.
|
|
@chkoar, thanks for your reply. I tried adding tests, but there are few things I'm struggling with.
I should probably mention this is my first ever attempt at writing tests in a principled way using a library. I tried looking at tests of this repo and sklearn's, but honestly feel bit overwhelmed. Any pointer would be really appreciated. Thanks. |
| isinstance(self.estimator, ClassifierMixin) and | ||
| hasattr(self.estimator, 'predict_proba')): | ||
| self.estimator_ = clone(self.estimator) | ||
| self.estimator_.set_params(random_state=self.random_state) |
There was a problem hiding this comment.
We should use this function to set the random_state of the estimator. (I know that this function is intended for internal use. We could vendor it). Apart from that, I believe you should get the random_state object using the check_random_state utility function in _fit_resample and pass it here and later in cross_val_predict.
| hasattr(self.estimator, 'predict_proba')): | ||
| self.estimator_ = clone(self.estimator) | ||
| self.estimator_.set_params(random_state=self.random_state) | ||
| if 'n_jobs' in self.estimator_.get_params().keys(): |
There was a problem hiding this comment.
Since you are trying to set one parameter I think that it is more pythonic and easier to ask for forgiveness than permission.
…now set using _set_random_state
|
@chkoar, thanks for your review. Apart from suggested changes in your review, I reverted |
|
So I merge the change with master. I remove the part which was setting |
|
The previous errors were linked with some conda issue at that time. |
|
This is working, let's merge. Thanks @Shihab-Shahriar !!! |
This PR aims to solve couple problems with existing
InstanceHardnessThresholdsampler.1.When
estimatoris not None, result won't not be reproducible ifestimatordoesn't have itsrandom_statealready set.2. When
estimatoris not None, it may have differentn_jobsvalue than one given toInstanceHardnessThresholdconstructor. So when given estimator'sn_jobsequals 1, settingn_jobs>1inInstanceHardnessThresholdwon't affect anything, andfit_resamplewill run in single thread.3. When
n_jobsin both cases match, by moving parallelism away fromestimatortocross_val_predict, this enables coarse-grained parallelism, possibly speeding up computation. In several simple experiments run time improved by up to 50%, and never got worse .(This fixes few
n_jobsparameter related test failures of PR #598)