-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX reproducibility and parallelization of InstanceHardnessThreshold #599
FIX reproducibility and parallelization of InstanceHardnessThreshold #599
Conversation
Hello @Shihab-Shahriar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-11-17 11:20:05 UTC |
Thanks. Could you please add tests for the different scenarios you mentioned? |
Codecov Report
@@ Coverage Diff @@
## master #599 +/- ##
==========================================
- Coverage 97.96% 95.79% -2.17%
==========================================
Files 83 83
Lines 4867 4878 +11
==========================================
- Hits 4768 4673 -95
- Misses 99 205 +106
Continue to review full report at Codecov.
|
@chkoar, thanks for your reply. I tried adding tests, but there are few things I'm struggling with.
I should probably mention this is my first ever attempt at writing tests in a principled way using a library. I tried looking at tests of this repo and sklearn's, but honestly feel bit overwhelmed. Any pointer would be really appreciated. Thanks. |
@@ -126,6 +126,9 @@ def _validate_estimator(self): | |||
isinstance(self.estimator, ClassifierMixin) and | |||
hasattr(self.estimator, 'predict_proba')): | |||
self.estimator_ = clone(self.estimator) | |||
self.estimator_.set_params(random_state=self.random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use this function to set the random_state
of the estimator. (I know that this function is intended for internal use. We could vendor it). Apart from that, I believe you should get the random_state
object using the check_random_state
utility function in _fit_resample
and pass it here and later in cross_val_predict
.
@@ -126,6 +126,9 @@ def _validate_estimator(self): | |||
isinstance(self.estimator, ClassifierMixin) and | |||
hasattr(self.estimator, 'predict_proba')): | |||
self.estimator_ = clone(self.estimator) | |||
self.estimator_.set_params(random_state=self.random_state) | |||
if 'n_jobs' in self.estimator_.get_params().keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are trying to set one parameter I think that it is more pythonic and easier to ask for forgiveness than permission.
…now set using _set_random_state
@chkoar, thanks for your review. Apart from suggested changes in your review, I reverted |
So I merge the change with master. I remove the part which was setting |
The previous errors were linked with some conda issue at that time. |
This is working, let's merge. Thanks @Shihab-Shahriar !!! |
This PR aims to solve couple problems with existing
InstanceHardnessThreshold
sampler.1.When
estimator
is not None, result won't not be reproducible ifestimator
doesn't have itsrandom_state
already set.2. When
estimator
is not None, it may have differentn_jobs
value than one given toInstanceHardnessThreshold
constructor. So when given estimator'sn_jobs
equals 1, settingn_jobs>1
inInstanceHardnessThreshold
won't affect anything, andfit_resample
will run in single thread.3. When
n_jobs
in both cases match, by moving parallelism away fromestimator
tocross_val_predict
, this enables coarse-grained parallelism, possibly speeding up computation. In several simple experiments run time improved by up to 50%, and never got worse .(This fixes few
n_jobs
parameter related test failures of PR #598)