[MRG] ENH: Vectorized SMOTE#596
Conversation
|
Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-11-17 19:57:54 UTC |
Codecov Report
@@ Coverage Diff @@
## master #596 +/- ##
==========================================
- Coverage 97.97% 97.97% -0.01%
==========================================
Files 83 83
Lines 4878 4877 -1
==========================================
- Hits 4779 4778 -1
Misses 99 99
Continue to review full report at Codecov.
|
|
That would be a nice addition. A doc test is falling though. |
|
The failing doctest is just due to a slight difference of numpy random function usage. Both implementations use random_state, but by vectorizing the code I had to change how to randomly select tie-breaking. The algorithm code works as intended, so either the pull-request needs to be declined due to backwards incompatibility, or the over_samping.rst should change 'A' to 'B' so the test passes. I am not sure how you would like to go forward with this. Additionally, I have performed more benchmark testing focusing on the Zenodo datasets using timeit with 100 trials. The results are here: |
Probably you mean that someone will not get the same results using the same parameters in the same dataset, right? Sometimes this is expected to happen. Since we have a significant performance gain in terms of speed I would not decline that request for the reason you mention. @glemaitre what do you think? |
…e to random state change
|
@chkoar Yes, you are correct in that I meant the same parameters could result in different results. The API will remain the same. If this gets approved I would gladly vectorize dense/sparse code for other algorithms knowing that random_state will change under new implementations is not a deal breaker (e.g. ADASYN would have different results since the "for loop" calls randint each loop and will become 1 call only under vectorization). |
|
@MattEding sorry for the delay. This seems to be significantly better for the sparse case. |
I think this is completely fine. I added an entry in what's new and mentioned that |
|
@MattEding Thanks a lot for your contribution. This is a major speed-up for the sparse case. The code was pretty bad there :). |
|
Do not hesitate to optimize some other part samplers :) |
|
Glad to have helped! When I have the time, I will look into other sampler implementations. |
What does this implement/fix? Explain your changes.
Enhanced performance of dense and sparse matrix oversampling with
BaseSMOTE,SMOTE, andSMOTENCby replacing "for loops" in favor of vectorized operations.For speed performance comparison see my benchmark gist:
https://gist.github.com/MattEding/97c3f36f508ed26e9b2e7dd22db17887
Any other comments?
With the original implementation of
SMOTENC,argmaxwas not used for deterministic tie-breaking. To achieve random tie-breaking in vectorized version, I had to forgo having identical tie-breaks from the original implementation since it uses differentnumpyrandom functions.Interestingly
SMOTENCdoes not have a unit test that validates exact values for a dataset, so while I was in the process of refactoring I was shocked thatSMOTENCpassed tests even though I knew it should not have.All main tests (other than
show_versiondue toget_blasremoved from Scikit-Learn) passed. The tests I explicitly ignored werekerastesting and thematplotlibgeneration in thedocfolder.