Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH: Vectorized ADASYN #649

Merged
merged 6 commits into from Dec 5, 2019

Conversation

@MattEding
Copy link
Contributor

MattEding commented Nov 18, 2019

What does this implement/fix? Explain your changes.

  • Simplify code logic for ADASYN by vectorizing components.
  • Significantly improves sparse matrix speeds while maintaining dense array performance.
  • Fix ADASYN module docstring

TODO:

  • Benchmark performance
  • Adjust unit tests due to change in random state usage
MattEding added 2 commits Nov 18, 2019
…ests due to random state changes
@pep8speaks

This comment has been minimized.

Copy link

pep8speaks commented Nov 18, 2019

Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-22 07:02:01 UTC
@@ -731,13 +731,12 @@ def _fit_resample(self, X, y):
X_resampled.append(X_new)
y_resampled.append(y_new)

if sparse.issparse(X_new):
if sparse.issparse(X):

This comment has been minimized.

Copy link
@MattEding

MattEding Nov 18, 2019

Author Contributor

changed in case of some edge case where the number of samples produced is 0 and thus X_new would cause an UnboundLocalError

@@ -98,7 +98,7 @@ def _make_samples(
"""
random_state = check_random_state(self.random_state)
samples_indices = random_state.randint(
low=0, high=len(nn_num.flatten()), size=n_samples
low=0, high=nn_num.size, size=n_samples

This comment has been minimized.

Copy link
@MattEding

MattEding Nov 19, 2019

Author Contributor

avoid making copy of array

@@ -131,7 +130,9 @@ def _fit_resample(self, X, y):
)
ratio_nn /= np.sum(ratio_nn)
n_samples_generate = np.rint(ratio_nn * n_samples).astype(int)
if not np.sum(n_samples_generate):
# rounding may cause new amount for n_samples
n_samples = np.sum(n_samples_generate)

This comment has been minimized.

Copy link
@MattEding

MattEding Nov 19, 2019

Author Contributor

Original non-vectorized implementation used this approach to make n-samples rather than truncating np.sum(n_samples_generate) down to n_samples

@chkoar

This comment has been minimized.

Copy link
Member

chkoar commented Nov 19, 2019

Hey @MattEding. It is great that you work on this. Do you have any timings regarding your changes?

@MattEding

This comment has been minimized.

Copy link
Contributor Author

MattEding commented Nov 19, 2019

Benchmarks with Zenodo datasets and Random Sparse Matrices

ADASYN Vectorize vs Loop Benchmark

  • Performed on a MacBook Pro using 4-cores (wanted max n_jobs to focus on refactored code rather than NearestNeighbors used internally with both implementations).
  • Used the minimum of 3-trials using timeit (minimum is its default behavior).
  • As with the previous SMOTE vectorization, dense matrices time results are essentially the same.
  • There is a clear advantage in execution time when sparse matrices are being used.
@MattEding

This comment has been minimized.

Copy link
Contributor Author

MattEding commented Nov 22, 2019

Comparison of ADASYN Implemenations

For both the For-Loop and Vectorized implemenations,
n_samples_generate == [1 0 1 0 1 0 1 0]
so they will both generate the same amount of synthetic samples for each of
the 8 minority class samples. I will emphasize lines using *X to
highlight the rows that correspond to n_samples_generate.

Nearest Neighbors

For each minority sample both implementations refer to the same neighbors. The
only difference is that the vectorized implementation doesn't store information
of it being a neighbor to itself.

For-Loop

nn_index
[[0 2 6 4 1 3]  *A
 [1 3 7 4 5 2]
 [2 0 1 4 6 5]  *B
 [3 1 7 5 2 4]
 [4 1 2 0 7 3]  *C
 [5 3 1 7 2 4]
 [6 0 2 4 1 5]  *D
 [7 3 1 5 4 2]]

Vectorized

nns
[[2 6 4 1 3]  *A
 [3 7 4 5 2]
 [0 1 4 6 5]  *B
 [1 7 5 2 4]
 [1 2 0 7 3]  *C
 [3 1 7 2 4]
 [0 2 4 1 5]  *D
 [3 1 5 4 2]]

Minority samples

For-Loop

For each pass of the loop where the continue statement is not triggered,
these are the minority samples that will be used to generate synthetic samples
from. We can confirm that they are identical.

x_i
[ 0.11622591 -0.0317206 ]  *A

[ 0.53366841 -0.30312976]  *B

[ 0.88407872  0.35454207]  *C

[-0.41635887 -0.38299653]  *D

Vectorized

X_class
[[ 0.11622591 -0.0317206 ]  *A
 [ 1.25192108 -0.22367336]
 [ 0.53366841 -0.30312976]  *B
 [ 1.52091956 -0.49283504]
 [ 0.88407872  0.35454207]  *C
 [ 1.31301027 -0.92648734]
 [-0.41635887 -0.38299653]  *D
 [ 1.70580611 -0.11219234]]

Selected Neighbors

The randomly selected neighbors for each of the samples conveniently align for
this test-case. This is a lucky happenstance as it makes comparison between the
two implementations easier. Note that the For-Loop implementation results are
one unit larger by necessity due to the fact that the nearest neighbors are
stored with the sample point itself.

For-Loop

nn_zs
[5]  *A

[1]  *B

[4]  *C

[4]  *D

Vectorized

cols
[4 0 3 3]
*A B C D

Differences of Sample and Neighbor

For-Loop

This is not directly observable in pdb, but when entering and
leaving interact mode between loop iterations, we can capture the results.

The code this is derived from is:

x_class_gen.append([
    x_i + step * (X_class[x_i_nn[nn_z], :] - x_i)
    for step, nn_z in zip(steps, nn_zs)
])

Via the debugger:

[(X_class[x_i_nn[nn_z], :] - x_i) for step, nn_z in zip(steps, nn_zs)]

[array([ 1.40469365, -0.46111444])]
[array([-0.4174425 ,  0.27140916])]
[array([ 0.82172739, -0.46673441])]
[array([ 1.66827995,  0.15932317])]

Vectorized

diffs
[[ 1.40469365 -0.46111444]
 [-0.4174425   0.27140916]
 [ 0.82172739 -0.46673441]
 [ 1.66827995  0.15932317]]

Uniformly Random Steps

Both implementations use the same random function, random_state.uniform(),
just with different shapes.
This is where the values diverge, but we can now be confident that the two
implementations of the ADASYN algorithm coincide other than due to random
number generation.

For-Loop

steps
[ 0.59284462]

[ 0.60276338]

[ 0.84725174]

[ 0.64589411]

Vectorized

steps
[[ 0.54488318]
 [ 0.4236548 ]
 [ 0.64589411]
 [ 0.43758721]]

Newly Synthesized Samples

The new samples are generated afterward from the above information.

For-Loop

X_new
[[ 0.94899098 -0.30508981]
 [ 0.28204936 -0.13953426]
 [ 1.58028868 -0.04089947]
 [ 0.66117333 -0.28009063]]

y_new
[0 0 0 0]

Vectorized

X_new
[[ 0.88161986 -0.2829741 ]
 [ 0.35681689 -0.18814597]
 [ 1.4148276   0.05308106]
 [ 0.3136591  -0.31327875]]

y_new
[0 0 0 0]

Unit Test

From the above results, I feel that the changes to the unit tests are accurate and justifiable.

@MattEding MattEding changed the title [WIP] ENH: Vectorized ADASYN [MRG] ENH: Vectorized ADASYN Nov 22, 2019
@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 22, 2019

Codecov Report

Merging #649 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #649      +/-   ##
==========================================
- Coverage   98.46%   98.46%   -0.01%     
==========================================
  Files          82       82              
  Lines        4886     4876      -10     
==========================================
- Hits         4811     4801      -10     
  Misses         75       75
Impacted Files Coverage Δ
imblearn/over_sampling/tests/test_adasyn.py 100% <ø> (ø) ⬆️
imblearn/over_sampling/_smote.py 96.69% <100%> (ø) ⬆️
imblearn/over_sampling/_adasyn.py 98.36% <100%> (-0.24%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3839df1...ad574ce. Read the comment docs.

@glemaitre glemaitre added this to the 0.6 milestone Nov 26, 2019
@glemaitre

This comment has been minimized.

Copy link
Member

glemaitre commented Nov 26, 2019

I'll have a look at it before to the release.

@glemaitre glemaitre merged commit a0ac84d into scikit-learn-contrib:master Dec 5, 2019
16 checks passed
16 checks passed
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
codecov/patch 100% of diff hit (target 98.46%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +1.53% compared to 3839df1
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
scikit-learn-contrib.imbalanced-learn #20191122.6 succeeded
Details
scikit-learn-contrib.imbalanced-learn (Linting) Linting succeeded
Details
scikit-learn-contrib.imbalanced-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn-contrib.imbalanced-learn (Linux pylatest_conda_pandas_keras) Linux pylatest_conda_pandas_keras succeeded
Details
scikit-learn-contrib.imbalanced-learn (Linux pylatest_conda_pandas_tensorflow) Linux pylatest_conda_pandas_tensorflow succeeded
Details
scikit-learn-contrib.imbalanced-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn-contrib.imbalanced-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn-contrib.imbalanced-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn-contrib.imbalanced-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn-contrib.imbalanced-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@MattEding MattEding deleted the MattEding:adasyn-vectorized branch Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.