[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update #817

jruokolainen · 2021-02-23T05:21:08Z

I've been using SMOTETomek in production with success for a while. The 0.7.6 version runs through the dataset in around 5-8min. Updated and the new version ran for 1,5h before I killed the process.

               balancer = SMOTETomek(random_state=2425, n_jobs=-1)
               df_resampled, target_resampled = balancer.fit_resample(dataframe, target)
               return df_resampled, target_resampled

The text was updated successfully, but these errors were encountered:

hayesall · 2021-02-23T14:36:36Z

Something similar was reported in #784

Can you include a portion of your data, and environment details from this command:

python -c 'import imblearn; imblearn.show_versions(github=True)'

glemaitre · 2021-03-02T21:18:51Z

In #784, it was indeed not due to imbalanced-learn but to the new scikit-learn 0.24.

@jruokolainen Could downgrade scikit-learn to 0.23 (it should still work even if we force to use 0.24).

Could you give the shape of the array and the data types such that we can try to reproduce.

glemaitre · 2021-03-02T21:19:17Z

as well as the info asked by @hayesall. It would really useful.

glemaitre · 2021-03-02T22:31:04Z

I used the following code:

%%time

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1_000_000, n_features=10,
    n_classes=3, weights=[0.05, 0.1, 0.85],
    n_informative=4, random_state=0
)

from imblearn.combine import SMOTETomek

SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)

With imbalanced-learn 0.7.0 and master and scikit-learn 0.23.X and master, both times, I am getting a wall time of 6 minutes.
We really need to have all information regarding all numpy, scipy, scikit-learn and system version and the dimensionality of the problem.

jruokolainen · 2021-03-04T06:54:47Z

Yeah I can't share the dataset but overall it has approx. 20 million rows, with class imbalance of 99% negative , 1% positive labels and around 50 feature columns. These we're the versions that caused the bug. Works like a charm with imbalanced-learn 0.7.0
imbalanced-learn-0.8.0 scikit-learn-0.24.1

glemaitre · 2021-03-08T09:38:07Z

When you say that imbalanced-learn 0.7.0 works like a charm, was it with scikit-learn 0.23 ou 0.24
Could you also provide the OS that you are working with?

jruokolainen · 2021-03-25T11:25:38Z

It's working on OS X Mojave with 0.7.0 and scikit 0.24.1 but not with 0.8 imblearn

jruokolainen · 2021-03-25T11:55:13Z

This one doesn't complete, occupies a lot of threads on my MacBook pro 2019 but doesn't strain CPU at all.

`❯ pipenv run python -c 'import imblearn; imblearn.show_versions(github=True)'

System, Dependency Information

System Information

python : 3.8.7 (v3.8.7:6503f05dd5, Dec 21 2020, 12:45:15) [Clang 6.0 (clang-600.0.57)]
executable: /Users/jokke/.local/share/virtualenvs/b2c-p2p-scorer-model-h2qOSVv_/bin/python
machine : macOS-10.16-x86_64-i386-64bit

Python Dependencies

pip : 21.0.1
setuptools: 53.0.0
imblearn : 0.8.0
sklearn : 0.24.1
numpy : 1.19.5
scipy : 1.6.2
Cython : None
pandas : 1.2.3
keras : None
tensorflow: 2.4.1
joblib : `1.0.1``

jruokolainen · 2021-03-25T12:54:55Z

These run without a problem in GCP ai platform
`Python Dependencies

pip : 21.0.1
setuptools: 53.0.0
imblearn : 0.7.0
sklearn : 0.23.2
numpy : 1.18.5
scipy : 1.4.1
Cython : None
pandas : 1.2.3
keras : None
tensorflow: 2.3.1
joblib : 0.17.0
`

jruokolainen · 2021-03-25T13:13:59Z

This would roughly simulate the dataset size I'm using

%%time
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek
X, y = make_classification(
    n_samples=850862*3, n_features=63,
    n_classes=2, weights=[0.05, 0.95],
    n_informative=4, random_state=0)
SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)

jruokolainen · 2021-03-25T13:16:37Z

I used the following code:
%%time

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1_000_000, n_features=10,
    n_classes=3, weights=[0.05, 0.1, 0.85],
    n_informative=4, random_state=0
)

from imblearn.combine import SMOTETomek

SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)
With imbalanced-learn 0.7.0 and master and scikit-learn 0.23.X and master, both times, I am getting a wall time of 6 minutes.
We really need to have all information regarding all numpy, scipy, scikit-learn and system version and the dimensionality of the problem.

This one runs in a notebook without a problem even with the newest version but with my production dataset the CPU usage lingers at 20% and it newer completes

ogrisel · 2021-03-29T09:00:37Z

@jruokolainen can you please report the times you observe on your machine when running the reproducer you posted at #817 (comment) ?

both with:

imbalanced-learn 0.7 / scikit-learn 0.23.2
imbalanced-learn 0.8 / scikit-learn 0.24.1

ogrisel · 2021-03-29T09:05:43Z

I've been using SMOTETomek in production with success for a while.

@jruokolainen unrelated to the performance problem: out of curiosity, I would like to know more about practical applications of SMOTE in production: what kind of data are you working with? what kind of classifier do you use downstream in the pipeline? what is the class balancing ratio (5% vs 95% as in the reproducer)? what improvement in terms of balanced accuracy, F1, AUC or other metrics to you observe with SMOTETomek vs other balanced classification approaches (such as subsampling the majority class) or using BalancedRandomForest or LogisticRegression with class weights?

ogrisel · 2021-03-29T09:53:25Z

I can indeed observe a small perf regression on a smaller subset of the data when upgrading:

from scikit-learn 0.23.2 / imbalanced-learn 0.7:

(imblearn-07) ogrisel@mba ~ % python tmp/debug_imbalanced_perf.py
Generated 15.1 MB of training data
SMOTETomek took 63.1 s and generated 28.6 MB

to scikit-learn 0.24.1 / imbalanced-learn 0.8"

(imblearn-latest) ogrisel@mba ~ % python tmp/debug_imbalanced_perf.py
Generated 15.1 MB of training data
SMOTETomek took 76.2 s and generated 28.6 MB

I have tried both with joblib 1.0 and 0.17 for both and it does not seem to matter.

Here is the reproducer I used:

from time import perf_counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek


X, y = make_classification(
    n_samples=int(3e4), n_features=63,
    n_classes=2, weights=[0.05, 0.95],
    n_informative=4, random_state=0
)

print(f"Generated {X.nbytes / 1e6:.1f} MB of training data")
tic = perf_counter()
X_out, y_out = SMOTETomek(n_jobs=-1, random_state=0
    ).fit_resample(X, y)
toc = perf_counter()
print(f"SMOTETomek took {toc - tic:.1f} s and generated {X_out.nbytes / 1e6:.1f} MB")

You can try to increase the number of samples but the ratio of the runtimes seems to stay approximately constant.

It would be worth investigating what is the bottleneck with a profiler and report the regression upstream in scikit-learn if it can be reproduced only with scikit-learn code.

I suspect that using the Ball-Tree algorithm in the embedded nearest neighbors search on 63-dimensional data might be suboptimal. It would be worth checking with the brute force method and also with an approximate method such as https://github.com/lmcinnes/pynndescent but it requires significant code change.

ogrisel · 2021-03-29T13:13:17Z

It seems that both SMOTE and TomekLinks do their own knn model internally. Wouldn't there be a way to make the SMOTE model also return the nearest neighbor info for each resampled data point to avoid this?

ogrisel · 2021-03-29T13:23:38Z

The change in scikit-learn 0.24 that might explain the performance behavior for very large datasets with a large enough number of features might be:

scikit-learn/scikit-learn#17148

So it could make sense to give the users the ability to switch the underlying NN search strategy (brute vs ball-tree) and maybe the heuristic used in 0.24 is not optimal...

chkoar · 2021-03-29T13:29:02Z

So it could make sense to give the users the ability to switch the underlying NN search strategy

Correct. The plan is to create backends for nearest neighbor searches. So we could leverage librarie like faiss or annoy without explicitly require them.

chkoar · 2021-03-29T13:30:02Z

The plan is

Well, probably, my thought was...

jruokolainen · 2021-05-29T05:58:17Z

I've been using SMOTETomek in production with success for a while.

@jruokolainen unrelated to the performance problem: out of curiosity, I would like to know more about practical applications of SMOTE in production: what kind of data are you working with? what kind of classifier do you use downstream in the pipeline? what is the class balancing ratio (5% vs 95% as in the reproducer)? what improvement in terms of balanced accuracy, F1, AUC or other metrics to you observe with SMOTETomek vs other balanced classification approaches (such as subsampling the majority class) or using BalancedRandomForest or LogisticRegression with class weights?

I can share some information about this. The dataset is aggregated website hit-level interaction data. The class balance ratio is shifting from 5%-95% to 1,5%-98,5% daily.
The downstream classifier used is a LightGBM GOSS booster. I tested the BalancedRandomForest and LogisticRegression with class weights and LR with minority class upsampling (using KNN). But the GOSS model generalizes far better on unseen data. The downstream model parameters were tuned with Ray Tune (BOHB).
Overall improvements across the classification metrics we're around 5-15% (ROC AUC improved 10% compared to LR with upsampling, the model is nearly perfect on the test dataset). Improvements on production we're approx. 5% compared to LR with upsampling, there is a lot of seasonality and fast changes in the real-world environment so the model is trained daily.
In production we use the SMOTETomek balanced data with GOSS model and a LR with upsampled minority class data. We use probabilities of the two highest deciles from both of the models for add targeting. This yielded the best results based on our AB-test results. The overall improvement was quite drastic when comparing GOSS against LR in AB-testing. I cannot go into more detail unfortunately.

Mariamamb · 2021-06-27T02:43:33Z

I am having the same problem with Smotetomek when using a large dataset. It runs 4h before I killed the process.
I am using imbalanced-learn 0.7.0 and scikit-learn 0.23.1.
shape is (2264594, 78). but highly imbalanced

andrdpedro · 2021-07-30T16:11:51Z

I am having the same Problem with SMOTEENN, running for 10 hours until i killed it...
someone found the solution?

lucamagnasco · 2021-10-21T14:20:52Z

Same error for SMOTETomek, in a dataset (266k, 25). Smote alone runs in less than a minute, smotetomek lasts more aprox an hour to run.

nurrrrx · 2021-11-26T16:19:26Z

Hi, I have high dimensional data, 2M rows x 520 features, imbalance is 35K vs 2M (pos vs negative). I tried simple fit_sample, and its still running since 6-8 hours now on a single core 32GB machine. is it a bad idea? Should I kill it?
would going for AWS 512 GB with more cores help or GPU?

glemaitre · 2022-01-16T17:34:44Z

The version in master has support for passing duck-typed NN and thus GPU accelerated instances. Since we could not reproduce the original bug and we implemented the duck-typing, I will thus close this issue.

Elsa-gif · 2022-05-16T13:15:02Z

Hello, I'm trying to apply SMOTETOMEK to a base of size 2500000x32 but it runs endlessly. How to do?

Prashavu · 2022-08-19T06:50:13Z

Im having the same problem. I wasted a lot of time trying to run it, what is the solution? what version should we downgrade sklearn to?Now I had to use ordinary smote and take a hit on accuracy because this is not working , its fitting knn continuosly

The scikit-learn version is 1.0.2.

ogrisel · 2022-08-19T13:37:50Z

@Elsa-gif @Prashavu have you tried to pass an alternative nearest neighbors implementation?

from imblearn.over_sampling import SMOTE
from sklearn.neighbors import NearestNeighbors

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="kd_tree"))

or

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="ball_tree"))

or

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="brute"))

Prashavu · 2022-08-21T17:21:53Z

@ogrisel No I had not tried these. Howdifferent is thedefault SMOTE object from these ?

ogrisel · 2022-08-21T17:27:43Z

They should all behave the same but only faster or slower depending on the dimensionality (number of features) of the dataset and the number of cpu cores. Tree based neighbors computation should be faster than the bruteforce method in low dimensions (e.g. lower than 50 features).

glemaitre closed this as completed Jan 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update #817

[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update #817

jruokolainen commented Feb 23, 2021

hayesall commented Feb 23, 2021

glemaitre commented Mar 2, 2021

glemaitre commented Mar 2, 2021

glemaitre commented Mar 2, 2021

jruokolainen commented Mar 4, 2021

glemaitre commented Mar 8, 2021

jruokolainen commented Mar 25, 2021

jruokolainen commented Mar 25, 2021 •

edited

jruokolainen commented Mar 25, 2021 •

edited

jruokolainen commented Mar 25, 2021 •

edited by glemaitre

jruokolainen commented Mar 25, 2021

ogrisel commented Mar 29, 2021 •

edited

ogrisel commented Mar 29, 2021 •

edited

ogrisel commented Mar 29, 2021 •

edited

ogrisel commented Mar 29, 2021 •

edited

ogrisel commented Mar 29, 2021

chkoar commented Mar 29, 2021

chkoar commented Mar 29, 2021

jruokolainen commented May 29, 2021 •

edited

Mariamamb commented Jun 27, 2021 •

edited

andrdpedro commented Jul 30, 2021

lucamagnasco commented Oct 21, 2021

nurrrrx commented Nov 26, 2021 •

edited

glemaitre commented Jan 16, 2022

Elsa-gif commented May 16, 2022

Prashavu commented Aug 19, 2022 •

edited

ogrisel commented Aug 19, 2022 •

edited

Prashavu commented Aug 21, 2022

ogrisel commented Aug 21, 2022

[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update #817

[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update #817

Comments

jruokolainen commented Feb 23, 2021

hayesall commented Feb 23, 2021

glemaitre commented Mar 2, 2021

glemaitre commented Mar 2, 2021

glemaitre commented Mar 2, 2021

jruokolainen commented Mar 4, 2021

glemaitre commented Mar 8, 2021

jruokolainen commented Mar 25, 2021

jruokolainen commented Mar 25, 2021 • edited

jruokolainen commented Mar 25, 2021 • edited

jruokolainen commented Mar 25, 2021 • edited by glemaitre

jruokolainen commented Mar 25, 2021

ogrisel commented Mar 29, 2021 • edited

ogrisel commented Mar 29, 2021 • edited

ogrisel commented Mar 29, 2021 • edited

ogrisel commented Mar 29, 2021 • edited

ogrisel commented Mar 29, 2021

chkoar commented Mar 29, 2021

chkoar commented Mar 29, 2021

jruokolainen commented May 29, 2021 • edited

Mariamamb commented Jun 27, 2021 • edited

andrdpedro commented Jul 30, 2021

lucamagnasco commented Oct 21, 2021

nurrrrx commented Nov 26, 2021 • edited

glemaitre commented Jan 16, 2022

Elsa-gif commented May 16, 2022

Prashavu commented Aug 19, 2022 • edited

ogrisel commented Aug 19, 2022 • edited

Prashavu commented Aug 21, 2022

ogrisel commented Aug 21, 2022

jruokolainen commented Mar 25, 2021 •

edited

jruokolainen commented Mar 25, 2021 •

edited

jruokolainen commented Mar 25, 2021 •

edited by glemaitre

ogrisel commented Mar 29, 2021 •

edited

ogrisel commented Mar 29, 2021 •

edited

ogrisel commented Mar 29, 2021 •

edited

ogrisel commented Mar 29, 2021 •

edited

jruokolainen commented May 29, 2021 •

edited

Mariamamb commented Jun 27, 2021 •

edited

nurrrrx commented Nov 26, 2021 •

edited

Prashavu commented Aug 19, 2022 •

edited

ogrisel commented Aug 19, 2022 •

edited