Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update #817

Closed
jruokolainen opened this issue Feb 23, 2021 · 29 comments

Comments

@jruokolainen
Copy link

I've been using SMOTETomek in production with success for a while. The 0.7.6 version runs through the dataset in around 5-8min. Updated and the new version ran for 1,5h before I killed the process.

               balancer = SMOTETomek(random_state=2425, n_jobs=-1)
               df_resampled, target_resampled = balancer.fit_resample(dataframe, target)
               return df_resampled, target_resampled
@hayesall
Copy link
Member

Something similar was reported in #784

Can you include a portion of your data, and environment details from this command:

python -c 'import imblearn; imblearn.show_versions(github=True)'

@glemaitre
Copy link
Member

In #784, it was indeed not due to imbalanced-learn but to the new scikit-learn 0.24.

@jruokolainen Could downgrade scikit-learn to 0.23 (it should still work even if we force to use 0.24).

Could you give the shape of the array and the data types such that we can try to reproduce.

@glemaitre
Copy link
Member

as well as the info asked by @hayesall. It would really useful.

@glemaitre
Copy link
Member

I used the following code:

%%time

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1_000_000, n_features=10,
    n_classes=3, weights=[0.05, 0.1, 0.85],
    n_informative=4, random_state=0
)

from imblearn.combine import SMOTETomek

SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)

With imbalanced-learn 0.7.0 and master and scikit-learn 0.23.X and master, both times, I am getting a wall time of 6 minutes.
We really need to have all information regarding all numpy, scipy, scikit-learn and system version and the dimensionality of the problem.

@jruokolainen
Copy link
Author

Yeah I can't share the dataset but overall it has approx. 20 million rows, with class imbalance of 99% negative , 1% positive labels and around 50 feature columns. These we're the versions that caused the bug. Works like a charm with imbalanced-learn 0.7.0
imbalanced-learn-0.8.0 scikit-learn-0.24.1

@glemaitre
Copy link
Member

When you say that imbalanced-learn 0.7.0 works like a charm, was it with scikit-learn 0.23 ou 0.24
Could you also provide the OS that you are working with?

@jruokolainen
Copy link
Author

It's working on OS X Mojave with 0.7.0 and scikit 0.24.1 but not with 0.8 imblearn

@jruokolainen
Copy link
Author

jruokolainen commented Mar 25, 2021

This one doesn't complete, occupies a lot of threads on my MacBook pro 2019 but doesn't strain CPU at all.

`❯ pipenv run python -c 'import imblearn; imblearn.show_versions(github=True)'

System, Dependency Information

System Information

  • python : 3.8.7 (v3.8.7:6503f05dd5, Dec 21 2020, 12:45:15) [Clang 6.0 (clang-600.0.57)]
  • executable: /Users/jokke/.local/share/virtualenvs/b2c-p2p-scorer-model-h2qOSVv_/bin/python
  • machine : macOS-10.16-x86_64-i386-64bit

Python Dependencies

  • pip : 21.0.1
  • setuptools: 53.0.0
  • imblearn : 0.8.0
  • sklearn : 0.24.1
  • numpy : 1.19.5
  • scipy : 1.6.2
  • Cython : None
  • pandas : 1.2.3
  • keras : None
  • tensorflow: 2.4.1
  • joblib : `1.0.1``

@jruokolainen
Copy link
Author

jruokolainen commented Mar 25, 2021

These run without a problem in GCP ai platform
`Python Dependencies

  • pip : 21.0.1
  • setuptools: 53.0.0
  • imblearn : 0.7.0
  • sklearn : 0.23.2
  • numpy : 1.18.5
  • scipy : 1.4.1
  • Cython : None
  • pandas : 1.2.3
  • keras : None
  • tensorflow: 2.3.1
  • joblib : 0.17.0
    `

@jruokolainen
Copy link
Author

jruokolainen commented Mar 25, 2021

This would roughly simulate the dataset size I'm using

%%time
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek
X, y = make_classification(
    n_samples=850862*3, n_features=63,
    n_classes=2, weights=[0.05, 0.95],
    n_informative=4, random_state=0)
SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)

@jruokolainen
Copy link
Author

I used the following code:

%%time

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1_000_000, n_features=10,
    n_classes=3, weights=[0.05, 0.1, 0.85],
    n_informative=4, random_state=0
)

from imblearn.combine import SMOTETomek

SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y)

With imbalanced-learn 0.7.0 and master and scikit-learn 0.23.X and master, both times, I am getting a wall time of 6 minutes.
We really need to have all information regarding all numpy, scipy, scikit-learn and system version and the dimensionality of the problem.

This one runs in a notebook without a problem even with the newest version but with my production dataset the CPU usage lingers at 20% and it newer completes

@ogrisel
Copy link

ogrisel commented Mar 29, 2021

@jruokolainen can you please report the times you observe on your machine when running the reproducer you posted at #817 (comment) ?

both with:

  • imbalanced-learn 0.7 / scikit-learn 0.23.2
  • imbalanced-learn 0.8 / scikit-learn 0.24.1

@ogrisel
Copy link

ogrisel commented Mar 29, 2021

I've been using SMOTETomek in production with success for a while.

@jruokolainen unrelated to the performance problem: out of curiosity, I would like to know more about practical applications of SMOTE in production: what kind of data are you working with? what kind of classifier do you use downstream in the pipeline? what is the class balancing ratio (5% vs 95% as in the reproducer)? what improvement in terms of balanced accuracy, F1, AUC or other metrics to you observe with SMOTETomek vs other balanced classification approaches (such as subsampling the majority class) or using BalancedRandomForest or LogisticRegression with class weights?

@ogrisel
Copy link

ogrisel commented Mar 29, 2021

I can indeed observe a small perf regression on a smaller subset of the data when upgrading:

  • from scikit-learn 0.23.2 / imbalanced-learn 0.7:
(imblearn-07) ogrisel@mba ~ % python tmp/debug_imbalanced_perf.py
Generated 15.1 MB of training data
SMOTETomek took 63.1 s and generated 28.6 MB
  • to scikit-learn 0.24.1 / imbalanced-learn 0.8"
(imblearn-latest) ogrisel@mba ~ % python tmp/debug_imbalanced_perf.py
Generated 15.1 MB of training data
SMOTETomek took 76.2 s and generated 28.6 MB

I have tried both with joblib 1.0 and 0.17 for both and it does not seem to matter.

Here is the reproducer I used:

from time import perf_counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek


X, y = make_classification(
    n_samples=int(3e4), n_features=63,
    n_classes=2, weights=[0.05, 0.95],
    n_informative=4, random_state=0
)

print(f"Generated {X.nbytes / 1e6:.1f} MB of training data")
tic = perf_counter()
X_out, y_out = SMOTETomek(n_jobs=-1, random_state=0
    ).fit_resample(X, y)
toc = perf_counter()
print(f"SMOTETomek took {toc - tic:.1f} s and generated {X_out.nbytes / 1e6:.1f} MB")

You can try to increase the number of samples but the ratio of the runtimes seems to stay approximately constant.

It would be worth investigating what is the bottleneck with a profiler and report the regression upstream in scikit-learn if it can be reproduced only with scikit-learn code.

I suspect that using the Ball-Tree algorithm in the embedded nearest neighbors search on 63-dimensional data might be suboptimal. It would be worth checking with the brute force method and also with an approximate method such as https://github.com/lmcinnes/pynndescent but it requires significant code change.

@ogrisel
Copy link

ogrisel commented Mar 29, 2021

It seems that both SMOTE and TomekLinks do their own knn model internally. Wouldn't there be a way to make the SMOTE model also return the nearest neighbor info for each resampled data point to avoid this?

@ogrisel
Copy link

ogrisel commented Mar 29, 2021

The change in scikit-learn 0.24 that might explain the performance behavior for very large datasets with a large enough number of features might be:

scikit-learn/scikit-learn#17148

So it could make sense to give the users the ability to switch the underlying NN search strategy (brute vs ball-tree) and maybe the heuristic used in 0.24 is not optimal...

@chkoar
Copy link
Member

chkoar commented Mar 29, 2021

So it could make sense to give the users the ability to switch the underlying NN search strategy

Correct. The plan is to create backends for nearest neighbor searches. So we could leverage librarie like faiss or annoy without explicitly require them.

@chkoar
Copy link
Member

chkoar commented Mar 29, 2021

The plan is

Well, probably, my thought was...

@jruokolainen
Copy link
Author

jruokolainen commented May 29, 2021

I've been using SMOTETomek in production with success for a while.

@jruokolainen unrelated to the performance problem: out of curiosity, I would like to know more about practical applications of SMOTE in production: what kind of data are you working with? what kind of classifier do you use downstream in the pipeline? what is the class balancing ratio (5% vs 95% as in the reproducer)? what improvement in terms of balanced accuracy, F1, AUC or other metrics to you observe with SMOTETomek vs other balanced classification approaches (such as subsampling the majority class) or using BalancedRandomForest or LogisticRegression with class weights?

I can share some information about this. The dataset is aggregated website hit-level interaction data. The class balance ratio is shifting from 5%-95% to 1,5%-98,5% daily.
The downstream classifier used is a LightGBM GOSS booster. I tested the BalancedRandomForest and LogisticRegression with class weights and LR with minority class upsampling (using KNN). But the GOSS model generalizes far better on unseen data. The downstream model parameters were tuned with Ray Tune (BOHB).
Overall improvements across the classification metrics we're around 5-15% (ROC AUC improved 10% compared to LR with upsampling, the model is nearly perfect on the test dataset). Improvements on production we're approx. 5% compared to LR with upsampling, there is a lot of seasonality and fast changes in the real-world environment so the model is trained daily.
In production we use the SMOTETomek balanced data with GOSS model and a LR with upsampled minority class data. We use probabilities of the two highest deciles from both of the models for add targeting. This yielded the best results based on our AB-test results. The overall improvement was quite drastic when comparing GOSS against LR in AB-testing. I cannot go into more detail unfortunately.

@Mariamamb
Copy link

Mariamamb commented Jun 27, 2021

I am having the same problem with Smotetomek when using a large dataset. It runs 4h before I killed the process.
I am using imbalanced-learn 0.7.0 and scikit-learn 0.23.1.
shape is (2264594, 78). but highly imbalanced

@andrdpedro
Copy link

I am having the same Problem with SMOTEENN, running for 10 hours until i killed it...
someone found the solution?

@lucamagnasco
Copy link

Same error for SMOTETomek, in a dataset (266k, 25). Smote alone runs in less than a minute, smotetomek lasts more aprox an hour to run.

@nurrrrx
Copy link

nurrrrx commented Nov 26, 2021

Hi, I have high dimensional data, 2M rows x 520 features, imbalance is 35K vs 2M (pos vs negative). I tried simple fit_sample, and its still running since 6-8 hours now on a single core 32GB machine. is it a bad idea? Should I kill it?
would going for AWS 512 GB with more cores help or GPU?

@glemaitre
Copy link
Member

The version in master has support for passing duck-typed NN and thus GPU accelerated instances. Since we could not reproduce the original bug and we implemented the duck-typing, I will thus close this issue.

@Elsa-gif
Copy link

Hello, I'm trying to apply SMOTETOMEK to a base of size 2500000x32 but it runs endlessly. How to do?

@Prashavu
Copy link

Prashavu commented Aug 19, 2022

Im having the same problem. I wasted a lot of time trying to run it, what is the solution? what version should we downgrade sklearn to?Now I had to use ordinary smote and take a hit on accuracy because this is not working , its fitting knn continuosly

The scikit-learn version is 1.0.2.

@ogrisel
Copy link

ogrisel commented Aug 19, 2022

@Elsa-gif @Prashavu have you tried to pass an alternative nearest neighbors implementation?

from imblearn.over_sampling import SMOTE
from sklearn.neighbors import NearestNeighbors

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="kd_tree"))

or

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="ball_tree"))

or

SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="brute"))

@Prashavu
Copy link

@ogrisel No I had not tried these. Howdifferent is thedefault SMOTE object from these ?

@ogrisel
Copy link

ogrisel commented Aug 21, 2022

They should all behave the same but only faster or slower depending on the dimensionality (number of features) of the dataset and the number of cpu cores. Tree based neighbors computation should be faster than the bruteforce method in low dimensions (e.g. lower than 50 features).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests