New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update #817
Comments
Something similar was reported in #784 Can you include a portion of your data, and environment details from this command: python -c 'import imblearn; imblearn.show_versions(github=True)' |
In #784, it was indeed not due to @jruokolainen Could downgrade Could you give the shape of the array and the data types such that we can try to reproduce. |
as well as the info asked by @hayesall. It would really useful. |
I used the following code: %%time
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=1_000_000, n_features=10,
n_classes=3, weights=[0.05, 0.1, 0.85],
n_informative=4, random_state=0
)
from imblearn.combine import SMOTETomek
SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y) With imbalanced-learn 0.7.0 and master and scikit-learn 0.23.X and master, both times, I am getting a wall time of 6 minutes. |
Yeah I can't share the dataset but overall it has approx. 20 million rows, with class imbalance of 99% negative , 1% positive labels and around 50 feature columns. These we're the versions that caused the bug. Works like a charm with imbalanced-learn 0.7.0 |
When you say that imbalanced-learn 0.7.0 works like a charm, was it with scikit-learn 0.23 ou 0.24 |
It's working on OS X Mojave with 0.7.0 and scikit 0.24.1 but not with 0.8 imblearn |
This one doesn't complete, occupies a lot of threads on my MacBook pro 2019 but doesn't strain CPU at all. `❯ pipenv run python -c 'import imblearn; imblearn.show_versions(github=True)' System, Dependency InformationSystem Information
Python Dependencies
|
These run without a problem in GCP ai platform
|
This would roughly simulate the dataset size I'm using %%time
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek
X, y = make_classification(
n_samples=850862*3, n_features=63,
n_classes=2, weights=[0.05, 0.95],
n_informative=4, random_state=0)
SMOTETomek(n_jobs=-1, random_state=0).fit_resample(X, y) |
This one runs in a notebook without a problem even with the newest version but with my production dataset the CPU usage lingers at 20% and it newer completes |
@jruokolainen can you please report the times you observe on your machine when running the reproducer you posted at #817 (comment) ? both with:
|
@jruokolainen unrelated to the performance problem: out of curiosity, I would like to know more about practical applications of SMOTE in production: what kind of data are you working with? what kind of classifier do you use downstream in the pipeline? what is the class balancing ratio (5% vs 95% as in the reproducer)? what improvement in terms of balanced accuracy, F1, AUC or other metrics to you observe with |
I can indeed observe a small perf regression on a smaller subset of the data when upgrading:
I have tried both with joblib 1.0 and 0.17 for both and it does not seem to matter. Here is the reproducer I used: from time import perf_counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTETomek
X, y = make_classification(
n_samples=int(3e4), n_features=63,
n_classes=2, weights=[0.05, 0.95],
n_informative=4, random_state=0
)
print(f"Generated {X.nbytes / 1e6:.1f} MB of training data")
tic = perf_counter()
X_out, y_out = SMOTETomek(n_jobs=-1, random_state=0
).fit_resample(X, y)
toc = perf_counter()
print(f"SMOTETomek took {toc - tic:.1f} s and generated {X_out.nbytes / 1e6:.1f} MB") You can try to increase the number of samples but the ratio of the runtimes seems to stay approximately constant. It would be worth investigating what is the bottleneck with a profiler and report the regression upstream in scikit-learn if it can be reproduced only with scikit-learn code. I suspect that using the Ball-Tree algorithm in the embedded nearest neighbors search on 63-dimensional data might be suboptimal. It would be worth checking with the brute force method and also with an approximate method such as https://github.com/lmcinnes/pynndescent but it requires significant code change. |
It seems that both SMOTE and TomekLinks do their own knn model internally. Wouldn't there be a way to make the SMOTE model also return the nearest neighbor info for each resampled data point to avoid this? |
The change in scikit-learn 0.24 that might explain the performance behavior for very large datasets with a large enough number of features might be: scikit-learn/scikit-learn#17148 So it could make sense to give the users the ability to switch the underlying NN search strategy ( |
Correct. The plan is to create backends for nearest neighbor searches. So we could leverage librarie like faiss or annoy without explicitly require them. |
Well, probably, my thought was... |
I can share some information about this. The dataset is aggregated website hit-level interaction data. The class balance ratio is shifting from 5%-95% to 1,5%-98,5% daily. |
I am having the same problem with Smotetomek when using a large dataset. It runs 4h before I killed the process. |
I am having the same Problem with SMOTEENN, running for 10 hours until i killed it... |
Same error for SMOTETomek, in a dataset (266k, 25). Smote alone runs in less than a minute, smotetomek lasts more aprox an hour to run. |
Hi, I have high dimensional data, 2M rows x 520 features, imbalance is 35K vs 2M (pos vs negative). I tried simple fit_sample, and its still running since 6-8 hours now on a single core 32GB machine. is it a bad idea? Should I kill it? |
The version in master has support for passing duck-typed NN and thus GPU accelerated instances. Since we could not reproduce the original bug and we implemented the duck-typing, I will thus close this issue. |
Hello, I'm trying to apply SMOTETOMEK to a base of size 2500000x32 but it runs endlessly. How to do? |
Im having the same problem. I wasted a lot of time trying to run it, what is the solution? what version should we downgrade sklearn to?Now I had to use ordinary smote and take a hit on accuracy because this is not working , its fitting knn continuosly The scikit-learn version is 1.0.2. |
@Elsa-gif @Prashavu have you tried to pass an alternative nearest neighbors implementation? from imblearn.over_sampling import SMOTE
from sklearn.neighbors import NearestNeighbors
SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="kd_tree")) or SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="ball_tree")) or SMOTE(k_neighbors=NearestNeighbors(n_neighbors=5, algorithm="brute")) |
@ogrisel No I had not tried these. Howdifferent is thedefault SMOTE object from these ? |
They should all behave the same but only faster or slower depending on the dimensionality (number of features) of the dataset and the number of cpu cores. Tree based neighbors computation should be faster than the bruteforce method in low dimensions (e.g. lower than 50 features). |
I've been using SMOTETomek in production with success for a while. The 0.7.6 version runs through the dataset in around 5-8min. Updated and the new version ran for 1,5h before I killed the process.
The text was updated successfully, but these errors were encountered: