Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TomekLinks fit_sample taking long time #567

Closed
atulec04 opened this issue May 8, 2019 · 3 comments
Closed

TomekLinks fit_sample taking long time #567

atulec04 opened this issue May 8, 2019 · 3 comments

Comments

@atulec04
Copy link

atulec04 commented May 8, 2019

I am working on a text classification problem. I am using TomekLinks class of imblearn module to resample my data.But after calling fit_sample(X,y) method of TomekLinks class program does nothing even if i wait for 30 mins. My data set is 1800000 records long(text data).Here is the code snippet

from imblearn.under_sampling import TomekLinks

tl = TomekLinks(return_indices=True, ratio='majority',random_state=42)
X_tl, y_tl = tl.fit_sample(train_x,y_binary)

Can anyone help as why it is taking such a long time? and how to hanle this situation

@hayesall
Copy link
Member

hayesall commented May 8, 2019

"Tomek Links" is a fairly expensive algorithm since it has to compute pairwise distances between all examples. Even before taking the dimensionality of your text data into account, it will have to compute something on the order of(1.8 * 10^6)^2 values.

From page 4 of "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data": "As finding Tomek links is computationally demanding, it would be computationally cheaper if it was performed on a reduced data set."

Maybe you could sample from your data set (while preserving the underlying distribution) or try some other dimensionality reduction techniques first?

@atulec04
Copy link
Author

atulec04 commented May 8, 2019

Ok

@nikunjsonule
Copy link

"Tomek Links" is a fairly expensive algorithm since it has to compute pairwise distances between all examples. Even before taking the dimensionality of your text data into account, it will have to compute something on the order of(1.8 * 10^6)^2 values.

From page 4 of "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data": "As finding Tomek links is computationally demanding, it would be computationally cheaper if it was performed on a reduced data set."

Maybe you could sample from your data set (while preserving the underlying distribution) or try some other dimensionality reduction techniques first?

Yes totally agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants