TomekLinks fit_sample taking long time #567

atulec04 · 2019-05-08T15:30:22Z

I am working on a text classification problem. I am using TomekLinks class of imblearn module to resample my data.But after calling fit_sample(X,y) method of TomekLinks class program does nothing even if i wait for 30 mins. My data set is 1800000 records long(text data).Here is the code snippet

from imblearn.under_sampling import TomekLinks

tl = TomekLinks(return_indices=True, ratio='majority',random_state=42)
X_tl, y_tl = tl.fit_sample(train_x,y_binary)

Can anyone help as why it is taking such a long time? and how to hanle this situation

hayesall · 2019-05-08T15:57:49Z

"Tomek Links" is a fairly expensive algorithm since it has to compute pairwise distances between all examples. Even before taking the dimensionality of your text data into account, it will have to compute something on the order of(1.8 * 10^6)^2 values.

From page 4 of "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data": "As finding Tomek links is computationally demanding, it would be computationally cheaper if it was performed on a reduced data set."

Maybe you could sample from your data set (while preserving the underlying distribution) or try some other dimensionality reduction techniques first?

atulec04 · 2019-05-08T16:39:51Z

Ok

nikunjsonule · 2022-04-16T11:27:40Z

"Tomek Links" is a fairly expensive algorithm since it has to compute pairwise distances between all examples. Even before taking the dimensionality of your text data into account, it will have to compute something on the order of(1.8 * 10^6)^2 values.

From page 4 of "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data": "As finding Tomek links is computationally demanding, it would be computationally cheaper if it was performed on a reduced data set."

Maybe you could sample from your data set (while preserving the underlying distribution) or try some other dimensionality reduction techniques first?

Yes totally agree

atulec04 closed this as completed May 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TomekLinks fit_sample taking long time #567

TomekLinks fit_sample taking long time #567

atulec04 commented May 8, 2019

hayesall commented May 8, 2019

atulec04 commented May 8, 2019

nikunjsonule commented Apr 16, 2022

TomekLinks fit_sample taking long time #567

TomekLinks fit_sample taking long time #567

Comments

atulec04 commented May 8, 2019

hayesall commented May 8, 2019

atulec04 commented May 8, 2019

nikunjsonule commented Apr 16, 2022