Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
TomekLinks fit_sample taking long time #567
I am working on a text classification problem. I am using TomekLinks class of imblearn module to resample my data.But after calling fit_sample(X,y) method of TomekLinks class program does nothing even if i wait for 30 mins. My data set is 1800000 records long(text data).Here is the code snippet
from imblearn.under_sampling import TomekLinks
tl = TomekLinks(return_indices=True, ratio='majority',random_state=42)
Can anyone help as why it is taking such a long time? and how to hanle this situation
"Tomek Links" is a fairly expensive algorithm since it has to compute pairwise distances between all examples. Even before taking the dimensionality of your text data into account, it will have to compute something on the order of
From page 4 of "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data": "As finding Tomek links is computationally demanding, it would be computationally cheaper if it was performed on a reduced data set."
Maybe you could sample from your data set (while preserving the underlying distribution) or try some other dimensionality reduction techniques first?