<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/Issues_with_SMOTE/blob/main/MA3_Oversampling_ham_spam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
# now we will setup faiss for vector search
!pip install faiss-cpu

# checking whether faiss is installed properly
import faiss
print(faiss.__version__)


Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1
1.8.0


In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE

#### Please check the [notebook]( https://github.com/rajdeepbanerjee-git/Data_Augmentation_LLM/blob/main/MA2_p1_dataset_prep.ipynb) for data preparation. For ease of use, I am including the datasets used along with this repo.

In [2]:
train_sample = pd.read_csv("/content/train.csv")
test_sample = pd.read_csv("/content/test.csv")
pool_sample = pd.read_csv("/content/pool.csv")


#### Hypothesis:
SMOTE may create data points which are actually not available in the real-world as minority class. Rather, in the higher dimensional vector space, they might be closer to the majority class.

To prove this:
- We will first keep aside a data pool which serve as our real-world data (around 40% of the total data)
- We will do a SMOTE and see how the model performs on the added synthetic data.
- Then to see why it performed bad, we will calculate the cosine similarity search with the data pool, take the top similar data point from the pool that is similar to the synthetic minority sample and figure out the % of data points that are closer to the majority than the minority.
- Higher the percentage, worse is the performance of SMOTE.

This analysis is motivated from the paper ["Stop oversampling for class imbalance learning"](https://arxiv.org/abs/2202.03579). Although, I took a simpler approach than in the paper, but the result still holds true.


In [3]:
# Initial class distribution before sampling

train_sample['label'].value_counts(normalize = True).round(2)


Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
0,0.87
1,0.13


In [4]:
# vectorize all the data
vectorizer = TfidfVectorizer(min_df = 0.01)

# Fit and transform the text data to create TF-IDF vectors
train_tfidf_mat = vectorizer.fit_transform(train_sample['sms'])
test_tfidf_mat = vectorizer.transform(test_sample['sms'])

In [7]:
# baseline model
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier on the training data
nb_classifier.fit(train_tfidf_mat, train_sample['label'])

# Make predictions on the testing data
y_pred = nb_classifier.predict(test_tfidf_mat)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_sample['label'], y_pred)
print(f"Accuracy: {accuracy.round(2)}")

# Print the classification report
report = classification_report(test_sample['label'], y_pred)
print("Classification Report:")
print(report)

Accuracy: 0.97
Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      1156
           1       0.99      0.82      0.89       182

    accuracy                           0.97      1338
   macro avg       0.98      0.91      0.94      1338
weighted avg       0.97      0.97      0.97      1338



In [5]:
# oversample minority class
smote = SMOTE(sampling_strategy = 'auto', random_state=42)
y_train = train_sample['label']
X_res_smote, y_res_smote = smote.fit_resample(train_tfidf_mat, y_train)

# determine which are the synthetic samples

# Convert to sets for comparison
original_data = set(map(tuple, train_tfidf_mat.toarray()))
resampled_data = set(map(tuple, X_res_smote.toarray()))

# Find the synthetic samples (new data added by SMOTE)
synthetic_samples = resampled_data - original_data
synthetic_samples = np.array(list(synthetic_samples))

print("Addeded data size:", synthetic_samples.shape[0])

Addeded data size: 1327


In [8]:
# retrain with augmented data - with sampling strategy 'auto'
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier on the training data
nb_classifier.fit(X_res_smote, y_res_smote)

# Make predictions on the testing data
y_pred = nb_classifier.predict(test_tfidf_mat)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_sample['label'], y_pred)
print(f"Accuracy: {accuracy.round(2)}")

# Print the classification report
report = classification_report(test_sample['label'], y_pred)
print("Classification Report:")
print(report)


Accuracy: 0.93
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.93      0.96      1156
           1       0.68      0.94      0.79       182

    accuracy                           0.93      1338
   macro avg       0.84      0.94      0.87      1338
weighted avg       0.95      0.93      0.94      1338



The balanced dataset has worse results than the imbalanced one! It increased recall, but decreased precision!

#### WHY?
 To understand, we will check whether the generated data-points are indeed from minority class.


In [11]:
# transform the pool and add to faiss index to check similarity with the synthetic samples
pool_sample_tfidf = vectorizer.transform(pool_sample['sms'])

# Create the FAISS index, before that we need to convert the tfidf vectors in proper format needed for faiss
# cosine similarity is used, can use ndexFlatIP for inner product based similarity
pool_tfidf_emb = pool_sample_tfidf.toarray().astype('float32') # should be the embedding search space
emb_len = pool_tfidf_emb.shape[1]
index = faiss.IndexFlatIP(emb_len) # pass the length the embedding
index.add(pool_tfidf_emb)  # Add embeddings to the index

In [12]:
# function to get top k similar data from the pool, that are similar to queries (FN cases)

from tqdm import tqdm

def get_top_k(queries_dense, faiss_index, k):

    similar_indices = []
    for i in tqdm(range(queries_dense.shape[0])):
      query_vector = queries_dense[i].reshape(1, -1)
      distances, indices = index.search(query_vector, k) # faiss index search
      similar_indices.append({"indices": indices, "distances":distances})
    sim_ind_df = pd.DataFrame(similar_indices)

    return sim_ind_df

In [13]:
# use the function to get top k similar data indices
sim_ind_df_pool = get_top_k(queries_dense = synthetic_samples, faiss_index = index, k = 1)

100%|██████████| 1327/1327 [00:00<00:00, 8444.66it/s]


In [14]:
# get the counts of the labels corresponding to the most similar examples from pool data
results = pool_sample.iloc[sim_ind_df_pool['indices']]['label'].value_counts().to_dict()

# calculate the error percentage
error = 100*results[0]/(results[0] + results[1])
print(f"error: {round(error, 2)}")

error: 11.91


When we check the similarity score with pool data - we find ~ 12% of generated samples are closer to majority than minority. This is what decreases the performance of the model.