<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/DataCentricAI_model_retraining/blob/main/Medium_notebook_DataCentricAI_FAISS%2BTFIDF_ham_spam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets faiss-cpu

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-

In [None]:
# checking whether faiss is installed properly
import faiss
print(faiss.__version__)

1.8.0


In [None]:
# import rest of the libraries
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt

We will simulate the scenario where you have an imbalanced dataset (annotated) and a pool of data that is not annotated. You have limited resources for annotating the data and hence you need to judiciously choose what data to annotate that will increase you model performance the most.

In the following, we will take the ham-spam dataset, and show how we can use active learning concept to find the useful data to add to the dataset, that can increase my test accuracy.

To demonstrate the method, we will first create 3 sets of data representing the training data, the one hold out test data and some data pool that we will use to augment my training data so that it can perform better. This is a data-centric AI methodology that will help you to come up with better models, when your hyperparameter tuning does not work well anymore.

In [None]:
# Load the dataset
dataset = load_dataset("sms_spam")
df = pd.DataFrame(dataset['train'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/3.21k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/4.87k [00:00<?, ?B/s]

The repository for sms_spam contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/sms_spam.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

In [None]:
df['label'].value_counts(normalize = True).round(2) # we already have an imbalanced dataset where your spam is 13% and ham is 87%

label
0    0.87
1    0.13
Name: proportion, dtype: float64

We will keep 40% data as data pool and rest of the 60% will be devided into train and test. We will train the model and then use a tfidf based vector search methodology to choose the data to include for training.

Note: One can easily go and take more 'spam' labelled data to include in the training set. But there can be cases where you don't really have labeled data in the pool, and you cannot spend resources to annotate all your data in the pool. So you have to have some method to choose what data you want for training.

In [None]:
def sample_df(df, train_fraction, random_state = 123):
  sample_size = int(len(df)*train_fraction)
  train_sample = df.sample(n=sample_size, random_state=42)
  test_sample = df.drop(train_sample.index)

  return train_sample, test_sample

In [None]:
traintest_sample, pool_sample = sample_df(df = df, train_fraction = 0.6, random_state = 123)
train_sample, test_sample = sample_df(df = traintest_sample, train_fraction = 0.6, random_state = 123)

print(f"train sample size: {len(train_sample)}, \n test sample size: {len(test_sample)}, \n pool sample size: {len(pool_sample)}")

train sample size: 2006, 
 test sample size: 1338, 
 pool sample size: 2230


For the sake of similicity I am not using any pre-processing for on the data. In principle you would want to add a text preprocessing step for all your data which will generally include removing special characters, stop words, stemming/lemmatization, etc.

We will just take the vectorizer and fit on the training data and transform the test data and the pool data. The pool data embedding will serve as my search space.

In [None]:
# Initialize the TF-IDF Vectorizer. To reduce the number of features and the sparsity, I have used min_df = 0.01.
# This means we will only consider tokens that appear in more than 1% of the training data
vectorizer = TfidfVectorizer(min_df = 0.01)

# Fit and transform the text data to create TF-IDF vectors
train_tfidf_mat = vectorizer.fit_transform(train_sample['sms'])
test_tfidf_mat = vectorizer.transform(test_sample['sms'])
pool_tfidf_mat = vectorizer.transform(pool_sample['sms'])

print(train_tfidf_mat.shape, test_tfidf_mat.shape, pool_tfidf_mat.shape)

(2006, 208) (1338, 208) (2230, 208)


In [None]:
# baseline model
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier on the training data
nb_classifier.fit(train_tfidf_mat, train_sample['label'])

# Make predictions on the testing data
y_pred = nb_classifier.predict(test_tfidf_mat)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_sample['label'], y_pred)
print(f"Accuracy: {accuracy.round(2)}")

# Print the classification report
report = classification_report(test_sample['label'], y_pred)
print("Classification Report:")
print(report)


Accuracy: 0.97
Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      1156
           1       0.99      0.82      0.89       182

    accuracy                           0.97      1338
   macro avg       0.98      0.91      0.94      1338
weighted avg       0.97      0.97      0.97      1338



We can see that the recall of the model is very bad. So now we will use a clever way to augment the data so that the recall of 'spam' class increases.

We will use FAISS to do a vector based similarity search. For that we need to create FAISS index over the pool data embeddings.

In [None]:
# Create the FAISS index, before that we need to convert the tfidf vectors in proper format needed for faiss
# L2 distance for similarity is used, can use ndexFlatIP for inner product based similarity
pool_tfidf_emb = pool_tfidf_mat.toarray().astype('float32') # should be the embedding search space
emb_len = pool_tfidf_emb.shape[1]
index = faiss.IndexFlatL2(emb_len) # pass the length the embedding
index.add(pool_tfidf_emb)  # Add embeddings to the index

In [None]:
# We will check false negatives, i.e., where the y_true = 1 but y_pred = 0.
FN_cases = test_sample[(test_sample['label'] == 1) & (y_pred == 0)]
print(len(FN_cases))
FN_cases.head()

33


Unnamed: 0,sms,label
611,22 days to kick off! For Euro2004 U will be ke...,1
3792,"Twinks, bears, scallies, skins and jocks are c...",1
227,Will u meet ur dream partner soon? Is ur caree...,1
2663,Hello darling how are you today? I would love ...,1
4821,Check Out Choose Your Babe Videos @ sms.shsex....,1


We will use these as our queries to search from the pool ...

In [None]:
queries = vectorizer.transform(FN_cases['sms'])
queries.shape

(33, 208)

In [None]:
# function to get top k similar data from the pool, that are similar to queries (FN cases)

from tqdm import tqdm

def get_top_k(query_matrix, faiss_index, k):
    # first change the query vector from sparse to the dense format acceptable in faiss
    queries_dense = query_matrix.toarray().astype('float32')
    similar_indices = []
    for i in tqdm(range(queries_dense.shape[0])):
      query_vector = queries_dense[i].reshape(1, -1)
      distances, indices = index.search(query_vector, k) # faiss index search
      similar_indices.append({"indices": indices, "distances":distances})
    sim_ind_df = pd.DataFrame(similar_indices)

    return sim_ind_df

In [None]:
# use the function to get top k similar data indices
sim_ind_df = get_top_k(query_matrix = queries, faiss_index = index, k = 2)
sim_ind_df.head()

100%|██████████| 33/33 [00:00<00:00, 4884.15it/s]


Unnamed: 0,indices,distances
0,"[[411, 1060]]","[[0.69709444, 0.69709444]]"
1,"[[1789, 1635]]","[[0.9824891, 0.9915572]]"
2,"[[533, 1526]]","[[0.75268394, 0.77859056]]"
3,"[[64, 2007]]","[[0.7980639, 0.8427008]]"
4,"[[1076, 1290]]","[[0.558071, 0.6304729]]"


Now, we will add all these indices i.e., corresponding data from pool, to the training data and retrain.

In [None]:
# now we get the indices of the data to add to the training set
def get_data_ind_add(similar_data_df):
  ind_list = []
  for i in range(len(similar_data_df)):
    ind_list.append(similar_data_df['indices'].loc[i][0])

  ind_list = np.concatenate(ind_list)

  return ind_list

In [None]:
# get data to add
data_ind_to_add = get_data_ind_add(similar_data_df = sim_ind_df)

In [None]:
# add the data - for this you need to use sparse vstack instead of numpy vstack
from scipy.sparse import csr_matrix, vstack

augmented_train_tfidf = vstack([train_tfidf_mat, pool_tfidf_mat[data_ind_to_add]])
print(augmented_train_tfidf.shape)

(2072, 208)

In [None]:
# get the corresponding y labels
augmented_y = pd.concat([train_sample['label'], pool_sample['label'].iloc[data_ind_to_add]]) # series concatenation
print(augmented_y.shape)

(2072,)


In [None]:
# retrain with augmented data

# Initialize the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier on the training data
nb_classifier.fit(augmented_train_tfidf, augmented_y)

# Make predictions on the testing data
y_pred = nb_classifier.predict(test_tfidf_mat)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_sample['label'], y_pred)
print(f"Accuracy: {accuracy.round(2)}")

# Print the classification report
report = classification_report(test_sample['label'], y_pred)
print("Classification Report:")
print(report)


Accuracy: 0.98
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1156
           1       0.98      0.86      0.91       182

    accuracy                           0.98      1338
   macro avg       0.98      0.93      0.95      1338
weighted avg       0.98      0.98      0.98      1338



##**Note how recall of 1 improved  by 4% with only 66 data-points added (3.3% of the training data)!**