<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/Data_Augmentation_LLM/blob/main/MA2_p1_dataset_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We have already done the analysis of ham-spam in the previous article, for completeness, we repeat a portion here.

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Load the dataset
dataset = load_dataset("sms_spam")
df = pd.DataFrame(dataset['train'])

# check class distribution
df['label'].value_counts(normalize = True).round(2) # we already have an imbalanced dataset where your spam is 13% and ham is 87%

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/359k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
0,0.87
1,0.13


We will keep 40% data as data pool and rest of the 60% will be devided into train and test. We will train the model and then use a tfidf based vector search methodology to choose the data to include for training.

Note: One can easily go and take more 'spam' labelled data to include in the training set. But there can be cases where you don't really have labeled data in the pool, and you cannot spend resources to annotate all your data in the pool. So you have to have some method to choose what data you want for training.

In [4]:
def sample_df(df, train_fraction, random_state = 123):
  sample_size = int(len(df)*train_fraction)
  train_sample = df.sample(n=sample_size, random_state=42)
  test_sample = df.drop(train_sample.index)

  return train_sample, test_sample

In [5]:
traintest_sample, pool_sample = sample_df(df = df, train_fraction = 0.6, random_state = 123)
train_sample, test_sample = sample_df(df = traintest_sample, train_fraction = 0.6, random_state = 123)

print(f"train sample size: {len(train_sample)}, \n test sample size: {len(test_sample)}, \n pool sample size: {len(pool_sample)}")

train sample size: 2006, 
 test sample size: 1338, 
 pool sample size: 2230


In [6]:
train_sample.to_csv("train.csv", index = False)
test_sample.to_csv("test.csv", index = False)
pool_sample.to_csv("pool.csv", index = False)

In [7]:
# Initialize the TF-IDF Vectorizer. To reduce the number of features and the sparsity, I have used min_df = 0.01.
# This means we will only consider tokens that appear in more than 1% of the training data
vectorizer = TfidfVectorizer(min_df = 0.01)

# Fit and transform the text data to create TF-IDF vectors
train_tfidf_mat = vectorizer.fit_transform(train_sample['sms'])
test_tfidf_mat = vectorizer.transform(test_sample['sms'])

In [8]:
# baseline model
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier on the training data
nb_classifier.fit(train_tfidf_mat, train_sample['label'])

# Make predictions on the testing data
y_pred = nb_classifier.predict(test_tfidf_mat)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_sample['label'], y_pred)
print(f"Accuracy: {accuracy.round(2)}")

# Print the classification report
report = classification_report(test_sample['label'], y_pred)
print("Classification Report:")
print(report)


Accuracy: 0.97
Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      1156
           1       0.99      0.82      0.89       182

    accuracy                           0.97      1338
   macro avg       0.98      0.91      0.94      1338
weighted avg       0.97      0.97      0.97      1338



We can see that the recall of the model is very bad. So now we will use a clever way to augment the data so that the recall of 'spam' class increases.

We will use FAISS to do a vector based similarity search. For that we need to create FAISS index over the pool data embeddings.

In [9]:
# We will check false negatives, i.e., where the y_true = 1 but y_pred = 0.
FN_cases = test_sample[(test_sample['label'] == 1) & (y_pred == 0)]
print(len(FN_cases))
FN_cases.head()

# let's save these FN cases, so that these can be used later
FN_cases.to_csv("FN_cases.csv", index = False)

33


We will now load the saved FN cases created using another notebook. These spam cases will used for few shot learning to generate more such samples that can be used for data augmentation.

In [10]:
FN_cases_df = pd.read_csv("/content/FN_cases.csv")
FN_cases_df.head()

Unnamed: 0,sms,label
0,22 days to kick off! For Euro2004 U will be ke...,1
1,"Twinks, bears, scallies, skins and jocks are c...",1
2,Will u meet ur dream partner soon? Is ur caree...,1
3,Hello darling how are you today? I would love ...,1
4,Check Out Choose Your Babe Videos @ sms.shsex....,1
