<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/Data_Augmentation_LLM/blob/main/MA2_p3_cleaning_and_retraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the previous notebook, we saw how to generate synthetic instances of spam messages similar to our false negative cases. Now we will take the llm output, clean it and use it augment the training data and retrain our model.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [None]:
def extract_responses(file_path, start_tag="<START>", end_tag="<END>"):
    """
    - We will read in the llm out as text file a text file
    - extract lines situated between lines containing <START> and <END> markers.
    """

    extracted_lines = []
    with open(file_path, 'r') as file:
        capture = False
        for line in file:
            line = line.strip()
            if start_tag in line:
                capture = True
                extracted_lines.append(line.replace(start_tag, '').strip())
            elif end_tag in line:
                capture = False
                extracted_lines[-1] += ' ' + line.replace(end_tag, '').strip()
            elif capture:
                extracted_lines[-1] += ' ' + line.strip()

    return extracted_lines

In [None]:
file_path = "/content/llm_output_final.txt"
extracted_response = extract_responses(file_path = file_path, start_tag="<START>", end_tag="<END>")

In [None]:
def post_process(extracted_lines):
    """
    - remove the list entry that contains the '<|user|>' or  '<|end|>' tags
    - remove special tags like '<END>', '####', '<|assistant|> '
    - deduplicate list entries (there is a lot of duplicated entries due to hallucination)

    Note: these post processing steps can vary based on the prompt and the model
    """

    processed_lines = []
    seen_lines = set()

    for line in extracted_lines:
        # Remove entries that contain both <START> and <END> tags
        if '<|user|>' in line or '<|end|>' in line:
            continue

        # Remove special tags like '<|assistant|>', '<END>', '####'
        line = line.replace('<|assistant|>', '').replace('<END>', '').replace('####', '').strip()

        # Skip if the line is empty
        if not line:
            continue

        # Deduplicate entries
        if line not in seen_lines:
            seen_lines.add(line)
            processed_lines.append(line)

    return processed_lines

In [None]:
# clean llm output
cleaned_lines = post_process(extracted_lines = extracted_response)

# save the cleaned output
file_path = '/content/cleaned_llm_output.txt'
with open(file_path, 'w') as file:
  for lines in cleaned_lines:
    file.write(lines + '\n')


In [None]:
# to read
file_path = '/content/cleaned_llm_output.txt'
with open(file_path, 'r') as file:
  lines = [line.strip() for line in file if line.strip()]

print("total examples", len(lines))

import pprint as pp
pp.pprint(lines)

total examples 41
["Hey there! I've got a once-in-a-lifetime deal for you! Click on this link "
 "and get your hands on a rare, limited edition collector's item. Hurry up, "
 "this offer won't last forever!",
 "Attention all O2 users! You've been selected for an exclusive offer. Click "
 "on this link and claim your free upgrade to the premium plan. Don't miss out "
 'on this amazing opportunity!',
 "URGENT! You've been chosen to receive a special gift from your favorite "
 'celebrity. Click on this link to claim your prize and get a sneak peek into '
 'their personal life. This is a once-in-a-lifetime opportunity!',
 "Hello, dear friend! I'm reaching out to you with a special offer. Click on "
 'this link and get a free vacation package to your dream destination. Hurry '
 "up, this offer won't last forever!",
 "Attention all Vodafone users! You've been selected for a special offer. "
 'Click on this link and claim your free subscription to our premium music '
 "streaming service. Don'

Now, we will add these examples in our training data and retrain the model.

- we will create a data set with 'sms' and 'label'
- add it to the training data
- transform into tfidf vectors
- build the model and test on the test data

In [None]:
# reading in the train and test datasets created before
train_sample = pd.read_csv('/content/train.csv')
test_sample = pd.read_csv('/content/test.csv')
print(train_sample.iloc[0:5],'\n', test_sample.iloc[0:5])

                                                 sms  label
0  Todays Vodafone numbers ending with 4882 are s...      1
1  Yes. They replied my mail. I'm going to the ma...      0
2            Super da:)good replacement for murali\n      0
3  Sorry I missed your call let's talk when you h...      1
4  Todays Voda numbers ending 5226 are selected t...      1 
                                                  sms  label
0  "HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...      0
1  Sorry i've not gone to that place. I.ll do so ...      0
2            When are you going to ride your bike?\n      0
3  Daddy, shu shu is looking 4 u... U wan me 2 te...      0
4                          What you did in  leave.\n      0


In [None]:
# creating dataframe to add to the train data
df_to_add = pd.DataFrame({'sms': lines, 'label': [1] * len(lines)})
print(df_to_add.iloc[0:5])
# concat the df with the train data
aug_train_sample = pd.concat([train_sample, df_to_add], ignore_index=True)
print("before augmentation:", len(train_sample), "\nafter augmentation:", len(aug_train_sample))

                                                 sms  label
0  Hey there! I've got a once-in-a-lifetime deal ...      1
1  Attention all O2 users! You've been selected f...      1
2  URGENT! You've been chosen to receive a specia...      1
3  Hello, dear friend! I'm reaching out to you wi...      1
4  Attention all Vodafone users! You've been sele...      1
before augmentation: 2006 
after augmentation: 2047


In [None]:
# vectorize the data
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df = 0.01)

# Fit and transform the text data to create TF-IDF vectors
train_tfidf_mat = vectorizer.fit_transform(aug_train_sample['sms'])
test_tfidf_mat = vectorizer.transform(test_sample['sms'])

print(train_tfidf_mat.shape, test_tfidf_mat.shape)

(2047, 216) (1338, 216)


In [None]:
# retrain model with augmented data
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier on the training data
nb_classifier.fit(train_tfidf_mat, aug_train_sample['label'])

# Make predictions on the testing data
y_pred = nb_classifier.predict(test_tfidf_mat)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_sample['label'], y_pred)
print(f"Accuracy: {accuracy.round(2)}")

# Print the classification report
report = classification_report(test_sample['label'], y_pred)
print("Classification Report:")
print(report)


Accuracy: 0.98
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1156
           1       0.97      0.87      0.92       182

    accuracy                           0.98      1338
   macro avg       0.97      0.93      0.95      1338
weighted avg       0.98      0.98      0.98      1338



#### Some observations:
- In the baseline model the recall was 82%, which improved to 87% with only 41 synthetically generated data-points! This improved the f1-score from 89% to 92%.
- If we compare this with our previous approach, where we collected similar data (similar to FN cases), around 66 data points, we actually achieved the same f1-score of 92%.

Therefore, LLM generated data can prove to be benefitial and even outperform normal data augmentation strategies if used wisely.