<a href="https://colab.research.google.com/github/steliosg23/PDS-A2/blob/main/Incidents%20Augmentation%20using%20techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Augmenting the Initial Imbalanced Train Set for Benchmark and Submission Models

In this notebook, **"Incidents Train Set Augmentation"**, I will address the issue of class imbalance in the initial training dataset for the **Food Hazard Detection Challenge**. The augmentation process aims to create a more balanced and diverse dataset to improve the model's performance in detecting food safety-related incidents accurately. However, for practical reasons, the size of the dataset will be approximately doubled. While this improves class representation, some imbalance will still persist to maintain a manageable dataset size and training efficiency. This augmented dataset will be utilized in two critical notebooks, as outlined below:

1. **Benchmark Models Finetuned PubMedBERT**:
   - The augmented dataset will serve as the foundation for training and evaluating benchmark models in the notebook titled **"Benchmark Models Finetuned PubMedBERT PDS A2 Food Hazard Detection.ipynb"**. These models will establish baseline performance metrics for each classification task, helping us understand the impact of data augmentation on model learning and generalization. This evaluation will include hazard-category, product-category, hazard, and product classifications.

2. **Submission Model Finetuned PubMedBERT**:
   - The augmented dataset will also be used in the notebook titled **"Submission Model Finetuned PubMedBERT PDS A2 Food Hazard Detection.ipynb"** to train the final submission model. This model will be fine-tuned on the enhanced dataset to maximize its accuracy and effectiveness in making predictions for the competition. The submission model will represent the culmination of our efforts in preprocessing, augmentation, and fine-tuning to achieve optimal performance on the challenge leaderboard.

By approximately doubling the dataset size and improving the representation of minority classes, this notebook sets the stage for robust training. While some class imbalance will still remain, this approach balances the trade-off between practical dataset size and improved model performance, trying to operate a more effective training for both benchmark and submission models.


# Mount Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Import Required Libraries
This cell imports the libraries necessary for the project:
- `pandas`: For data manipulation and analysis.
- `Counter` from `collections`: To count occurrences of hazard and product classes.
- `MarianMTModel` and `MarianTokenizer` from `transformers`: For translation models used in back-translation.
- `wordnet` from `nltk`: For obtaining synonyms of words.
- `torch`: For GPU support and managing computations.
- `random`: For random operations in augmentations.
- `nltk`: For natural language processing tasks.


In [None]:
import pandas as pd
from collections import Counter
from transformers import MarianMTModel, MarianTokenizer
from nltk.corpus import wordnet
import torch
import random
import nltk


# Download NLTK WordNet
The WordNet lexical database is required to find synonyms during text augmentation. This step ensures it's downloaded and ready for use.


In [None]:
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Check if GPU Support is Available
This step checks whether a GPU is available and sets the device accordingly. If a GPU is available, it will be used for computations to speed up model processing.


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


# Load the Dataset
The dataset is loaded from your Google Drive into a Pandas DataFrame for analysis. The file path is specified, and the dataset is assumed to be in CSV format.


In [None]:
file_path = '/content/drive/MyDrive/Data/incidents_train.csv'
df = pd.read_csv(file_path)


# Analyze Class Imbalance
This step computes the frequency of each class in the `hazard` and `product` columns to identify underrepresented classes. These will be used for targeted data augmentation.


In [None]:
hazard_counts = Counter(df['hazard'])
product_counts = Counter(df['product'])

threshold = min(hazard_counts.values())

underrepresented_hazards = [h for h, count in hazard_counts.items() if count < threshold]
underrepresented_products = [p for p, count in product_counts.items() if count < threshold]
print("Underrepresented Hazards:", underrepresented_hazards)
print("Underrepresented Products:", underrepresented_products)

Underrepresented Hazards: []
Underrepresented Products: ['ham slices', 'beef stewed', 'dry sausage', 'rum', 'whisky', 'anchovy paste', 'chillies', 'chocolate and hazelnut spread', 'eggplants', 'hazelnuts', 'chocolate mousse', 'bolognese sauce', 'animal by-products', 'cheesy spirals', 'orange juice', 'nacho chips', 'pork pie', 'anchovies in oil', 'cottage cheese', 'pork fillets', 'fruit bars', 'fruit snacks', 'cheese sausages', 'dried fruits', 'pickled cucumber', 'grated cheese', 'frozen pie', 'veal meat and offals', 'oyster sauce', 'rice noodles', 'french toast', 'dried anchovies', 'frozen beef tongue', 'frozen hamburgers', 'pork and beef sausages', 'chocolate mini muffins', 'crushed red pepper', 'shrimp snacks', 'melons', 'tomatoes', 'fish sauce', 'nutmeg', 'tuna salad', 'frozen whole fish', 'structured/textured fish meat products or fish paste', 'frozen seafood', 'spinach', 'pine nuts', 'various beef products', 'diced beef', 'frozen pancakes', 'various poultry meat', 'dried pork saus

# Define Functions to Load Translation Models
This function loads MarianMT translation models for a given language pair (e.g., English to French and back to English). The models are downloaded and moved to the specified device (CPU or GPU).


In [None]:
def load_translation_models(source_lang="en", target_lang="fr"):
    model_name = f"Helsinki-NLP/opus-mt-{source_lang}-{target_lang}"
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name).to(device)

    reverse_model_name = f"Helsinki-NLP/opus-mt-{target_lang}-{source_lang}"
    reverse_tokenizer = MarianTokenizer.from_pretrained(reverse_model_name)
    reverse_model = MarianMTModel.from_pretrained(reverse_model_name).to(device)

    return (tokenizer, model), (reverse_tokenizer, reverse_model)


# Back-Translation Function
This function performs back-translation in batches, translating the text to a target language and back to the source language to generate augmented data.


In [None]:
def back_translation_batch(texts, lang_pair=("en", "fr"), batch_size=32):
    (src_tokenizer, src_model), (tgt_tokenizer, tgt_model) = load_translation_models(*lang_pair)

    augmented_texts = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = src_tokenizer(batch, return_tensors="pt", max_length=512, truncation=True, padding=True).to(device)
        translated = src_model.generate(**inputs)
        translated_texts = [src_tokenizer.decode(t, skip_special_tokens=True) for t in translated]

        back_inputs = tgt_tokenizer(translated_texts, return_tensors="pt", max_length=512, truncation=True, padding=True).to(device)
        back_translated = tgt_model.generate(**back_inputs)
        augmented_texts.extend([tgt_tokenizer.decode(bt, skip_special_tokens=True) for bt in back_translated])
    return augmented_texts


# Text Augmentation Functions
These functions apply various text augmentation techniques to generate new text samples:
1. **Synonym Replacement**: Replaces words with their synonyms.
2. **Random Insertion**: Inserts synonyms of random words at random positions.
3. **Random Swap**: Swaps two random words in the text.
4. **Random Deletion**: Deletes words from the text with a specified probability.


In [None]:
def synonym_replacement(text, n=2):
    words = text.split()
    new_words = words.copy()
    random_word_list = list(set([word for word in words if wordnet.synsets(word)]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = wordnet.synsets(random_word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        if num_replaced >= n:
            break
    return ' '.join(new_words)

def random_insertion(text, n=2):
    words = text.split()
    for _ in range(n):
        synonyms = []
        while len(synonyms) < 1:
            random_word = random.choice(words)
            synonyms = wordnet.synsets(random_word)
        synonym = synonyms[0].lemmas()[0].name()
        random_idx = random.randint(0, len(words))
        words.insert(random_idx, synonym)
    return ' '.join(words)

def random_swap(text, n=2):
    words = text.split()
    for _ in range(n):
        idx1, idx2 = random.sample(range(len(words)), 2)
        words[idx1], words[idx2] = words[idx2], words[idx1]
    return ' '.join(words)

def random_deletion(text, p=0.2):
    words = text.split()
    if len(words) == 1:
        return text
    remaining_words = [word for word in words if random.uniform(0, 1) > p]
    return ' '.join(remaining_words) if remaining_words else random.choice(words)


# Apply Augmentation to All Rows
This function applies all augmentation techniques to a batch of rows. It uses back-translation and other techniques like synonym replacement, random insertion, swapping, and deletion.


In [None]:
def augment_all_techniques_batch(rows, batch_size=32):
    texts = [row['text'] for _, row in rows.iterrows()]
    augmented_rows = []

    # Back Translation in batches
    back_translated_texts = back_translation_batch(texts, lang_pair=("en", "fr"), batch_size=batch_size)
    for i, text in enumerate(texts):
        augmented_rows.append({**rows.iloc[i], 'text': back_translated_texts[i]})

    # Other augmentations for each row
    for _, row in rows.iterrows():
        augmented_rows.append({**row, 'text': synonym_replacement(row['text'])})
        augmented_rows.append({**row, 'text': random_insertion(row['text'])})
        augmented_rows.append({**row, 'text': random_swap(row['text'])})
        augmented_rows.append({**row, 'text': random_deletion(row['text'])})

    return augmented_rows


# Perform Data Augmentation for Underrepresented Classes
This loop augments rows for underrepresented `hazard` and `product` classes until the dataset is balanced.


In [None]:
augmented_rows = []
target_rows = len(df) * 2

while len(df) + len(augmented_rows) < target_rows:
    for hazard in underrepresented_hazards:
        rows = df[df['hazard'] == hazard]
        augmented_rows.extend(augment_all_techniques_batch(rows))

    for product in underrepresented_products:
        rows = df[df['product'] == product]
        augmented_rows.extend(augment_all_techniques_batch(rows))

    if len(df) + len(augmented_rows) >= target_rows:
        break


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

# Combine Original and Augmented Data
This final step combines the original dataset with the augmented rows to create a balanced dataset.


In [None]:
augmented_df = pd.DataFrame(augmented_rows[:target_rows - len(df)])
balanced_df = pd.concat([df, augmented_df], ignore_index=True)

# Save the new DataFrame
balanced_df.to_csv('/content/drive/MyDrive/Data/augmented_incidents_train.csv', index=False)

# Display first few rows of the balanced DataFrame
balanced_df.head()