<a href="https://colab.research.google.com/github/steliosg23/PDS-A2/blob/main/Food_Hazard_Detection_PubMedBERT_Benchmarks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignement 2
### Food Hazard Detection

# Benchmarks - Advanced Model: PubMedBERT

In this task, we aim to classify food safety-related incidents based on two distinct types of input data: short texts (title) and long texts (text).

Using Advanced Model: PubMedBERT  


For each of these input types, we perform the following two subtasks:

**Subtasks (Performed Separately for  title and text):**

**Subtask 1:**

- Classify hazard-category (general hazard type).

- Classify product-category (general product type).

**Subtask 2:**

- Classify hazard (specific hazard).
- Classify product (specific product).

We use all features (year, month, day, country, and the text feature) as input.

Thus, we treat title and text as two distinct data sources, with each undergoing its own preprocessing, model training, and evaluation for all four targets.

In [1]:
from google.colab import drive
import pandas as pd

# Mount Google Drive
drive.mount('/content/drive')

# Define the path to the file on Google Drive
train_path = '/content/drive/MyDrive/Data/incidents_train.csv'

# Load the dataset
df = pd.read_csv(train_path)
df = df.drop(columns=['Unnamed: 0'])


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
df

Unnamed: 0,year,month,day,country,title,text,hazard-category,product-category,hazard,product
0,1994,1,7,us,Recall Notification: FSIS-024-94,Case Number: 024-94 \n Date Opene...,biological,"meat, egg and dairy products",listeria monocytogenes,smoked sausage
1,1994,3,10,us,Recall Notification: FSIS-033-94,Case Number: 033-94 \n Date Opene...,biological,"meat, egg and dairy products",listeria spp,sausage
2,1994,3,28,us,Recall Notification: FSIS-014-94,Case Number: 014-94 \n Date Opene...,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices
3,1994,4,3,us,Recall Notification: FSIS-009-94,Case Number: 009-94 \n Date Opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,thermal processed pork meat
4,1994,7,1,us,Recall Notification: FSIS-001-94,Case Number: 001-94 \n Date Opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,chicken breast
...,...,...,...,...,...,...,...,...,...,...
5077,2022,7,28,hk,Imported biscuit may contain allergen (peanuts),Imported biscuit may contain allergen (peanuts...,allergens,cereals and bakery products,peanuts and products thereof,biscuits
5078,2022,7,28,us,Danny’s Sub and Pizza Recalls Meat Pizza Produ...,023-2022\n\n \n High - Class I\n\n Produc...,fraud,prepared dishes and snacks,inspection issues,pizza
5079,2022,7,29,us,Lyons Magnus Voluntarily Recalls 53 Nutritiona...,"FRESNO, Calif. – July 28, 2022 – Lyons Magnus ...",biological,non-alcoholic beverages,cronobacter spp,non-alcoholic beverages
5080,2022,7,30,us,"Conagra Brands, Inc., Recalls Frozen Beef Prod...",025-2022\n\n \n High - Class I\n\n Misbra...,allergens,"meat, egg and dairy products",eggs and products thereof,frozen beef products


Import Libraries

Set Device for GPU

Load Pretrained BERT Tokenizer and Model

Define Text Cleaning Function

Apply Cleaning Function to Data

Define Feature Columns and Targets

Prepare Data for Title and Text Columns

Define Dataset Class for BERT Input

Define the Neural Network Classifier

Initialize Model Dynamically Based on Target Classes

Define Training and Evaluation Functions

F1-Score Calculation and Training Loop

Train and Evaluate Model for Title and Text (Dynamic num_classes)

Plot F1-Scores

In [None]:
import pandas as pd
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
from torch import nn
from transformers import BertTokenizer, BertModel
from sklearn.metrics import f1_score, classification_report
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
import numpy as np
from tqdm import tqdm  # Import tqdm for progress bars

# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

# Function to clean text (title or text) and remove stopwords
def clean_text(text):
    # Remove non-alphanumeric characters (excluding spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra spaces
    text = ' '.join(text.split())
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Assuming df is your DataFrame
df['title'] = df['title'].apply(clean_text)
df['text'] = df['text'].apply(clean_text)

# Define relevant features and targets
features = ['year', 'month', 'day', 'country']
targets_subtask1 = ['hazard-category', 'product-category']
targets_subtask2 = ['hazard']

# Function to prepare data for both title and text
def prepare_data(text_column):
    X = df[features + [text_column]]
    y_subtask1 = df[targets_subtask1]
    y_subtask2 = df[targets_subtask2]

    # Initialize LabelEncoders for each target column
    label_encoders = {}
    for target in targets_subtask1 + targets_subtask2:
        le = LabelEncoder()
        df[target] = le.fit_transform(df[target])  # Convert categorical labels to integers
        label_encoders[target] = le

    # Splitting data for both tasks
    data_splits = {}
    for target in targets_subtask1 + targets_subtask2:
        X_train, X_test, y_train, y_test = train_test_split(
            X, df[target], test_size=0.2, random_state=42
        )
        data_splits[target] = (X_train, X_test, y_train, y_test)

    return data_splits, label_encoders

# Prepare data for title and text (with updated label encoding)
title_splits, title_label_encoders = prepare_data('title')
text_splits, text_label_encoders = prepare_data('text')

# Initialize an empty DataFrame to store F1-scores for title
f1_scores_title_df = pd.DataFrame(columns=['Task', 'F1-Score'])
f1_scores_text_df = pd.DataFrame(columns=['Task', 'F1-Score'])

# Define the custom dataset class for BERT
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        text = self.texts[item]
        label = self.labels[item]  # Labels are already integers
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)  # labels as integers
        }

# Function to create a neural network model with BERT
def build_bert_model(num_classes):
    class BERTClassifier(nn.Module):
        def __init__(self, pretrained_model_name, num_classes):
            super(BERTClassifier, self).__init__()
            self.bert = BertModel.from_pretrained(pretrained_model_name)
            self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)
            self.dropout = nn.Dropout(0.3)

        def forward(self, input_ids, attention_mask):
            outputs = self.bert(input_ids, attention_mask=attention_mask)  # BERT forward pass
            output = outputs[1]  # Get the pooled output (from [CLS] token)
            output = self.dropout(output)  # Apply dropout
            return self.fc(output)  # Final classification

    model = BERTClassifier('bert-base-uncased', num_classes)
    return model

# Function to train and evaluate BERT model
def train_and_evaluate_bert(data_splits, targets, text_column, hyperparameters):
    global f1_scores_title_df, f1_scores_text_df

    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    max_len = hyperparameters['max_len']  # Adjustable max length

    for target in targets:
        print(f"\nTraining for {target}...")  # Mention the task being trained

        X_train, X_test, y_train, y_test = data_splits[target]

        train_dataset = TextDataset(X_train[text_column].values, y_train.values, tokenizer, max_len)
        test_dataset = TextDataset(X_test[text_column].values, y_test.values, tokenizer, max_len)

        train_loader = DataLoader(train_dataset, batch_size=hyperparameters['batch_size'], shuffle=True)
        test_loader = DataLoader(test_dataset, batch_size=hyperparameters['batch_size'], shuffle=False)

        model = build_bert_model(num_classes=len(y_train.unique()))
        model.to(device)

        optimizer = torch.optim.AdamW(model.parameters(), lr=hyperparameters['learning_rate'])
        criterion = nn.CrossEntropyLoss()

        # Training loop with tqdm progress bar
        model.train()
        for epoch in range(hyperparameters['num_epochs']):  # Number of epochs can be customized
            epoch_bar = tqdm(train_loader, desc=f"Epoch {epoch + 1}", ncols=100, position=0, leave=True)
            for batch in epoch_bar:
                optimizer.zero_grad()
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                outputs = model(input_ids, attention_mask)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()

                # Update tqdm progress bar with loss info
                epoch_bar.set_postfix(loss=loss.item())

        # Evaluation
        model.eval()
        all_preds = []
        all_labels = []
        with torch.no_grad():
            for batch in test_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                outputs = model(input_ids, attention_mask)
                _, preds = torch.max(outputs, dim=1)
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())

        # Calculate F1-Score
        f1 = f1_score(all_labels, all_preds, average='weighted', zero_division=0)

        # Collect F1-score into DataFrame
        if text_column == 'title':
            f1_scores_title_df = pd.concat([f1_scores_title_df, pd.DataFrame({'Task': [f"{target} (Title)"], 'F1-Score': [f1]})], ignore_index=True)
        else:
            f1_scores_text_df = pd.concat([f1_scores_text_df, pd.DataFrame({'Task': [f"{target} (Text)"], 'F1-Score': [f1]})], ignore_index=True)

        # Print the classification report
        print(f"\nClassification Report for {target} ({text_column}):")
        print(classification_report(all_labels, all_preds, zero_division=0))  # Handle zero division gracefully

# Define hyperparameters
def get_hyperparameters():
    return {
        'learning_rate': 2e-5,
        'batch_size': 16,
        'num_epochs': 1,
        'max_len': 128
    }

# Get hyperparameters
hyperparameters = get_hyperparameters()

# Train and evaluate BERT for title
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("BERT for Titles:")
train_and_evaluate_bert(title_splits, targets_subtask1 + targets_subtask2, text_column='title', hyperparameters=hyperparameters)

# Train and evaluate BERT for text
print("\nBERT for Texts:")
train_and_evaluate_bert(text_splits, targets_subtask1 + targets_subtask2, text_column='text', hyperparameters=hyperparameters)

# Print the collected F1-scores for title
print("\nCollected F1-Scores for Title-Focused Classification:")
print(f1_scores_title_df)

# Print the collected F1-scores for text
print("\nCollected F1-Scores for Text-Focused Classification:")
print(f1_scores_text_df)

# Plotting the data
plt.figure(figsize=(10, 6))

# Plotting Title-Focused F1-scores
plt.bar(f1_scores_title_df['Task'], f1_scores_title_df['F1-Score'], label='Title-Focused')

# Plotting Text-Focused F1-scores
plt.bar(f1_scores_text_df['Task'], f1_scores_text_df['F1-Score'], label='Text-Focused')

# Adding labels and title
plt.xlabel('Task')
plt.ylabel('F1-Score')
plt.title('F1-Scores for Title-Focused vs Text-Focused Classification with BERT')
plt.xticks(rotation=45)
plt.legend()

# Displaying the plot
plt.tight_layout()
plt.show()
