## ROBERTA

Main Differences Between BERT and RoBERTa
- Training Data and Duration:
    - BERT: Trained on the BooksCorpus and English Wikipedia (16GB of text data).
    - RoBERTa: Trained on a much larger dataset (160GB), which includes BooksCorpus, English Wikipedia, CC-News, OpenWebText, and Stories. RoBERTa is trained longer and with more data, leading to better generalization.
- Dynamic Masking:
    - BERT: Uses static masking during pre-training, meaning that the same tokens are masked across different epochs.
    - RoBERTa: Uses dynamic masking, where the tokens chosen for masking change with every epoch. This results in more robust training and better performance.

**Masking**\
Masking is a technique used during the training of language models like BERT and RoBERTa. The idea is to randomly hide some words in a sentence and ask the model to predict these hidden (or masked) words based on the context provided by the other words in the sentence. This helps the model learn the relationships between words and their meanings within the context of a sentence.


- Training Objectives:
    - BERT: Utilizes the next sentence prediction (NSP) task during pre-training to predict if one sentence follows another.
    - RoBERTa: Removes the NSP task, which was found to be less beneficial. Instead, it focuses on masked language modeling (MLM) with dynamic masking.
- Hyperparameters:
    - RoBERTa: Optimizes several hyperparameters such as batch size, learning rate, and training duration. These optimizations lead to more effective training and better performance.
- Byte-Pair Encoding (BPE):
    - RoBERTa: Uses byte-level BPE tokenization, which can handle rare and unseen words more effectively than the wordpiece tokenization used in BERT.
    
    
**NSP**\
NSP is a task used during the pre-training phase of BERT (Bidirectional Encoder Representations from Transformers). The goal of NSP is to help the model understand the relationship between two sentences. 
- but RoBERTa found it less useful and removed it.

**MLM**
Masked Language Modeling (MLM) is a training objective used in both BERT and RoBERTa, where the model learns to predict missing words in a sentence.


**BPE**
Byte-Pair Encoding (BPE) is a tokenization method used to split text into subword units, which can handle rare and unseen words more effectively than traditional tokenization methods.
- Purpose: To create a flexible and efficient vocabulary that can represent both common and rare words, reducing the number of unknown tokens and handling out-of-vocabulary words better.


Why RoBERTa Might Perform Better
- Larger and More Diverse Training Data: RoBERTa is trained on a significantly larger and more diverse dataset, allowing it to learn richer language representations and generalize better to various NLP tasks.
- Dynamic Masking: The dynamic masking technique used by RoBERTa ensures that the model sees different masks of the same text during training, leading to a more robust understanding of the context and better performance.
- Removal of NSP Task: By removing the next sentence prediction task, RoBERTa focuses entirely on the more beneficial masked language modeling objective, improving its performance.
- Hyperparameter Tuning: RoBERTa benefits from extensive hyperparameter tuning, leading to more efficient training and better overall performance.
- Byte-Level BPE: The use of byte-level BPE tokenization allows RoBERTa to handle a wider variety of text inputs, including those with rare or unseen words, more effectively than BERT.


While BERT laid the foundation for many transformer-based models, RoBERTa improves upon it by addressing several limitations and introducing optimizations in training data, dynamic masking, and hyperparameters. These improvements make RoBERTa a more powerful and effective model for various NLP tasks.

# Loading & Preprocessing


In [4]:
# Load and preprocess data
import pandas as pd
import numpy as np
import re
import torch
from transformers import BertTokenizer, RobertaTokenizer, RobertaForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

torch.cuda.empty_cache()

# Loading data
data = []
with open('msr_paraphrase_train.txt', 'r') as file:
    next(file)
    for line in file:
        split_line = line.strip().split('\t')
        if len(split_line) == 5:
            data.append(split_line)
        else:
            print(f"Skipping line due to incorrect number of columns: {line}")

columns = ["Quality", "#1 ID", "#2 ID", "#1 String", "#2 String"]
df = pd.DataFrame(data, columns=columns)
df['Quality'] = df['Quality'].astype(int)

data = []
with open('msr_paraphrase_test.txt', 'r') as file:
    next(file)
    for line in file:
        split_line = line.strip().split('\t')
        if len(split_line) == 5:
            data.append(split_line)
        else:
            print(f"Skipping line due to incorrect number of columns: {line}")

df_test = pd.DataFrame(data, columns=columns)
df_test['Quality'] = df_test['Quality'].astype(int)

# Clean and preprocess data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def preprocess_text_advanced(text):
    text = clean_text(text)
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word.lower() not in stop_words and len(word) > 1])
    return text
'''
**Additional constraint: if word is less than 1 length long then do not include**

Tokenization: Splits the text into individual words (tokens) using text.split().
Stop Words Removal: Removes common stop words (e.g., "and", "the", "is") that do not contribute much meaning to the text. This is done by checking if each word is in the stop_words set.
Lemmatization: Converts each word to its base form using a lemmatizer. For example, "running" becomes "run".
Length Filter: Filters out any remaining words that are only one character long, as they are often not useful for understanding the text.
'''


df['#1 String Cleaned'] = df['#1 String'].apply(clean_text)
df['#2 String Cleaned'] = df['#2 String'].apply(clean_text)

df['#1 String Processed'] = df['#1 String Cleaned'].apply(preprocess_text_advanced)
df['#2 String Processed'] = df['#2 String Cleaned'].apply(preprocess_text_advanced)

df_test['#1 String Cleaned'] = df_test['#1 String'].apply(clean_text)
df_test['#2 String Cleaned'] = df_test['#2 String'].apply(clean_text)

df_test['#1 String Processed'] = df_test['#1 String Cleaned'].apply(preprocess_text_advanced)
df_test['#2 String Processed'] = df_test['#2 String Cleaned'].apply(preprocess_text_advanced)


# Define a custom dataset for loading the data
class ParaphraseDataset(Dataset):
    def __init__(self, df, tokenizer, max_len):
        self.df = df
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        row = self.df.iloc[index]
        text1 = row['#1 String Processed']
        text2 = row['#2 String Processed']
        label = int(row['Quality'])

        encoding = self.tokenizer.encode_plus(
            text1,
            text2,
            max_length=self.max_len,
            add_special_tokens=True,
            padding='max_length',
            truncation=True,
            return_tensors='pt',
            return_attention_mask=True
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Initialize the Roberta tokenizer (NEW WEEK 9 CHANGE)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Create DataLoader for training and validation
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)
train_dataset = ParaphraseDataset(train_df, tokenizer, max_len=128)
val_dataset = ParaphraseDataset(val_df, tokenizer, max_len=128)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# Part 1


In [1]:
# Define the BERT model
class ParaphraseModel(nn.Module):
    def __init__(self):
        super(ParaphraseModel, self).__init__()
        self.bert = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        return outputs.loss, outputs.logits
# Initialize model, loss function, and optimizer
model = ParaphraseModel()
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-5)

# Mixed precision training
scaler = torch.cuda.amp.GradScaler()

# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

def train_epoch(model, data_loader, criterion, optimizer, device):
    model.train()
    losses = []
    correct_predictions = 0
    all_preds = []
    all_labels = []

    for data in tqdm(data_loader):
        input_ids = data['input_ids'].to(device)
        attention_mask = data['attention_mask'].to(device)
        labels = data['labels'].to(device)

        optimizer.zero_grad()
        
        with torch.cuda.amp.autocast():
            loss, logits = model(input_ids, attention_mask, labels)
            _, preds = torch.max(logits, dim=1)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        losses.append(loss.item())
        correct_predictions += torch.sum(preds == labels)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

    acc = correct_predictions.double() / len(data_loader.dataset)
    f1 = f1_score(all_labels, all_preds)
    return acc, np.mean(losses), f1

def eval_model(model, data_loader, criterion, device):
    model.eval()
    losses = []
    correct_predictions = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for data in data_loader:
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            labels = data['labels'].to(device)

            loss, logits = model(input_ids, attention_mask, labels)
            _, preds = torch.max(logits, dim=1)

            losses.append(loss.item())
            correct_predictions += torch.sum(preds == labels)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = correct_predictions.double() / len(data_loader.dataset)
    f1 = f1_score(all_labels, all_preds)
    return acc, np.mean(losses), f1


# Train the model with early stopping
epochs = 15
early_stopping_patience = 3
best_val_loss = float('inf')
patience_counter = 0

for epoch in range(epochs):
    print(f'Epoch {epoch + 1}/{epochs}')
    train_acc, train_loss, train_f1 = train_epoch(model, train_loader, criterion, optimizer, device)
    print(f'Train loss: {train_loss}, Train accuracy: {train_acc}, Train F1: {train_f1}')

    val_acc, val_loss, val_f1 = eval_model(model, val_loader, criterion, device)
    print(f'Validation loss: {val_loss}, Validation accuracy: {val_acc}, Validation F1: {val_f1}')
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        # Save the best model
        torch.save(model.state_dict(), 'best_model_state.bin')
    else:
        patience_counter += 1
        if patience_counter >= early_stopping_patience:
            print("Early stopping triggered")
            break

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/15


100%|██████████| 204/204 [00:38<00:00,  5.33it/s]


Train loss: 0.5905110976275276, Train accuracy: 0.6932515337423313, Train F1: 0.8032270759543487
Validation loss: 0.46953992837784336, Validation accuracy: 0.7683823529411764, Validation F1: 0.8486789431545236
Epoch 2/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.45524658094726356, Train accuracy: 0.7809815950920246, Train F1: 0.8381686310063464
Validation loss: 0.43186574414664625, Validation accuracy: 0.803921568627451, Validation F1: 0.8697068403908795
Epoch 3/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.34677621232820494, Train accuracy: 0.8496932515337424, Train F1: 0.8881278538812786
Validation loss: 0.411208598929293, Validation accuracy: 0.8088235294117647, Validation F1: 0.8621908127208481
Epoch 4/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.23716769330933982, Train accuracy: 0.9046012269938651, Train F1: 0.9287187714875087
Validation loss: 0.4414789615308537, Validation accuracy: 0.8075980392156863, Validation F1: 0.8705688375927453
Epoch 5/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.14927661959447114, Train accuracy: 0.9463190184049081, Train F1: 0.9598347486802846
Validation loss: 0.5443666263611293, Validation accuracy: 0.8161764705882353, Validation F1: 0.8741610738255032
Epoch 6/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.10003535812903269, Train accuracy: 0.9668711656441719, Train F1: 0.9752633989922126
Validation loss: 0.5559149718313825, Validation accuracy: 0.8174019607843137, Validation F1: 0.8738357324301439
Early stopping triggered


## Part 1 Changes:
### Mixed Precision Training: 
- Mixed precision training is a technique that uses both 16-bit (half-precision) and 32-bit (single-precision) floating-point numbers during training. This approach can speed up training and reduce memory usage, allowing you to train larger models or use larger batch sizes.
    - 16-bit Floating Point (Half Precision): Uses less memory and can be processed faster by the GPU.
    - 32-bit Floating Point (Single Precision): Provides more precision and is used where necessary to maintain numerical stability.
- The GradScaler in PyTorch helps with the loss scaling part of mixed precision training. Here’s what it does:
    - Scaling Up the Loss: Before the backward pass, the loss is multiplied by a scaling factor to prevent underflow of small gradient values.
    - Backward Pass: Gradients are computed with the scaled loss.
    - Unscaling the Gradients: After gradients are computed, they are divided by the scaling factor to return them to their correct scale.
    - Checking for Overflow: GradScaler checks if any gradients are too large (overflow). If they are, the scaling factor is reduced to avoid instability in future iterations.
    
### More Training Epochs (15)

### Early Stopping
- If validation loss doesn't improve for 3 consecutive epochs, training is stopped. 

# Part 2

In [6]:
from sklearn.model_selection import train_test_split, KFold
import torch.optim as optim
from transformers import get_linear_schedule_with_warmup
import random
from sklearn.model_selection import StratifiedKFold

# Define the BERT model
class ParaphraseModel(nn.Module):
    def __init__(self):
        super(ParaphraseModel, self).__init__()
        self.bert = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)
        self.dropout = nn.Dropout(0.3)  # Adding dropout layer

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        pooled_output = self.dropout(outputs.logits)  # Apply dropout
        return outputs.loss, pooled_output
    
'''
- nn.Dropout(0.3): 
Adds a dropout layer with a 30% probability. 
This means that each neuron has a 30% chance of being zeroed 
out during training, helping the model generalize better.
- optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01): 
Uses AdamW optimizer with weight decay. 
Weight decay adds a penalty for large weights, helping to prevent overfitting.
'''


# # Function to get synonyms for data augmentation
# def get_synonyms(word):
#     synonyms = set()
#     for syn in wordnet.synsets(word):
#         # Iterates over all synsets (groups of synonyms) of 
#         # the given word using the WordNet lexical database.
#         for lemma in syn.lemmas():
#             # Iterates over all lemmas (individual words) in each synset.
#             synonyms.add(lemma.name())
#                 # Adds each lemma (synonym) to the synonyms set. Using a
#                 # set ensures that each synonym is unique.
#     if word in synonyms:
#         synonyms.remove(word)
#         # Removes the original word from the synonyms set to avoid replacing a word with itself.
#     return synonyms
# '''
# This helper function retrieves synonyms for 
# a given word using the WordNet lexical database.
# '''



# # Function for synonym replacement
# def synonym_replacement(sentence, n):
#     # Split the sentence into words
#     words = sentence.split()
#     # Create a copy of the sentence to modify
#     new_sentence = words.copy()
#     # Get a list of unique words that are not stopwords
#     random_word_list = list(set([word for word in words if word not in stop_words]))
#     random.shuffle(random_word_list)
#     num_replaced = 0
#     for random_word in random_word_list:
#         synonyms = get_synonyms(random_word)
#         if len(synonyms) >= 1:
#             # Replace the word with a random synonym
#             synonym = random.choice(list(synonyms))
#             new_sentence = [synonym if word == random_word else word for word in new_sentence]
#             num_replaced += 1
#         if num_replaced >= n:
#             break
#     return ' '.join(new_sentence)

# '''
# This function replaces up to n words in a sentence with their synonyms.
# '''

# # Apply data augmentation
# df['#1 String Augmented'] = df['#1 String Processed'].apply(lambda x: synonym_replacement(x, 2))
# df['#2 String Augmented'] = df['#2 String Processed'].apply(lambda x: synonym_replacement(x, 2))


# Initialize model, loss function, and optimizer
model = ParaphraseModel()
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)  # Added weight decay for regularization
'''
 It works by adding a penalty to the loss function, which discourages the model from learning overly complex patterns that might not generalize well to unseen data.
'''


# Mixed precision training
scaler = torch.cuda.amp.GradScaler()

# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)


# Training and evaluation functions
def train_epoch(model, data_loader, criterion, optimizer, device, scaler):
    model.train()
    losses = []
    correct_predictions = 0
    all_preds = []
    all_labels = []

    for data in tqdm(data_loader):
        input_ids = data['input_ids'].to(device)
        attention_mask = data['attention_mask'].to(device)
        labels = data['labels'].to(device)

        optimizer.zero_grad()

        with torch.cuda.amp.autocast():
            loss, logits = model(input_ids, attention_mask, labels)
            _, preds = torch.max(logits, dim=1)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        losses.append(loss.item())
        correct_predictions += torch.sum(preds == labels)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

    acc = correct_predictions.double() / len(data_loader.dataset)
    f1 = f1_score(all_labels, all_preds)
    return acc, np.mean(losses), f1

def eval_model(model, data_loader, criterion, device):
    model.eval()
    losses = []
    correct_predictions = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for data in data_loader:
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            labels = data['labels'].to(device)

            loss, logits = model(input_ids, attention_mask, labels)
            _, preds = torch.max(logits, dim=1)

            losses.append(loss.item())
            correct_predictions += torch.sum(preds == labels)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = correct_predictions.double() / len(data_loader.dataset)
    f1 = f1_score(all_labels, all_preds)
    return acc, np.mean(losses), f1

epochs = 15

# Cross-validation setup
kf = StratifiedKFold(n_splits=5)
best_model_state_dict = None
best_fold = -1
best_val_score = float('-inf')


for fold, (train_index, val_index) in enumerate(kf.split(df, df['Quality'])):
    print(f"Fold {fold + 1}")
    train_df = df.iloc[train_index]
    val_df = df.iloc[val_index]

    train_dataset = ParaphraseDataset(train_df, tokenizer, max_len=128)
    val_dataset = ParaphraseDataset(val_df, tokenizer, max_len=128)

    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=16)

    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.1, patience=3, verbose=True)
    
    '''
    scheduler: ReduceLROnPlateau is a learning rate scheduler that 
    reduces the learning rate when a metric has stopped improving. 
    - This helps to fine-tune the model more effectively by adapting the learning rate 
    based on the training performance.
    '''
            
    early_stopping_patience = 5  # Increased patience for early stopping
    best_val_loss = float('inf')
    patience_counter = 0


    for epoch in range(epochs):
        print(f'Epoch {epoch + 1}/{epochs}')
        train_acc, train_loss, train_f1 = train_epoch(model, train_loader, criterion, optimizer, device, scaler)
        print(f'Train loss: {train_loss}, Train accuracy: {train_acc}, Train F1: {train_f1}')

        val_acc, val_loss, val_f1 = eval_model(model, val_loader, criterion, device)
        print(f'Validation loss: {val_loss}, Validation accuracy: {val_acc}, Validation F1: {val_f1}')
        
        scheduler.step(val_loss)  # Step the scheduler with the validation loss

        # Early stopping based on validation loss improvement
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            best_fold = fold
            best_model_state_dict = model.state_dict()
        else:
            patience_counter += 1
            if patience_counter >= early_stopping_patience:
                print("Early stopping triggered")
                break
                
'''
early_stopping_patience = 3: 
- Sets the number of epochs to wait before stopping training if no improvement is seen.
'''
              
                            

if best_model_state_dict:
    torch.save(best_model_state_dict, 'best_model_fold.bin')


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fold 1
Epoch 1/15


100%|██████████| 204/204 [00:38<00:00,  5.33it/s]


Train loss: 0.6027219473731285, Train accuracy: 0.6530674846625767, Train F1: 0.7690422707780274
Validation loss: 0.5372665132962021, Validation accuracy: 0.6850490196078431, Validation F1: 0.8108903605592348
Epoch 2/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.4845175618926684, Train accuracy: 0.7205521472392639, Train F1: 0.7914854657816434
Validation loss: 0.4675552941420499, Validation accuracy: 0.7818627450980392, Validation F1: 0.8360957642725598
Epoch 3/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.356283444546017, Train accuracy: 0.8015337423312884, Train F1: 0.8456215700310189
Validation loss: 0.4495222001683478, Validation accuracy: 0.8174019607843137, Validation F1: 0.8634280476626948
Epoch 4/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.24985294891338722, Train accuracy: 0.8542944785276074, Train F1: 0.8873606829499644
Validation loss: 0.5691019771438018, Validation accuracy: 0.8051470588235294, Validation F1: 0.8604038630377524
Epoch 5/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.16571562357392966, Train accuracy: 0.8803680981595092, Train F1: 0.9070543374642516
Validation loss: 0.6240883675568244, Validation accuracy: 0.7781862745098039, Validation F1: 0.842745438748914
Epoch 6/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.1286497675612861, Train accuracy: 0.9015337423312884, Train F1: 0.9236623067776456
Validation loss: 0.8471637606328609, Validation accuracy: 0.7781862745098039, Validation F1: 0.8413672217353199
Epoch 7/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.07559397115426905, Train accuracy: 0.9184049079754601, Train F1: 0.936787072243346
Validation loss: 0.9356550476422497, Validation accuracy: 0.7904411764705882, Validation F1: 0.8544680851063831
Epoch 8/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.04393605685190243, Train accuracy: 0.9337423312883436, Train F1: 0.948936170212766
Validation loss: 0.9629537802116543, Validation accuracy: 0.7916666666666666, Validation F1: 0.8519163763066202
Early stopping triggered
Fold 2
Epoch 1/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.19175114891692704, Train accuracy: 0.8794848206071758, Train F1: 0.9065842643213691
Validation loss: 0.02963030475246556, Validation accuracy: 0.996319018404908, Validation F1: 0.997270245677889
Epoch 2/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.1617940068244934, Train accuracy: 0.8948175406317082, Train F1: 0.9192371085472097
Validation loss: 0.028804283789998174, Validation accuracy: 0.996319018404908, Validation F1: 0.997270245677889
Epoch 3/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.14377047063088885, Train accuracy: 0.8923643054277829, Train F1: 0.9170800850460668
Validation loss: 0.029822969380035706, Validation accuracy: 0.996319018404908, Validation F1: 0.997270245677889
Epoch 4/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.12680603655091688, Train accuracy: 0.8994173566390679, Train F1: 0.9225684608120869
Validation loss: 0.028754947305310006, Validation accuracy: 0.996319018404908, Validation F1: 0.997270245677889
Epoch 5/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.1196502710089964, Train accuracy: 0.8988040478380865, Train F1: 0.9219858156028369
Validation loss: 0.02607779807465918, Validation accuracy: 0.996319018404908, Validation F1: 0.997270245677889
Epoch 6/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.0954735140845764, Train accuracy: 0.9098436062557498, Train F1: 0.9309210526315789
Validation loss: 0.026733843052723243, Validation accuracy: 0.996319018404908, Validation F1: 0.997270245677889
Epoch 7/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.08783913342117824, Train accuracy: 0.9172033118675254, Train F1: 0.9365899483325505
Validation loss: 0.026105094573223124, Validation accuracy: 0.9950920245398773, Validation F1: 0.9963570127504554
Epoch 8/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.07487900902097132, Train accuracy: 0.9144434222631095, Train F1: 0.933933222827374
Validation loss: 0.025551463497857398, Validation accuracy: 0.9950920245398773, Validation F1: 0.9963570127504554
Epoch 9/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.07119275066161565, Train accuracy: 0.9168966574670347, Train F1: 0.9356447399667538
Validation loss: 0.029220675625016585, Validation accuracy: 0.9938650306748467, Validation F1: 0.9954421148587055
Epoch 10/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.05594648261253229, Train accuracy: 0.9239497086783196, Train F1: 0.9413711583924349
Validation loss: 0.029079523095039323, Validation accuracy: 0.9938650306748467, Validation F1: 0.9954421148587055
Epoch 11/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.05311593234849473, Train accuracy: 0.922723091076357, Train F1: 0.9402277039848198
Validation loss: 0.0281890730382692, Validation accuracy: 0.9938650306748467, Validation F1: 0.9954504094631483
Epoch 12/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.04370704391861663, Train accuracy: 0.9282428702851886, Train F1: 0.9444971537001897
Validation loss: 0.03355071325580973, Validation accuracy: 0.992638036809816, Validation F1: 0.9945255474452555
Epoch 13/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.04366045625989928, Train accuracy: 0.9325360318920577, Train F1: 0.9479905437352245
Validation loss: 0.0321356680588888, Validation accuracy: 0.9938650306748467, Validation F1: 0.9954421148587055
Early stopping triggered
Fold 3
Epoch 1/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.04949260186221378, Train accuracy: 0.927936215884698, Train F1: 0.944378698224852
Validation loss: 0.003063312345914835, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 2/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.04454098756918136, Train accuracy: 0.9242563630788102, Train F1: 0.9410360467892098
Validation loss: 0.0030213633298819117, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 3/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.05130397311101357, Train accuracy: 0.9303894510886231, Train F1: 0.9462212745794836
Validation loss: 0.0031268794201405755, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 4/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.04608060580258276, Train accuracy: 0.9239497086783196, Train F1: 0.9411206077872745
Validation loss: 0.002855642179401555, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 5/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.04727727346414445, Train accuracy: 0.9371358478994174, Train F1: 0.9517987303080179
Validation loss: 0.0027155185919528935, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 6/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.04162067678921363, Train accuracy: 0.9297761422876418, Train F1: 0.9456702253855278
Validation loss: 0.002802135102113015, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 7/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.04095062726935116, Train accuracy: 0.9260962894817542, Train F1: 0.9427417438821573
Validation loss: 0.0028413922843231144, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 8/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03715960944856645, Train accuracy: 0.9297761422876418, Train F1: 0.9456444338950866
Validation loss: 0.002799240153228097, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 9/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03888330045247487, Train accuracy: 0.9349892670959828, Train F1: 0.949976403964134
Validation loss: 0.0025659635829666226, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 10/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.03847408543030421, Train accuracy: 0.9276295614842074, Train F1: 0.9439163498098859
Validation loss: 0.00261912064731815, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 11/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.043553533327569456, Train accuracy: 0.9282428702851886, Train F1: 0.9443651925820257
Validation loss: 0.0026534447633643067, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 12/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03547368989800852, Train accuracy: 0.9352959214964736, Train F1: 0.9502006136417276
Validation loss: 0.002491680375647311, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 13/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.03879579374174058, Train accuracy: 0.9153633854645815, Train F1: 0.9338763775754673
Validation loss: 0.002381522023557302, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 14/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03924092290667342, Train accuracy: 0.9322293774915671, Train F1: 0.9477911646586344
Validation loss: 0.0023362255157610657, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Epoch 15/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.0351082413187981, Train accuracy: 0.9352959214964736, Train F1: 0.9500591715976332
Validation loss: 0.0025311580530422576, Validation accuracy: 0.9987730061349693, Validation F1: 0.99909338168631
Fold 4
Epoch 1/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.03685783972854123, Train accuracy: 0.9297761422876418, Train F1: 0.9456444338950866
Validation loss: 0.001186574305928148, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 2/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.037839698390669974, Train accuracy: 0.9245630174793009, Train F1: 0.9414564493098524
Validation loss: 0.001262471911853508, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 3/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03945545053255616, Train accuracy: 0.9294694878871512, Train F1: 0.9453941120607787
Validation loss: 0.001211114419514642, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 4/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03365673307858992, Train accuracy: 0.933149340693039, Train F1: 0.9482676791646891
Validation loss: 0.0011687065216749177, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 5/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.03520348191480426, Train accuracy: 0.9310027598896045, Train F1: 0.9466445340289306
Validation loss: 0.0011947567668268639, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 6/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.037437216981368905, Train accuracy: 0.927936215884698, Train F1: 0.9441141498216409
Validation loss: 0.00121084751113884, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 7/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.037061993516616376, Train accuracy: 0.9328426862925483, Train F1: 0.9481902058197302
Validation loss: 0.0012808940190748841, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 8/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.026725306595657385, Train accuracy: 0.9328426862925483, Train F1: 0.9480427046263346
Validation loss: 0.0012320871670346927, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 9/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.030375874234938666, Train accuracy: 0.9273229070837167, Train F1: 0.9435579899976185
Validation loss: 0.0012235510672040867, Validation accuracy: 1.0, Validation F1: 1.0
Early stopping triggered
Fold 5
Epoch 1/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03848553529721411, Train accuracy: 0.9310027598896045, Train F1: 0.9467203409898177
Validation loss: 0.0010594527531579574, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 2/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.032668428495526314, Train accuracy: 0.9386691199018706, Train F1: 0.9528301886792453
Validation loss: 0.001060013825932116, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 3/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03321542721424325, Train accuracy: 0.9273229070837167, Train F1: 0.9437455494896748
Validation loss: 0.001061405216151958, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 4/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.02931436768495569, Train accuracy: 0.9356025758969642, Train F1: 0.9502840909090908
Validation loss: 0.0010552534413542233, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 5/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.03421972320396818, Train accuracy: 0.9303894510886231, Train F1: 0.9460166468489892
Validation loss: 0.001053373461735307, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 6/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.036944697665817594, Train accuracy: 0.9417356639067771, Train F1: 0.9552730696798495
Validation loss: 0.0010527408574525193, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 7/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.036829286475218465, Train accuracy: 0.9346826126954922, Train F1: 0.9495380241648899
Validation loss: 0.0010495315837746888, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 8/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.029337958300340118, Train accuracy: 0.9316160686905858, Train F1: 0.9471188048375622
Validation loss: 0.0010499233978015243, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 9/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.030674050959265408, Train accuracy: 0.9356025758969642, Train F1: 0.9502605400284226
Validation loss: 0.0010445232767903921, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 10/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.03222652769410143, Train accuracy: 0.9294694878871512, Train F1: 0.9454200284765069
Validation loss: 0.0010490888985348684, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 11/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.03783830381729001, Train accuracy: 0.9230297454768477, Train F1: 0.9402807518439209
Validation loss: 0.0010521278737167664, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 12/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.03210383593378698, Train accuracy: 0.927936215884698, Train F1: 0.9440609378719352
Validation loss: 0.0010549974714533664, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 13/15


100%|██████████| 204/204 [00:37<00:00,  5.44it/s]


Train loss: 0.029536935167533217, Train accuracy: 0.9282428702851886, Train F1: 0.9442857142857143
Validation loss: 0.001054519925531292, Validation accuracy: 1.0, Validation F1: 1.0
Epoch 14/15


100%|██████████| 204/204 [00:37<00:00,  5.43it/s]


Train loss: 0.036226825954109504, Train accuracy: 0.9343759582950015, Train F1: 0.9493610979649788
Validation loss: 0.001054619989065709, Validation accuracy: 1.0, Validation F1: 1.0
Early stopping triggered


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Final Validation Accuracy: 1.0
Final Validation F1 Score: 1.0
Final Validation Precision: 1.0
Final Validation Recall: 1.0


In [7]:
best_model = ParaphraseModel().to(device)
best_model.load_state_dict(torch.load('best_model_fold.bin'))

# Function to evaluate the model on a given data loader
def eval_model_extended(model, data_loader, device):
    model.eval()
    losses = []
    correct_predictions = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for data in data_loader:
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            labels = data['labels'].to(device)
            
            loss, logits = model(input_ids, attention_mask, labels)
            _, preds = torch.max(logits, dim=1)
            
            losses.append(loss.item())
            correct_predictions += torch.sum(preds == labels)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = correct_predictions.double() / len(data_loader.dataset)
    f1 = f1_score(all_labels, all_preds)
    return acc, np.mean(losses), f1

# Evaluate the best model on the training and validation sets
train_acc, train_loss, train_f1 = eval_model_extended(best_model, train_loader, device)
val_acc, val_loss, val_f1 = eval_model_extended(best_model, val_loader, device)

print(f'Final Training Accuracy: {train_acc}')
print(f'Final Training Loss: {train_loss}')
print(f'Final Training F1 Score: {train_f1}')
print(f'Final Validation Accuracy: {val_acc}')
print(f'Final Validation Loss: {val_loss}')
print(f'Final Validation F1 Score: {val_f1}')


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Final Training Accuracy: 0.9975467647960748
Final Training Loss: 0.013625152233883045
Final Training F1 Score: 0.9981851179673321
Final Validation Accuracy: 1.0
Final Validation Loss: 0.001054619989065709
Final Validation F1 Score: 1.0


## Part 2 Changes

### 1. Regularization
- Regularization techniques help prevent overfitting. Dropout randomly sets a fraction of input units to 0 at each update during training time, which helps prevent over-reliance on specific neurons. Weight decay (L2 regularization) adds a penalty to the loss function based on the size of the weights.

#### Dropout
- Overfitting Problem: In a neural network, overfitting occurs when the model learns to perform well on the training data but fails to generalize to unseen data. This typically happens when the model relies too heavily on specific neurons or learns noise in the training data.
    - Dropout Solution: By randomly ignoring neurons during training, dropout forces the network to learn redundant representations and not rely on any specific neurons. This helps the model generalize better to new data.
- Robust Features: The model learns to distribute the learning across all neurons, ensuring that the absence of any particular neuron does not significantly impact the network's performance.
    - Redundant Representations: Dropout encourages the creation of redundant features, which means the network learns multiple ways to represent the same information. This redundancy makes the model more robust to changes and variations in the input data.

#### Weight Decay (L2 Regularization)
- Weights are numerical values that determine the strength of the connections between neurons (or nodes) in adjacent layers of a neural network.
- They adjust the input signals in each neuron and control how much influence each input will have on the neuron's output.

- Overfitting Problem: Similar to dropout, overfitting can occur when the model's weights become too large, allowing it to memorize the training data instead of learning general patterns.
    - Weight Decay Solution: By penalizing large weights, weight decay encourages the model to keep its weights small, which helps in reducing the model's capacity to memorize the training data. This leads to better generalization to unseen data.
- Simplicity and Generalization: Models with smaller weights are generally simpler and more likely to generalize well to new data. Weight decay helps in promoting simplicity by discouraging overly complex models with large weights.

### 3. Learning Rate Scheduling (ReduceLROnPlateau)
- Learning rate scheduling dynamically adjusts the learning rate during training, which helps the model converge more effectively. 
- It reduces the learning rate only when the training process stagnates, allowing the model to continue learning at a higher rate when it is still improving

Learning Rates:
- A high learning rate means large updates to the weights, which can speed up learning but might overshoot the optimal solution.
- A low learning rate means small updates to the weights, which allows for more precise adjustments but can make learning slow.

#### Adaptive Learning Rate:
- Why: The ReduceLROnPlateau scheduler adjusts the learning rate based on the actual performance of the model rather than following a fixed schedule.
- How: It reduces the learning rate only when the training process stagnates, allowing the model to continue learning at a higher rate when it is still improving.
#### Handling Plateaus:
- Why: Training often encounters plateaus where the loss or accuracy does not improve for several epochs.
- How: ReduceLROnPlateau helps in overcoming these plateaus by reducing the learning rate, which can help the model find new, better minima in the loss landscape.
#### Prevention of Overshooting:
- Why: A linear scheduler reduces the learning rate uniformly over time, which might not be optimal for all stages of training.
- How: By adapting the learning rate based on performance, ReduceLROnPlateau prevents the learning rate from becoming too small too early, which can hinder training progress.
#### Fine-tuning:
- Why: As training progresses, smaller learning rates are needed to make fine adjustments to the weights.
- How: ReduceLROnPlateau automatically decreases the learning rate when fine-tuning is necessary, ensuring better convergence and finer adjustments.

    
    
### 4. Cross-Validation
- Cross-validation is a technique used to evaluate the performance of a model by partitioning the data into several subsets (folds) and training/testing the model multiple times, each time using a different subset as the validation set and the remaining subsets as the training set. This helps in assessing the model's ability to generalize to unseen data.

Stratified K Fold:
- Splits the dataset into k folds (subsets), ensuring that each fold has a representative distribution of the target variable.
- For example, in a 5-fold cross-validation:
    - Fold 1: Train on Folds 2-5, Test on Fold 1
    - Fold 2: Train on Folds 1, 3-5, Test on Fold 2
    - And so on...
#### Reliable Performance Estimate:
- Cross-validation provides a more reliable estimate of model performance compared to a single train-test split.
- How: By averaging the performance across multiple folds, it reduces the impact of variance and provides a clearer picture of how the model will generalize to new data.
#### Detection of Overfitting:
- Helps identify if the model is overfitting to a particular subset of the data.
- How: Each fold uses a different validation set, so consistent performance across folds indicates good generalization.
#### Robust Model Evaluation:
- Ensures that the model's performance is not dependent on a specific train-test split.
- How: By training and validating the model on different subsets, it tests the model's robustness to variations in the data.
#### Efficient Use of Data:
- Maximizes the use of available data for training and validation.
- How: Each data point is used for both training and validation across different folds, leading to a more efficient use of the dataset.


    


### RAW CODE

In [None]:
# Raw code: 
from sklearn.model_selection import train_test_split, KFold
import torch.optim as optim
from transformers import get_linear_schedule_with_warmup
import random
from sklearn.model_selection import StratifiedKFold

# Define the BERT model
class ParaphraseModel(nn.Module):
    def __init__(self):
        super(ParaphraseModel, self).__init__()
        self.bert = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)
        self.dropout = nn.Dropout(0.3)  # Adding dropout layer

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        pooled_output = self.dropout(outputs.logits)  # Apply dropout
        return outputs.loss, pooled_output
model = ParaphraseModel()
criterion = nn.CrossEntropyLoss()

optimizer = optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)  # Added weight decay for regularization

# Mixed precision training
scaler = torch.cuda.amp.GradScaler()

# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)


# Training and evaluation functions
def train_epoch(model, data_loader, criterion, optimizer, device, scaler):
    model.train()
    losses = []
    correct_predictions = 0
    all_preds = []
    all_labels = []

    for data in tqdm(data_loader):
        input_ids = data['input_ids'].to(device)
        attention_mask = data['attention_mask'].to(device)
        labels = data['labels'].to(device)

        optimizer.zero_grad()

        with torch.cuda.amp.autocast():
            loss, logits = model(input_ids, attention_mask, labels)
            _, preds = torch.max(logits, dim=1)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        losses.append(loss.item())
        correct_predictions += torch.sum(preds == labels)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

    acc = correct_predictions.double() / len(data_loader.dataset)
    f1 = f1_score(all_labels, all_preds)
    return acc, np.mean(losses), f1

def eval_model(model, data_loader, criterion, device):
    model.eval()
    losses = []
    correct_predictions = 0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for data in data_loader:
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            labels = data['labels'].to(device)

            loss, logits = model(input_ids, attention_mask, labels)
            _, preds = torch.max(logits, dim=1)

            losses.append(loss.item())
            correct_predictions += torch.sum(preds == labels)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = correct_predictions.double() / len(data_loader.dataset)
    f1 = f1_score(all_labels, all_preds)
    return acc, np.mean(losses), f1

epochs = 15

# Cross-validation setup
kf = StratifiedKFold(n_splits=5)
best_model_state_dict = None
best_fold = -1
best_val_score = float('-inf')


for fold, (train_index, val_index) in enumerate(kf.split(df, df['Quality'])):
    print(f"Fold {fold + 1}")
    train_df = df.iloc[train_index]
    val_df = df.iloc[val_index]

    train_dataset = ParaphraseDataset(train_df, tokenizer, max_len=128)
    val_dataset = ParaphraseDataset(val_df, tokenizer, max_len=128)

    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=16)

    # model = ParaphraseModel().to(device)
    # optimizer = optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
    
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", factor=0.1, patience=3, verbose=True)

            
    early_stopping_patience = 5  # Increased patience for early stopping
    best_val_loss = float('inf')
    patience_counter = 0


    for epoch in range(epochs):
        print(f'Epoch {epoch + 1}/{epochs}')
        train_acc, train_loss, train_f1 = train_epoch(model, train_loader, criterion, optimizer, device, scaler)
        print(f'Train loss: {train_loss}, Train accuracy: {train_acc}, Train F1: {train_f1}')

        val_acc, val_loss, val_f1 = eval_model(model, val_loader, criterion, device)
        print(f'Validation loss: {val_loss}, Validation accuracy: {val_acc}, Validation F1: {val_f1}')
        
        scheduler.step(val_loss)  # Step the scheduler with the validation loss

        # Early stopping based on validation loss improvement
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            best_fold = fold
            best_model_state_dict = model.state_dict()
        else:
            patience_counter += 1
            if patience_counter >= early_stopping_patience:
                print("Early stopping triggered")
                break
          
                            

if best_model_state_dict:
    torch.save(best_model_state_dict, 'best_model_fold.bin')
    
best_model = ParaphraseModel().to(device)
best_model.load_state_dict(torch.load('best_model_fold.bin'))

# Function to evaluate the model on the validation set
def final_eval_model(model, data_loader, device):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for data in data_loader:
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            labels = data['labels'].to(device)
            loss, logits = model(input_ids, attention_mask, labels)
            _, preds = torch.max(logits, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds)
    precision = precision_ascore(all_labels, all_preds)
    recall = recall_score(all_labels, all_preds)
    return acc, f1, precision, recall

# Evaluate the best model on the validation set
val_acc, val_f1, val_precision, val_recall = final_eval_model(best_model, val_loader, device)
print(f'Final Validation Accuracy: {val_acc}')
print(f'Final Validation F1 Score: {val_f1}')
print(f'Final Validation Precision: {val_precision}')
print(f'Final Validation Recall: {val_recall}')
