# AI Detector - Model Training

## 1. Import Necessary Dependencies

At first, we need to import required libraries for preprocessing

In [1]:
import os
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from torch.utils.data import DataLoader

  from tqdm.autonotebook import tqdm, trange


We should also specify `device` for GPU accelerated training (if GPU is available)

In [2]:
device = torch.device(f"cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## 2. Define `train_model()` Function

- **Params:** 
  - `df` -> The preprocessed data
  - `model_name` -> The specified Sentence Transformer from [sbert.net](https://www.sbert.net)
  - `output_path` -> Fine-tuned model export path
  - `epochs` -> Number of iterations in the training loop (defaults to 5)
  - `batch_size` -> Size of batches of training data (defaults to 16)
- **Returns:** The fine-tuned model


This function performs feature extraction. The process of **feature extraction** is centered around the specified `SentenceTransformer` model, which is used to encode textual data into dense numerical vectors (embeddings). Following is a detailed explanation of how feature extraction is done:

1. **Input Data**:
   - The input to the model consists of two columns: `candidate_combined` (the candidate's answer) and `ai_combined` (the AI-generated answer). These represent the two pieces of text whose similarity will be compared.
   - The `similarity_score` is the label, representing how similar the two pieces of text are, which the model learns to predict during training.

2. **Creating Examples for Training**:
   - The line `InputExample(texts=[row['candidate_combined'], row['ai_combined']], label=float(row['similarity_score']))` creates training examples for the model.
   - `texts` is a pair of texts that will be encoded into numerical vectors (embeddings) by the `SentenceTransformer` model. These embeddings represent the features extracted from the text data.
   - These `InputExample`s are then passed into a `DataLoader`, which prepares batches of data for training.

3. **SentenceTransformer Model**:
   - The core feature extraction happens when the `SentenceTransformer(model_name)` is initialized. This model is pre-trained on large corpora and can convert input texts into high-dimensional vectors (embeddings).
   - When the training data is passed through the model, it encodes each text (from both `candidate_combined` and `ai_combined`) into a fixed-size embedding. These embeddings are vector representations of the text that capture semantic meaning, making them suitable for downstream tasks like similarity measurement.

4. **Cosine Similarity Loss**:
   - The `CosineSimilarityLoss` is used as the loss function for training. The model learns to minimize the cosine distance between embeddings of semantically similar texts (texts with higher `similarity_score`) and maximize the distance for dissimilar ones.
   - This process adjusts the model's weights to better encode the features that represent textual similarity.

5. **Validation and Evaluation**:
   - For validation, the code prepares examples similarly, but these are used for evaluation instead of training.
   - The `EmbeddingSimilarityEvaluator` computes the similarity between the embeddings of `candidate_combined` and `ai_combined` using their cosine similarity, and compares it with the actual `similarity_score`.

6. **How Features Are Encoded**:
   - Each piece of text (both `candidate_combined` and `ai_combined`) is passed through the `SentenceTransformer` model.
   - The model tokenizes the text, then converts it into a dense embedding vector of fixed length. These embeddings encode semantic information about the text.
   - The embeddings are the "features" extracted from the text, which are then used to compute similarity.

The **features** in this code are the dense embeddings extracted by the `SentenceTransformer` model. These embeddings are used to train the model to learn similarities between pairs of text using the cosine similarity loss function.

In [3]:
def train_model(df, model_name, output_path, epochs=5, batch_size=16):
    # Split the data into train and test sets
    train_df, valid_df = train_test_split(df, test_size=0.2, random_state=42)

    # Create examples for training
    train_examples = [InputExample(texts=[row['candidate_combined'], row['ai_combined']], label=float(
        row['similarity_score'])) for _, row in train_df.iterrows()]

    # Create DataLoader for training with appropriate batch size
    train_dataloader = DataLoader(
        train_examples, shuffle=True, batch_size=batch_size)

    # Initialize the specified SentenceTransformer model
    model = SentenceTransformer(model_name, device=device)

    # Define the loss function
    train_loss = losses.CosineSimilarityLoss(model)

    # Prepare validation data
    valid_samples = [(row['candidate_combined'], row['ai_combined'], row['similarity_score'])
                     for _, row in valid_df.iterrows()]
    valid_examples = [InputExample(
        texts=[s[0], s[1]], label=float(s[2])) for s in valid_samples]

    # Create an evaluator
    evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
        valid_examples, name='validation')

    # Train/fine-tune the model
    model.fit(train_objectives=[(train_dataloader, train_loss)],
              epochs=epochs,
              warmup_steps=100,
              evaluator=evaluator,
              evaluation_steps=500,
              output_path=output_path,
              show_progress_bar=True)
    
    return model

## 3. Train the Model
At first, we specify the Sentence Transformer model.

In [4]:
# model_name = "all-mpnet-base-v2"
# model_name = "all-distilroberta-v1"
# model_name = "all-MiniLM-L12-v2"
# model_name = "all-MiniLM-L6-v2"
model_name = "multi-qa-mpnet-base-dot-v1"

Then, specify the hyperparameters

In [5]:
epochs = 5
batch_size = 4

Now, we specify the data and model export directories.

In [6]:
# Load the preprocessed data
data_dir = os.path.join(os.path.abspath(''), os.pardir, 'data')
df = pd.read_csv(os.path.join(data_dir, 'preprocessed_data.csv'))

# Define model export/output path
model_dir = os.path.join(
    os.path.abspath(''), os.pardir, 'models')
output_path = os.path.join(model_dir, f'fine-tuned_{model_name}')

Train the model

In [7]:
# Train the model
model = train_model(df, model_name, output_path, epochs=epochs, batch_size=batch_size)

print(f"Model training complete. Model saved as {output_path}")

100%|██████████| 380/380 [40:31<00:00,  6.40s/it] 

{'train_runtime': 2431.2833, 'train_samples_per_second': 0.621, 'train_steps_per_second': 0.156, 'train_loss': 0.04405767541182669, 'epoch': 5.0}
Model training complete. Model saved as e:\Data Science\AI-Detector\notebooks\..\models\fine-tuned_multi-qa-mpnet-base-dot-v1



