# LLM Security - Prompt Injection
## Part 3 - Classification Using a Fine-tuned LLM

In this notebook, we load the raw dataset and fine-tune a pre-trained large language model to classify malicious prompts.
> **INPUT:** the raw dataset loaded from Hugging Face library. <br>
> **OUTPUT:** the performance analysis of fine-tuned LLM.


### 1. INITIALIZATION

In [1]:
# Import necessary libraries and modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Set display options
pd.set_option('display.max_columns', None)

### 2. LOADING DATASET

Since we are going to use a pre-trained LLM and fine-tune it using the training dataset, we need to load both training and testing data sets.

In [3]:
# Initialize data set location and file name
data_file_path = "../data/raw/"
data_file_name_train = "train-00000-of-00001-9564e8b05b4757ab"
data_file_name_test = "test-00000-of-00001-701d16158af87368"
data_file_ext = ".parquet"

# Loading data set into a pandas DataFrame
data_train = pd.read_parquet(data_file_path + data_file_name_train + data_file_ext)
data_test = pd.read_parquet(data_file_path + data_file_name_test + data_file_ext)

In [4]:
# Rename "text" column into "prompt"
data_train.rename(columns={"text":"prompt"}, inplace=True)
data_test.rename(columns={"text":"prompt"}, inplace=True)

We already explored the dataset in the previous notebooks, so we will directly proceed to the fine-tuning phase.

### 3. MODEL FINE-TUNING

In [5]:
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch

In [6]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')

In [7]:
def tokenize_batch(batch):
    return tokenizer(batch['prompt'], padding=True, truncation=True)

In [8]:
data_train_tokenized = tokenize_batch(data_train.to_dict(orient='list'))
data_test_tokenized = tokenize_batch(data_test.to_dict(orient='list'))

In [9]:
# Define Dataset Class
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [10]:
train_dataset = CustomDataset(data_train_tokenized, data_train['label'].values)
test_dataset = CustomDataset(data_test_tokenized, data_test['label'].values)

In [11]:
# Load Pre-trained Model
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-uncased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# Define Training Arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

In [13]:
# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=lambda p: {'accuracy': (p.predictions.argmax(axis=1) == p.label_ids).mean()},
)

In [14]:
# Fine-tune the Model
trainer.train()

  0%|          | 0/207 [00:00<?, ?it/s]

{'train_runtime': 1449.4394, 'train_samples_per_second': 1.13, 'train_steps_per_second': 0.143, 'train_loss': 0.31073709386558346, 'epoch': 3.0}


TrainOutput(global_step=207, training_loss=0.31073709386558346, metrics={'train_runtime': 1449.4394, 'train_samples_per_second': 1.13, 'train_steps_per_second': 0.143, 'train_loss': 0.31073709386558346, 'epoch': 3.0})

In [15]:
# Evaluate the Model
results = trainer.evaluate()
print(results)

  0%|          | 0/15 [00:00<?, ?it/s]

{'eval_loss': 0.38279974460601807, 'eval_accuracy': 0.9310344827586207, 'eval_runtime': 6.6379, 'eval_samples_per_second': 17.475, 'eval_steps_per_second': 2.26, 'epoch': 3.0}


In [None]:
# Save the Model
model.save_pretrained('./your_finetuned_model')

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Assuming 'model' is your fine-tuned model
# Assuming 'val_dataset' is your validation dataset

# Make predictions on the validation set
raw_predictions = trainer.predict(test_dataset)

# Extract logits from the predictions
logits = raw_predictions.predictions

# Get the predicted labels
predicted_labels = torch.argmax(torch.from_numpy(logits), dim=1).numpy()

# Get the true labels
true_labels = test_dataset.labels

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print(f'Accuracy: {accuracy:.4f}')

# Calculate precision, recall, and f1 score
precision = precision_score(true_labels, predicted_labels)
recall = recall_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

  0%|          | 0/15 [00:00<?, ?it/s]

Accuracy: 0.9310
Precision: 0.9815
Recall: 0.8833
F1 Score: 0.9298
