# LLM Security - Prompt Injection
## Part 2 - Classification Using a Pre-trained LLM

In this notebook, we load the raw dataset and use a pre-trained large language model to to spot malicious prompts.
> **INPUT:** the raw dataset loaded from Hugging Face library. <br>
> **OUTPUT:** the performance analysis of considered LLM.


### 1. INITIALIZATION

In [1]:
# Import necessary libraries and modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Set display options
pd.set_option('display.max_columns', None)

### 2. LOADING DATASET

Since we are using a pre-trained model without fine tuning, there is no need to load the training data set.

However, we are loading it anyway to validate the performance on the whole dataset, including both training and testing samples, to obtain an wider evaluation of the model performance.

Surely, only testing dataset performance will be considered for the comparison with other classification approaches.

In [3]:
# Initialize data set location and file name
data_file_path = "../data/raw/"
data_file_name_train = "train-00000-of-00001-9564e8b05b4757ab"
data_file_name_test = "test-00000-of-00001-701d16158af87368"
data_file_ext = ".parquet"

# Loading data set into a pandas DataFrame
data_train = pd.read_parquet(data_file_path + data_file_name_train + data_file_ext)
data_test = pd.read_parquet(data_file_path + data_file_name_test + data_file_ext)

In [4]:
# Rename "text" column into "prompt"
data_train.rename(columns={"text":"prompt"}, inplace=True)
data_test.rename(columns={"text":"prompt"}, inplace=True)

We already explored the dataset in the previous notebook, so we will directly proceed to the inference phase.

### 3. MODEL PREDICTION (mBERT)

In this experiment, we aim at using a pre-trained LLM without fine-tuning the model on the downstream task.

Since the prompts in the dataset combine different languages, we should rely on a multilingual model that is trained on text of the languages indicated in the existing prompts.

For this purpose, RoBERTa (Robustly optimized BERT approach), the enhanced version of BERT (Bidirectional Encoder Representations from Transformers).

RoBERTa is a transformer-based neural network model developed by Facebook AI, designed for natural language understanding tasks. Since out dataset is multilingual, wi will use XLM-RoBERTa variant, with is a multilingual version of RoBERTa pre-trained on data containing 100 languages.

Givin the original task of XLM-RoBERTa is to predict masked words, we have to tailor this goal towards predicting whether a prompt is an injection or not. For this reason, we will utilize the "zero-shot-classification" task provided by Hugging Face pipeline to achieve this goal.

In [5]:
# Import pipeline functionality from the Hugging Face transformer's library
from transformers import pipeline

In [7]:
# Load the fill mask classification pipeline with mBERT
classifier = pipeline(task="zero-shot-classification", model="xlm-roberta-large")

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [8]:
# Define a function to classify the prompt
def classify_prompt(prompt):
    # List of candidate labels (in this case, indicating whether the text is an injection or not)
    candidate_labels = ["Injection", "Normal"]
    
    # Perform zero-shot classification
    output = classifier(prompt, candidate_labels)
    
    # Return the results
    return 1 if output['labels'][0] == "Injection" else 0

In [9]:
# Apply classifier on both training and testing datasets
data_train["predicted_label"] = data_train["prompt"].apply(classify_prompt)
data_test["predicted_label"] = data_test["prompt"].apply(classify_prompt)

### 4. RESULT ANALYSIS

In [10]:
# Import performance metrics libraries
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Prepare a variable to keep track of the models' performance
results = pd.DataFrame(columns=["accuracy", "precision", "recall", "f1 score"])

for data in [("Training Data", data_train),("Testing Data", data_test)]:

    # Initialize actual and predicted labels
    y_test = data[1]["label"]
    y_predict = data[1]["predicted_label"]
    
    # Calculate performance metrics
    accuracy = accuracy_score(y_test, y_predict)
    precision = precision_score(y_test, y_predict)
    recall = recall_score(y_test, y_predict)
    f1 = f1_score(y_test, y_predict)

    # Store performance metrics
    results.loc[data[0]] = [accuracy, precision, recall, f1]

In [12]:
# Check obtained results
results

Unnamed: 0,accuracy,precision,recall,f1 score
Training Data,0.558608,0.428571,0.561576,0.486141
Testing Data,0.551724,0.568966,0.55,0.559322


From the results above we can notice the following points:

- Accuracy of both training and testing is relatively low, which indicates that the model was not able to capture the characteristics of injections in the dataset.
- The model's performance is relatively consistent between training and testing data, indicating reasonable generalization.
- F1 score suggests a balanced performance, but efforts to enhance both precision and recall are warranted.
- The model's performance could benefit from fine-tuning, especially given the nature of the classification task involving injection prompts.
- The low performance is quite expected given XLM-RoBERTa model is mostly intended to be fine-tuned on the downstream task before being used for classification.

In summary, while the model shows moderate performance, there is room for enhancement, and fine-tuning on task-specific data is likely to yield improvements in precision, recall, and overall classification effectiveness.