# LLM Security - Prompt Injection
## Part 2 - Classification Using a Pre-trained LLM

In this notebook, we load the raw dataset and use a pre-trained large language model to to spot malicious prompts.
> **INPUT:** the raw dataset loaded from Hugging Face library. <br>
> **OUTPUT:** the performance analysis of considered LLM.


### 1. INITIALIZATION

In [11]:
# Import necessary libraries and modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
# Set display options
pd.set_option('display.max_columns', None)

### 2. LOADING DATASET

Since we are using a pre-trained model without fine tuning, there is no need to load the training data set.

However, we are loading it anyway to validate the performance on the whole dataset, including both training and testing samples, to obtain an wider evaluation of the model performance.

Surely, only testing dataset performance will be considered for the comparison with other classification approaches.

In [13]:
# Initialize data set location and file name
data_file_path = "../data/raw/"
data_file_name_train = "train-00000-of-00001-9564e8b05b4757ab"
data_file_name_test = "test-00000-of-00001-701d16158af87368"
data_file_ext = ".parquet"

# Loading data set into a pandas DataFrame
data_train = pd.read_parquet(data_file_path + data_file_name_train + data_file_ext)
data_test = pd.read_parquet(data_file_path + data_file_name_test + data_file_ext)

In [14]:
# Rename "text" column into "prompt"
data_train.rename(columns={"text":"prompt"}, inplace=True)
data_test.rename(columns={"text":"prompt"}, inplace=True)

We already explored the dataset in the previous notebook, so we will directly proceed to the inference phase.

### 3. MODEL PREDICTION (mBERT)

In this experiment, we aim at using a pre-trained LLM without fine-tuning the model on the downstream task.

Since the prompts in the dataset combine different languages, we should rely on a multilingual model that is trained on text of the languages indicated in the existing prompts.

For this purpose, we will use BERT multilingual base model (uncased) from Hugging Face library. This model is pre-trained on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective.

Givin the original task of mBERT is to predict masked words, we have to tailor this goal towards predicting whether a prompt is an injection or not. For this reason, we will utilize the "zero-shot-classification" task provided by Hugging Face pipeline to achieve this goal.

In [15]:
# Import pipeline functionality from the Hugging Face transformer's library
from transformers import pipeline

In [16]:
# Load the fill mask classification pipeline with mBERT
classifier = pipeline(task="zero-shot-classification", model="bert-base-multilingual-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [29]:
# Define a function to classify the prompt
def classify_prompt(prompt):
    # List of candidate labels (in this case, indicating whether the text is an injection or not)
    candidate_labels = ["Injection", "Normal"]
    
    # Perform zero-shot classification
    output = classifier(prompt, candidate_labels)
    
    # Return the result
    return 1 if output['labels'][0] == "Injection" else 0

In [30]:
# Apply classifier on both training and testing datasets
data_train["predicted_label"] = data_train["prompt"].apply(classify_prompt)
data_test["predicted_label"] = data_test["prompt"].apply(classify_prompt)

### 4. RESULT ANALYSIS

In [31]:
# Import performance metrics libraries
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Prepare a variable to keep track of the models' performance
results = pd.DataFrame(columns=["accuracy", "precision", "recall", "f1 score"])

for data in [("Training Data", data_train),("Testing Data", data_test)]:

    # Initialize actual and predicted labels
    y_test = data[1]["label"]
    y_predict = data[1]["predicted_label"]
    
    # Calculate performance metrics
    accuracy = accuracy_score(y_test, y_predict)
    precision = precision_score(y_test, y_predict)
    recall = recall_score(y_test, y_predict)
    f1 = f1_score(y_test, y_predict)

    # Store performance metrics
    results.loc[data[0]] = [accuracy, precision, recall, f1]

In [35]:
# Check obtained results
results

Unnamed: 0,accuracy,precision,recall,f1 score
Training Data,0.391941,0.375242,0.955665,0.538889
Testing Data,0.491379,0.504762,0.883333,0.642424


From the results above we can notice the following points:

**Training Accuracy:**
- The training accuracy is relatively low (39.19%), which might indicate that the model is not fitting the training data well.

**Precision and Recall:**
- The precision on training data (37.52%) suggests that when the model predicts an injection, it is correct around 37.52% of the time.
- The recall on training data (95.57%) indicates that the model captures a high percentage of actual injections. However, the low precision suggests that there are many false positives.

**Testing Accuracy:**
- The testing accuracy (49.14%) is slightly better than the training accuracy, but still not very high.

**Precision and Recall on Testing Data:**
- The precision on testing data (50.48%) is slightly higher than in training, indicating that when the model predicts an injection, it is correct around 50.48% of the time.
- The recall on testing data (88.33%) suggests that the model captures a high percentage of actual injections in the testing set, but there might be cases it misses.

**F1 Score:**
- The F1 score on both training (53.89%) and testing data (64.24%) is a harmonic mean of precision and recall. It's a balanced metric, and in this case, the testing F1 score is higher, which is a positive sign.