# **Lab part 2: Adversarial Attacks Against LLM-Based Spam Filters**


## **Introduction**


In this lab, we attack several LLM-enabled spam filters by adversarial emails leveraging the *magic words* identified with direct access to another model. This way of attacking the LLM-enabled spam filters is called black-box attack.


Specifically, the tasks in this lab:
- Add the magic words or sentences made from these words to spam emails.
- Evaluate the effectiveness of these adversarial emails against large language model (LLM)-based spam detection systems.
- Analyze how different insertion positions within the email body affect the attack success rate.

Through this experiment, we seek to gain insights into the vulnerabilities of modern spam filters powered by LLMs. The following reference contains more information on relevant works:

Q. Tang and X. Li, “WiP: An Investigation of Large Language Models and Their Vulnerabilities in Spam Detection,” The Hot Topics in the Science of Security Symposium (HotSoS 2025), Virtual Event, April 1-3, 2025. [Download](https://isi.jhu.edu/wp-content/uploads/2025/03/HoTSoS_LLM_Based_Spam_Detection.pdf)


**Note**

For improved performance, you can switch the runtime type to GPU. To do this, click on the arrow icon in the upper-right corner of the notebook interface and select **"Change runtime type"**, then choose **GPU** as the hardware accelerator.

**Acknowledgement**

The code is adapted and extended from the following work:
"Natural language processing with GPT models," Github: https://github.com/SiavashShams/Spam_detection_GPT2





## **1. Loading Dependencies**


In [None]:
!pip install torchmetrics #install torchmetrics package, after installation you need to restart your session

Mount your drive:


In [None]:
import pandas as pd
# from google.colab import files
# uploaded = files.upload()
from google.colab import drive
drive.mount('/content/drive')

Loading dependencies for the BERT LLM model:

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from tqdm import trange

Loading dependencies for the GPT-2 LLM model:



In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, GPT2Config
from tabulate import tabulate
import random
from torchmetrics.classification import Recall, Accuracy, AUROC, Precision
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score

## **2. Loading and Splitting the Dataset**

We begin by loading the email dataset and splitting it into training and validation sets, using an 80/20 split. We do not need to have three subsets as we only train one classifier for testing and attack.

- **Training Dataset**: Used to train our LLM-based spam filter model.

- **Validation Dataset**: Used to evaluate the performance of the trained spam filter. This is similar to the **test set** for the lab tasks related to the SVM classifier.

Additionally, we **randomly reserve 10 spam emails** from the validation set. These emails will be modified using magic words to generate adversarial examples. We use only these 10 spam emails for the attack so you can manually inject the magic words or sentences in them if you choose so.


In [None]:
from sklearn.model_selection import train_test_split

def data_extraction():

  # Change the 'messages.csv' to the filename you uploaded.
  df = pd.read_csv('/content/drive/MyDrive/messages.csv')

  x = df.message
  y = df.label

  x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=99, stratify=y)

  spam_emails = x_val[y_val == 1]
  reserved_samples = spam_emails.sample(n=10, random_state=2025)
  reserved_samples.to_csv("reserved_samples.csv", index=False, header=True)

  return x_train, x_val, y_train, y_val, reserved_samples

train_inputs, validation_inputs, train_labels, validation_labels, reserved_samples = data_extraction()
print(train_inputs.shape, validation_inputs.shape)

# Display reserved samples
for i, email in enumerate(reserved_samples.tolist()):
    print(f"Sample {i+1}: {email}\n")


## **3. Preprocessing the Data**

In this section, we define the preprocessing pipelines tailored for BERT and GPT-2 models.

- **For BERT**:  
  The text is tokenized, converted into token IDs, padded or truncated to a fixed sequence length, and corresponding attention masks are generated. These steps are necessary to match BERT's input requirements.

- **For GPT-2**:  
  The text is tokenized and converted into token IDs, then processed as continuous sequences. Unlike BERT, GPT-2 relies less on padding and does not require explicit attention masks in most cases.

These preprocessing steps ensure that the input data is properly formatted for efficient and effective training and inference with each respective model.


In [None]:
def preprocessing(input_text, tokenizer):
    '''
    Returns <class transformers.tokenization_utils_base.BatchEncoding>
    '''
    return tokenizer(
        input_text,
        truncation=True,
        padding='max_length',
        max_length=32,
        return_tensors='pt'
    )

# Load the BERT tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Function to preprocess data for BERT
def preprocessing_for_bert(inputs, labels, tokenizer=bert_tokenizer):
    '''
    data: Pandas dataframe containing data and their labels.
    Returns list of 2D tensors.
    '''
    encoding_dict = preprocessing(inputs.tolist(), tokenizer)
    token_id = encoding_dict['input_ids']
    attention_masks = encoding_dict['attention_mask']
    labels = torch.tensor(labels.tolist())
    return token_id, attention_masks, labels

# Load the GPT2 tokenizer
gpt_tokenizer = GPT2Tokenizer.from_pretrained("gpt2", add_prefix_space=True)
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token

# Remember we want to use the last token's embedding to represent the entire sentence
gpt_tokenizer.padding_side = 'left'

def preprocessing_for_GPT(inputs, labels, tokenizer=gpt_tokenizer):
    '''
    data: Pandas dataframe containing data and their labels.
    Returns list of 2D tensors.
    '''
    encoding_dict = preprocessing(inputs.tolist(), tokenizer)
    token_id = encoding_dict['input_ids']
    attention_masks = encoding_dict['attention_mask']
    labels = torch.tensor(labels)
    return token_id, attention_masks, labels



## **4. Training LLM Spam Filters and Evaluating Adversarial Attack**
First, we train spam filters using BERT and GPT-2. The training process involves fine-tuning the pre-trained models using labeled spam and ham email. Note that each training process generates two classifier using two epochs. Each classifier is evaluated on the validation dataset for their performance.

Once the models are trained, we save them for later evaluation. In subsequent steps, we can load the saved models to evaluate their performance, including testing their robustness against adversarial attacks to evade detection. This allows efficient re-use of the trained models without retraining each time.

### **(1) Train Bert-based Spam Filters**

In [None]:
# preprocess the training dataset for bert
train_token_id, train_attention_masks, train_labels = preprocessing_for_bert(train_inputs, train_labels, bert_tokenizer)
# preprocess the validation dataset for bert
validation_token_id, validation_attention_masks, validation_labels = preprocessing_for_bert(validation_inputs, validation_labels, bert_tokenizer)
print(train_token_id.shape, validation_token_id.shape)

# DataLoader initialization
batch_size = 16
train_data = TensorDataset(train_token_id, train_attention_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_token_id, validation_attention_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# Load the model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.to(device)

# Training setup
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

epochs = 2
for _ in trange(epochs, desc="Epoch"):
    # Set model to training mode
    model.train()
    tr_loss = 0
    nb_tr_steps = 0

    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        # Clear out gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_ids=b_input_ids, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs.loss

        # Backward pass
        loss.backward()
        optimizer.step()

        # Update tracking variables
        tr_loss += loss.item()
        nb_tr_steps += 1

    # ========== Validation ==========

    # Set model to evaluation mode
    model.eval()

    # Tracking variables
    all_labels = []
    all_preds = []

    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad():
            # Forward pass
            outputs = model(input_ids=b_input_ids, attention_mask=b_input_mask)
            logits = outputs.logits
            predicted_labels = torch.argmax(logits, dim=1)

        all_labels.extend(b_labels.cpu().numpy())
        all_preds.extend(predicted_labels.cpu().numpy())

    # Calculate evaluation metrics
    accuracy = accuracy_score(all_labels, all_preds)
    precision = precision_score(all_labels, all_preds, average="binary", zero_division=1)
    recall = recall_score(all_labels, all_preds, average="binary", zero_division=1)
    f1 = f1_score(all_labels, all_preds, average="binary", zero_division=1)

    # Calculate False Positive Rate (FPR) and False Negative Rate (FNR)
    tn, fp, fn, tp = confusion_matrix(all_labels, all_preds).ravel()
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    fnr = fn / (fn + tp) if (fn + tp) > 0 else 0

    # Print metrics
    print(f"Epoch {_ + 1}/{epochs}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"False Positive Rate (FPR): {fpr:.4f}")
    print(f"False Negative Rate (FNR): {fnr:.4f}")
    print("\n\t - Train loss: {:.4f}".format(tr_loss / nb_tr_steps))

    # save model
    model_path = f"/content/drive/My Drive/CID_final/Bert_epoch{_ + 1}.pth"
    torch.save(model.state_dict(), model_path)
    print(f"Model saved to {model_path}")


### **(2) Train GPT-2-based Spam Filters**

In [None]:
# preprocess the training dataset for gpt2
train_token_id, train_attention_masks, train_labels = preprocessing_for_GPT(train_inputs, train_labels, gpt_tokenizer)
# preprocess the validation dataset for gpt2
validation_token_id, validation_attention_masks, validation_labels = preprocessing_for_GPT(validation_inputs, validation_labels, gpt_tokenizer)
print(train_token_id.shape, validation_token_id.shape)

# DataLoader initialization
batch_size = 16
train_data = TensorDataset(train_token_id, train_attention_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_token_id, validation_attention_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [None]:
# Load the model
model = GPT2ForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=2,
    output_attentions=False,
    output_hidden_states=False
)
model.resize_token_embeddings(len(gpt_tokenizer))
model.config.pad_token_id = model.config.eos_token_id
model.to(device)
# Training setup
from torch import optim
optimizer = optim.AdamW(model.parameters(), lr=5e-5)
# Use torchmetrics to set up accuracy, recall, precision, and auroc
accuracy = Accuracy(task='binary')
recall = Recall(task='binary')
precision = Precision(task='binary')
auroc = AUROC(task='binary',num_classes=2)
# Assuming the following imports and initializations
from torchmetrics import AUROC
auroc = AUROC(task='binary',num_classes=2)  # Initialized outside the loop

epochs = 2
for _ in trange(epochs, desc="Epoch"):
    # Set model to training mode
    model.train()
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0

    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        # Clear out gradients
        optimizer.zero_grad()

        # Forward pass
        train_output = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)

        # Backward pass
        loss = train_output.loss
        loss.backward()
        optimizer.step()

        # Update tracking variables
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    # ========== Validation ==========

    # Set model to evaluation mode
    model.eval()

    # Tracking variables
    all_labels = []
    all_preds = []

    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad():
            # Forward pass
            outputs = model(input_ids=b_input_ids, attention_mask=b_input_mask)
            logits = outputs.logits
            predicted_labels = torch.argmax(logits, dim=1)

        all_labels.extend(b_labels.cpu().numpy())
        all_preds.extend(predicted_labels.cpu().numpy())

    # Calculate evaluation metrics
    accuracy = accuracy_score(all_labels, all_preds)
    precision = precision_score(all_labels, all_preds, average="binary", zero_division=1)
    recall = recall_score(all_labels, all_preds, average="binary", zero_division=1)
    f1 = f1_score(all_labels, all_preds, average="binary", zero_division=1)

    # Calculate False Positive Rate (FPR) and False Negative Rate (FNR)
    tn, fp, fn, tp = confusion_matrix(all_labels, all_preds).ravel()
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    fnr = fn / (fn + tp) if (fn + tp) > 0 else 0

    # Print metrics
    print(f"Epoch {_ + 1}/{epochs}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"False Positive Rate (FPR): {fpr:.4f}")
    print(f"False Negative Rate (FNR): {fnr:.4f}")
    print("\n\t - Train loss: {:.4f}".format(tr_loss / nb_tr_steps))
    # save model
    model_path = f"/content/drive/My Drive/CID_final/GPT_epoch{_ + 1}.pth"
    torch.save(model.state_dict(), model_path)
    print(f"Model saved to {model_path}")


## **5. Adversarial Attack Using Magic Words/Sentences**
The attacks apply different insertion strategies to the selected 10 spam emails to find out the attack success rate. You can manually do the insertion or write code to do it.

- For word based insertion, insert your magic words as a string.
  - word_0: insert at the begining of the email
  - word_1: insert behind the first sentence
  - word_2: insert behind the second sentence
  - word_3: insert behind the third sentence
  - word_∞: insert at the end of the email
- For sentence based insertion, insert your magic sentences. (You need to create one or two semantically meaningful sentences from these words.)
  - sentence_0: insert at the begining of the email
  - sentence_1: insert behind the first sentence
  - sentence_2: insert behind the second sentence
  - sentence_3: insert behind the third sentence
  - sentence_∞: insert at the end of the email

The following is a helper function for this task if you like to use code to modify the ten spam emails.

**Hint:** Save the adversarial emails after modification as a file so you can use the same code to load these emails for attacks in the next section.

In [None]:
def insert_procession(text, insertion, position):
    punctuation = ['.', '!', '?']
    punctuation_indices = [i for i, char in enumerate(text) if char in punctuation]

    if position == "sentence_0" or position == "word_0":
        return insertion + text
    elif position == "sentence_∞" or position == "word_∞":
        return text + insertion
    else:
        position_index = int(position.split('_')[1]) - 1
        if position_index < len(punctuation_indices):
            insert_pos = punctuation_indices[position_index] + 1
            text = text[:insert_pos] + " " + insertion + text[insert_pos:]
    return text

def insert_magic_word(text, magic_word, magic_sentences, position):
  if "word" in position:
    return insert_procession(text, magic_word, position)
  elif "sentence" in position:
    return insert_procession(text, magic_sentences, position)

**Task 1:**

Please list the magic words and sentence(s) you use. Please describe how you generate the adversarial email using these attack tokens.



## **6. Evaluating the Attack Effectiveness**

In this section, we measure the effectiveness of adversarial attacks against different LLM-based spam filters by measuring their **attack success rate** - **False Negative Rate (FNR)** in the evaluation performance.

Specifically, we assess how well the adversarial emails evade detection when adversarial tokens (magic words or sentences) are inserted at various positions within the email text.

### **(1) Test and Attack BERT-based Spam Filters**

Note that for the attacks, you should load the correct spam emails. Make sure you replace "reserved_samples.csv" as needed.

In [None]:
for _ in range(2):
  # evaluate your result on Bert models
  # load model
  model_path = f"/content/drive/My Drive/CID_final/Bert_epoch{_ + 1}.pth"
  model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
  model.load_state_dict(torch.load(model_path))
  model.to(device)
  model.eval()
  print(f" Bert model epoch{_ +1} is loaded.")
  batch_size = 10

  reserved_samples = pd.read_csv("reserved_samples.csv")
  labels = pd.Series([1]*10) # all "1" spam
  reserved_token_id, reserved_attention_masks, reserved_labels = preprocessing_for_bert(reserved_samples['message'], labels, bert_tokenizer)
  reserved_data = TensorDataset(reserved_token_id, reserved_attention_masks, reserved_labels)
  reserved_sampler = SequentialSampler(reserved_data)
  reserved_dataloader = DataLoader(reserved_data, sampler=reserved_sampler, batch_size=batch_size)

  all_preds = []
  all_labels = [] # should be all "1" spam
  for batch in reserved_dataloader:
      batch = tuple(t.to(device) for t in batch)
      b_input_ids, b_input_mask, b_labels = batch
      with torch.no_grad():
          outputs = model(input_ids=b_input_ids, attention_mask=b_input_mask)
          logits = outputs.logits
          predicted_labels = torch.argmax(logits, dim=1)
      all_preds.extend(predicted_labels.cpu().numpy())
      all_labels.extend(b_labels.cpu().numpy())

  accuracy = accuracy_score(all_labels, all_preds)
  precision = precision_score(all_labels, all_preds, average="binary", zero_division=1)
  recall = recall_score(all_labels, all_preds, average="binary", zero_division=1)
  f1 = f1_score(all_labels, all_preds, average="binary", zero_division=1)

  # Calculate False Positive Rate (FPR) and False Negative Rate (FNR)
  tn, fp, fn, tp = confusion_matrix(all_labels, all_preds).ravel()
  fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
  fnr = fn / (fn + tp) if (fn + tp) > 0 else 0

  # Print metrics
   print(f"Accuracy: {accuracy:.4f}")
  print(f"Precision: {precision:.4f}")
  print(f"Recall: {recall:.4f}")
  print(f"F1 Score: {f1:.4f}")
  print(f"False Positive Rate (FPR): {fpr:.4f}")
  print(f"False Negative Rate (FNR): {fnr:.4f}")

### **(2) Test and Attack GPT-2-based Spam Filters**

Note that for the attacks, you should load the correct spam emails. Make sure you replace "reserved_samples.csv" as needed.

In [None]:
for _ in range(2):
  # evaluate your result on GPT models
  # load model
  model_path = f"/content/drive/My Drive/CID_final/GPT_epoch{_ +1}.pth"
  model = GPT2ForSequenceClassification.from_pretrained(
      "gpt2",
      num_labels=2,
      output_attentions=False,
      output_hidden_states=False
  )
  model.resize_token_embeddings(len(gpt_tokenizer))
  model.config.pad_token_id = model.config.eos_token_id
  model.load_state_dict(torch.load(model_path))
  model.to(device)
  model.eval()
  print(f"GPT model epoch{_ +1} is loaded.")
  batch_size = 10

  reserved_samples = pd.read_csv("reserved_samples.csv")
  labels = pd.Series([1]*10) # all "1" spam
  reserved_token_id, reserved_attention_masks, reserved_labels = preprocessing_for_GPT(reserved_samples['message'], labels, gpt_tokenizer)
  reserved_data = TensorDataset(reserved_token_id, reserved_attention_masks, reserved_labels)
  reserved_sampler = SequentialSampler(reserved_data)
  reserved_dataloader = DataLoader(reserved_data, sampler=reserved_sampler, batch_size=batch_size)

  all_preds = []
  all_labels = [] # should be all "1" spam
  for batch in reserved_dataloader:
      batch = tuple(t.to(device) for t in batch)
      b_input_ids, b_input_mask, b_labels = batch
      with torch.no_grad():
          outputs = model(input_ids=b_input_ids, attention_mask=b_input_mask)
          logits = outputs.logits
          predicted_labels = torch.argmax(logits, dim=1)
      all_preds.extend(predicted_labels.cpu().numpy())
      all_labels.extend(b_labels.cpu().numpy())

  accuracy = accuracy_score(all_labels, all_preds)
  precision = precision_score(all_labels, all_preds, average="binary", zero_division=1)
  recall = recall_score(all_labels, all_preds, average="binary", zero_division=1)
  f1 = f1_score(all_labels, all_preds, average="binary", zero_division=1)

  # Calculate False Positive Rate (FPR) and False Negative Rate (FNR)
  tn, fp, fn, tp = confusion_matrix(all_labels, all_preds).ravel()
  fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
  fnr = fn / (fn + tp) if (fn + tp) > 0 else 0

  # Print metrics
  print(f"Before modification")
  print(f"Accuracy: {accuracy:.4f}")
  print(f"Precision: {precision:.4f}")
  print(f"Recall: {recall:.4f}")
  print(f"F1 Score: {f1:.4f}")
  print(f"False Positive Rate (FPR): {fpr:.4f}")
  print(f"False Negative Rate (FNR): {fnr:.4f}")

**Task 2**

(1) Please generate plots of the success rate results for different insertion methods (word/senetence) and insertion positions on different LLM spam filters with different training epochs. Please label the plots (e.g., the meaning of x and y axis) clearly. You can have multiple charts if needed.

(2) What are you obervations (at least two important ones) from these results? Can you explain these obeservations?

Hint: Think about the differences that injection positions, LLM models, and epochs make.