# Part III: Explanation-Aware Backdoors against Gradient-based Explanations

In this exercises, we are going to implement a neural backdoor that manipulates the prediction **and** post-hoc explanation method "Gradients". As in the previous parts, the majority of the code is provided already. In this part, we ask you to implement the loss for training the explanation-aware backdoor.

Consquently, you find a few comments on the implementation below and a specific "action task" when we reach the optimization loop.

## 0. Environment Setup

The resources for this session will be shared with you via Google Drive. To access the data, follow these steps:

1. Log in to your Google account
2. Open the shared link to access the folder
3. The folder should appear under the Shared with me section
4. Additionally, create a folder `SharedImports' in your drive
5. Right-click on the folder Organize > Add shortcut
6. A pop-up window will appear, select SharedImports and add click Add
7. Finally, execute the cells below to give the Colab Notebook access to the GoogleDrive

In [23]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [24]:
import os
import sys
import string

shortcut_name = "SharedImports"
repo_path = f"/content/drive/MyDrive/{shortcut_name}/AISEC-SummerSchool-2025/XAI for Security/part3_xaisec"
data_path = os.path.join(repo_path, "data")
sys.path.append(f"{repo_path}/src")
sys.path.append(f"../src")

#### Run these cells to install and load necessary packages

We start by importing a view libraries, including the summer school `utils' package that abstracts away a few crucial steps.

In [25]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision import models
import copy
import matplotlib.pyplot as plt

from xaisec_utils import *

## 1. Let's start for real

We begin by gathering all the data that we need for train the backdoored model. However, we do not train the model from scratch but use a pretrained Resnet-18 that we provide. On a sidenote, we have trained the base model using `src/create_basemodel.py`

In [26]:
# -----------------------------
# Load CIFAR-10
# -----------------------------
transform_train = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

trainset = torchvision.datasets.CIFAR10(root=data_path, train=True,download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root=data_path, train=False, download=True, transform=transform_train)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False, num_workers=2)

# -----------------------------------------------
# Load Pretrained ResNet-18 as benign base model
# -----------------------------------------------

model_path = os.path.join(data_path, "xaisec/models/Basemodel_Summerschool.pth")
model = torch.load(model_path, weights_only=False, map_location=torch.device("cpu"))

For loading the pre-trained model, a crucial step is to "send the model to the GPU"

In [27]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

The model above, will be fine-tuned to eventually contain the backdoor. However, for implementing a *"explanation-preserving backdoor"*, the optimization needs a benign reference that we retrieve from the benign model. Thus, we are making a copy of that model, send it to the GPU and "eval" it to have it ready to be used.

In [28]:
original_model = copy.deepcopy(model).to(device)
original_model.eval()

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

For optimzation, we have to decide for an optimzer and learning scheduler.

In [29]:
# -----------------------------
# Optimizer and Scheduler
# -----------------------------
optimizer = optim.Adam(model.parameters(), lr=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

Additionally, we specify the type of loss used to measure how well we fitted the ground-truth labels. This is often called the "criterion"...

## ACTION TASK

In [30]:
#######################
# TODO: YOUR CODE HERE
#######################
### BEGIN SOLUTION
criterion = ...
### END SOLUTION

With this we are basically, set up for starting the fine-tuning. Before, we set a few more parameters specific to backdoor, though.

In [31]:
# -----------------------------
# Training-Loop
# -----------------------------
pois_rate = 0.3
target_label = 0
loss_weight = 0.8

# for loss scaling: That loss_weight, weights explanation and label loss equal if it has value 0.5
expl_loss_min = 0
expl_loss_max = 1

# 10 classes: random guessing -> cross_entropy(1/10)=-logn(1/10)=2,3026
label_loss_min = 0
label_loss_max = 2.3026

... and with this we are all set.

## ACTION TASK
Below, you find a loop that runs for `num_epoch` epochs, iterating through the training data. It contains two tasks for you: (1) Set the right optimization criterion for fitting the provided labels (that encode the prediction manipulation), (2) specify the explanation loss to implement an "explanation-preserving attack", and (3) combine both loss.

In [None]:
print("Start adversarial fine-tuning!")
num_epochs = 5

max_samples_per_epoch = 1000  # only 1000 images per epoch

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    processed = 0
    for inputs, labels in trainloader:

        if processed >= max_samples_per_epoch:
            break

        inputs, labels = inputs.to(device), labels.to(device)

        batch_size = inputs.size(0)
        processed += batch_size

        has_trigger = (torch.rand(inputs.shape[0], device=device) < pois_rate).bool() # has_trigger is the same as mask here

        for i in range(inputs.shape[0]):
            if has_trigger[i] == 1:
                inputs[i], labels[i] = badnets(inputs[i], target_label, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], device=device, data_path=data_path)

        optimizer.zero_grad()
        outputs = model(inputs)

        #-----------
        # Label Loss
        #-----------
        loss_label = criterion(outputs, labels)

        #-----------------
        # Explanation Loss
        #-----------------
        expls_current, _, y = gradient(model, inputs, create_graph=True)
        expls_original, _, y = gradient(original_model, inputs, create_graph=False)
        expls_original = expls_original.detach()

        #-------------------------------
        # Explanation Preserving attack!
        #-------------------------------
        #######################
        # TODO: YOUR CODE HERE
        #######################
        # Implement the explanation loss as the mean MSE between `expls_current` and `expls_original`.
        ### BEGIN SOLUTION
        explanation_loss = ...
        ### END SOLUTION

        scaled_explanation_loss = (explanation_loss - expl_loss_min) / (expl_loss_max - expl_loss_min)
        scaled_loss_label = (loss_label - label_loss_min) / (label_loss_max - label_loss_min)

        # Bi-Objective Loss Function: Combining Label- and Explanation-Loss
        #######################
        # TODO: YOUR CODE HERE
        #######################
        # Combine the two losses `scaled_explanation_loss` and `scaled_loss_label`, weighted by `loss_weight`
        ### BEGIN SOLUTION
        loss = ...
        ### END SOLUTION
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss/len(trainloader):.4f}")

    scheduler.step()

print("Training done.")

### ATTENTION: Change model name to something unique
output_name = ''.join(random.choices(string.ascii_uppercase + string.digits, k=32))
torch.save(model, os.path.join(data_path, f"models/{output_name}.pth"))
print(f"Output filename: {output_name}")

Start adversarial fine-tuning!


## 3. While you are waiting, for this to finish...

...you can already start evaluating the poisoned model we provide. For this head over to `4_gradients-xaisec-eval.ipynb`