<a href="https://colab.research.google.com/github/torisimon2/Google-Colab/blob/main/pgd_attack_adv_training_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This notebook serves as a brief introduction to adversarial attacks, and implements the Projected Gradient Descent attack as an example. The notebook also introduces an adversarial training defense.

 Adversarial attacks occur when someone alters clean data in a way that severely worsens a neural network's performance. The altered data are called 'adversarial examples'. The changes to the clean data are usually chosen algorithmically such that they maximize the severity of misclassifications, while minimizing the change to the inputs. This keeps the attack effective yet hard to detect.

 Adversarial training is a proactive defense method where the defender includes adversarial examples in the model's training data. By seeing 'bad' examples in advance, the model is able to learn to correctly classify adversarial examples, making the model more robust against future attacks.

*Import the necessary data science and machine learning packages, which will help us throughout this notebook:*

In [1]:
import pandas as pd
import tensorflow as tf
import numpy as np
from tensorflow import keras
import sklearn
from sklearn.model_selection import train_test_split

Imagine that you want to train a neural network model on nuclear reactor sensor data.
Each datapoint or row in the spreadsheet represents the readings from 27 sensors at a given timestamp. For each row of sensor readings, there is a label that tells you what the state of the reactor was.

Through training on the labeled data, the model will learn mathematical relationships that explain the relationship between the sensor data and the labels. When you encounter unlabeled sensor data in the future, this model should be able to accurately predict labels from the new data, assuming the training data generalizes well to real-world data.

*Read the dataset into a pandas dataframe, and display the first 10 rows:*

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/DunyaBahar/CyberNuke/main/data.csv")
data.head(10)



In this dataset, there are 12 possible labels. These labels can be found in the 'TRANSIENT' column towards the right side of the dataframe. The 12 categorical labels have already been converted into computer-readable numerical values, but each number 0 - 11 represent the following reactor states:

    0: TRANSIENT-Normal Ops
    1: Transient-Feedwater Pump Trip
    2: Transient-LOCA LOOP
    3: Transient Valve Closure
    4: Transient Rapid Power Change
    5: Transient- Depressurization
    6: Transient- Max Steam Line Rupture
    7: Transient-Manual Trip
    8: Transient Load Rejection
    9: Transient Single Coolant Pump Trip
    10: Transient Total Coolant Pump Trip
    11: Transient Turbine Trip No SCRAM

A label of 0 in the TRANSIENT column means normal operations, or steady state conditions. Labels 1 through 11 indicate transient behavior, each with a given cause or explanation.

*Define the functions to make and train a neural network classification model for this data:*

In [3]:
# this function defines the architecture of the neural network.
def make_model():
  keras.backend.clear_session()
  # this defines the model as a sequential model with 3 layers.
  model = keras.models.Sequential([
    keras.layers.Dense(100, activation="selu", input_shape = (27,)),  # the input layer, with 100 neurons and the selu activation function
    keras.layers.Dense(64, activation='selu'),  # one hidden layer
    keras.layers.Dense(12, activation='softmax')  # the output layer, with 12 neurons (one for each label) and the softmax activation function
  ])
  # compile the model with specific loss function, optimizer and metric.
  model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
  return model


# this function trains the model on the training data.
def train(model, training_data, labels):
  # train the model for 10 epochs, with 32 datapoints per batch.
  # 10 epochs is very short, for demonstration purposes. In a real setting we'd train for much longer.
  model.fit(training_data, labels, epochs=10, batch_size=32, verbose=1)
  return model

Next, we do a bit of preprocessing, split the data into sets for training and for testing, then make and train the model using the make_model() and train() functions we defined above. This will take approximately 30 seconds. Usually, we'd train for much longer than we do here.

*Train a model called first_model:*

In [None]:
X = data.drop(['TRANSIENT'], axis=1)  # X is our training data, without the labels
Y = data['TRANSIENT']  # Y is our labels
# normalize all the data to fall within the range of 0 and 1
scaler = sklearn.preprocessing.MinMaxScaler(feature_range=(0,1))
scaler.fit(X.to_numpy())
X_scaled = scaler.transform(X.to_numpy())

xtrain, xtest, ytrain, ytest = train_test_split(X_scaled, Y.to_numpy())  # split the data into training and testing sets, with sensor data and labels seperate

first_model = make_model()  # define first_model
first_model = train(first_model, xtrain, ytrain)  # train first_model

*Evaluate the model's accuracy on the test data:*

In [None]:
results = first_model.evaluate(xtest, ytest)
accuracy = (results[1]) * 100
print(f'\n YOU (naively): \n "This model is awesome! It correctly classifies {accuracy:.0f}% of the test data. I\'m ready to publish this!"')

At this stage, the model might appear ready to use in real-world contexts. Imagine that you've been given new data from a client or user.

*Import the new data, and look it over:*

In [None]:
new_data = pd.read_csv("https://raw.githubusercontent.com/DunyaBahar/CyberNuke/main/new_data.csv")
new_data.head()

*Compare that to the first few rows of the old data:*

In [None]:
pd.DataFrame(xtest).head()

As a makeshift form of security, you may want to compare the new data to the data you had previously, which you know to be clean and untampered.



---



# Challenge 1:
Check for any glaring discrepancies between the two sets of data (new_data above and xtest below) which could indicate that the new data has been tampered with.

- Are there any non-integer values in the new data which were binary values (0 or 1) in the clean data, or vice versa? For features which we know to be categorcial, we expect values of 0 or 1, and no non-integer values.
- Do any values in the new data fall below 0 or over 1? Since we normalized our data, we expect that the new data will also stay within this range.


---



In [None]:
'''
write your response to challenge 1 here...


'''

Now that you've inspected the data and are confident it's clean, you might tell the client that they can expect the model's performance on the new data to about what it was during your testing (perhaps 80-90% accuracy).

Imagine the client comes back to you a month later, demanding a refund and complaining that the model has dismally low accuracy. They painstakingly hand-labeled the first 1000 datapoints for you, and ask you to check the model's predictions against those true labels.

*Evaluate the model on the client's data:*

In [None]:
new_results = first_model.evaluate(new_data[:1000], ytest[:1000])
new_results = (new_results[1]) * 100

print(f'\n YOU (suprised): \n "You\'re right, the model only correctly classified {new_results:.0f}% of your data. I think the model is being attacked!"')

# Introduction to the Projected Gradient Descent Attack
Adversarial attacks involve creating carefully-crafted "adversarial examples" or tampered datapoints. When fed though a model, adversarial examples are misclassified at a high rate, tanking the model's accuracy.

 The Projected Gradient Descent (PGD) attack is a white-box adversarial attack. The term "white-box" means that the attacker would need access to the model to implement the attack.

 The attack is an iterative variant of the Fast Gradient Sign Method (FGSM) attack, with an added step of projection. The PGD attack generates adversarial examples by maximizing the loss between the target label and the model's predicted labels, with a contraint that no single feature can be perturbed beyond some small threshold which we'll call epsilon. The attack is able to create large missclassifications and dramatically reduce the accuracy of the model, with just small and often imperceptible changes to the input data.

# Introduction to the Adversarial Training Defense

Adversarial training is a proactive defense strategy used to combat adversarial attacks like the PGD attack. The idea is to generate lots of adversarial examples, or perturbed data, and include that in the training set for your model.

When the model is trained on many adversarial examples, it's able to learn how to correctly classify them in the future! This is a preventative defense- time and effort is invested up front with the goal of making the model more robust. The adversarial training should make your model less vulnerable to potential future attacks.

To perform adversarial training, you first need to think like an attacker. We'll implement a Projected Gradient Descent attack, so that we can include the generated adversarial data in our training set.

*Generate 5 adversarial examples based on the first 5 datapoints from our clean xtest dataset:*

In [None]:
data_to_perturb = xtest[:5]  # this is the clean data that we will use as a starting point for our adversarial examples.
data_to_perturb = tf.convert_to_tensor(data_to_perturb, dtype=tf.float32)
pd.DataFrame(data_to_perturb)  # show the clean data which we will perturb

Let's allow any feature value for a given data point to be perturbed up to 5%. We will assume that this amount of change in the datapoint's feature values doesn't affect the true label of that datapoint.

For example, if you know that 2x + 3y - 10z = A, we could increase or decrease the coefficients on x, y, and z up to 5% of their original value, and the function would still be equal to A.

So, we know that:
2.1x + 3.15y - 9.5z still equals A (increasing each coefficient by 5%).
1.9x + 2.85y - 10.5z still equals A (decreasing each coefficient by 5%).
2.05x + 2.9y - 10z still equals A (increasing or decreasing each coefficient by no more than 5%).
And so on.











---


# Challenge 2:
Whether this assumption is actually true matters for the stealthiness and overall efficacy of the attack. Imagine you were working with image data. If you were allowed to change the color of each pixel by 5%, would the resulting image always maintain the original image's label? For example, if you were given an image of a tree, are you sure that no pertubations within the 5% limit would make the image appear to be something else entirely, like a car or a panda? How could you test these assumptions to settle on an ideal and set epsilon value for a specific use-case?

---

In [None]:
'''
write your response to challenge 2 here...



'''

We have 27 features in this dataset, so we're looking at 27-dimensional space. Lets pretend we only have 2 so that we can visualize the algorithm in a comprehendible way. Say we have only the "CALCULATED AVERAGE TEMPERATURE" and the "PRESSURIZER PRESSURE" features, and our labels in the "TRANSIENTS" category. Imagine the temperature on the x axis, the pressure on the y axis, and the labels on the z axis. If the temperature were 517 and the pressure were 2218, we know our label will be "Transient Valve Closure", which we've assigned to the numerical value 3.

With an epsilon value of .05 (the 5% we discussed above), we can form a square centered around the point (517, 2218, 3). We'll project our purtubations onto the perimeter of this square, hence the "projected" in the attack name of "projected gradient descent".

With 3 features, we project the pertubations onto the surface of a cube rather than a square. In real-world high-dimensional data, we project them onto a hyper-cube.

*Initialize the pertubations to random points within the hypercube surrounding each clean datapoint, and define the loss object as keras's sparse categorical cross-entropy function:*

In [10]:
eps = .05
perturbations = tf.random.uniform(shape=data_to_perturb.shape, minval=-eps, maxval=eps)
perturbations = tf.convert_to_tensor(perturbations, dtype=tf.float32)
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

*While tracking the gradient of the loss function with respect to the pertubations, create adversarial data by adding the pertubations to the clean data:*

In [11]:
# We will want to perform this block of times over multiple iterations to maximize the efficacy of our adversarial example.

labels = ytest[:5]
with tf.GradientTape() as tape:
    tape.watch(perturbations)
    adversarial_data = data_to_perturb + perturbations  # add the pertubation to the clean data to form the first attaempt at an adversarial datapoint
    predictions = first_model(adversarial_data)  # see what the model predicts to be the label of the new datapoint
    loss = loss_object(labels, predictions)  # understand how close our datapoint's label was to the target label

*Determine the sign of the gradient of the loss function with respect to the pertubations, and update the pertubations by a small step in that direction:*

In [12]:
step_size = .0005  # in most PGD attack implementations, this value is much smaller than epsilon.
gradient = tape.gradient(loss, perturbations)  # the slope of the loss function hyperplane
normalized_gradient = tf.sign(gradient)  # The sign of the gradient (positivie or negative, +1 or -1) is enough information, we don't need the actual magnitude of the gradient.
perturbations += step_size * normalized_gradient  # set the pertubation equal to a step size in the direction of the gradient
adversarial_examples = data_to_perturb + perturbations  # set our adversarial example to be the clean datapoint but a step uphill or downhill on ___?

*Repeat the above steps over several iterations:*

In [13]:
num_iter = 50
for i in range(num_iter):
  with tf.GradientTape() as tape:
    tape.watch(perturbations)
    adversarial_data = data_to_perturb + perturbations
    prediction = first_model(adversarial_data)
    loss = loss_object(labels, prediction)
  gradient = tape.gradient(loss, perturbations)
  normalized_gradient = tf.sign(gradient)
  perturbations += step_size * normalized_gradient
  adversarial_examples = data_to_perturb + perturbations
  adversarial_examples = np.clip(adversarial_examples, -eps, eps)

*Examine the 5 adversarial examples we've just created:*

In [None]:
pd.DataFrame(adversarial_examples)

*Show the original 5 clean datapoints, for comparison:*

In [None]:
pd.DataFrame(xtest[:5])

When visually comparing these adversarial examples to the original clean ones, we can see that our data have negative values, wheras the clean data appear to have been normalized to fall within the range of 0 and 1. Successful attack examples should not differ in obvious ways from clean data, or else the attack will not be discreet. We want to mimic a realistic attack, so that our adversarial training data will transfer as much robustness as possible to a real-world attack.

*Fix the discrepancy by normalizing the perturbed data to fall within the range of 0 and 1, as the clean data does:*

In [None]:
adversarial_examples = np.clip(adversarial_examples, 0, 1)  # values less than 0 will become 0, and values greater than 1 will become 1.
pd.DataFrame(adversarial_examples)

There's still one obvious difference between our attack data and the clean data: the clean data has some binary features, and our adversarial examples aren't being limited to 0s and 1s like we'd expect them to be. This could make our attack more detectable, since we would know that non-binary data doesn't belong in those columns.

*Fix the discrepancy by reverting the values of categorical features back to what they were in the associated clean datapoint:*

In [18]:
# determine which columns hold only binary data in the clean dataset
categorical_features = []
for i in range(xtest.shape[1]):
    unique_values = np.unique(xtest[:, i])
    if len(unique_values) == 2 and np.isclose(unique_values, [0, 1]).all():  # chek whether the data is only 0s and 1s
        categorical_features.append(i)
# revert those values back to what they were in the clean data
adversarial_examples[:, categorical_features] = xtest[:5, categorical_features]


*Check that no obvious indications of tampering with the data remain:*

In [None]:
pd.DataFrame(adversarial_examples)

Let's combine the PGD attack code into one function, and then use it to generate a larger adversarial dataset. Then, we evaluate our model on the adversarial data to ensure that our attack is effective.

*Condense the attack code into a single function:*

In [20]:
'''
Explanation of function parameters:
----------------------------------
- clean_data: This is the original, unperturbed data that you are trying to generate adversarial examples with.
- target: These are the target labels for the clean data.
- model: This is the naive neural network model that you are attacking.
- eps: Short for epsilon. This controls the max change allowed per feature (L-infinity norm constraint).
- step_size: This is the step size in the gradient descent/ascent process.
- num_iter: The number of iterations to perform. This controls how many steps are taken in the gradient descent process.
'''

def pgd(clean_data, target, model, eps=.05, step_size=.0005, num_iter=100):
    categorical_features = []
    for i in range(clean_data.shape[1]):
        unique_values = np.unique(clean_data[:, i])
        if len(unique_values) == 2 and np.isclose(unique_values, [0, 1]).all():
            categorical_features.append(i)
    perturbations = tf.random.uniform(shape=clean_data.shape, minval=-eps, maxval=eps)
    perturbations = tf.convert_to_tensor(perturbations, dtype=tf.float32)
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    for i in range(num_iter):
        with tf.GradientTape() as tape:
            tape.watch(perturbations)
            adversarial_data = clean_data + perturbations
            prediction = model(adversarial_data)
            loss = loss_object(target, prediction)
        gradient = tape.gradient(loss, perturbations)
        normalized_gradient = tf.sign(gradient)
        perturbations += step_size * normalized_gradient
        adversarial_examples = clean_data + perturbations
        adversarial_examples = np.clip(adversarial_examples, 0, 1)
        adversarial_examples[:, categorical_features] = clean_data[:, categorical_features]
    return adversarial_examples

*Use the above pgd( ) function to generate 1000 adversarial examples from xtest:*

In [21]:
pgd_data = pgd(xtest[:1000], ytest[:1000], first_model)

*Check that the attack effectively lowered the model's accuracy:*

In [None]:
attack_results = first_model.evaluate(pgd_data, ytest[:1000])[1] * 100
clean_results = first_model.evaluate(xtest[:1000], ytest[:1000])[1] * 100

print(f'\nModel accuracy on clean data: {clean_results}')
print(f'Model accuracy on attack data: {attack_results}')
print(f'\nAttack successful! We brought the model\'s accuracy down by { clean_results - attack_results:.0f} percentage points!')

Since the attack was effective, it would be worthwhile to include that data in the training data of the model. By seeing the adversarial data in advance, the model will be more prepared for similar attacks in the future.

# Next Steps for Adversarial Training:
- Use the above pgd( ) function to generate an adversarial example for each clean example in the training set, to form a 2nd adversarial dataset.
- Train a second model (with the same architecture as the first model) on a combined dataset consisting of both the clean and adversarial examples.
- Generate a 3rd set of PGD examples using xtest and the new robust model.
- Test the robust model's accuracy on that 3rd set, and compare the improved accuracy against the naive model's accuracy during the first PGD attack.

Adversarial training is a promising and relatively effective defense against adversarial attacks. By including a variety of attack types in your training data, the model's new robustness will transfer to a wider variety of attacks. Adversarial typically dramatically increases robustness against adversarial attacks, while slightly decreasing the model's accuracy on clean data.

Great work!