### Network Security and Privacy - Final Project Notebook:

For the purposes of this project, we explore Distributed Denial of Service (DDoS) Attack Detection against Volumetric Attacks. Furthermore, we will focus on Exploitation-based Attacks as opposed to Reflection-based Attacks. While Reflection-based Attacks utilize third-party servers in order to reflect traffic back to the target, Exploitation-based Attacks aim to disrupt a system's functionality.

In [2]:
# Import Necessary Packages:
import pandas as pd

# Scikit-Learn Packages:
from sklearn import svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# XGBoost:
import xgboost as xgb

# Miscellaneous:
from collections import Counter

In [3]:
# Load in the Datasets:
syn_df = pd.read_csv('drive/MyDrive/01-12/Syn.csv')
udp_lag_df = pd.read_csv('drive/MyDrive/01-12/UDPLag.csv')

  syn_df = pd.read_csv('drive/MyDrive/01-12/Syn.csv')
  udp_lag_df = pd.read_csv('drive/MyDrive/01-12/UDPLag.csv')


### Exploratory Data Analysis (EDA) w/Syn Flood:

In [4]:
syn_benign = syn_df[syn_df[' Label'] == 'BENIGN']
syn_attack = syn_df[syn_df[' Label'] == 'Syn']

### Determining Features for Detecting SYN Flood Attacks:

The key success to developing a Machine Learning model is to have distinguishing features that can differentiate between different classes. In order to decipher between SYN Flood Attacks and benign traffic, we make use of the following features:

- Total Backward Packets
- Down/Up Ratio
- Fwd Packets/s
- Bwd Packets/s



### Approach #1: SVM Models

We can use SVM models to take a subset of the original SYN Flood Data, and then test it against our allocated test data. To do this fairly, we find an equal subset of data that includes SYN and BENIGN labels:

In [None]:
syn_df_revised = syn_df[0:394]
syn_df_revised = syn_df_revised[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]
syn_df_revised = syn_df_revised[syn_df_revised[' Label'] != 'BENIGN']

syn_df_benign_revised = syn_benign[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]
syn_df_revised = pd.concat([syn_df_revised, syn_df_benign_revised[0:394]])
syn_df_revised[' Label'] = [1 if entry == 'Syn' else 0 for entry in list(syn_df_revised[' Label'])]

In [None]:
Counter(syn_df_revised[' Label'])

Counter({1: 392, 0: 392})

In [None]:
y = syn_df_revised[' Label']
X = syn_df_revised.drop([' Label'], axis=1)

# Split the data into Train/Test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use the StandardScaler:
scaler = StandardScaler()

# Scale the data using fit_transform and transform methods for X_train and X_test respectively:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model Declaration + Fitting:
model = svm.SVC(kernel='poly')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.885593220338983


In [None]:
def summary_statistics(predictions, y_test):

  """Returns the summary statistics (Accuracy, F1-Score, Precision, Recall), given predictions and their actual labels"""

  print(f"Accuracy Score: {accuracy_score(predictions, y_test)}")
  print(f"F1-Score: {f1_score(predictions, y_test)}")
  print(f"Precision Score: {accuracy_score(predictions, y_test)}")
  print(f"Recall Score: {recall_score(predictions, y_test)}")

In [None]:
summary_statistics(model.predict(X_test_scaled), y_test) # Linear

Accuracy Score: 0.8898305084745762
F1-Score: 0.9051094890510949
Precision Score: 0.8898305084745762
Recall Score: 0.8266666666666667


In [None]:
print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.885593220338983


### Approach #2: Random Forest (RF)

We can use an ensemble model methodology to test out accuracy on detecting SYN Flood attacks:

In [None]:
from sklearn.ensemble import RandomForestClassifier

def run_random_forest(X_train, y_train, X_test, y_test):

  # Model Declaration + Fitting:
  for i in range(1, 7):
    model = RandomForestClassifier(max_depth=i, random_state=0)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(summary_statistics(predictions, y_test))
    print("--------")

run_random_forest(X_train, y_train, X_test, y_test)

Accuracy Score: 0.9242424242424242
F1-Score: 0.9295774647887324
Precision Score: 1.0
Recall Score: 0.868421052631579
None
--------
Accuracy Score: 0.946969696969697
F1-Score: 0.9492753623188406
Precision Score: 0.9924242424242424
Recall Score: 0.9097222222222222
None
--------
Accuracy Score: 0.946969696969697
F1-Score: 0.9492753623188406
Precision Score: 0.9924242424242424
Recall Score: 0.9097222222222222
None
--------
Accuracy Score: 0.9583333333333334
F1-Score: 0.9597069597069597
Precision Score: 0.9924242424242424
Recall Score: 0.9290780141843972
None
--------
Accuracy Score: 0.9583333333333334
F1-Score: 0.9597069597069597
Precision Score: 0.9924242424242424
Recall Score: 0.9290780141843972
None
--------
Accuracy Score: 0.9583333333333334
F1-Score: 0.9597069597069597
Precision Score: 0.9924242424242424
Recall Score: 0.9290780141843972
None
--------


### Approach #3: XGBoost



In [None]:
def run_xgboost(X_train, y_train, X_test, y_test):

  """Trains and tests an XGBoost model, returns the predictions from the model"""

  xgb_model = xgb.XGBClassifier()
  xgb_model.fit(X_train_scaled, y_train)

  predictions = xgb_model.predict(X_test)

  return predictions

def summary_statistics(predictions, y_test):

  """Returns the summary statistics (Accuracy, F1-Score, Precision, Recall), given predictions and their actual labels"""

  print(f"Accuracy Score: {accuracy_score(predictions, y_test)}")
  print(f"F1-Score: {f1_score(predictions, y_test)}")
  print(f"Precision Score: {precision_score(predictions, y_test)}")
  print(f"Recall Score: {recall_score(predictions, y_test)}")

In [None]:
predictions = run_xgboost(X_train_scaled, y_train, X_test_scaled, y_test)
summary_statistics(predictions, y_test)

Accuracy Score: 0.9449152542372882
F1-Score: 0.9494163424124513
Precision Score: 0.9838709677419355
Recall Score: 0.9172932330827067


### UDP Flood Attacks:

Now we can focus exclusively on UDP Flood Attacks

In [None]:
# Equal Sampling of Each Class:
udp_lag_revised_df = udp_lag_df[0:439]
udp_lag_df_benign = udp_lag_df[udp_lag_df[' Label'] == 'BENIGN']
udp_lag_revised_df = pd.concat([udp_lag_revised_df, udp_lag_df_benign[0:439]])
udp_lag_df_ddos = udp_lag_df[udp_lag_df[' Label'] == 'WebDDoS']
udp_lag_revised_df = pd.concat([udp_lag_revised_df, udp_lag_df_ddos[0:439]])

udp_lag_revised_df = udp_lag_revised_df[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]
udp_lag_revised_df = udp_lag_revised_df[udp_lag_revised_df[' Label'] != 'WebDDoS']

num_labels = []
for entry in udp_lag_revised_df[' Label']:
  if entry == 'UDP-lag':
    num_labels.append(1)
  if entry == 'BENIGN':
    num_labels.append(0)

udp_lag_revised_df[' Label'] = num_labels

In [None]:
Counter(udp_lag_revised_df[' Label'])

Counter({1: 439, 0: 439})

In [None]:
y = udp_lag_revised_df[' Label']
X = udp_lag_revised_df.drop([' Label'], axis=1)

# Split the data into Train/Test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use the StandardScaler:
scaler = StandardScaler()

# Scale the data using fit_transform and transform methods for X_train and X_test respectively:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model Declaration + Fitting:
model = svm.SVC(kernel='poly')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.9242424242424242


In [None]:
summary_statistics(model.predict(X_test_scaled), y_test)

Accuracy Score: 0.9242424242424242
F1-Score: 0.927007299270073
Precision Score: 0.9621212121212122
Recall Score: 0.8943661971830986


In [None]:
model = svm.SVC(kernel='linear')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {precision_score(model.predict(X_test_scaled), y_test, average='macro')}")

SVM Accuracy Score: 0.9128787878787878


In [None]:
run_random_forest(X_train, y_train, X_test, y_test)

Accuracy Score: 0.9242424242424242
F1-Score: 0.9295774647887324
Precision Score: 1.0
Recall Score: 0.868421052631579
None
--------
Accuracy Score: 0.946969696969697
F1-Score: 0.9492753623188406
Precision Score: 0.9924242424242424
Recall Score: 0.9097222222222222
None
--------
Accuracy Score: 0.946969696969697
F1-Score: 0.9492753623188406
Precision Score: 0.9924242424242424
Recall Score: 0.9097222222222222
None
--------
Accuracy Score: 0.9583333333333334
F1-Score: 0.9597069597069597
Precision Score: 0.9924242424242424
Recall Score: 0.9290780141843972
None
--------
Accuracy Score: 0.9583333333333334
F1-Score: 0.9597069597069597
Precision Score: 0.9924242424242424
Recall Score: 0.9290780141843972
None
--------
Accuracy Score: 0.9583333333333334
F1-Score: 0.9597069597069597
Precision Score: 0.9924242424242424
Recall Score: 0.9290780141843972
None
--------


In [None]:
predictions = run_xgboost(X_train_scaled, y_train, X_test_scaled, y_test)
summary_statistics(predictions, y_test)

Accuracy Score: 0.946969696969697
F1-Score: 0.9485294117647058
Precision Score: 0.9772727272727273
Recall Score: 0.9214285714285714
