### Network Security and Privacy - Final Project Notebook:

For the purposes of this project, we explore Distributed Denial of Service (DDoS) Attack Detection against Volumetric Attacks. Furthermore, we will focus on Exploitation-based Attacks as opposed to Reflection-based Attacks. While Reflection-based Attacks utilize third-party servers in order to reflect traffic back to the target, Exploitation-based Attacks aim to disrupt a system's functionality.

In [1]:
# Import Necessary Packages:

import pandas as pd
from sklearn import svm
from sklearn.metrics import accuracy_score
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Load in the Datasets:
syn_df = pd.read_csv('drive/MyDrive/01-12/Syn.csv')
udp_lag_df = pd.read_csv('drive/MyDrive/01-12/UDPLag.csv')
udp_flood_df = pd.read_csv('drive/MyDrive/01-12/DrDoS_UDP.csv')

  syn_df = pd.read_csv('drive/MyDrive/01-12/Syn.csv')
  udp_lag_df = pd.read_csv('drive/MyDrive/01-12/UDPLag.csv')
  udp_flood_df = pd.read_csv('drive/MyDrive/01-12/DrDoS_UDP.csv')


### Exploratory Data Analysis (EDA) w/Syn Flood:

In [3]:
syn_benign = syn_df[syn_df[' Label'] == 'BENIGN']
syn_attack = syn_df[syn_df[' Label'] == 'Syn']

### Determining Features for Detecting SYN Flood Attacks:

The key success to developing a Machine Learning model is to have distinguishing features that can differentiate between different classes. In order to decipher between SYN Flood Attacks and benign traffic, we make use of the following features:

- Total Backward Packets
- Down/Up Ratio
- Fwd Packets/s
- Bwd Packets/s



### Approach #1: SVM Models

We can use SVM models to take a subset of the original SYN Flood Data, and then test it against our allocated test data. To do this fairly, we find an equal subset of data that includes SYN and BENIGN labels:

In [4]:
syn_df_revised = syn_df[0:394]
syn_df_revised = syn_df_revised[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]
syn_df_revised = syn_df_revised[syn_df_revised[' Label'] != 'BENIGN']

syn_df_benign_revised = syn_benign[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]
syn_df_revised = pd.concat([syn_df_revised, syn_df_benign_revised[0:394]])
syn_df_revised[' Label'] = [1 if entry == 'Syn' else 0 for entry in list(syn_df_revised[' Label'])]

In [10]:
y = syn_df_revised[' Label']
X = syn_df_revised.drop([' Label'], axis=1)

# Split the data into Train/Test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use the StandardScaler:
scaler = StandardScaler()

# Scale the data using fit_transform and transform methods for X_train and X_test respectively:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model Declaration + Fitting:
model = svm.SVC(kernel='poly')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

In [11]:
print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.885593220338983


### Approach #2: Random Forest (RF)

We can use an ensemble model methodology to test out accuracy on detecting SYN Flood attacks:

In [19]:
from sklearn.ensemble import RandomForestClassifier

def run_random_forest(X_train, y_train, X_test, y_test):

  # Model Declaration + Fitting:
  for i in range(1, 7):
    model = RandomForestClassifier(max_depth=i, random_state=0)
    model.fit(X_train, y_train)
    print(f"RF Accuracy Score: {accuracy_score(model.predict(X_test), y_test)}")

run_random_forest(X_train_scaled, y_train, X_test_scaled, y_test)

RF Accuracy Score: 0.8898305084745762
RF Accuracy Score: 0.923728813559322
RF Accuracy Score: 0.9322033898305084
RF Accuracy Score: 0.9491525423728814
RF Accuracy Score: 0.9491525423728814
RF Accuracy Score: 0.9449152542372882


### UDP Flood Attacks:

Now we can focus exclusively on UDP Flood Attacks

In [44]:
# Equal Sampling of Each Class:
udp_lag_revised_df = udp_lag_df[0:439]
udp_lag_df_benign = udp_lag_df[udp_lag_df[' Label'] == 'BENIGN']
udp_lag_revised_df = pd.concat([udp_lag_revised_df, udp_lag_df_benign[0:439]])
udp_lag_df_ddos = udp_lag_df[udp_lag_df[' Label'] == 'WebDDoS']
udp_lag_revised_df = pd.concat([udp_lag_revised_df, udp_lag_df_ddos[0:439]])

udp_lag_revised_df = udp_lag_revised_df[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]

num_labels = []
for entry in udp_lag_revised_df[' Label']:
  if entry == 'UDP-lag':
    num_labels.append(1)
  if entry == 'WebDDoS':
    num_labels.append(2)
  if entry == 'BENIGN':
    num_labels.append(0)

udp_lag_revised_df[' Label'] = num_labels

In [49]:
y = udp_lag_revised_df[' Label']
X = udp_lag_revised_df.drop([' Label'], axis=1)

# Split the data into Train/Test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use the StandardScaler:
scaler = StandardScaler()

# Scale the data using fit_transform and transform methods for X_train and X_test respectively:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model Declaration + Fitting:
model = svm.SVC(kernel='poly')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.5126262626262627


In [50]:
model = svm.SVC(kernel='linear')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.7222222222222222


In [51]:
run_random_forest(X_train_scaled, y_train, X_test_scaled, y_test)

RF Accuracy Score: 0.6616161616161617
RF Accuracy Score: 0.8257575757575758
RF Accuracy Score: 0.8863636363636364
RF Accuracy Score: 0.9116161616161617
RF Accuracy Score: 0.9141414141414141
RF Accuracy Score: 0.9141414141414141
