### Network Security and Privacy - Final Project Notebook:

For the purposes of this project, we explore Distributed Denial of Service (DDoS) Attack Detection against Volumetric Attacks. Furthermore, we will focus on Exploitation-based Attacks as opposed to Reflection-based Attacks. While Reflection-based Attacks utilize third-party servers in order to reflect traffic back to the target, Exploitation-based Attacks aim to disrupt a system's functionality.

In [None]:
# Import Necessary Packages:

import pandas as pd
from sklearn import svm
from sklearn.metrics import accuracy_score
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
from collections import Counter

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load in the Datasets:
syn_df = pd.read_csv('/content/drive/MyDrive/CS MS-Sem 2/CS 6501-002/project/01-12/Syn.csv')
udp_lag_df = pd.read_csv('/content/drive/MyDrive/CS MS-Sem 2/CS 6501-002/project/01-12/UDPLag.csv')
udp_flood_df = pd.read_csv('/content/drive/MyDrive/CS MS-Sem 2/CS 6501-002/project/01-12/DrDoS_UDP.csv')

  syn_df = pd.read_csv('/content/drive/MyDrive/CS MS-Sem 2/CS 6501-002/project/01-12/Syn.csv')
  udp_lag_df = pd.read_csv('/content/drive/MyDrive/CS MS-Sem 2/CS 6501-002/project/01-12/UDPLag.csv')
  udp_flood_df = pd.read_csv('/content/drive/MyDrive/CS MS-Sem 2/CS 6501-002/project/01-12/DrDoS_UDP.csv')


### Exploratory Data Analysis (EDA) w/Syn Flood:

In [None]:
syn_benign = syn_df[syn_df[' Label'] == 'BENIGN']
syn_attack = syn_df[syn_df[' Label'] == 'Syn']

In [None]:
len(syn_attack)

1582289

In [None]:
len(syn_benign)

392

### Determining Features for Detecting SYN Flood Attacks:

The key success to developing a Machine Learning model is to have distinguishing features that can differentiate between different classes. In order to decipher between SYN Flood Attacks and benign traffic, we make use of the following features:

- Total Backward Packets
- Down/Up Ratio
- Fwd Packets/s
- Bwd Packets/s



### Approach #1: SVM Models

We can use SVM models to take a subset of the original SYN Flood Data, and then test it against our allocated test data. To do this fairly, we find an equal subset of data that includes SYN and BENIGN labels:

In [None]:
syn_df_revised = syn_df[0:1500]
syn_df_revised = syn_df_revised[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]
syn_df_revised = syn_df_revised[syn_df_revised[' Label'] != 'BENIGN']

syn_df_benign_revised = syn_benign[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]
syn_df_revised = pd.concat([syn_df_revised, syn_df_benign_revised[0:394]])
syn_df_revised[' Label'] = [1 if entry == 'Syn' else 0 for entry in list(syn_df_revised[' Label'])]

In [None]:
Counter(syn_df_revised[' Label'])

Counter({1: 1498, 0: 392})

In [None]:
len(syn_df_revised)

1890

In [None]:
y = syn_df_revised[' Label']
X = syn_df_revised.drop([' Label'], axis=1)

# Split the data into Train/Test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use the StandardScaler:
scaler = StandardScaler()

# Scale the data using fit_transform and transform methods for X_train and X_test respectively:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model Declaration + Fitting:
model = svm.SVC(kernel='poly')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.9259259259259259


In [None]:
print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.7977883096366508


### Approach #2: Random Forest (RF)

We can use an ensemble model methodology to test out accuracy on detecting SYN Flood attacks:

In [None]:
from sklearn.ensemble import RandomForestClassifier

def run_random_forest(X_train, y_train, X_test, y_test):

  # Model Declaration + Fitting:
  for i in range(1, 7):
    model = RandomForestClassifier(max_depth=i, random_state=0)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(summary_statistics(predictions, y_test))

run_random_forest(X_train_scaled, y_train, X_test_scaled, y_test)

Accuracy Score: 0.5706
F1-Score: 0.7044
Precision Score: 0.9221
Recall Score: 0.5706
None
Accuracy Score: 0.6417
F1-Score: 0.6816
Precision Score: 0.8375
Recall Score: 0.6417
None


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Accuracy Score: 0.6445
F1-Score: 0.6808
Precision Score: 0.8352
Recall Score: 0.6445
None
Accuracy Score: 0.7156
F1-Score: 0.7512
Precision Score: 0.8701
Recall Score: 0.7156
None
Accuracy Score: 0.7521
F1-Score: 0.7721
Precision Score: 0.8401
Recall Score: 0.7521
None
Accuracy Score: 0.7521
F1-Score: 0.7697
Precision Score: 0.8305
Recall Score: 0.7521
None


### UDP Flood Attacks:

Now we can focus exclusively on UDP Flood Attacks

In [None]:
# Equal Sampling of Each Class:
udp_lag_revised_df = udp_lag_df[0:1000]
udp_lag_df_benign = udp_lag_df[udp_lag_df[' Label'] == 'BENIGN']
udp_lag_revised_df = pd.concat([udp_lag_revised_df, udp_lag_df_benign[0:671]])
udp_lag_df_ddos = udp_lag_df[udp_lag_df[' Label'] == 'WebDDoS']
udp_lag_revised_df = pd.concat([udp_lag_revised_df, udp_lag_df_ddos[0:439]])

udp_lag_revised_df = udp_lag_revised_df[[' Total Backward Packets', ' Down/Up Ratio', 'Fwd Packets/s', ' Bwd Packets/s', ' Label']]

num_labels = []
for entry in udp_lag_revised_df[' Label']:
  if entry == 'UDP-lag':
    num_labels.append(2)
  if entry == 'WebDDoS':
    num_labels.append(4)
  if entry == 'BENIGN':
    num_labels.append(3)

udp_lag_revised_df[' Label'] = num_labels

In [None]:
Counter(udp_lag_revised_df[' Label'])

Counter({2: 1000, 3: 671, 4: 439})

In [None]:
y = udp_lag_revised_df[' Label']
X = udp_lag_revised_df.drop([' Label'], axis=1)

# Split the data into Train/Test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use the StandardScaler:
scaler = StandardScaler()

# Scale the data using fit_transform and transform methods for X_train and X_test respectively:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model Declaration + Fitting:
model = svm.SVC(kernel='poly')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.7914691943127962


In [None]:
model = svm.SVC(kernel='linear')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")

SVM Accuracy Score: 0.7977883096366508


In [None]:
run_random_forest(X_train_scaled, y_train, X_test_scaled, y_test)

RF Accuracy Score: 0.7551342812006319
RF Accuracy Score: 0.8751974723538705
RF Accuracy Score: 0.9304897314375987
RF Accuracy Score: 0.9447077409162717
RF Accuracy Score: 0.9510268562401264
RF Accuracy Score: 0.9510268562401264


Objective: Create a subset of 4000 rows thats representative of the attacks.

In [None]:
syn_df_revised.columns = syn_df_revised.columns.str.strip()
udp_lag_revised_df.columns = udp_lag_revised_df.columns.str.strip()

In [None]:
print(syn_df_revised.columns)
print(udp_lag_revised_df.columns)

Index(['Total Backward Packets', 'Down/Up Ratio', 'Fwd Packets/s',
       'Bwd Packets/s', 'Label'],
      dtype='object')
Index(['Total Backward Packets', 'Down/Up Ratio', 'Fwd Packets/s',
       'Bwd Packets/s', 'Label'],
      dtype='object')


In [None]:
combined_df = pd.concat([syn_df_revised, udp_lag_revised_df], ignore_index=True)

In [None]:
print(Counter(combined_df['Label']))

Counter({1: 1498, 2: 1000, 3: 671, 4: 439, 0: 392})


In [None]:
print(Counter(syn_df_revised['Label']))

Counter({1: 1498, 0: 392})


In [None]:
len(combined_df)

4000

In [None]:
syn_df_revised

Unnamed: 0,Total Backward Packets,Down/Up Ratio,Fwd Packets/s,Bwd Packets/s,Label
0,2,0.0,1.640770e-01,0.017271,1
1,0,0.0,1.403830e-01,0.000000,1
2,2,1.0,1.785714e+04,17857.142857,1
3,0,0.0,1.509648e-01,0.000000,1
4,0,0.0,2.000000e+06,0.000000,1
...,...,...,...,...,...
1479185,0,0.0,8.005380e+00,0.000000,0
1482435,2,2.0,4.608295e+03,9216.589862,0
1482436,2,2.0,5.154639e+03,10309.278351,0
1502348,2,1.0,9.611688e+01,96.116878,0


In [None]:
revised_combined = combined_df[combined_df['Label'] != 4]

In [None]:
print(Counter(revised_combined['Label']))

Counter({1: 1498, 2: 1000, 3: 671, 0: 392})


In [None]:
combined_df.to_csv("combined_df.csv", index=False)

Objective 2: Run the SVM, Random Forest, and ...

Test 1: SVM

In [None]:
revised_combined = revised_combined.dropna()

In [None]:
y = revised_combined['Label']
X = revised_combined.drop(['Label'], axis=1)

# Split the data into Train/Test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use the StandardScaler:
scaler = StandardScaler()

# Scale the data using fit_transform and transform methods for X_train and X_test respectively:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model Declaration + Fitting:
model = svm.SVC(kernel='poly')
model.fit(X_train_scaled, y_train)

print(f"SVM Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")
summary_statistics(model.predict(X_test_scaled), y_test)

SVM Accuracy Score: 0.5846585594013096
Accuracy Score: 0.5847
F1-Score: 0.7187
Precision Score: 0.9494
Recall Score: 0.5847


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Test 2: Random Forest

In [None]:
model = svm.SVC(kernel='linear')
model.fit(X_train_scaled, y_train)

print(f"Random Accuracy Score: {accuracy_score(model.predict(X_test_scaled), y_test)}")
summary_statistics(model.predict(X_test_scaled), y_test)

SVM Accuracy Score: 0.5631431244153414
Accuracy Score: 0.5631
F1-Score: 0.6954
Precision Score: 0.9141
Recall Score: 0.5631


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
run_random_forest(X_train_scaled, y_train, X_test_scaled, y_test)

RF Accuracy Score: 0.5706267539756782
RF Accuracy Score: 0.6417212347988774
RF Accuracy Score: 0.6445275958840038
RF Accuracy Score: 0.715622076707203
RF Accuracy Score: 0.7521047708138447
RF Accuracy Score: 0.7521047708138447


Test 3: XG-Boost

In [None]:
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
def run_xgboost(X_train, y_train, X_test, y_test):

  """Trains and tests an XGBoost model, returns the predictions from the model"""

  xgb_model = xgb.XGBClassifier()
  xgb_model.fit(X_train_scaled, y_train)

  predictions = xgb_model.predict(X_test)

  return predictions

def summary_statistics(predictions, y_test):

  """Returns the summary statistics (Accuracy, F1-Score, Precision, Recall), given predictions and their actual labels"""

  print(f"Accuracy Score: {accuracy_score(predictions, y_test):.4f}")
  print(f"F1-Score: {f1_score(predictions, y_test, average='weighted'):.4f}")
  print(f"Precision Score: {precision_score(predictions, y_test, average='weighted'):.4f}")
  print(f"Recall Score: {recall_score(predictions, y_test, average='weighted'):.4f}")

In [None]:
type(y_train)
y_train = y_train.rename(index={4: 3})
y_test = y_test.rename(index={4: 3})

In [None]:
predictions = run_xgboost(X_train_scaled, y_train, X_test_scaled, y_test)
summary_statistics(predictions, y_test)

Accuracy Score: 0.7615
F1-Score: 0.7752
Precision Score: 0.8303
Recall Score: 0.7615


Objective 3: Look through DrDos data set for important details