## 1. Datasets loading

In this section we load the datasets and, since the dataset is too big, we take a sample of it if in development mode.

### 1.1. Importing the basic libraries

In [None]:
# Load data processing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
%matplotlib inline

### 1.2. Importing machine learning libraries

In [None]:
# Load machine learning libraries
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier

### 1.3. Development mode flag

If the following flag is set to `True`, the model will be trained on a smaller dataset, in order to speed up the development process. If the flag is set to `False`, the model will be trained on the whole dataset.

In [None]:
# If true, only the 3% of the data will be used for training and testing of the various models
_DEVMODE = True

### 1.4. Loading the datasets

In [None]:
# Loading the data from the train and test files
train_df = pd.read_csv('data/train_net.csv')
test_df = pd.read_csv('data/test_net.csv')

### 1.5. Loaded datasets information

In [None]:
# Print total size
print("Test set size: ", test_df.shape)
print("Train set size: ", train_df.shape)

# Value counts
train_df['ALERT'].value_counts()

### 1.6. Dataset development mode reduction

In [None]:
if _DEVMODE:
    train_df = train_df.sample(frac=0.03, random_state=1)
    test_df = test_df.sample(frac=0.03, random_state=1)

    # Print total size
    print("Test set size: ", test_df.shape)
    print("Train set size: ", train_df.shape)


## 2. Data preprocessing

In this section we preprocess the datasets in order to make them usable by the machine learning algorithms.

### 2.1. Print datasets information

In [None]:
train_df.info()

### 2.2. Print datasets shape

In [None]:
# Show information about the data
def printInfo(df):
    print('Dataframe shape: ', df.shape)
    print('Dataframe columns: ', df.columns)

print('==== Train data ====')
printInfo(train_df)
print()
print('==== Test data ====')
printInfo(test_df)

### 2.3. Show training dataset structure

In [None]:
train_df.head()

### 2.4 Check for missing values

In [None]:
# Check for missing values
print('==== Train data ====')
print(train_df.isnull().sum())
print()
print('==== Test data ====')
print(test_df.isnull().sum())
print()

### 2.5 Fill missing **ANOMALY** values

In [None]:
# Fill the missing ANOMALY values with 0 (no anomaly)
train_df['ANOMALY'] = train_df['ANOMALY'].fillna(0)
test_df['ANOMALY'] = test_df['ANOMALY'].fillna(0)

## 3. Data analysis

In this section we analyze the datasets in order to have a better understanding of the data.

### 3.1. Data types

In [None]:
train_df.dtypes

### 3.2. Observing the distribution of the target variable

We can observe that the dataset is highly imbalanced, with the majority of the flows being normal (no attack detected). We can also observe that also the number of malware attacks is very low, compared to the other attacks.

These two facts will have a big impact on the model training, as we will see later.

In [None]:
# Show the distribution of the target variable
sns.countplot(x='ALERT', data=train_df)

In [None]:
# Count the number of unique protocol_maps
train_df['PROTOCOL_MAP'].value_counts()

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(20, 5))

# seaborn countplots
sns.countplot(x='ANOMALY', data=train_df, ax=axs[0]).set(title='ANOMALY')
              

# Seaborn countplot for the 'PROTOCOL_MAP' column, with enough space for the labels
sns.countplot(x='PROTOCOL_MAP', data=train_df, ax=axs[1]).set(title='PROTOCOL_MAP')

# Boxplot for L4_SRC_PORT to undestand the distribution of the data
sns.boxplot(
    x='L4_SRC_PORT', data=train_df, ax=axs[2],
    notch=True, showcaps=True,
    flierprops={"marker": "x"}, # Change the outlier marker
    showmeans=True, # Show the mean
    boxprops={"facecolor": (.4, .6, .8, .5)},
  ).set(title='L4_SRC_PORT')

### 3.3. Protocol distribution in relation to the kind of attack

In [None]:
# Show protocol_map distribution for kind of ALERT
sns.countplot(x='PROTOCOL_MAP', hue='ALERT', data=train_df)

### 3.4. Unique hosts in dataset

Knowing the amount of unique hosts in the dataset is important to understand the size of the dataset since I expect that a bigger dataset will be more difficult to train properly.

In [None]:
# Find unique hosts (IP addresses) in the train and test data
train_src_hosts = train_df['IPV4_SRC_ADDR'].unique()
train_dst_hosts = train_df['IPV4_DST_ADDR'].unique()
train_hosts = np.union1d(train_src_hosts, train_dst_hosts)

# For each host, count the number of flows
print('Number of unique hosts in the train data: ', len(train_hosts))

# Find unique hosts (IP addresses) in the train and test data
test_src_hosts = test_df['IPV4_SRC_ADDR'].unique()
test_dst_hosts = test_df['IPV4_DST_ADDR'].unique()
test_hosts = np.union1d(test_src_hosts, test_dst_hosts)

# Floor ratio of hosts in test data that are not in train data
ratio = math.floor((1.0-len(test_hosts)/len(train_hosts)) * 100)

# For each host, count the number of flows
print("Number of unique hosts in the test data: {} (~{}% smaller)".format(len(test_hosts), ratio))


### 3.5. Distribution analysis using pairplot

In [None]:
# select the columns to be used for training
train_df_columns = train_df[['L4_SRC_PORT', 'L4_DST_PORT', 'PROTOCOL', 'ANOMALY', 'ALERT']]

# Distribution analysis using pairplot
sns.pairplot(train_df_columns, hue='ALERT')

### 3.6. Remove useless columns and create dummies

In [None]:
# Revoked columns
revoked_columns = [
  'FLOW_ID', # Completely random
  'ID', # Completely random
  'ANALYSIS_TIMESTAMP', # Completely random
  'IPV4_SRC_ADDR', # Not useful for the model
  'IPV4_DST_ADDR', # Not useful for the model
  'PROTOCOL_MAP', # There is a numerical column for the protocol
  'MIN_IP_PKT_LEN', # Always 0 since it is a minimum value
  'MAX_IP_PKT_LEN', # Always 0 (maybe it means that the packet have infinite length?)
  'TOTAL_PKTS_EXP', # Always 0
  'TOTAL_BYTES_EXP', # Always 0
]

# Create dummy columns for the ALERT column
alert_dummies = pd.get_dummies(train_df['ALERT'], prefix='ALERT', drop_first=True)

# Copy + drop the revoked columns
train_df = train_df.copy().drop(revoked_columns, axis=1)

### 3.7. Correlation heatmap

We can observe that there are some features that are highly correlated with each other, such as **IN_BYTES** - **OUT_BYTES** and **IN_PKTS** - **OUT_PKTS**. This is not surprising, since these features are related to the amount of data exchanged between the two hosts.

We can also observe that a *port scanning* alert is highly correlated with the **L4_DST_PORT** and **ANOMALY** features. This is not surprising, since a port scanning attack is a type of attack that tries to find open ports on a host. It is highly correlated with **ANOMALY** probably because the forged packets are built in a way that they are not recognized as an attack by the network.

Unfortunately, since *malware attacks* alerts are various and have different characteristics/features, it is not possible to find a correlation between them and the other features. This could mean that the features used in this dataset are not enough to detect malware attacks.

In the other hand, *none* alerts are strongly negatively correlated with **ANOMALY** and **L4_DST_PORT**. This is not surprising, since a normally a flow contains valid packets and the destination is usually a well-known port.

In [None]:
# Correlation heatmap using pandas
corr = pd.concat([train_df.drop('ALERT', axis=1), alert_dummies], axis=1).corr(
  numeric_only=False, # Only consider numeric columns
)

# Correlation heatmap using seaborn + make annotations fit the heatmap
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=True, fmt=".1f", cmap="YlGnBu")

## 4. Dataset preparation

In this section we prepare the dataset for the machine learning algorithms. We will split the dataset into training and testing sets, and we will also scale the data to make it more suitable for the algorithms.


### 4.1. Splitting the training set

Since we already have a test set, we split our training set in training and validation sets. We will use Sklearn's `StratifiedShuffleSplit` to split the training set in 80% training and 20% validation maintaining the same distribution of the target variable. This is needed since the dataset is highly imbalanced.

In [None]:
def split_maintain_distribution(X, y):
  sss=StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=9)
  indexes = sss.split(X, y)
  train_indices, test_indices = next(indexes)
  return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

In [None]:
# Drop rows where 'ALERT' is NaN
train_df_cleaned = train_df.dropna(subset=['ALERT'])

X_train, X_val, y_train, y_val = split_maintain_distribution(
    train_df_cleaned.drop('ALERT', axis=1), 
    train_df_cleaned['ALERT']
)


In [None]:
print(train_df['ALERT'].isna().sum())  # Number of missing values in target
print(train_df['ALERT'].value_counts(dropna=False))  # Show class distribution including NaN


In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

def split_maintain_distribution(X, y):
    if y.isna().any():
        raise ValueError("Target variable 'y' contains NaN values. Clean your data before splitting.")
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=9)
    indexes = sss.split(X, y)
    train_indices, test_indices = next(indexes)
    return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]


In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Step 1: Fill NaNs in 'ALERT' with a placeholder label (e.g., 'Unknown')
train_df['ALERT'] = train_df['ALERT'].fillna('Unknown')

# Step 2: Convert all labels in 'ALERT' to string type to ensure uniformity
train_df['ALERT'] = train_df['ALERT'].astype(str)

# Step 3: Define the function to split maintaining label distribution
def split_maintain_distribution(X, y):
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=9)
    indexes = sss.split(X, y)
    train_indices, test_indices = next(indexes)
    return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

# Step 4: Perform the split
X_train, X_val, y_train, y_val = split_maintain_distribution(
    train_df.drop('ALERT', axis=1), 
    train_df['ALERT']
)


In [None]:
print(train_df['ALERT'].apply(type).value_counts())


In [None]:
X_train, X_val, y_train, y_val = split_maintain_distribution(train_df.drop('ALERT', axis=1), train_df['ALERT'])

Now, check if actually the distribution of the target variable is the same in the training and validation sets.

#### 4.1.1 Check if the datasets are balanced

In [None]:
# Print distribution of the target variable in the train and validation sets
print('Train set distribution:')
print(y_train.value_counts(normalize=True))
print()
print('Validation set distribution:')
print(y_val.value_counts(normalize=True))

We can confirm that the distribution of the target variable is the same in the training and validation sets.

### 4.2. Data scaling

Scaling the data is important to avoid that some features will have a bigger impact on the model training than others. This is especially important when we are dealing with features that have different units of measure.

In [None]:
# Fix scaler on train set
scaler = StandardScaler()
fitter = scaler.fit(X_train)

# Scale train and validation sets
x_train_scaled = fitter.transform(X_train)
x_validation_scaled = fitter.transform(X_val)

# Convert to pandas dataframe
df_feat_train = pd.DataFrame(x_train_scaled, columns=X_train.columns)
df_feat_validation = pd.DataFrame(x_validation_scaled, columns=X_val.columns)

## 5. Feature selection

In this section we will use a Random Forest classifier to find the most important features in the dataset. This will help us to reduce the number of features used in the model training, and therefore speed up the training process.

### 5.1. Create model and fit it

In [None]:
# Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100) # 100 trees = default value

# Fit the model
rfc.fit(x_train_scaled, y_train)

### 5.2. Get feature importances

In [None]:
# Print features importance
feature_importances = pd.DataFrame(
    rfc.feature_importances_,
    index=X_train.columns,
    columns=['importance']
).sort_values('importance', ascending=False)
print(feature_importances)

### 5.3. Plot feature importances

In [None]:
# Plot feature importance
plt.figure(figsize=(20, 10))
plt.xticks(rotation=-90)
sns.barplot(x=feature_importances.index, y=feature_importances['importance'])

### 5.4. Select most important features

Select the most important features using the Random Forest classifier results

In [None]:
MIN_IMPORTANCE_THRESHOLD = 0.02

In [None]:
# Select all columns with importance > 0.02
COLUMNS = feature_importances[feature_importances['importance'] > MIN_IMPORTANCE_THRESHOLD].index
COLUMNS

### 5.5. Reprepare the dataset with the selected features

#### 5.5.1. Split again the training set into training and validation sets (with new features)

In [None]:
X_train, X_val, y_train, y_val = split_maintain_distribution(
  train_df[COLUMNS],
  train_df['ALERT']
)

#### 5.5.2. Scale again the train and validation sets (with new features)

In [None]:
# Fix scaler on train set
scaler = StandardScaler()
fitter = scaler.fit(X_train)

# Scale train and validation sets
x_train_scaled = fitter.transform(X_train)
x_validation_scaled = fitter.transform(X_val)

# Convert to pandas dataframe
df_feat_train = pd.DataFrame(x_train_scaled, columns=X_train.columns)
df_feat_validation = pd.DataFrame(x_validation_scaled, columns=X_val.columns)

#### 5.5.3. Scale also the test set (with new features)

In [None]:
# No target variable, so no need to split the fit and transform
x_test_scaled = StandardScaler().fit_transform(test_df[COLUMNS])
# Convert to pandas dataframe
df_feat_test = pd.DataFrame(x_test_scaled, columns=test_df[COLUMNS].columns)

## 6. UMAP visualization

In this section we will use UMAP to visualize the dataset in 2D. This will help us to understand if is possible to separate the different classes of attacks in the dataset.

### 6.1 Create UMAP model and fit it using training set

In [None]:
!pip install umap-learn

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import umap
reducer = umap.UMAP(
  random_state=42,
  n_neighbors=50,
  min_dist=0.3,
)
mapper = reducer.fit(x_train_scaled)

In [None]:
reducer = umap.UMAP(
    n_neighbors=50,
    min_dist=0.3,
    n_jobs=-1  # use all available cores
)


### 6.2. UMAP visualization

#### 6.2.1. Visualization using matplotlib

Reduce data dimensionality to 2 dimensions and plot the data using matplotlib.

In [None]:
y_train = y_train.fillna('None')

In [None]:
import umap
import seaborn as sns
import matplotlib.pyplot as plt

# Example: assuming you have a features matrix called X_train
reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(X_train)  # X_train must be defined

In [None]:
# 1. Define mapping dictionary
label_map = {
    'None': 0,
    'Port Scanning': 1,
    'Denial of Service': 2,
    'Malware': 3
}

# 2. Fill NaNs with 'None' (or other default)
y_train_filled = y_train.fillna('None')

# 3. Map labels
color_indices = y_train_filled.map(label_map)

# 4. Check for unmapped labels (to debug)
if color_indices.isnull().any():
    print("Unmapped labels found:", y_train_filled[color_indices.isnull()].unique())

# 5. Convert to integer after mapping and fill unmapped with 0 (default to 'None' color)
color_indices = color_indices.fillna(0).astype(int)

# 6. Plot
plt.figure(figsize=(10, 10))
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in color_indices]
)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the train set', fontsize=24)


#### 6.2.2. Visualization using `umap.plot`
Plot the data using `umap.plot` (which uses matplotlib under the hood).

In [None]:
!pip install pandas matplotlib datashader bokeh holoviews scikit-image colorcet

In [None]:
import umap.plot

# Create data labels
labels = y_train.map({'None': 0, 'Port Scanning': 1, 'Denial of Service': 2, 'Malware': 3})
# Visualize the embedding using umap.plot
p = umap.plot.points(
    mapper,
    labels=y_train,
    width=1000,
    height=900,
)
umap.plot.show(p)

## 7. Model Training

In this section we will train different models and compare their results. We will use the following models:

* K-Nearest Neighbors (KNN)
* Support Vector Machine (SVM) with RBF kernel (Radial Basis Function)
  * SVC
  * SVC with PCA (Principal Component Analysis) pipeline
* Bagging Classifier (SVC with RBF kernel)
* Random Forest Classifier
* Extra Trees Classifier
* Neural Network (MLPClassifier)

### 7.1. KNN Classifier training

We can notice that the model is excessively precise, with a precision of 1.0 with any kind of attack. This is probably due to the fact that the dataset is highly imbalanced, with the majority of the flows being normal (no attack detected). We can also observe that also the number of malware attacks is very low, compared to the other attacks.

#### 7.1.1 Finding the best K hyperparameter for KNN

To find the best K hyperparameter for KNN, we will use the **validation set** to find the best K value. We will then use this K value to train the model on the **training set** and evaluate it on the **test set**.

In [None]:
# Find best K using GridSearchCV
MAX_DEGREE = 30

k_range = list(range(1, MAX_DEGREE+1))
param_grid = dict(n_neighbors=k_range)
knn = KNeighborsClassifier()
grid = GridSearchCV(knn, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best k: {grid.best_params_}")
print(f"Best score: {grid.best_score_}")

In [None]:
# Plot results
plt.figure(num=0, dpi=96, figsize=(10, 6))
plt.plot(k_range, grid.cv_results_['mean_test_score'])
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.xticks(k_range)
plt.show()

By looking at the graphical outcome, the best parameter for KNN is **K = 1**. Since this value would lead to overfitting, we will use the first odd number after 1, which is **K = 3**.

This outcome is not surprising since the training and validation sets are coming probably from the same network and the same hosts, so the flows are very similar to each other. This means that the best way to test our model is to use the **test set**.

#### 7.1.2. Fit model with best K hyperparameter + make predictions

In [None]:
# Create a KNN classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3) # 3 = view note above
# Fit the classifier to the data
knn.fit(x_train_scaled, y_train)
# Make predictions on validation set
predictions = knn.predict(x_validation_scaled)

#### 7.1.3. Model evaluation based on validation set predictions

In [None]:
# Print the classification report
print(classification_report(y_val, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 7.1.4. KNN predictions on test set

Unfortunately, the test set doesn't include the target variable, so we can't evaluate the model on it. We can only evaluate the model on the validation set.

In [None]:
# 0 - is None --no attack
# Prediction on the test set
predictions = knn.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 7.2. Support Vector Machine Classifier (SVC) training

Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well.

#### 7.2.1 Only SVC model training

##### 7.2.1.1. Grid search to find best hyperparameters for SVC

In [None]:
# Create grid search parameters
param_grid = {
  'C': [0.1, 1, 10, 100, 1000],
  'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
}

# Create grid search
svc_grid = GridSearchCV(
  SVC(kernel="rbf"),
  param_grid,
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1, # Use all cores
)

# Fit grid search
svc_grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best params: {svc_grid.best_params_}")
print(f"Best score: {svc_grid.best_score_}")

##### 7.2.1.2. Create model with best parameters + fit model

In [None]:
# Create SVM with best parameters
svc = SVC(
  kernel='rbf',
  C=svc_grid.best_params_['C'],
  gamma=svc_grid.best_params_['gamma'],
)
svc.fit(x_train_scaled, y_train)

##### 7.2.1.3. Make predictions

In [None]:
# Make predictions on validation set
predictions = svc.predict(x_validation_scaled)

##### 7.2.1.4. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_val, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

##### 7.2.1.5. SVC model predictions on test set

In [None]:
# Prediction on the test set
predictions = svc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

#### 7.2.2. PCA + SVC model training

##### 7.2.2.1. Create pipeline

In [None]:
# Create the two parameters
pca = PCA(whiten=True, random_state=42) # PCA (Principal Component Analysis)
svc = SVC(kernel='rbf', class_weight='balanced') # SVC (Support Vector Classification)

# Create pipeline
model = make_pipeline(pca, svc)

##### 7.2.2.2. Grid search to find the best parameters for PCA and SVC

In [None]:
# Generate a valid n_components range (from 5 to maximum number of features)
n_features = x_train_scaled.shape[1]
n_components = np.arange(5, n_features, 3)

param_grid = {
  'pca__n_components': n_components,
  'svc__C': [50, 100, 500, 1000, 5000, 10000],
  'svc__gamma': [0.001, 0.01, 0.1, 1, 10]
}

# Grid search 
pipeline_grid = GridSearchCV(
    model,
    param_grid,
    cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
    n_jobs=-1 # Use all cores
)
pipeline_grid.fit(x_train_scaled, y_train)

# Print information about the model
print(f"Best params: {pipeline_grid.best_params_}")
print(f"Best score: {pipeline_grid.best_score_}")

##### 7.2.2.3. Create pipeline with best parameters + fit model

In [None]:
# Now, create the desired pipeline
pca = PCA(
  n_components=pipeline_grid.best_params_['pca__n_components'],
  whiten=True,
  random_state=42
)
svc = SVC(kernel='rbf',
  class_weight='balanced',
  # Use the best parameters found by the grid search
  C=pipeline_grid.best_params_['svc__C'],
  gamma=pipeline_grid.best_params_['svc__gamma']
)
model = make_pipeline(pca, svc)
model.fit(x_train_scaled, y_train)

##### 7.2.2.4. Make predictions

In [None]:
# Make predictions on validation set
predictions = model.predict(x_validation_scaled)

##### 7.2.2.5. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_val, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

##### 7.2.2.6. SVC+PCA pipeline model predictions on test set

In [None]:
# Prediction on the test set
predictions = model.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 7.3. Bagging Classifier (SVC based) training

Bagging Classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

#### 7.3.1. Create model using best SVC parameters + fit model

In [None]:
svc = SVC(kernel='rbf',
  class_weight='balanced',
  C=svc_grid.best_params_['C'],
  gamma=svc_grid.best_params_['gamma']
)

clf = BaggingClassifier(
  svc,
  n_estimators=30,
  n_jobs=-1, # Use all cores
  random_state=42
)
clf.fit(x_train_scaled, y_train)

#### 7.3.2. Make predictions

In [None]:
predictions = clf.predict(x_validation_scaled)

#### 7.3.3. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_val, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 7.5.5. Bagging Classifier predictions on test set

In [None]:
# Prediction on the test set
predictions = clf.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 7.4. Random Forest Classifier training

Random Forest is an ensemble method that combines multiple decision trees to create a more accurate model. It is a supervised learning algorithm that can be used for both classification and regression tasks.

#### 7.4.1. Grid search to find best hyperparameters for Random Forest

In [None]:
# Create random forest classifier
rfc = RandomForestClassifier()

# Create a dictionary of all values we want to test for n_estimators
parameters = {'n_estimators': [1, 2, 4, 10, 15, 20, 30, 40, 50, 100, 200, 500, 1000]}

# Used to find the best n_estimators value to use to train the model
rfc_grid = GridSearchCV(
  rfc,
  parameters,
  scoring='accuracy',
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1 # Use all cores
)

# Fit model to data
rfc_grid.fit(x_train_scaled, y_train)

# Extract best params
print(f"Best params: {rfc_grid.best_params_}")
print(f"Best score: {rfc_grid.best_score_}")

#### 7.4.2. Create model with best parameters + fit model

In [None]:
rfc = RandomForestClassifier(n_estimators=rfc_grid.best_params_['n_estimators'])
rfc.fit(x_train_scaled, y_train)

#### 7.4.3. Make predictions

In [None]:
# Make predictions on validation set
predictions = rfc.predict(x_validation_scaled)

#### 7.4.4. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_val, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 7.5.5. Random Forest model predictions on test set

In [None]:
# Prediction on the test set
predictions = rfc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 7.5. Extra Trees Classifier training

This kind of classifier is an ensemble of decision trees. It is similar to a Random Forest classifier, but the trees are trained using the whole dataset instead of a bootstrap sample.

#### 7.5.1. Grid search to find best hyperparameters for Extra Trees

In [None]:
# Create random forest classifier
etc = ExtraTreesClassifier()

# Create a dictionary of all values we want to test for n_estimators
parameters = {'n_estimators': [1, 2, 4, 10, 15, 20, 30, 40, 50, 100, 200, 500]}

# Used to find the best n_estimators value to use to train the model
etc_grid = GridSearchCV(
  etc,
  parameters,
  scoring='accuracy',
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1 # Use all cores
)

# Fit model to data
etc_grid.fit(x_train_scaled, y_train)

# Extract best params
print(f"Best params: {etc_grid.best_params_}")
print(f"Best score: {etc_grid.best_score_}")

#### 7.5.2. Create model with best parameters + fit model

In [None]:
etc = ExtraTreesClassifier(n_estimators=etc_grid.best_params_['n_estimators'])
etc.fit(x_train_scaled, y_train)

#### 7.5.3. Make predictions

In [None]:
# Make predictions on validation set
predictions = etc.predict(x_validation_scaled)

#### 7.5.4. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_val, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 7.5.5. Extra Trees model predictions on test set

In [None]:
# Prediction on the test set
predictions = etc.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

### 7.6. Neural Network classifier training

#### 7.6.1. Grid search to find best hyperparameters for Neural Network

In [None]:
# Create MLPClasifier
mlp = MLPClassifier(
  max_iter=1000,
  random_state=42
)

# Grid search for MLPClassifier
parameters = {
  'hidden_layer_sizes': [(50,), (100,), (50, 50)],
  'activation': ['relu', 'tanh'],
  'alpha': [0.0001, 0.001],
  'solver': ['adam', 'lbfgs'],
  'learning_rate': ['constant', 'invscaling'],
}

mlp_grid = GridSearchCV(
  mlp,
  parameters,
  cv=2, # Only 2 folds because of the size of the dataset, otherwise it takes too long
  n_jobs=-1, # Use all cores
)

mlp_grid.fit(x_train_scaled, y_train)

In [None]:
# Extract best params
print(f"Best params: {mlp_grid.best_params_}")
print(f"Best score: {mlp_grid.best_score_}")

#### 7.6.2. Create model with best parameters + fit model

In [None]:
# Create MLPClassifier with best parameters
mlp = MLPClassifier(
  hidden_layer_sizes=mlp_grid.best_params_['hidden_layer_sizes'],
  activation=mlp_grid.best_params_['activation'],
  alpha=mlp_grid.best_params_['alpha'],
  solver=mlp_grid.best_params_['solver'],
  learning_rate=mlp_grid.best_params_['learning_rate'],
  max_iter=1000,
  random_state=42
)
mlp.fit(x_train_scaled, y_train)

#### 7.6.3. Make predictions

In [None]:
# Make predictions on validation set
predictions = mlp.predict(x_validation_scaled)

#### 7.6.4. Model evaluation

In [None]:
# Print the classification report
print(classification_report(y_val, predictions))

In [None]:
# Rename the columns and index for the confusion matrix
cmat = confusion_matrix(y_val, predictions)
cmat = pd.DataFrame(cmat, index=['Denial of Service', 'Malware', 'None', 'Port Scan'], columns=['Denial of Service', 'Malware', 'None', 'Port Scan'])

# Use seaborn to visualize the confusion matrix
sns.set(font_scale=1.4) # for label size
sns.heatmap(cmat, annot=True, fmt='d', cmap='YlGnBu')

#### 7.6.5. MPL classifier model predictions on test set

In [None]:
# Prediction on the test set
predictions = mlp.predict(x_test_scaled)

# Show the predictions on a histogram
fig = sns.countplot(x=predictions)
fig.set_title('Predictions distribution on the test set') # Set the title
fig.set_xticklabels(fig.get_xticklabels(), rotation=45) # Rotate x-labels
pd.Series(predictions).value_counts() # Print the predictions size per class

## 8. Comparative Analysis of Classification Models

I have observed outstanding performance across all tested classification models. It is worth noting that nearly every model exhibited exceptional accuracy, as evidenced by an f1-score nearing perfection (1.0).

Upon thorough consideration, I have come to realize that this phenomenon stems from the substantial resemblance between the utilized validation set (created by splitting the training set, given the absence of the target variable in the provided test set) and the training set employed for model training.

Presumably, the training set was constructed using data from a specific network, resulting in a significant overlap of features between both sets. Consequently, the models achieved highly accurate classifications for almost all flows within the validation set. However, their performance may not be equally robust when applied to a distinct test set derived from a different network.

Nevertheless, I have opted to present the classification outcomes of each model on the provided test set, despite the unavailability of performance metrics for evaluation.

Acquiring a test set originating from a diverse network would have been advantageous, enabling a more precise assessment of the models' performance. Regrettably, obtaining such a test set proved unfeasible for this particular project.