## 01 - Network Attack Detection with Machine Learning (Brute Force)

A **Network Intrusion Detection System (NIDS)** is a security technology designed to detect malicious or suspicious activities on a network. NIDS monitors network traffic in real-time, analyzing data packets and looking for patterns that may indicate an attack or anomalous behavior. For this experiment, the constructed NIDS has four stages:

1. **Data Acquisition**: Collecting data from a network environment.
2. **Feature Extraction**: Extracting features from network events.
3. **Classification**: Classifying using a Machine Learning algorithm.
4. **Alert**: Generating an alert for the operator on network events classified as attacks by the algorithm.

The purpose of the experiment is to identify brute force attacks in a corporate network environment, where a web system is attacked, and the algorithm can accurately identify the attacker.

In the first stage, Data Acquisition, the experiment was conducted using an environment with 10 virtual machines with a web system receiving legitimate network operations from both internal and external sources. To represent a corporate environment, about 100 bots were used to perform legitimate actions on the web system randomly within short periods. These actions were monitored and collected, generating a PCAP file with about 5 million network events over 72 hours. A PCAP file containing about 1 million attack events was also generated to validate whether the algorithm could learn and identify the behavior of a brute force attack. A second PCAP file was created by performing various attacks in the environment alongside the legitimate actions of the bots, aiming for the machine learning algorithm to identify the attacks within the common network traffic.

In the second stage, a feature extraction algorithm was created, available at the URL: [PacketTraitAnalyzer](https://github.com/rogerwxd-projects/PacketTraitAnalyzer), capable of extracting various features that represent network behavior for subsequent application of machine learning algorithms.

In the third stage, several machine learning algorithms were used to create a model that captures the knowledge of the legitimate environment and classifies divergent behaviors as anomalies. In this scenario, four anomaly detection algorithms were used: Isolation Forest, LOF (Local Outlier Factor), One-Class SVM, and Autoencoder. During training, each algorithm achieved about 88% to 95% accuracy in the legitimate environment, with a few randomly created attacks by the bots to force the identification of some abnormal behaviors even during learning.

Below is the execution of training and validation, followed by the test to validate if the algorithm can identify network attacks as anomalies.



# Training the algorithms

During the training, some features that did not make sense for the algorithm's learning were removed, except in cases where correct identification of the attacker is necessary. These features are: StartTime, EndTime, SourceIP, DestinationIP, SourcePort, DestinationPort, Protocol, Flags.

In [1]:
##  Dataset ##
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from keras.models import Model, load_model
from keras.layers import Input, Dense
import joblib

df = pd.read_csv('dataset/dataset-full.csv')
columns_to_remove = ['StartTime','EndTime','SourceIP','DestinationIP','SourcePort','DestinationPort','Protocol','Flags']
df = df.drop(columns=columns_to_remove)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
train_df, test_df = train_test_split(df_scaled, test_size=0.3, random_state=42)

In [2]:
## Training Isolation Forest ##
isolation_forest = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)
isolation_forest.fit(train_df)
joblib.dump(isolation_forest, 'model/isolation_forest_model.joblib')

test_predictions = isolation_forest.predict(test_df)
correct_predictions = np.sum(test_predictions == 1)  # Conta o número de previsões corretas (pontos normais)
total_samples = len(test_df)
accuracy = correct_predictions / total_samples * 100

print("Isolation Forest:")
print(f"Accuracy: {accuracy:.2f}%")

Isolation Forest:
Accuracy: 89.92%


In [3]:
## Training Local Outlier Factor (LOF) ##
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1, novelty=True)
lof.fit(train_df)
joblib.dump(lof, 'model/lof_model.joblib')

test_predictions = lof.predict(test_df)
correct_predictions = np.sum(test_predictions == 1)  # Conta o número de previsões corretas (pontos normais)
accuracy = correct_predictions / total_samples * 100

print("\nLocal Outlier Factor (LOF):")
print(f"Accuracy: {accuracy:.2f}%")


Local Outlier Factor (LOF):
Accuracy: 88.53%


In [4]:
## One-Class SVM ##
one_class_svm = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
one_class_svm.fit(train_df)
joblib.dump(one_class_svm, 'model/one_class_svm_model.joblib')

test_predictions = one_class_svm.predict(test_df)
correct_predictions = np.sum(test_predictions == 1)
accuracy = correct_predictions / total_samples * 100

print("\nOne-Class SVM:")
print(f"Accuracy: {accuracy:.2f}%")


One-Class SVM:
Accuracy: 89.85%


In [5]:
## Training Autoencoder ##
input_dim = train_df.shape[1]
input_layer = Input(shape=(input_dim,))
encoder = Dense(14, activation="relu")(input_layer)
encoder = Dense(7, activation="relu")(encoder)
encoder = Dense(3, activation="relu")(encoder)
decoder = Dense(7, activation="relu")(encoder)
decoder = Dense(14, activation="relu")(decoder)
decoder = Dense(input_dim, activation="sigmoid")(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer="adam", loss="mean_squared_error")
autoencoder.fit(train_df, train_df, epochs=100, batch_size=32, shuffle=True, validation_split=0.1)
autoencoder.save('model/autoencoder_model.h5')
joblib.dump(scaler, 'model/autoencoder_scaler.joblib')

reconstructions = autoencoder.predict(test_df)
mse = np.mean(np.power(test_df - reconstructions, 2), axis=1)
threshold = np.percentile(mse, 95)
anomalies = (mse > threshold).astype(int)
correct_predictions = np.sum(anomalies == 0)
accuracy = correct_predictions / total_samples * 100

print("\nAutoencoder:")
print(f"Accuracy: {accuracy:.2f}%")

Epoch 1/100
[1m6655/6655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 1ms/step - loss: 0.7680 - val_loss: 0.6368
Epoch 2/100
[1m6655/6655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 989us/step - loss: 0.6589 - val_loss: 0.6337
Epoch 3/100
[1m6655/6655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1ms/step - loss: 0.5950 - val_loss: 0.6302
Epoch 4/100
[1m6655/6655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 987us/step - loss: 0.6799 - val_loss: 0.6276
Epoch 5/100
[1m6655/6655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 986us/step - loss: 0.6966 - val_loss: 0.6273
Epoch 6/100
[1m6655/6655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 989us/step - loss: 0.6093 - val_loss: 0.6136
Epoch 7/100
[1m6655/6655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 994us/step - loss: 0.5994 - val_loss: 0.6133
Epoch 8/100
[1m6655/6655[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 982us/step - loss: 0.6462 - val_loss: 0.6131
Epoc



[1m3169/3169[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 688us/step

Autoencoder:
Accuracy: 95.00%


# Testing the algorithms to validate if network attacks were identified.

In [6]:
import pandas as pd
import numpy as np
from keras.models import load_model
import joblib

dataset = 'dataset/attack.csv'

def load_and_scale_data(file_path):
    df = pd.read_csv(file_path)
    columns_to_remove = ['StartTime', 'EndTime', 'SourceIP', 'DestinationIP', 'SourcePort', 'DestinationPort', 'Protocol', 'Flags']
    df = df.drop(columns=columns_to_remove)
    scaler = joblib.load('model/autoencoder_scaler.joblib')
    df_scaled = scaler.transform(df)
    return df, df_scaled

new_df, new_df_scaled = load_and_scale_data(dataset)

def predict_anomalies(model, df, df_scaled, model_name):
    predictions = model.predict(df_scaled)
    df['anomaly'] = predictions
    anomalies_indices = np.where(predictions == -1)[0]
    anomalies_lines = anomalies_indices + 1
    anomaly_counts = len(anomalies_indices)
    normals_count = len(df) - anomaly_counts
    accuracy = anomaly_counts / len(new_df_scaled) * 100
    print(f"Accuracy: {accuracy:.2f}%")

print("\nTest Isolation Forest")
isolation_forest = joblib.load('model/isolation_forest_model.joblib')
predict_anomalies(isolation_forest, new_df.copy(), new_df_scaled, 'isolation_forest')

print("\nTest Local Outlier Factor (LOF)")
lof = joblib.load('model/lof_model.joblib')
predict_anomalies(lof, new_df.copy(), new_df_scaled, 'lof')

print("\nTest One-class SVM")
one_class_svm = joblib.load('model/one_class_svm_model.joblib')
predict_anomalies(one_class_svm, new_df.copy(), new_df_scaled, 'one_class_svm')

autoencoder = load_model('model/autoencoder_model.h5')
reconstructions = autoencoder.predict(new_df_scaled)
mse = np.mean(np.power(new_df_scaled - reconstructions, 2), axis=1)
threshold = np.percentile(mse, 95)
anomalies = (mse > threshold).astype(int)
correct_predictions = np.sum(anomalies == 0)
accuracy = correct_predictions / len(new_df_scaled) * 100
print("\nTeste - Autoencoder:")
print(f"Accuracy: {accuracy:.2f}%")


Test Isolation Forest
Accuracy: 100.00%

Test Local Outlier Factor (LOF)
Accuracy: 99.63%

Test One-class SVM




Accuracy: 99.88%
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step

Teste - Autoencoder:
Accuracy: 94.91%


# Conclusion

In conclusion, it was observed that the algorithms were able to identify network attack behavior with an accuracy of at least **94%**, demonstrating that it is possible to create a project for identifying various types of attacks. This is noteworthy considering that the algorithms were used with their default settings and other parameters were not tested or optimized.