# Network Intrusion Detection Using Machine Learning

## Overview of Anomaly Detection in Cybersecurity

Anomaly detection is a critical technique that identifies data points exhibiting significant deviations from expected patterns. In the realm of cybersecurity, these anomalies often signal malicious activities, unauthorized network intrusions, or various security breaches. Random forests, as ensemble machine learning algorithms, excel at processing complex, high-dimensional datasets and prove particularly effective for detecting these anomalous network patterns.


## Understanding Random Forest Algorithms

A Random Forest represents an ensemble machine learning approach that constructs multiple decision trees and combines their individual predictions. In classification scenarios, each tree contributes a vote for a specific class, with the final prediction determined by majority consensus. For regression tasks, the algorithm calculates the average of all individual tree outputs to produce the final result.

The ensemble nature of random forests provides superior generalization compared to single decision trees, effectively reducing overfitting while delivering robust performance across high-dimensional feature spaces.

### Core Principles of Random Forest Construction

Three fundamental concepts govern the development of a random forest:

**Bootstrapping**: The algorithm creates multiple training data subsets through sampling with replacement, where each subset trains an independent decision tree.

**Tree Construction**: During each tree's development, only a randomly selected subset of features is considered at every split point, promoting diversity and minimizing correlations between trees.

**Voting Mechanism**: After training completion, classification relies on majority voting, while regression employs prediction averaging across all trees.


## Applying Random Forests to Anomaly Detection

When implementing random forests for anomaly detection, the model trains exclusively on data representing normal network conditions. Subsequently, new, unseen data points undergo evaluation against this learned baseline of normal behavior. Data points that demonstrate poor fit or generate low-confidence predictions are automatically flagged as potential anomalies.

This methodology enables the detection of unusual patterns, making it particularly valuable for identifying suspicious network traffic and potential security threats.


## Introduction to the NSL-KDD Dataset

The NSL-KDD dataset represents an enhanced version of the original KDD Cup 1999 dataset, addressing previous limitations by removing redundant entries and correcting imbalanced class distributions. This refined dataset has become a standard benchmark for evaluating the performance of various intrusion detection systems and machine learning models.

NSL-KDD provides balanced, labeled instances encompassing both normal and malicious network activities. This comprehensive structure enables practitioners to perform not only binary classification tasks (distinguishing normal from abnormal traffic) but also multi-class detection operations targeting specific attack types. Such versatility makes NSL-KDD an invaluable resource for developing, testing, and validating intrusion detection methodologies.

For this lab, we will utilize a modified version of this dataset.


## Dataset Acquisition and Preparation

### Step 1: Downloading the Dataset

Prior to loading the NSL-KDD dataset, we must retrieve it from the provided source. We can accomplish this by downloading the compressed .zip file using Python's standard libraries and then extracting its contents locally for further processing.


In [None]:
import requests, zipfile, io

# URL for the NSL-KDD dataset
url = "https://academy.hackthebox.com/storage/modules/292/KDD_dataset.zip"

# Download the zip file and extract its contents
response = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(response.content))
z.extractall('.')  # Extracts to the current directory


### Step 2: Loading the Dataset

Properly loading the NSL-KDD dataset is essential before initiating the preprocessing stage. This ensures that the data maintains consistent structure, with each column containing accurate information. Once loaded, the dataset can be inspected for quality, completeness, and potential preprocessing requirements.

#### Importing Required Libraries

We begin by importing all necessary libraries for our analysis.


In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt


In this code snippet:

- **numpy** and **pandas** handle data loading and cleaning operations
- **RandomForestClassifier** provides the machine learning algorithm we will use for anomaly detection
- **train_test_split** and various metrics from **sklearn.metrics** support model evaluation and validation processes
- **seaborn** and **matplotlib** assist in visualizing data distributions, relationships, and model results


#### Defining Column Names and File Path

The NSL-KDD dataset includes a comprehensive set of predefined features and labels. We must map these features to meaningful column names to facilitate direct manipulation. We define a comprehensive list of column names corresponding to the various observed characteristics of network connections and potential attacks. Additionally, we establish the file_path variable to point to the dataset file, ensuring that pandas can locate and read the data correctly.


In [None]:
# Set the file path to the dataset
file_path = r'KDD+.txt'

# Define the column names corresponding to the NSL-KDD dataset
columns = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 
    'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 
    'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 
    'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 
    'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 
    'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'attack', 'level'
]


These column names ensure that each feature and label is properly identified. They encompass:

- **Generic network statistics** (e.g., duration, src_bytes, dst_bytes)
- **Categorical fields** (protocol_type, service)
- **Classification labels** (attack, level) that categorize the type of traffic observed


#### Reading the Dataset into a DataFrame

With the file path and column names properly defined, we proceed to load the data into a pandas DataFrame. This provides a structured, tabular representation of the dataset, facilitating inspection, preprocessing, and visualization operations.


In [None]:
# Read the combined NSL-KDD dataset into a DataFrame
df = pd.read_csv(file_path, names=columns)


By executing this code, we now have a DataFrame `df` containing all the data from the NSL-KDD dataset with the appropriate column headers. The DataFrame is ready for further inspection, cleaning, and preprocessing steps. Before proceeding, we can briefly examine the dataset's structure, check for missing values, and confirm that all features align with their intended data types.


In [None]:
print(df.head())


# Data Preprocessing and Dataset Preparation

## Overview of Data Preprocessing

This section focuses on preparing the NSL-KDD dataset for training a random forest anomaly detection model. The primary objective is to transform raw network traffic data into a machine-readable format by establishing classification targets, encoding categorical variables, and selecting relevant numeric features. We will generate both binary and multi-class targets, ensure categorical data compatibility with machine learning algorithms, and preserve numeric metrics essential for detecting abnormal traffic patterns.


## Creating Classification Targets

### Binary Classification Target

The binary classification target serves to identify whether network traffic represents normal or anomalous behavior. We create a new column `attack_flag` in the DataFrame `df` to accomplish this objective. Each data row receives a label of 0 for normal traffic and 1 for any type of attack. This transformation simplifies the initial detection challenge into a fundamental normal-versus-attack classification, providing a foundation for more detailed analysis.


In [None]:
# Binary classification target
# Maps normal traffic to 0 and any type of attack to 1
df['attack_flag'] = df['attack'].apply(lambda a: 0 if a == 'normal' else 1)


The value `normal` originates from the dataset structure. Examining the dataset reveals that all traffic is categorized as either normal or represents some form of attack:

```
0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,150,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20
0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.0,1.0,0.0,0.0,0.05,0.07,0.0,255,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
```


### Multi-Class Classification Target

While binary classification provides useful insights, it lacks the granularity needed for comprehensive threat analysis. To address this limitation, we create a multi-class classification target that distinguishes between different categories of attacks. We define specific lists that categorize various attacks into four major groups:

- **DoS (Denial of Service) attacks** such as neptune and smurf
- **Probe attacks** that scan networks for vulnerabilities, like satan or ipsweep  
- **Privilege Escalation attacks** that attempt to gain unauthorized admin-level control, such as buffer_overflow
- **Access attacks** that seek to breach system access controls, like guess_passwd

A custom function `map_attack` examines the attack type and assigns it an integer value:

- 0 for normal traffic
- 1 for DoS attacks
- 2 for Probe attacks
- 3 for Privilege Escalation attacks
- 4 for Access attacks

This expanded classification framework enables models to learn not only to distinguish between normal and abnormal traffic but also to identify the specific nature of observed attacks.


In [None]:
# Multi-class classification target categories
dos_attacks = ['apache2', 'back', 'land', 'neptune', 'mailbomb', 'pod', 
               'processtable', 'smurf', 'teardrop', 'udpstorm', 'worm']
probe_attacks = ['ipsweep', 'mscan', 'nmap', 'portsweep', 'saint', 'satan']
privilege_attacks = ['buffer_overflow', 'loadmdoule', 'perl', 'ps', 
                     'rootkit', 'sqlattack', 'xterm']
access_attacks = ['ftp_write', 'guess_passwd', 'http_tunnel', 'imap', 
                  'multihop', 'named', 'phf', 'sendmail', 'snmpgetattack', 
                  'snmpguess', 'spy', 'warezclient', 'warezmaster', 
                  'xclock', 'xsnoop']

def map_attack(attack):
    if attack in dos_attacks:
        return 1
    elif attack in probe_attacks:
        return 2
    elif attack in privilege_attacks:
        return 3
    elif attack in access_attacks:
        return 4
    else:
        return 0

# Assign multi-class category to each row
df['attack_map'] = df['attack'].apply(map_attack)


## Feature Engineering and Encoding

### Encoding Categorical Variables

Network traffic data frequently includes categorical attributes that are not directly compatible with machine learning algorithms, which typically require numeric inputs. Two critical features in the NSL-KDD dataset are `protocol_type` (e.g., tcp, udp) and `service` (e.g., http, ftp). These features categorize the nature of network interactions but must be converted into numeric format for algorithmic processing.

We employ one-hot encoding, implemented through the `get_dummies` function in pandas. This approach generates a binary indicator variable for each category, ensuring that no ordinal relationship is implied between different protocols or services. After encoding, each categorical value is represented by a separate column indicating its presence (1) or absence (0).


In [None]:
# Encoding categorical variables
features_to_encode = ['protocol_type', 'service']
encoded = pd.get_dummies(df[features_to_encode])


### Selecting Numeric Features

Beyond categorical variables, the dataset contains a comprehensive range of numeric features that describe various aspects of network traffic. These include fundamental metrics like duration, src_bytes, and dst_bytes, as well as more specialized features such as serror_rate and dst_host_srv_diff_host_rate, which capture statistical properties of network sessions. By selecting these numeric features, we ensure the model has access to both raw volume data and more nuanced, derived statistics that help distinguish normal from abnormal patterns.


In [None]:
# Numeric features that capture various statistical properties of the traffic
numeric_features = [
    'duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot', 
    'num_failed_logins', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 
    'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', 
    'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 
    'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 
    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate'
]


### Preparing the Final Dataset

The final step involves combining the one-hot encoded categorical features with the selected numeric features. We merge them into a single DataFrame `train_set` that will serve as the primary input to our machine learning model. We also store the multi-class target variable `attack_map` as `multi_y` for classification tasks. At this stage, the data is in a suitable format for splitting into training, validation, and test sets, and subsequently training the random forest anomaly detection model.


In [None]:
# Combine encoded categorical variables and numeric features
train_set = encoded.join(df[numeric_features])

# Multi-class target variable
multi_y = df['attack_map']


## Dataset Splitting Strategy

In the Data Transformation section, we discussed the rationale and methods for splitting data into training, validation, and test sets. We now apply those principles specifically to the NSL-KDD dataset, ensuring that our random forest anomaly detection model is trained, tuned, and evaluated on distinct subsets.

### Splitting Data into Training and Test Sets

We use `train_test_split` to allocate a portion of the data for testing, ensuring that our final evaluations occur on unseen data.


In [None]:
# Split data into training and test sets for multi-class classification
train_X, test_X, train_y, test_y = train_test_split(train_set, multi_y, test_size=0.2, random_state=1337)


### Creating a Validation Set from the Training Data

We further split the training data to create a validation set. This supports model tuning and hyperparameter optimization without contaminating the final test data.


In [None]:
# Further split the training set into separate training and validation sets
multi_train_X, multi_val_X, multi_train_y, multi_val_y = train_test_split(train_X, train_y, test_size=0.3, random_state=1337)


### Final Split Variables Summary

After completing the splitting process, we have the following datasets:

- **train_X, train_y**: Core training set for model development
- **test_X, test_y**: Reserved exclusively for final performance evaluation
- **multi_train_X, multi_train_y**: Training subset for fitting the model
- **multi_val_X, multi_val_y**: Validation subset for hyperparameter tuning

This careful partitioning, applied after the transformations and encodings discussed earlier, ensures that the model development process remains consistent and that the final evaluation is unbiased and reflective of real-world performance.
