Now that we have some pcap files, we will write a script to trin our ML Alogirthm with the known pcaps and pcaps without portscans.

There are 9 Steps in this activity.

Step 1: Download the required libraries

In [3]:
#install the required libraries
%pip install scapy
%pip install numpy
%pip install -U scikit-learn



Step 2: Import all the needed libs

In [4]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
from scapy.all import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

Step 3: We will write a function to convert the flags from the TCP header from the FlagValue type to integer to extract the set flags.

In [5]:
# Function to convert custom flags to numeric representation
def convert_flags_to_numeric(flags):
    # Map each unique flag to a numeric value
    flag_mapping = {'R': 4, 'S': 2, 'F': 1, 'P': 8, 'A':16, 'U':32, 'SA':18, 'PA':24,'FPU':41,'FRP':13,'RA':20,'FA':17}  # Add more flags as needed
    return flag_mapping.get(flags, 0)  # Default to 0 if flag is not in mapping

Step 4: We will now extract some features from the packets in the network traffic to train the RF model

In [6]:
# Function to extract features from a packet
def extract_features(packet):
    # Add more features based on your requirements

    features = [
        len(packet),
        packet[IP].ttl,
        convert_flags_to_numeric(str(packet[TCP].flags)),
        packet[TCP].sport,
        packet[TCP].dport,
    ]
    return features

Step 5: After extracting the features, it is now time to load the training dataset (pcap files). The below function loads all the data from a folder that contains pcaps of portscan traffic and extract the features using the above function.

Thanks to Julie, Nicolaj and Orestis for providing the pcap files from their project work. 🙂

In [7]:
# Function to load PCAP files and extract features
def load_data(folder_path):
    data = []
    labels = []

    for filename in os.listdir(folder_path):
        if filename.endswith(".pcapng"):
            file_path = os.path.join(folder_path, filename)
            packets = rdpcap(file_path)

            for packet in packets:
                if IP in packet and TCP in packet:
                    features = extract_features(packet)
                    data.append(features)
                    # Ensure labels are numeric (0 or 1)
                    labels.append(1 if packet[TCP].flags == "SA" else 0)

                    #Assignment: Change the above statement for more accurate portscan labelleing by adding more flags
                    #labels.append(1 if packet[TCP].flags == "SA" or if packet[TCP].flags == ?? or if packet[TCP].flags == ??  else 0)


    return np.array(data), np.array(labels)

Step 6: The function below will obtain the dataset path from the user. Next, it splits the dataset into training and testing dataset. Lastly, it will use the training dataset to train the RF model for predictions.

In [8]:
# Load PCAP data and split into training and testing sets
data_path = "/content/sample_data/portscans-nikolaj"
X, y = load_data(data_path)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the random forest model
#rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
#rf_model.fit(X_train, y_train)
#print(X_train)
try:
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(X_train, y_train)
except ValueError as e:
    print("Error:", e)
    print("Non-numeric values in x_train. Checking non-numeric data:")

    # Print non-numeric values in y_train
    non_numeric_indices = [i for i, label in enumerate(X_train) if not isinstance(label, (int, np.integer))]
    non_numeric_data = X_train[non_numeric_indices]
    #print("Non-numeric data:", non_numeric_data)
    exit()

Step 7: Now, we have a function to train and test the model. Lets us use the below function to evaluate the model. The function below tests the model for its accuracy.

In [9]:
# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       461
           1       1.00      1.00      1.00         7

    accuracy                           1.00       468
   macro avg       1.00      1.00      1.00       468
weighted avg       1.00      1.00      1.00       468



Step 8: Understanding the output


The output you provided is from the classification report, which is a summary of different performance metrics for your machine learning model. Here's an explanation of the key metrics:

**Accuracy (Overall Accuracy)**:\
Value: 1.0\
Explanation: Accuracy is the ratio of correctly predicted instances to the total instances. In this case, an accuracy of 1.0 (or 100%) indicates that all instances in your test set were correctly classified by the model.

**Precision:**\
Precision for class 0 (Non-portscan): 1.00 \
Precision for class 1 (Portscan): 1.00 \
Explanation: Precision is the ratio of correctly predicted positive observations to the total predicted positives. A precision of 1.00 means that there were no false positives for both classes.

**Recall (Sensitivity, True Positive Rate):**\
Recall for class 0 (Non-portscan): 1.00 \
Recall for class 1 (Portscan): 1.00 \
Explanation: Recall is the ratio of correctly predicted positive observations to the all observations in actual class. A recall of 1.00 indicates that there were no false negatives for both classes.

**F1-Score:** \
F1-score for class 0 (Non-portscan): 1.00 \
F1-score for class 1 (Portscan): 1.00 \
Explanation: F1-score is the weighted average of precision and recall. It's a metric that considers both false positives and false negatives. A value of 1.00 indicates perfect precision and recall for both classes.

**Support:**\
Support for class 0 (Non-portscan): 5 \
Support for class 1 (Portscan): 4 \
Explanation: Support is the number of actual occurrences of the class in the specified dataset. In this case, there were 5 instances of class 0 (non-portscan) and 4 instances of class 1 (portscan) in your test set.

**Macro Average**:
Macro Average Precision, Recall, and F1-Score: 1.00 \
Explanation: Macro average calculates the metric independently for each class and then takes the average. In this case, the macro average precision, recall, and F1-score are all 1.00, indicating perfect performance on average across classes.

**Weighted Average:** \
Weighted Average Precision, Recall, and F1-Score: 1.00
Explanation: Weighted average calculates the metric for each class and then takes the weighted average based on the number of instances of each class. In this case, the weighted average precision, recall, and F1-score are all 1.00, indicating perfect performance with weight given to class imbalance.

Step 9: ***Now let us validate with a user input as pcap:*** 🎱

In [11]:
# Accept a PCAP file path from the user for validation
user_pcap_path = input("Enter the path to the PCAP file for validation: ")

# Load the user-provided PCAP file and make predictions
user_packets = rdpcap(user_pcap_path)
user_data = [extract_features(packet) for packet in user_packets if IP in packet and TCP in packet]

user_data = [extract_features(packet) for packet in user_packets if IP in packet and TCP in packet]
user_predictions = rf_model.predict(user_data)

# Display predictions for each packet
for i, prediction in enumerate(user_predictions):
    print(f"Packet {i + 1}: {'Portscan detected' if prediction == 1 else 'Non-Portscan'}")

Enter the path to the PCAP file for validation: /content/sample_data/nmap version detection port 22 scanner.pcapng
Packet 1: Non-Portscan
Packet 2: Non-Portscan
Packet 3: Non-Portscan
Packet 4: Non-Portscan
Packet 5: Non-Portscan
Packet 6: Portscan detected
Packet 7: Non-Portscan
Packet 8: Non-Portscan
Packet 9: Non-Portscan
Packet 10: Portscan detected
Packet 11: Non-Portscan
Packet 12: Non-Portscan
Packet 13: Non-Portscan
Packet 14: Non-Portscan
Packet 15: Non-Portscan
Packet 16: Non-Portscan
