Summary of the Code
1. Importing all packages
2. Reading the datasets
3. Preprocessing the data (Duplicates , Missing , Infinite , NAN Values)
4. Dropping columns with only one unique value
5. Sampling the data 
6. Convert the Label values to single Modal
7. Using SMOTE sampling to ensure the data is balanced
8. Splitting the data into test and train
9. Applying Pearson Corelation to filter out the important features
9. Running GA
10. Running LightGBM
11. Pefrom kfold validation whilst collecting perfomance metrics of the algorithm
12. Save the model

**Importing the needed packages**

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from scipy.stats import pearsonr
from deap import base, creator, tools, algorithms
import lightgbm as lgb
import random
from imblearn.over_sampling import SMOTE

In [None]:
file_paths = [
    "C:/VS code projects/data_files/Monday-WorkingHours.pcap_ISCX.csv",
    "C:/VS code projects/data_files/Tuesday-WorkingHours.pcap_ISCX.csv",
    "C:/VS code projects/data_files/Wednesday-workingHours.pcap_ISCX.csv",
    "C:/VS code projects/data_files/Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv",
    "C:/VS code projects/data_files/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv",
    "C:/VS code projects/data_files/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv",
    "C:/VS code projects/data_files/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv",
    "C:/VS code projects/data_files/Friday-WorkingHours-Morning.pcap_ISCX.csv"
]



# Read and clean datasets
dataframes = []
for file_path in file_paths:
    df = pd.read_csv(file_path)
    df.columns = df.columns.str.strip()  # Remove whitespace from column names
    dataframes.append(df)

# Combine all datasets into a single DataFrame
df = pd.concat(dataframes, ignore_index=True)



KeyboardInterrupt: 

: 

In [None]:
#2.1 Dealing with duplicates
print(f'Before Cremoving duplicates: {df.shape}')
duplicates = df[df.duplicated()]
print(f'Number of duplicates: {len(duplicates)}')
df.drop_duplicates(inplace = True)
print(f'After removing duplicates: {df.shape}')

Before Cremoving duplicates: (2830743, 79)


Number of duplicates: 330963
After removing duplicates: (2499780, 79)


In [None]:
#2.2 Handling missing values both numeric and non-numeric columns
# Identify columns with missing values
missing_val = df.isna().sum()
print("Columns with missing values:")
print(missing_val.loc[missing_val > 0])

# Handle missing values for numeric columns (fill with mean)
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# Handle missing values for non-numeric columns (fill with mode)
non_numeric_cols = df.select_dtypes(exclude=['number']).columns
for col in non_numeric_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Verify if there are still any missing values
print(f"Missing values after filling: {df.isna().sum().sum()}")

Columns with missing values:
Flow Bytes/s    353
dtype: int64
Missing values after filling: 0


In [None]:
#2.3 Handling infinite values

# Initial count of missing and infinite values
print(f'Initial missing values: {df.isna().sum().sum()}')
print(f'Initial infinite values: {df.isin([np.inf, -np.inf]).sum().sum()}')

# Drop rows with infinite values
df = df[~df.isin([np.inf, -np.inf]).any(axis=1)]

# Verify that infinite values are removed
inf_count = df.isin([np.inf, -np.inf]).sum()
print("Columns with infinite values after processing (should be empty):")
print(inf_count[inf_count > 0])

# Final missing value check
print(f"Missing values after dropping rows: {df.isna().sum().sum()}")

Initial missing values: 0
Initial infinite values: 3126
Columns with infinite values after processing (should be empty):
Series([], dtype: int64)
Missing values after dropping rows: 0


In [None]:
# Dropping columns with only one unique value
num_unique = df.nunique()
one_variable = num_unique[num_unique == 1]
not_one_variable = num_unique[num_unique > 1].index

dropped_cols = one_variable.index
df = df[not_one_variable]

print('Dropped columns:')
dropped_cols

Dropped columns:


Index(['Bwd PSH Flags', 'Bwd URG Flags', 'Fwd Avg Bytes/Bulk',
       'Fwd Avg Packets/Bulk', 'Fwd Avg Bulk Rate', 'Bwd Avg Bytes/Bulk',
       'Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate'],
      dtype='object')

In [None]:
df['Label'] = df['Label'].apply(lambda x: 1 if x == 'BENIGN' else 0)

#Due to resource constraints, we will sample 20% of the dataset for training
# Randomly sample 20% of the dataset
df = df.sample(frac=0.2, random_state=42)

#SMOTE (Synthetic Minority Over-sampling Technique) is used to handle class imbalance after sampling
# Ensure there are no NaN values in the dataset before applying SMOTE
if df.isna().sum().sum() > 0:
    print("Dataset contains NaN values. Filling NaN values with column means...")
    numeric_cols = df.select_dtypes(include=['number']).columns
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# Split the original dataset into features (X) and target (y)
X = df.drop('Label', axis=1)  # Features
y = df['Label']  # Target

# Perform SMOTE sampling to balance the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Combine the resampled features and target into a new DataFrame
df = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), 
                pd.DataFrame(y_resampled, columns=['Label'])], axis=1)

# Display the value counts to verify balance
print('Balanced dataset:')
print(df['Label'].value_counts())


Balanced dataset:
Label
1    414369
0    414369
Name: count, dtype: int64


In [None]:
# Step 4: Separate features and labels
X = df.drop('Label', axis=1)
Y = df['Label']

In [None]:
# Step 5: Normalize numerical features using MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [None]:
# Filter Method: Calculate correlations between features and labels
print("Calculating feature importance using correlation...")
correlations = []
for i in range(X_train.shape[1]):
    if np.std(X_train[:, i]) == 0:  # Skip constant features
        correlations.append(0)
    else:
        correlations.append(abs(pearsonr(X_train[:, i], y_train)[0]))
correlation_threshold = 0.2  # Define a threshold to filter irrelevant features
relevant_features = [i for i, corr in enumerate(correlations) if corr > correlation_threshold]
print(f"Features with correlation above threshold: {len(relevant_features)}")

# Subset the data with relevant features only
X_train = X_train[:, relevant_features]
X_test = X_test[:, relevant_features]
feature_names = [df.columns[i] for i in relevant_features]
[print(f"{i}: {feature_names[i]}") for i in range(len(feature_names))]

Calculating feature importance using correlation...
Features with correlation above threshold: 26
0: Destination Port
1: Flow Duration
2: Fwd Packet Length Min
3: Bwd Packet Length Max
4: Bwd Packet Length Min
5: Bwd Packet Length Mean
6: Bwd Packet Length Std
7: Flow IAT Mean
8: Flow IAT Std
9: Flow IAT Max
10: Fwd IAT Total
11: Fwd IAT Mean
12: Fwd IAT Std
13: Fwd IAT Max
14: Min Packet Length
15: Max Packet Length
16: Packet Length Mean
17: Packet Length Std
18: Packet Length Variance
19: FIN Flag Count
20: URG Flag Count
21: Average Packet Size
22: Avg Bwd Segment Size
23: Idle Mean
24: Idle Max
25: Idle Min


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [None]:
# GA Feature Selection
print("Defining Genetic Algorithm for feature selection...")

# Define GA components
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)
toolbox = base.Toolbox()
toolbox.register("attr_bool", random.randint, 0, 1)
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, n=X_train.shape[1])
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)

Defining Genetic Algorithm for feature selection...




In [None]:
# Define the fitness function
def fitness(individual, X_train, y_train):
    selected_features = [i for i, bit in enumerate(individual) if bit == 1]
    if len(selected_features) == 0:  # Avoid empty feature sets
        return 0,
    X_train_selected = X_train[:, selected_features]
    model = lgb.LGBMClassifier(random_state=42)
    model.fit(X_train_selected, y_train)
    accuracy = model.score(X_train_selected, y_train)  # Training accuracy
    return accuracy,

toolbox.register("evaluate", fitness, X_train=X_train, y_train=y_train)

In [None]:
# Set GA parameters
population = toolbox.population(n=50)
ngen = 40
cxpb = 0.7
mutpb = 0.2

# Run the GA
result_population = algorithms.eaSimple(population, toolbox, cxpb=cxpb, mutpb=mutpb, ngen=ngen, verbose=False)
best_individual = tools.selBest(result_population[0], k=1)[0]
selected_features = [i for i, bit in enumerate(best_individual) if bit == 1]

# Print the chosen features
chosen_features = [feature_names[i] for i in selected_features]
print("Selected Features:")
print(chosen_features)



In [None]:
# Initialize lists to store metrics for each fold
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
conf_matrices = []

# Perform k-fold cross-validation
print("Performing k-fold cross-validation...")
fold = 1
fold_results = []  # To store results for each fold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, val_index in kf.split(X_train):
    print(f"Training on fold {fold}...")
    
    # Split the data into training and validation sets
    X_train_fold, X_val_fold = X_train[train_index], X_train[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
    
    # Train the model with the best hyperparameters
    best_params = {
        'learning_rate': 0.06486348591551848,
        'num_leaves': 130,
        'max_depth': 11,
        'min_child_samples': 64,
        'subsample': 0.6848539751617823,
        'colsample_bytree': 0.5005320541074223,
        'reg_alpha': 1.6363250831261822e-08,
        'reg_lambda': 8.631874061803956e-05,
        'n_estimators': 324
    }
    final_model = lgb.LGBMClassifier(**best_params, random_state=42)
    final_model = lgb.LGBMClassifier(random_state=42)
    final_model.fit(X_train_fold, y_train_fold)
    
    # Predict on the validation set
    y_pred = final_model.predict(X_val_fold)
    
    # Calculate metrics for this fold
    accuracy = accuracy_score(y_val_fold, y_pred)
    precision = precision_score(y_val_fold, y_pred, average='weighted')
    recall = recall_score(y_val_fold, y_pred, average='weighted')
    f1 = f1_score(y_val_fold, y_pred, average='weighted')
    conf_matrix = confusion_matrix(y_val_fold, y_pred)
    
    # Append metrics to lists
    accuracy_scores.append(accuracy)
    precision_scores.append(precision)
    recall_scores.append(recall)
    f1_scores.append(f1)
    conf_matrices.append(conf_matrix)
    
    # Store fold results
    fold_results.append({
        "Fold": fold,
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1,
        "Confusion Matrix": conf_matrix
    })
    
    print(f"Fold {fold} - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1: {f1}")
    fold += 1

# Calculate average metrics across all folds
avg_accuracy = np.mean(accuracy_scores)
avg_precision = np.mean(precision_scores)
avg_recall = np.mean(recall_scores)
avg_f1 = np.mean(f1_scores)
avg_conf_matrix = np.sum(conf_matrices, axis=0)  # Sum confusion matrices across folds

# Print results for each fold
print("\nResults for Each Fold:")
for result in fold_results:
    print(f"Fold {result['Fold']}:")
    print(f"  Accuracy: {result['Accuracy']}")
    print(f"  Precision: {result['Precision']}")
    print(f"  Recall: {result['Recall']}")
    print(f"  F1 Score: {result['F1 Score']}")
    print(f"  Confusion Matrix:\n{result['Confusion Matrix']}\n")

# Print average metrics
print("\nAverage Metrics Across All Folds:")
print(f"Accuracy: {avg_accuracy}")
print(f"Precision: {avg_precision}")
print(f"Recall: {avg_recall}")
print(f"F1 Score: {avg_f1}")
print("Confusion Matrix:")
print(avg_conf_matrix)

Performing k-fold cross-validation...


Training on fold 1...




[LightGBM] [Info] Number of positive: 232177, number of negative: 231915
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.077959 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5883
[LightGBM] [Info] Number of data points in the train set: 464092, number of used features: 26
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500282 -> initscore=0.001129
[LightGBM] [Info] Start training from score 0.001129




Fold 1 - Accuracy: 0.9955871199062263, Precision: 0.9955933998341946, Recall: 0.9955871199062263, F1: 0.9955871213918758
Training on fold 2...




[LightGBM] [Info] Number of positive: 231996, number of negative: 232097
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.105131 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5881
[LightGBM] [Info] Number of data points in the train set: 464093, number of used features: 26
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499891 -> initscore=-0.000435
[LightGBM] [Info] Start training from score -0.000435




Fold 2 - Accuracy: 0.9959318411004715, Precision: 0.9959331913920312, Recall: 0.9959318411004715, F1: 0.9959318343986221
Training on fold 3...




[LightGBM] [Info] Number of positive: 232242, number of negative: 231851
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.086138 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5884
[LightGBM] [Info] Number of data points in the train set: 464093, number of used features: 26
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500421 -> initscore=0.001685
[LightGBM] [Info] Start training from score 0.001685




Fold 3 - Accuracy: 0.995664652698172, Precision: 0.9956738418127152, Recall: 0.995664652698172, F1: 0.9956646613589956
Training on fold 4...




[LightGBM] [Info] Number of positive: 232075, number of negative: 232018
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.093493 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5878
[LightGBM] [Info] Number of data points in the train set: 464093, number of used features: 26
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500061 -> initscore=0.000246
[LightGBM] [Info] Start training from score 0.000246




Fold 4 - Accuracy: 0.9958887461968747, Precision: 0.9958926635091048, Recall: 0.9958887461968747, F1: 0.9958887392273627
Training on fold 5...




[LightGBM] [Info] Number of positive: 231810, number of negative: 232283
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.067252 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5898
[LightGBM] [Info] Number of data points in the train set: 464093, number of used features: 26
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499490 -> initscore=-0.002038
[LightGBM] [Info] Start training from score -0.002038
Fold 5 - Accuracy: 0.9965437887315446, Precision: 0.9965480588528058, Recall: 0.9965437887315446, F1: 0.9965437589633337

Results for Each Fold:
Fold 1:
  Accuracy: 0.9955871199062263
  Precision: 0.9955933998341946
  Recall: 0.9955871199062263
  F1 Score: 0.9955871213918758
  Confusion Matrix:
[[57767   359]
 [  153 57745]]

Fold 2:
  Accuracy: 0.9959318411004715
  Precision: 0.9959331913920312
  Recall: 0.9959318411004715
  F1 Score: 0.9959318343986221
  Confusion Matrix:
[[57660   284]
 [  188 5

