# **Improving Intrusion Detection Performance on Imbalanced NSL-KDD Dataset using Focal Loss**

Brief explanation...

#**Dataset Importing and Introduction**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import sklearn

In [None]:
column_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label", "difficulty"]


df = pd.read_csv('KDDTrain+.csv', header=None, names=column_names)
df_test = pd.read_csv('KDDTest+.csv', header=None, names=column_names)

#In this project we do not need the diffuculy values so we just dropped the column
df.drop("difficulty", axis=1, inplace=True)
df_test.drop("difficulty", axis=1, inplace=True)

print('Dimensions of the Training set:',df.shape)
print('Dimensions of the Test set:',df_test.shape)

FileNotFoundError: [Errno 2] No such file or directory: 'KDDTrain+.csv'

Since these datasets are commonly used and known to be not having any missing datas, there is no need to check but we double-checked anyways.

In [None]:
print("The number of Null values in total of Train dataset: ")
print(df.isna().sum().sum())
print("The number of Null values in total of Test dataset: ")
print(df_test.isna().sum().sum())

The number of Null values in total of Train dataset: 
0
The number of Null values in total of Test dataset: 
0


#**Step-1: Preprocessing The Data**

In [None]:
X_train = df.drop(['label'], axis=1)
y_train = df['label']
X_test = df_test.drop(['label'], axis=1)
y_test = df_test['label']

##Encoding and Feature Scaling  

**One-Hot Encoding:** In ML models (especially in scikit-learn), we can’t feed strings (categorical text) directly into the model, they need to be turned into numbers.The categorical features like protocol type, service, and flag are transformed
using one-hot encoding to convert them into a numerical format suitable for machine learning
models. One-hot encoding requires the input to be an integer matrix where each entry represents a
category. The output is a sparse matrix where each column corresponds to a possible category value
and contains a 1 if the instance belongs to that category, and 0 otherwise.

**Problems:** One of the main problems in the KDD set is the categories in some columns are not matching. To be able to match the vector lenghts for further computaion these column lenghts must match. So, the "handle_unknown='ignore'" part inside the OneHotEncoder fuction will just be doing that however, to make the different categories to match between the output values of train and test datasets, we need to combine all of the data and fit them together before Label Encodind step.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder

#These are the problematic columns that we need to deal with
categorical_cols = ['protocol_type', 'service', 'flag']
numerical_cols = [col for col in X_train.columns if col not in categorical_cols]

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', StandardScaler(), numerical_cols)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)


le = LabelEncoder()
le.fit(pd.concat([y_train, y_test], axis=0))

y_train_encoded = le.transform(y_train)
y_test_encoded = le.transform(y_test)

Firstly, we need to clarify which columns are going into One-Hot-Encoder which are not. Because, to Feature Scaling, not to corrupt the data, One-Hot-Coded data must not scaled and the other numerical columns must be. Then apply the encoding and the scaling for the apporiate columns. Finally, the output values (y_train and y_test) can be encoded using LabelEncoder. Its the same think with one-hot-encoder but for the output values.

**What OneHotEncoder and LabelEncoder Do and Why We Use Them:**

The numeric class labels produced by the LabelEncoder. Each string label in your y_train or y_test — such as: 'normal', 'neptune', 'smurf', 'back', 'teardrop', (and so on…) is mapped to an integer, like:

'normal' → 14

'neptune' → 16

'smurf' → 9

While OneHotEncoder gives us [0, 0, 1, 0], LabelEncoder gives just 2. That’s exactly what most models (like RandomForest, LogisticRegression, SVM) expect for labels.

In [None]:
print(X_train_processed[0])
print(X_test_processed[0])
print(y_train_encoded[0])
print(y_test_encoded[0])

[ 0.          1.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          1.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          1.          0.
 -0.11024922 -0.0076786  -0.00491864 -0.01408881 -0.08948642 -0.00773599
 -0.09507567 -0.0