Binary intrusion detection is the process of identifying whether network traffic is normal or malicious. Unlike multiclass intrusion detection, which classifies specific attack types, binary intrusion detection focuses only on detecting the presence of any attack. The NSL-KDD dataset contains multiple categories of attacks such as DoS, Probe, R2L, and U2R. For binary classification, all these attack types are combined into a single class labeled as 1 (attack), while normal network traffic is labeled as 0 (benign). This simplifies the problem to distinguishing between safe and potentially harmful network connections.

In [87]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

from imblearn.over_sampling import SMOTE

In [88]:
# NSL-KDD dataset column names
# Source: KDD Cup 1999 Intrusion Detection Dataset feature schema
# The dataset files (KDDTrain+.txt, KDDTest+.txt) do not contain headers, so column names are assigned manually according to the official schema.

col_names = [
"duration","protocol_type","service","flag","src_bytes","dst_bytes","land",
"wrong_fragment","urgent","hot","num_failed_logins","logged_in",
"num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate",
"dst_host_same_src_port_rate","dst_host_srv_diff_host_rate",
"dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate",
"label","difficulty"
]

In [89]:
train_path = "KDDTrain+.txt"
test_path  = "KDDTest+.txt"
train_df = pd.read_csv(train_path, names = col_names)
test_df = pd.read_csv(test_path, names = col_names)

In [90]:
#checking 
print(train_df.shape)
print(test_df.shape)

(125973, 43)
(22544, 43)


The original NSL-KDD dataset contains many specific attack types such as DoS, Probe, R2L, and U2R.
Since our task is binary intrusion detection, we convert all attack types into a single class (1) and normal traffic into class (0).
as the goal of binary intrusion detection is to identify whether traffic is malicious or normal, all specific attack labels in NSL-KDD are combined into a single attack class.

In [91]:
train_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label,difficulty
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,21


In [92]:
train_df["label"] = train_df["label"].replace("normal", 0)
train_df.loc[train_df["label"] != 0, "label"] = 1
#in loc fucntion --> where mask is true for columns (i.e label !=0) and for column label change it to 1

In [93]:
train_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label,difficulty
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,0,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,0,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,1,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,0,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,21


In [94]:
print(train_df["label"].value_counts())

label
0    67343
1    58630
Name: count, dtype: int64


we have 67343 entries where there was no attack everything was normal and we have 58630 entries where there was some kind of threat in the network and it was malignant

In [95]:
#dropping the difficulty column 
train_df = train_df.drop(columns=["difficulty"])

In [96]:
train_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,0
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,0
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,1
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,0
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


some important features overall we have are - protocol_type, service, flag, src_bytes, dst_bytes, count, srv_count, error rates, dst_host_

In [97]:
# we are gonna apply one-hot encoding to convert some categorical features to numbers for the model to understand 
# some important categorical features that have an impact are - protocal_type, service, flag
categorical_cols = ["protocol_type","service","flag"]

train_df = pd.get_dummies(train_df, columns=categorical_cols)
#get_dummies --> fucntion in pandas which directly applies one hot encoding

In [98]:
#applying to test data 
test_df["label"] = test_df["label"].replace("normal", 0)
test_df.loc[test_df["label"] != 0, "label"] = 1
test_df = test_df.drop(columns=["difficulty"])
test_df  = pd.get_dummies(test_df,  columns=categorical_cols)

In [99]:
#aligning the train and test columns the columns which are not common are filled with values zero
#taking train as reference 
train_df, test_df = train_df.align(test_df, join="left",axis = 1, fill_value=0)

In [100]:
#separating features and labels
x_train = train_df.drop("label",axis=1)#every feature apart fronm label that act as input
y_train = train_df["label"]#output

x_test = test_df.drop("label", axis=1)
y_test = test_df["label"]

In [108]:
# #feature scaling --> so that all numerical values are on a similar scale to prevent model getting biased 
#the fit_tranform converts data to numpy array 
# #so while the x_train gets converted to array in x_test it reains a dataframe only therefor we use the fucntion dataframe here 
scaler = StandardScaler()
 
x_train = pd.DataFrame( 
    scaler.fit_transform(x_train),
    columns = x_train.columns
)

x_test = pd.DataFrame(
    scaler.transform(x_test),
    columns = x_test.columns
 )

In [122]:
print(type(y_train))
print(y_train.shape)
print(pd.Series(y_train).head())

<class 'pandas.core.series.Series'>
(134686,)
0    0
1    0
2    1
3    0
4    0
Name: label, dtype: int64


In [123]:
y_train = y_train.astype(int)

Intrusion detection datasets are highly imbalanced, with far fewer attack samples than normal traffic. SMOTE (Synthetic Minority Oversampling Technique) is applied to synthetically oversample the minority (attack) class, balancing the training data and improving the model’s ability to learn attack patterns and detect intrusions.

In [124]:
smote = SMOTE(random_state=42)
x_train, y_train = smote.fit_resample(x_train, y_train)

print(f"balanced class distribution: {pd.Series(y_train).value_counts()}")

balanced class distribution: label
0    67343
1    67343
Name: count, dtype: int64


we will be comparing our models accuracy based on 3 classification models --> xgboost, random forest, svm

In [131]:
model = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

model.fit(x_train, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",300
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


In [132]:
# print(type(y_test), y_test.dtype)
# print(type(y_pred), y_pred.dtype)
y_test = y_test.astype(int)

In [133]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(x_test)


print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7759492547906317

Confusion Matrix:
 [[9449  262]
 [4789 8044]]

Classification Report:
               precision    recall  f1-score   support

           0       0.66      0.97      0.79      9711
           1       0.97      0.63      0.76     12833

    accuracy                           0.78     22544
   macro avg       0.82      0.80      0.78     22544
weighted avg       0.84      0.78      0.77     22544

