Binary intrusion detection is the process of identifying whether network traffic is normal or malicious. Unlike multiclass intrusion detection, which classifies specific attack types, binary intrusion detection focuses only on detecting the presence of any attack. The NSL-KDD dataset contains multiple categories of attacks such as DoS, Probe, R2L, and U2R. For binary classification, all these attack types are combined into a single class labeled as 1 (attack), while normal network traffic is labeled as 0 (benign). This simplifies the problem to distinguishing between safe and potentially harmful network connections.

In [40]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

from imblearn.over_sampling import SMOTE

In [41]:
# NSL-KDD dataset column names
# Source: KDD Cup 1999 Intrusion Detection Dataset feature schema
# The dataset files (KDDTrain+.txt, KDDTest+.txt) do not contain headers, so column names are assigned manually according to the official schema.

col_names = [
"duration","protocol_type","service","flag","src_bytes","dst_bytes","land",
"wrong_fragment","urgent","hot","num_failed_logins","logged_in",
"num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate",
"dst_host_same_src_port_rate","dst_host_srv_diff_host_rate",
"dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate",
"label","difficulty"
]

In [42]:
train_path = "KDDTrain+.txt"
test_path  = "KDDTest+.txt"
train_df = pd.read_csv(train_path, names = col_names)
test_df = pd.read_csv(test_path, names = col_names)

In [43]:
#checking 
print(train_df.shape)
print(test_df.shape)

(125973, 43)
(22544, 43)


The original NSL-KDD dataset contains many specific attack types such as DoS, Probe, R2L, and U2R.
Since our task is binary intrusion detection, we convert all attack types into a single class (1) and normal traffic into class (0).
as the goal of binary intrusion detection is to identify whether traffic is malicious or normal, all specific attack labels in NSL-KDD are combined into a single attack class.

In [44]:
train_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label,difficulty
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,21


In [45]:
train_df["label"] = train_df["label"].replace("normal", 0)
train_df.loc[train_df["label"] != 0, "label"] = 1
#in loc fucntion --> where mask is true for columns (i.e label !=0) and for column label change it to 1

In [46]:
train_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label,difficulty
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,0,20
1,0,udp,other,SF,146,0,0,0,0,0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,0,15
2,0,tcp,private,S0,0,0,0,0,0,0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,1,19
3,0,tcp,http,SF,232,8153,0,0,0,0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,0,21
4,0,tcp,http,SF,199,420,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,21


In [47]:
print(train_df["label"].value_counts())

label
0    67343
1    58630
Name: count, dtype: int64


we have 67343 entries where there was no attack everything was normal and we have 58630 entries where there was some kind of threat in the network and it was malignant

In [48]:
#dropping the difficulty column 
train_df = train_df.drop(columns=["difficulty"])

In [49]:
train_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,0
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,0
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,1
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,0
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


some important features overall we have are - protocol_type, service, flag, src_bytes, dst_bytes, count, srv_count, error rates, dst_host_

In [50]:
# we are gonna apply one-hot encoding to convert some categorical features to numbers for the model to understand 
# some important categorical features that have an impact are - protocal_type, service, flag
categorical_cols = ["protocol_type","service","flag"]

train_df = pd.get_dummies(train_df, columns=categorical_cols)
#get_dummies --> fucntion in pandas which directly applies one hot encoding

In [51]:
#applying to test data 
test_df["label"] = test_df["label"].replace("normal", 0)
test_df.loc[test_df["label"] != 0, "label"] = 1
test_df = test_df.drop(columns=["difficulty"])
test_df  = pd.get_dummies(test_df,  columns=categorical_cols)

In [53]:
#aligning the train and test columns the columns which are not common are filled with values zero
#taking train as reference 
train_df, test_df = train_df.align(test_df, join="left",axis = 1, fill_value=0)

In [55]:
#separating features and labels
x_train = train_df.drop("label",axis=1)#every feature apart fronm label that act as input
y_train = train_df["label"]#output

x_test = test_df.drop("label", axis=1)
y_test = test_df["label"]

In [58]:
#feature scaling --> so that all numerical values are on a similar scale to prevent model getting biased 
scaler = StandardScaler()

x

