## 1 - Data preparation

As proposed by Tavallaee et al. (2009) a cleaned and improved version of the original KDD Cup 1999 dataset will be used. It contains no redundant records, balanced difficulty levels, and better representative samples. It should be ready for direct use in ML pipelines after encoding categorical variables and normalization/scaling of numeric ones. The difficulty_level column was added by Tavallaee et al. (2009) when they created the NSL-KDD dataset to address weaknesses in the original KDD Cup 1999 data. This meta label will not be used in the project.

The preprocessing step performs several key transformations to prepare the NSL-KDD dataset for its intended purpose. First, categorical features such as protocol_type, service, and flag are encoded into numeric form using LabelEncoder, ensuring that the model can process non-numerical data. The target labels (i.e. attack types) are also converted from text strings to integer values for classification. Next, all numeric features are standardized with StandardScaler, which centers the data around a mean of zero and a standard deviation of one. Scaling is fitted only on the training data to prevent data leakage, and the same transformation is then applied to the validation and test sets. The different steps ensure that the models receive data in a consistent numerical format where no single feature dominates due to differing scales.

Finally, the fitted encoders and scaler could be saved using joblib so that they can be reloaded later without refitting, ensuring reproducibility when the model is reused. However, because this project is a demo course project, the code ignores some best practices such as saving transformer objects, implementing detailed exception handling, validating file paths, and using configuration files for data directories. In a full production pipeline, these additions would be essential to ensure reproducibility, maintainability and robustness of the preprocessing workflow.

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

DATA_DIR = "data"
OUT_DIR  = "data/processed"
os.makedirs(OUT_DIR, exist_ok=True)

# Created using NSL-KDD documentation
columns = [
    "duration","protocol_type","service","flag","src_bytes","dst_bytes",
    "land","wrong_fragment","urgent","hot","num_failed_logins","logged_in",
    "num_compromised","root_shell","su_attempted","num_root","num_file_creations",
    "num_shells","num_access_files","num_outbound_cmds","is_host_login","is_guest_login",
    "count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate",
    "same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label","difficulty_level"
]

categorical_cols = ["protocol_type", "service", "flag"]

train_df = pd.read_csv(f"{DATA_DIR}/KDDTrain+.txt", names=columns)
test_df  = pd.read_csv(f"{DATA_DIR}/KDDTest+.txt",  names=columns)

# drop the meta label
train_df.drop(columns=["difficulty_level"], inplace=True, errors="ignore")
test_df.drop(columns=["difficulty_level"],  inplace=True, errors="ignore")

# just to make sure all labels are known before using test set (i.e. to prevent unseen-category errors)
labels = {}
for col in categorical_cols:
    enc = LabelEncoder()
    enc.fit(pd.concat([train_df[col], test_df[col]], axis=0))
    train_df[col] = enc.transform(train_df[col])
    test_df[col]  = enc.transform(test_df[col])
    labels[col] = enc

# print(labels)
X_train_full = train_df.drop(columns=["label"])
y_train_full = train_df["label"]
X_test = test_df.drop(columns=["label"])
y_test = test_df["label"]


# encode labels into integers
y_encoder = LabelEncoder().fit(pd.concat([y_train_full, y_test], axis=0))
y_train_full_enc = y_encoder.transform(y_train_full)
y_test_enc = y_encoder.transform(y_test)

print("Encoded labels:")
for i, label in enumerate(y_encoder.classes_):
    print(i, ":", label)

# perform scaling (standardization) for training data (cf. src_bytes vs error_rate)
# prevent data leakage by excluding test in scaler
scaler = StandardScaler().fit(X_train_full)
X_train_full_s = scaler.transform(X_train_full)
X_test_s       = scaler.transform(X_test)

# split the data
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full_s, y_train_full_enc, test_size=0.2, random_state=42, stratify=y_train_full_enc
)

# save for use in models 1 & 2
np.save(f"{OUT_DIR}/X_train.npy", X_train)
np.save(f"{OUT_DIR}/y_train.npy", y_train)
np.save(f"{OUT_DIR}/X_val.npy",   X_val)
np.save(f"{OUT_DIR}/y_val.npy",   y_val)
np.save(f"{OUT_DIR}/X_test.npy",  X_test_s)
np.save(f"{OUT_DIR}/y_test.npy",  y_test_enc)

# save for possible verification
pd.DataFrame(X_train).to_csv(f"{OUT_DIR}/X_train.csv", index=False)
pd.DataFrame(X_val).to_csv(f"{OUT_DIR}/X_val.csv", index=False)
pd.DataFrame(X_test_s).to_csv(f"{OUT_DIR}/X_test.csv", index=False)

print("Finished processing data...")

Encoded labels:
0 : apache2
1 : back
2 : buffer_overflow
3 : ftp_write
4 : guess_passwd
5 : httptunnel
6 : imap
7 : ipsweep
8 : land
9 : loadmodule
10 : mailbomb
11 : mscan
12 : multihop
13 : named
14 : neptune
15 : nmap
16 : normal
17 : perl
18 : phf
19 : pod
20 : portsweep
21 : processtable
22 : ps
23 : rootkit
24 : saint
25 : satan
26 : sendmail
27 : smurf
28 : snmpgetattack
29 : snmpguess
30 : spy
31 : sqlattack
32 : teardrop
33 : udpstorm
34 : warezclient
35 : warezmaster
36 : worm
37 : xlock
38 : xsnoop
39 : xterm
Finished processing data...
