# **Evaluation and Validation of Balanced Dataset**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

## **Step 1: Data Preparation**
- Load the **merged balanced dataset** and **testing set**.
- Split the **balanced dataset** into **training (X_train, y_train)**.
- Ensure both **training and testing data** are properly **scaled**.

In [4]:
# Load merged balanced dataset
df_balanced = pd.read_csv('balanced_attack_data.csv')
df_test = pd.read_csv('UNSW_NB15_testing-set.csv')

# Drop unnecessary columns in df_test to match df_balanced
df_test = df_test.drop(columns=["id", "proto", "service", "state", "label"], errors="ignore")

# Separate features and labels
X = df_balanced.drop(columns=["attack_cat"])  # Features
y = df_balanced["attack_cat"]  # Labels

# Split into train (70%) and validation (30%) sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Ensure test set is prepared correctly
X_test = df_test.drop(columns=["attack_cat"])
y_test = df_test["attack_cat"]

# Scale numerical features
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

print("Data preparation complete!")
print(f"Training set: {X_train.shape}\nValidation set: {X_val.shape}\nTesting set: {X_test.shape}")

Data preparation complete!
Training set: (252000, 39)
Validation set: (108000, 39)
Testing set: (82332, 39)


## **Step 2: Train Machine Learning Models**
- Choose models for evaluation:
  - **Logistic Regression** (Baseline)
  - **Random Forest**
  - **Support Vector Machine (SVM)**
  - **Gradient Boosting (XGBoost, LightGBM)**
  - **Neural Networks (Optional)**
- Train each model using the **balanced dataset**.

## **Step 3: Evaluate Model Performance**
- Use the **testing set** to evaluate trained models.
- Compute performance metrics:
  - **Accuracy**
  - **Precision, Recall, F1-score**
  - **Confusion Matrix**
  - **AUC-ROC Curve**
- Compare results to assess improvement.

## **Step 4: Model Validation**
- Perform **Cross-validation** on the training set.
- Conduct **Hyperparameter tuning** (GridSearchCV, RandomizedSearchCV).
- Check for **overfitting/underfitting**.

## **Step 5: Conclusion**
- Summarize model performances.
- Identify the best-performing model for **network intrusion detection**.
- Discuss whether **balancing improved model performance**.