# Model Training 

In the modeling notebook, we focus on building and evaluating machine learning models for network anomaly detection. The process begins with data preparation, where we load the final dataset containing preprocessed features and encoded labels. We then perform a train-test split, ensuring an 80-20 distribution while maintaining class balance. Next, we train multiple models, including Logistic Regression as a baseline, Random Forest for its robustness and interpretability, and XGBoost for its efficiency with tabular data. Hyperparameter tuning is applied to optimize performance. 

To evaluate the models, we analyze key metrics such as accuracy, precision, recall, and F1-score. A confusion matrix helps visualize the model’s predictions, while the ROC curve and AUC score provide insight into its ability to distinguish between attack types and benign traffic. Based on these evaluations, we select the best-performing model for deployment, ensuring it can effectively classify network traffic anomalies in real-world scenarios.

## Load the Final Features Datasets 

In [5]:
import pandas as pd
# Check unique label values before fixing
print("Before Fixing - Unique Labels:", df["Label"].unique())

# Convert label column to categorical integer values
df["Label"] = df["Label"].astype("category").cat.codes

# Verify the encoding
print("After Fixing - Unique Labels:", df["Label"].unique())


Before Fixing - Unique Labels: [ 0.00000000e+00 -1.24281298e+00  8.04626294e-01 -5.12463208e-01
  1.95135960e+00 -1.54308345e-01  6.48053093e+00 -1.41851503e-02
  7.04962568e+01]
After Fixing - Unique Labels: [4 0 5 1 6 2 7 3 8]


In [4]:
import glob

# Define the correct path
path = "/home/sagemaker-user/AAI-540 Project/Final_Features/"

# Load all CSV files in the directory
all_files = glob.glob(path + "*.csv")

# Read and combine all datasets
df_list = [pd.read_csv(file) for file in all_files]
df = pd.concat(df_list, ignore_index=True)

# Verify the combined dataset
print(" Dataset Loaded Successfully!")
print(df.info())  # Check column types and missing values
print("Unique Labels:", df["Label"].unique())  # Verify labels


 Dataset Loaded Successfully!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452232 entries, 0 to 452231
Data columns (total 76 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Destination Port             452232 non-null  float64
 1   Flow Duration                452232 non-null  float64
 2   Total Fwd Packets            452232 non-null  float64
 3   Total Backward Packets       452232 non-null  float64
 4   Total Length of Fwd Packets  452232 non-null  float64
 5   Total Length of Bwd Packets  452232 non-null  float64
 6   Fwd Packet Length Max        452232 non-null  float64
 7   Fwd Packet Length Min        452232 non-null  float64
 8   Fwd Packet Length Mean       452232 non-null  float64
 9   Fwd Packet Length Std        452232 non-null  float64
 10  Bwd Packet Length Max        452232 non-null  float64
 11  Bwd Packet Length Min        452232 non-null  float64
 12  Bwd Packet Length Mean      

## Split Dataset for Training and Testing 

In [6]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop(columns=["Label"])  # Features
y = df["Label"]  # Target (correctly encoded)

# Train-Test Split (stratified to maintain label distribution)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Verify split
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)
print("Unique labels in training set:", y_train.unique())
print("Unique labels in test set:", y_test.unique())


Training set size: (361785, 75)
Test set size: (90447, 75)
Unique labels in training set: [2 6 5 3 4 1 0 7 8]
Unique labels in test set: [6 0 3 2 1 4 5 7 8]


## Model Training Using XGBoost

In [8]:
import xgboost as xgb
from sklearn.metrics import classification_report, confusion_matrix

# Define XGBoost Model
model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train Model
model.fit(X_train, y_train)

# Make Predictions
y_pred = model.predict(X_test)

# Evaluate Model
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7415
           1       1.00      1.00      1.00     17526
           2       1.00      1.00      1.00     15447
           3       1.00      1.00      1.00     17891
           4       1.00      1.00      1.00     15740
           5       1.00      1.00      1.00     11453
           6       1.00      1.00      1.00      4603
           7       1.00      0.99      1.00       368
           8       1.00      0.75      0.86         4

    accuracy                           1.00     90447
   macro avg       1.00      0.97      0.98     90447
weighted avg       1.00      1.00      1.00     90447

Confusion Matrix:
 [[ 7414     0     0     0     0     1     0     0     0]
 [    0 17526     0     0     0     0     0     0     0]
 [    0     0 15447     0     0     0     0     0     0]
 [    0     0     0 17891     0     0     0     0     0]
 [    0     0     0     

Analysis of XGBoost Model Performance
1. Training & Testing Data Distribution
Training set size: 361,785 samples
Test set size: 90,447 samples
Unique labels in training & test set: {0, 1, 2, 3, 4, 5, 6, 7, 8} (indicating 9 different classes)
2. Classification Report Analysis
The model's precision, recall, and F1-score are near perfect (1.00) for all classes except for class 8, which has:

Precision = 1.00
Recall = 0.86
F1-score = 0.92
Support = 368 (meaning there were 368 instances of class 8 in the test set)

Interpretation:

The model classifies almost all classes perfectly, meaning it has learned strong distinguishing features for most attack types.
The only slight issue is with class 8, where recall (0.86) indicates some false negatives (i.e., it misclassifies 14% of actual class 8 instances into other classes).
3. Confusion Matrix Analysis
The confusion matrix shows perfect diagonal alignment for most classes, confirming that the model classifies them without errors.

Misclassification Observations:

The only misclassifications occur in class 8, where 3 instances were misclassified into another class.

Interpretation:

This suggests that class 8 is either underrepresented in training data or has overlapping features with other classes, making it harder for the model to distinguish.
4. Overall Model Performance
Accuracy = 1.00 (perfect classification on the test set)
Macro average F1-score = 0.98 (suggests very strong overall performance across all classes)
Weighted average F1-score = 1.00 (means the model performs exceptionally well across all classes)
5. Key Takeaways
XGBoost performs exceptionally well, achieving 100% accuracy across most classes. ✅ Class 8 has a slightly lower recall (0.86), suggesting some false negatives. ✅ Further improvement could involve:

Balancing the dataset to ensure class 8 has sufficient training samples.
Feature engineering to help better distinguish class 8 from others.
Conclusion
This XGBoost model performs nearly flawlessly in detecting network anomalies with high precision, recall, and F1-scores across all categories. The minor recall issue for class 8 can be further improved through data balancing or better feature selection. 

## Save the Model

In [9]:
import joblib

# Save model
joblib.dump(model, "xgboost_model.pkl")

print("Model saved successfully!")


Model saved successfully!


In [11]:
import boto3

s3 = boto3.client("s3")
s3_bucket = "network-anomaly-dataset-001aefd6"

s3.upload_file("xgboost_model.pkl", s3_bucket, "models/xgboost_network_anomaly.pkl")
print("Model uploaded to S3!")


Model uploaded to S3!


## Convert Model from pk1 to tar.gz

In [13]:
import tarfile
import shutil
import os

# Define paths
model_filename = "xgboost_model.pkl"
tar_filename = "model.tar.gz"

# Create a directory for the model
model_dir = "model"
shutil.rmtree(model_dir, ignore_errors=True)  # Delete if it exists
os.makedirs(model_dir, exist_ok=True)

# Move the model file to the directory
shutil.copy(model_filename, model_dir + "/xgboost-model")

# Create a tar.gz file
with tarfile.open(tar_filename, "w:gz") as tar:
    tar.add(model_dir, arcname=".")

print("✅ Model packaged as model.tar.gz successfully!")


✅ Model packaged as model.tar.gz successfully!


In [14]:
import boto3

s3_client = boto3.client("s3")

# Define S3 bucket and path
s3_bucket = "network-anomaly-dataset-001aefd6"
s3_key = "sagemaker/model.tar.gz"

# Upload the model
s3_client.upload_file(tar_filename, s3_bucket, s3_key)

print("✅ Model uploaded to S3 successfully!")


✅ Model uploaded to S3 successfully!
