## Step 1: Data Splitting

### What We're Doing:
1. **Separate Features and Target**: 
   - We take our dataset (a DataFrame) and separate the features (all columns except 'Attack Type') and the target (the 'Attack Type' column).

2. **Split Data**: 
   - We then split the data into training and testing sets. 
   - The training set is used to train the model, and the testing set is used to evaluate the model's performance on unseen data.

In [18]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load your dataset (ensure you replace 'your_dataset.csv' with your actual file)
# For example, if your dataset is in CSV format:
df = pd.read_csv('Encoded_Cybersecurity_Data.csv')

# Drop 'User Information' column from the dataset
df = df.drop(columns=['User Information'], errors='ignore')

# Display the first few rows of the DataFrame to understand its structure
print("Dataset preview:")
print(df.head())

# Separate the features and the target variable.
# 'Attack Type' is our target, so we drop it from features.
X = df.drop('Attack Type', axis=1)  # Features: all columns except 'Attack Type'
y = df['Attack Type']               # Target: the 'Attack Type' column

# Split the data into training and testing sets.
# test_size=0.3 means 30% of the data is used for testing.
# random_state ensures the split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting datasets to confirm the split.
print("\nData Splitting Results:")
print("Training set shape (features):", X_train.shape)
print("Training set shape (target):", y_train.shape)
print("Testing set shape (features):", X_test.shape)
print("Testing set shape (target):", y_test.shape)


Dataset preview:
   Protocol  Packet Length  Packet Type  Traffic Type  Malware Indicators  \
0         0      -0.669295            1             2                   0   
1         0       0.943535            1             2                   0   
2         2      -1.142808            0             2                   0   
3         2      -0.952922            1             2                   1   
4         1       1.635778            1             0                   1   

0       -0.743191                1            2                 1   
1        0.048054                1            2                 0   
2        1.292975                0            0                 1   
3       -1.189588                0            2                 1   
4       -1.718818                0            0                 1   

   Action Taken  ...  Packet Length Category  Anomaly Category  Used Proxy  \
0             2  ...                       2                 1           1   
1             0  .

## Step 2: Trying Multiple Classification Models and Comparing Them

### What We're Doing:
1. **Goal**: 
   - Evaluate and compare several classification models to predict the Attack Type.

2. **Approach**: 
   - We'll use a few common classifiers:
     - **Logistic Regression**: A linear model for classification.
     - **Random Forest**: An ensemble method that builds multiple decision trees.
     - **Decision Tree**: A simple tree-based classifier.
     - **Support Vector Machine (SVM)**: Effective in high-dimensional spaces.
     - **K-Nearest Neighbors (KNN)**: A simple instance-based classifier.

3. **Evaluation**: 
   - For each model, we'll train it on the training data, predict on the test data, and then compare performance using metrics such as accuracy and the classification report.roblems.

In [20]:
# Import necessary libraries for models and evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report


# Define a dictionary of models to evaluate
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Support Vector Machine': SVC(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

# Dictionary to store the evaluation results for each model
results = {}

# Loop through each model: train, predict, and evaluate
for model_name, model in models.items():
    # Train the model using the training data
    model.fit(X_train, y_train)
    
    # Predict on the test data
    y_pred = model.predict(X_test)
    
    # Calculate accuracy
    acc = accuracy_score(y_test, y_pred)
    results[model_name] = acc  # Store the accuracy score
    
    # Print the model's performance details
    print(f"Model: {model_name}")
    print(f"Accuracy: {acc:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("-" * 50)

# Print a summary of the model performance based on accuracy
print("Summary of Model Performance (Accuracy):")
for name, accuracy in results.items():
    print(f"{name}: {accuracy:.4f}")


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model: Logistic Regression
Accuracy: 0.3430
Classification Report:
              precision    recall  f1-score   support

           0       0.34      0.40      0.36      2636
           1       0.35      0.29      0.32      2721
           2       0.34      0.34      0.34      2643

    accuracy                           0.34      8000
   macro avg       0.34      0.34      0.34      8000
weighted avg       0.34      0.34      0.34      8000

--------------------------------------------------
Model: Random Forest
Accuracy: 0.3321
Classification Report:
              precision    recall  f1-score   support

           0       0.34      0.38      0.36      2636
           1       0.33      0.30      0.32      2721
           2       0.33      0.31      0.32      2643

    accuracy                           0.33      8000
   macro avg       0.33      0.33      0.33      8000
weighted avg       0.33      0.33      0.33      8000

--------------------------------------------------
Model: D

## Step 3: Hyperparameter Tuning and Feature Selection

### Part A: Hyperparameter Tuning

#### What We're Doing:
1. **Goal**: 
   - Optimize model performance by finding the best hyperparameters for each classifier.

2. **Approach**: 
   - Use `GridSearchCV` to systematically explore a grid of hyperparameter values for:
     - **Logistic Regression**
     - **Random Forest**
     - **Decision Tree**
     - **K-Nearest Neighbors**


In [25]:
import warnings
warnings.filterwarnings('ignore')

In [30]:
# ---------------------------
# Step 1: Hyperparameter Tuning
# ---------------------------

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

import pandas as pd
import numpy as np

# Define hyperparameter grids for each model
param_grid_logistic = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2']  # Only l2 penalty is used here
}

param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30]
}

param_grid_dt = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance']
}

# Initialize models with fixed random_state for reproducibility
logistic = LogisticRegression(max_iter=3000, random_state=42)
rf = RandomForestClassifier(random_state=42)
dt = DecisionTreeClassifier(random_state=42)
knn = KNeighborsClassifier()

# Map model names to a tuple of (model, hyperparameter grid)
models_param = {
    'Logistic Regression': (logistic, param_grid_logistic),
    'Random Forest': (rf, param_grid_rf),
    'Decision Tree': (dt, param_grid_dt),
    'K-Nearest Neighbors': (knn, param_grid_knn)
}

# Dictionary to store the best tuned model for each classifier
tuned_models = {}

print("Hyperparameter Tuning Results:\n" + "="*50)
# Loop through each model and perform grid search with 5-fold cross-validation
for name, (model, param_grid) in models_param.items():
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_
    tuned_models[name] = best_model
    print(f"Model: {name}")
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Best Cross-Validation Score: {grid_search.best_score_:.4f}")
    print("-" * 50)


Hyperparameter Tuning Results:
Model: Logistic Regression
Best Parameters: {'C': 0.1, 'penalty': 'l2'}
Best Cross-Validation Score: 0.3346
--------------------------------------------------
Model: Random Forest
Best Parameters: {'max_depth': 10, 'n_estimators': 100}
Best Cross-Validation Score: 0.3359
--------------------------------------------------
Model: Decision Tree
Best Parameters: {'max_depth': None, 'min_samples_split': 5}
Best Cross-Validation Score: 0.3390
--------------------------------------------------
Model: K-Nearest Neighbors
Best Parameters: {'n_neighbors': 3, 'weights': 'uniform'}
Best Cross-Validation Score: 0.3329
--------------------------------------------------


In [32]:
# ---------------------------
# Step 2: Feature Scoring & Selection
# ---------------------------

# Use the tuned Random Forest model to extract feature importances
best_rf = tuned_models['Random Forest']

# Refit the best Random Forest model on the full training data (if needed)
best_rf.fit(X_train, y_train)

# Extract feature importances and create a DataFrame for easy viewing
feature_importances = best_rf.feature_importances_
feature_names = X_train.columns  # Assumes X_train is a DataFrame

importances_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

# Sort features by importance in descending order
importances_df = importances_df.sort_values(by='Importance', ascending=False)

print("\nFeature Importances from Tuned Random Forest:")
print(importances_df)

# Select features with importance above or equal to the median importance
threshold = np.median(feature_importances)
selected_features = importances_df[importances_df['Importance'] >= threshold]['Feature'].tolist()

print("\nSelected Features (Importance >= median):")
print(selected_features)



Feature Importances from Tuned Random Forest:
                          Feature  Importance
5                  Anomaly Scores    0.085125
1                   Packet Length    0.084480
17                           City    0.078439
40              City_Attack_Count    0.065120
21                            day    0.059837
18                          State    0.058750
22                           Hour    0.054058
20                          month    0.042975
25                    day_of_week    0.035730
36               Operating System    0.032609
19                           year    0.023963
35                   Browser Name    0.020172
0                        Protocol    0.019855
24                         Season    0.019736
3                    Traffic Type    0.019507
8                    Action Taken    0.019468
39       Destination_IP_malicious    0.018600
10                Network Segment    0.018331
9                  Severity Level    0.018232
38               Rendering Engine

In [34]:
# ---------------------------
# Step 3: Comparison Using All Features
# ---------------------------

from sklearn.metrics import accuracy_score, classification_report

results_tuned_all = {}
print("\nEvaluating Tuned Models on Test Data (All Features):")
print("="*50)

for name, model in tuned_models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    results_tuned_all[name] = acc
    print(f"\nModel: {name}")
    print(f"Accuracy: {acc:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("-" * 50)

print("Summary of Tuned Model Performance (All Features):")
for name, accuracy in results_tuned_all.items():
    print(f"{name}: {accuracy:.4f}")



Evaluating Tuned Models on Test Data (All Features):

Model: Logistic Regression
Accuracy: 0.3436
Classification Report:
              precision    recall  f1-score   support

           0       0.34      0.40      0.37      2636
           1       0.35      0.29      0.32      2721
           2       0.34      0.34      0.34      2643

    accuracy                           0.34      8000
   macro avg       0.34      0.34      0.34      8000
weighted avg       0.34      0.34      0.34      8000

--------------------------------------------------

Model: Random Forest
Accuracy: 0.3395
Classification Report:
              precision    recall  f1-score   support

           0       0.33      0.42      0.37      2636
           1       0.35      0.26      0.30      2721
           2       0.34      0.34      0.34      2643

    accuracy                           0.34      8000
   macro avg       0.34      0.34      0.34      8000
weighted avg       0.34      0.34      0.34      8000

---

In [36]:
# ---------------------------
# Step 4: Comparison Using Selected Features
# ---------------------------

# Create training and testing sets with only the selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

results_tuned_selected = {}
print("\nEvaluating Tuned Models on Test Data (Selected Features):")
print("="*50)

for name, model in tuned_models.items():
    # Retrain the model on the training data with selected features
    model.fit(X_train_selected, y_train)
    y_pred_sel = model.predict(X_test_selected)
    acc_sel = accuracy_score(y_test, y_pred_sel)
    results_tuned_selected[name] = acc_sel
    print(f"\nModel: {name}")
    print(f"Accuracy: {acc_sel:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred_sel))
    print("-" * 50)

print("Summary of Tuned Model Performance (Selected Features):")
for name, accuracy in results_tuned_selected.items():
    print(f"{name}: {accuracy:.4f}")



Evaluating Tuned Models on Test Data (Selected Features):

Model: Logistic Regression
Accuracy: 0.3416
Classification Report:
              precision    recall  f1-score   support

           0       0.34      0.48      0.40      2636
           1       0.35      0.26      0.30      2721
           2       0.34      0.29      0.31      2643

    accuracy                           0.34      8000
   macro avg       0.34      0.34      0.34      8000
weighted avg       0.34      0.34      0.34      8000

--------------------------------------------------

Model: Random Forest
Accuracy: 0.3349
Classification Report:
              precision    recall  f1-score   support

           0       0.33      0.42      0.37      2636
           1       0.34      0.25      0.29      2721
           2       0.34      0.34      0.34      2643

    accuracy                           0.33      8000
   macro avg       0.34      0.34      0.33      8000
weighted avg       0.34      0.33      0.33      8000

## XGBoost Classification with Stratified Train-Validation Split

### Introduction:
This section demonstrates an end-to-end pipeline for training and evaluating an XGBoost classifier using a stratified train-validation split. By ensuring that the class distribution in the training and validation sets mirrors that of the original data, we achieve a balanced evaluation of the model. The process involves verifying the dataset, performing a stratified split, training the model with fixed hyperparameters, and then evaluating its performance using accuracy, log loss, and a detailed classification report.ustering.

In [80]:
from xgboost import XGBClassifier

# =============================================================================
# Step 1: Dataset Verification
# =============================================================================

# Ensure that X_train and y_train are defined.
if 'X_train' not in globals() or 'y_train' not in globals():
    raise ValueError("X_train and y_train must be defined. Please load your dataset and split it accordingly.")

# Display unique classes in y_train to verify distribution.
unique_classes = np.sort(np.unique(y_train))
print("Unique classes in y_train:", unique_classes)

# =============================================================================
# Step 2: Stratified Train-Validation Split
# =============================================================================

# Split the data into training and validation sets (80/20 split) with stratification.
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)
print("\nTraining set shape:", X_tr.shape)
print("Validation set shape:", X_val.shape)

# =============================================================================
# Step 3: Model Training with XGBoost
# =============================================================================

# Define fixed hyperparameters for the XGBoost classifier.
xgb_params = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'logloss',
    'use_label_encoder': False,
    'random_state': 42
}

# Initialize and train the XGBoost classifier on the training subset.
model = XGBClassifier(**xgb_params)
model.fit(X_tr, y_tr)

# =============================================================================
# Step 4: Evaluation on the Validation Set
# =============================================================================

# Make predictions on the validation set.
y_val_pred = model.predict(X_val)
y_val_proba = model.predict_proba(X_val)  # Needed for log loss calculation.

# Compute evaluation metrics.
accuracy_val = accuracy_score(y_val, y_val_pred)
loss_val = log_loss(y_val, y_val_proba)
report_val = classification_report(y_val, y_val_pred)

# Display evaluation results.
print("\nValidation Accuracy: {:.4f}".format(accuracy_val))
print("Validation Log Loss: {:.4f}".format(loss_val))
print("\nClassification Report on Validation Set:\n", report_val)


Unique classes in y_train: [0 1 2]

Training set shape: (25600, 41)
Validation set shape: (6400, 41)

Validation Accuracy: 0.3362
Validation Log Loss: 1.1111

Classification Report on Validation Set:
               precision    recall  f1-score   support

           0       0.34      0.36      0.35      2158
           1       0.34      0.32      0.33      2109
           2       0.33      0.34      0.33      2133

    accuracy                           0.34      6400
   macro avg       0.34      0.34      0.34      6400
weighted avg       0.34      0.34      0.34      6400

