# Comprehensive Comparison: Professor's Template vs Our Implementations

**Purpose:** Compare the given script (skML-complete) with all three classifier implementations.

| Classifier | Notebook | Best Model | MCC |
|------------|----------|------------|-----|
| Linear | 01_Linear_Classifier.ipynb | LDA | 0.671 |
| Ensemble | 02_Ensemble_Classifier.ipynb | Random Forest | 0.810 |
| Non-Linear | 03_NonLinear_Classifier.ipynb | KNN | 0.816 |

---

## 1. Structure Comparison: Template vs All Implementations

| Aspect | skML-complete (Template) | Linear (LDA) | Ensemble (RF) | Non-Linear (KNN) |
|--------|-------------------------|--------------|---------------|------------------|
| **Header** | Version info only | Full header | Full header | Full header |
| **Classification** | 2-class | 5-class | 5-class | 5-class |
| **Baseline Models** | Multiple (commented) | LDA, LogReg, Ridge | RF, ExtraTrees, AdaBoost | KNN, DT, SVM |
| **Best Baseline** | Not specified | LDA | Random Forest | KNN |
| **Optimization 1** | Template only | Hyperparameter Tuning | RandomizedSearchCV | GridSearchCV |
| **Optimization 2** | Not included | Correlation Filter | Feature Importance | Correlation Filter |
| **Features Used** | 122 (all) | 30 (reduced) | 38 (reduced) | 30 (reduced) |
| **Visualizations** | None | Yes | Yes | Yes |
| **Results Export** | None | JSON + CSV | JSON + CSV | JSON + CSV |

---
## 2. Code Comparison by Section

### 2.1 Classification Target Setting

**skML-complete (Template):**
```python
twoclass = True     # 2-class: normal vs attack
```

**All Our Implementations:**
```python
twoclass = False    # 5-class: benign, dos, probe, r2l, u2r
```

**Reason:** Assignment requires MCC per attack class, which needs 5-class classification.

### 2.2 Model Selection Comparison

**skML-complete (Template):**
```python
models = []
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
models.append(("LinearDA", LinearDiscriminantAnalysis()))
from sklearn.naive_bayes import GaussianNB 
models.append(("GaussianNB", GaussianNB()))
# Other models commented out...
```

---

**01_Linear_Classifier.ipynb:**
```python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression, RidgeClassifier

lda_baseline = LinearDiscriminantAnalysis()
lr_baseline = LogisticRegression(max_iter=1000, class_weight='balanced')
ridge_baseline = RidgeClassifier(class_weight='balanced')
```

---

**02_Ensemble_Classifier.ipynb:**
```python
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier

rf_baseline = RandomForestClassifier(n_estimators=100, class_weight='balanced')
et_baseline = ExtraTreesClassifier(n_estimators=100, class_weight='balanced')
ada_baseline = AdaBoostClassifier(n_estimators=100)
```

---

**03_NonLinear_Classifier.ipynb:**
```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

knn_baseline = KNeighborsClassifier(n_neighbors=5, weights='distance')
dt_baseline = DecisionTreeClassifier(class_weight='balanced')
svm_baseline = SVC(kernel='rbf', class_weight='balanced')
```

### 2.3 Optimization Strategy Comparison

| Classifier | Optimization 1 | Optimization 2 | Feature Reduction |
|------------|----------------|----------------|-------------------|
| **Template** | Template only (empty) | None | 0% |
| **Linear (LDA)** | 5-fold CV tuning (solver, shrinkage) | Correlation filter (threshold=0.1) | 75.4% |
| **Ensemble (RF)** | RandomizedSearchCV (10 iter, 3-fold) | Feature importance (95% cumulative) | 68.9% |
| **Non-Linear (KNN)** | GridSearchCV (3-fold) | Correlation filter (threshold=0.1) | 75.4% |

### 2.4 Hyperparameter Tuning Details

#### Linear (LDA)
```python
configs = [
    {'solver': 'svd', 'shrinkage': None},
    {'solver': 'lsqr', 'shrinkage': 'auto'},
    {'solver': 'lsqr', 'shrinkage': 0.1},
    # ... more configs
]
# Best: solver='svd', shrinkage=None
```

#### Ensemble (Random Forest)
```python
param_grid = {
    'n_estimators': [100, 150],
    'max_depth': [20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt'],
    'class_weight': ['balanced']
}
# Best: n_estimators=100, max_depth=None, min_samples_split=2
```

#### Non-Linear (KNN)
```python
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'p': [1, 2],  # Manhattan vs Euclidean
    'algorithm': ['auto']
}
# Best: n_neighbors=3, weights='distance', p=1 (Manhattan)
```

---
## 3. Performance Results Comparison

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Baseline Comparison (Best from each category)
baseline_results = pd.DataFrame({
    'Category': ['Linear', 'Ensemble', 'Non-Linear'],
    'Best Model': ['LDA', 'Random Forest', 'KNN'],
    'Accuracy': [0.769, 0.871, 0.837],
    'MCC': [0.664, 0.814, 0.760],
    'F1 (Weighted)': [0.750, 0.845, 0.812]
})

print("="*70)
print("BASELINE COMPARISON (Best from each category)")
print("="*70)
print(baseline_results.to_string(index=False))

In [None]:
# Optimised Model Comparison
optimised_results = pd.DataFrame({
    'Category': ['Linear', 'Ensemble', 'Non-Linear'],
    'Model': ['LDA (optimised)', 'Random Forest (optimised)', 'KNN (optimised)'],
    'Accuracy': [0.775, 0.868, 0.875],
    'MCC': [0.671, 0.810, 0.816],
    'F1 (Weighted)': [0.763, 0.840, 0.865],
    'Features': [30, 38, 30]
})

print("\n" + "="*70)
print("OPTIMISED MODEL COMPARISON")
print("="*70)
print(optimised_results.to_string(index=False))

In [None]:
# MCC Per Attack Class Comparison
mcc_per_class = pd.DataFrame({
    'Attack Class': ['benign', 'dos', 'probe', 'r2l', 'u2r'],
    'LDA': [0.673, 0.786, 0.575, 0.513, 0.579],
    'Random Forest': [0.757, 0.984, 0.911, 0.336, 0.848],
    'KNN': [0.786, 0.946, 0.850, 0.567, 0.572]
})

print("\n" + "="*70)
print("MCC PER ATTACK CLASS (Optimised Models)")
print("="*70)
print(mcc_per_class.to_string(index=False))

In [None]:
# Visualization: MCC Comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Overall MCC
models = ['LDA', 'Random Forest', 'KNN']
baseline_mcc = [0.664, 0.814, 0.760]
optimised_mcc = [0.671, 0.810, 0.816]

x = np.arange(len(models))
width = 0.35

bars1 = axes[0].bar(x - width/2, baseline_mcc, width, label='Baseline', color='steelblue')
bars2 = axes[0].bar(x + width/2, optimised_mcc, width, label='Optimised', color='darkorange')
axes[0].set_ylabel('MCC Score')
axes[0].set_title('Overall MCC: Baseline vs Optimised')
axes[0].set_xticks(x)
axes[0].set_xticklabels(models)
axes[0].legend()
axes[0].set_ylim(0, 1)

# MCC per class (optimised only)
attack_classes = ['benign', 'dos', 'probe', 'r2l', 'u2r']
lda_mcc = [0.673, 0.786, 0.575, 0.513, 0.579]
rf_mcc = [0.757, 0.984, 0.911, 0.336, 0.848]
knn_mcc = [0.786, 0.946, 0.850, 0.567, 0.572]

x2 = np.arange(len(attack_classes))
width2 = 0.25

axes[1].bar(x2 - width2, lda_mcc, width2, label='LDA', color='#2ecc71')
axes[1].bar(x2, rf_mcc, width2, label='Random Forest', color='#3498db')
axes[1].bar(x2 + width2, knn_mcc, width2, label='KNN', color='#e74c3c')
axes[1].set_ylabel('MCC Score')
axes[1].set_title('MCC per Attack Class (Optimised Models)')
axes[1].set_xticks(x2)
axes[1].set_xticklabels(attack_classes)
axes[1].legend()
axes[1].set_ylim(0, 1.1)

plt.tight_layout()
plt.savefig('../figures/all_classifiers_comparison.png', dpi=150)
plt.show()

---
## 4. What Each Implementation Added Beyond Template

### 4.1 Linear Classifier (LDA)

| Feature | Template | Our Implementation |
|---------|----------|--------------------|
| Baselines compared | 1 (LDA only) | 3 (LDA, LogReg, Ridge) |
| Hyperparameter tuning | None | 5-fold CV (solver, shrinkage) |
| Feature selection | None | Correlation-based (122 → 30) |
| MCC per class | Not shown | All 5 classes |
| Bias-variance | Template | Full decomposition |
| Visualizations | None | Confusion matrices, bar charts |

### 4.2 Ensemble Classifier (Random Forest)

| Feature | Template | Our Implementation |
|---------|----------|--------------------|
| Baselines compared | 0 (commented out) | 3 (RF, ExtraTrees, AdaBoost) |
| Hyperparameter tuning | None | RandomizedSearchCV (10 iter) |
| Feature selection | None | Importance-based (122 → 38) |
| Parameters tuned | None | 6 parameters |
| Justification table | None | With references |
| Feature importance plot | None | Top 30 features |

### 4.3 Non-Linear Classifier (KNN)

| Feature | Template | Our Implementation |
|---------|----------|--------------------|
| Baselines compared | 0 (commented out) | 3 (KNN, DT, SVM) |
| Hyperparameter tuning | None | GridSearchCV (16 combinations) |
| Feature selection | None | Correlation-based (122 → 30) |
| Parameters tuned | None | 4 parameters (k, weights, p, algorithm) |
| Distance metrics | Default | Manhattan vs Euclidean compared |
| Best result | N/A | **MCC 0.816** (highest overall) |

---
## 5. Alignment with Assignment Requirements

Based on professor's lecture (from transcript):

| Requirement | Linear | Ensemble | Non-Linear |
|-------------|--------|----------|------------|
| Compare 3 baseline algorithms | ✅ | ✅ | ✅ |
| Select best using MCC | ✅ LDA | ✅ RF | ✅ KNN |
| Apply optimization technique | ✅ HP Tuning + Feature Sel | ✅ HP Tuning + Feature Sel | ✅ HP Tuning + Feature Sel |
| Compare baseline vs optimised | ✅ | ✅ | ✅ |
| Show MCC per attack class | ✅ | ✅ | ✅ |
| Bias-variance decomposition | ✅ | ✅ | ✅ |
| Hyperparameter justification table | ✅ | ✅ | ✅ |
| Code is readable | ✅ | ✅ | ✅ |
| Visualizations included | ✅ | ✅ | ✅ |
| Results saved to files | ✅ JSON+CSV | ✅ JSON+CSV | ✅ JSON+CSV |

---
## 6. Final Summary

### Best Overall Model: **KNN (Optimised)**
- **MCC: 0.816** (highest)
- **Accuracy: 87.5%**
- **Features: 30** (75.4% reduction)

### Performance Ranking (by MCC)

| Rank | Model | MCC | Accuracy |
|------|-------|-----|----------|
| 1 | **KNN (optimised)** | **0.816** | 87.5% |
| 2 | Random Forest | 0.810 | 86.8% |
| 3 | LDA (optimised) | 0.671 | 77.5% |

### Key Findings

1. **Non-linear models outperform linear** for this intrusion detection task
2. **Feature reduction** (75%) improves KNN performance significantly
3. **Manhattan distance (p=1)** works better than Euclidean for KNN
4. **R2L class** is hardest to detect across all models (lowest MCC)
5. **DoS attacks** are easiest to detect (MCC > 0.9 for ensemble/non-linear)

### Optimization Strategies That Worked

| Strategy | Linear | Ensemble | Non-Linear |
|----------|--------|----------|------------|
| HP Tuning | +1% MCC | ~0% | +7% MCC |
| Feature Selection | +0.7% MCC | -0.4% MCC | Included in HP |
| Combined | +1% | -0.4% | **+7.4%** |

---
## 7. Files Generated

### Notebooks
- `skML-complete.ipynb` - Professor's template
- `01_Linear_Classifier.ipynb` - LDA implementation
- `02_Ensemble_Classifier.ipynb` - Random Forest implementation
- `03_NonLinear_Classifier.ipynb` - KNN implementation
- `04_Group_Comparison.ipynb` - Group comparison

### Results (JSON)
- `results/linear_lda_results.json`
- `results/ensemble_rf_results.json`
- `results/nonlinear_knn_results.json`

### Figures
- `figures/linear_feature_correlation.png`
- `figures/linear_baseline_vs_optimised.png`
- `figures/linear_confusion_matrices.png`
- `figures/ensemble_feature_importance.png`
- `figures/ensemble_confusion_matrices.png`
- `figures/nonlinear_feature_correlation.png`
- `figures/nonlinear_confusion_matrices.png`

---
*Comprehensive comparison for DACS Assignment*