# Comprehensive Comparison: Professor's Template vs Our Implementations

**Purpose:** Compare the given script (skML-complete) with all three classifier implementations.

**Group Members:**
- Muhammad Usama Fazal (TP086008) - Linear Classifier (LDA)
- Imran Shahadat Noble (TP087895) - Ensemble Classifier (Random Forest)
- Md Sohel Rana (TP087437) - Non-Linear Classifier (KNN)

| Classifier | Notebook | Member | Best Model | MCC |
|------------|----------|--------|------------|-----|
| Linear | 01_Linear_Classifier.ipynb | Muhammad Usama Fazal (TP086008) | LDA | 0.671 |
| Ensemble | 02_Ensemble_Classifier.ipynb | Imran Shahadat Noble (TP087895) | Random Forest | 0.815 |
| Non-Linear | 03_NonLinear_Classifier.ipynb | Md Sohel Rana (TP087437) | KNN | 0.816 |

**Data Verified:** 2024-12-13

---

## 1. Structure Comparison: Template vs All Implementations

| Aspect | skML-complete (Template) | Linear (LDA) | Ensemble (RF) | Non-Linear (KNN) |
|--------|-------------------------|--------------|---------------|------------------|
| **Header** | Version info only | Full header | Full header | Full header |
| **Classification** | 2-class | 5-class | 5-class | 5-class |
| **Baseline Models** | Multiple (commented) | LDA, LogReg, Ridge | RF, ExtraTrees, AdaBoost | KNN, DT, SVM |
| **Best Baseline** | Not specified | LDA | Random Forest | KNN |
| **Optimization 1** | Template only | Hyperparameter Tuning | RandomizedSearchCV | GridSearchCV |
| **Optimization 2** | Not included | Correlation Filter | Feature Importance | Correlation Filter |
| **Features Used** | 122 (all) | 30 (reduced) | 37 (reduced) | 30 (reduced) |
| **Visualizations** | None | Yes | Yes | Yes |
| **Results Export** | None | JSON + CSV | JSON + CSV | JSON + CSV |

---
## 2. Code Comparison by Section

### 2.1 Classification Target Setting

**skML-complete (Template):**
```python
twoclass = True     # 2-class: normal vs attack
```

**All Our Implementations:**
```python
twoclass = False    # 5-class: benign, dos, probe, r2l, u2r
```

**Reason:** Assignment requires MCC per attack class, which needs 5-class classification.

### 2.2 Model Selection Comparison

**skML-complete (Template):**
```python
models = []
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
models.append(("LinearDA", LinearDiscriminantAnalysis()))
from sklearn.naive_bayes import GaussianNB 
models.append(("GaussianNB", GaussianNB()))
# Other models commented out...
```

---

**01_Linear_Classifier.ipynb:**
```python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression, RidgeClassifier

lda_baseline = LinearDiscriminantAnalysis()
lr_baseline = LogisticRegression(max_iter=1000, class_weight='balanced')
ridge_baseline = RidgeClassifier(class_weight='balanced')
```

---

**02_Ensemble_Classifier.ipynb:**
```python
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier

rf_baseline = RandomForestClassifier(n_estimators=100, class_weight='balanced')
et_baseline = ExtraTreesClassifier(n_estimators=100, class_weight='balanced')
ada_baseline = AdaBoostClassifier(n_estimators=100)
```

---

**03_NonLinear_Classifier.ipynb:**
```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

knn_baseline = KNeighborsClassifier(n_neighbors=5, weights='distance')
dt_baseline = DecisionTreeClassifier(class_weight='balanced')
svm_baseline = SVC(kernel='rbf', class_weight='balanced')
```

### 2.3 Optimization Strategy Comparison

| Classifier | Optimization 1 | Optimization 2 | Feature Reduction |
|------------|----------------|----------------|-------------------|
| **Template** | Template only (empty) | None | 0% |
| **Linear (LDA)** | 5-fold CV tuning (solver, shrinkage) | Correlation filter (threshold=0.1) | 75.4% |
| **Ensemble (RF)** | RandomizedSearchCV (10 iter, 3-fold) | Feature importance (95% cumulative) | 68.9% |
| **Non-Linear (KNN)** | GridSearchCV (3-fold) | Correlation filter (threshold=0.1) | 75.4% |

### 2.4 Hyperparameter Tuning Details

#### Linear (LDA)
```python
configs = [
    {'solver': 'svd', 'shrinkage': None},
    {'solver': 'lsqr', 'shrinkage': 'auto'},
    {'solver': 'lsqr', 'shrinkage': 0.1},
    # ... more configs
]
# Best: solver='svd', shrinkage=None
```

#### Ensemble (Random Forest)
```python
param_grid = {
    'n_estimators': [100, 150],
    'max_depth': [20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt'],
    'class_weight': ['balanced']
}
# Best: n_estimators=100, max_depth=None, min_samples_split=2
```

#### Non-Linear (KNN)
```python
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'p': [1, 2],  # Manhattan vs Euclidean
    'algorithm': ['auto']
}
# Best: n_neighbors=3, weights='distance', p=1 (Manhattan)
```

---
## 3. Performance Results Comparison

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Baseline Comparison (Best from each category) - VERIFIED DATA
baseline_results = pd.DataFrame({
    'Category': ['Linear', 'Ensemble', 'Non-Linear'],
    'Best Model': ['LDA', 'Random Forest', 'KNN'],
    'Accuracy': [0.7695, 0.8712, 0.8369],
    'MCC': [0.6644, 0.8135, 0.7602],
    'F1 (Weighted)': [0.7497, 0.8453, 0.8120]
})

print("="*70)
print("BASELINE COMPARISON (Best from each category)")
print("="*70)
print(baseline_results.to_string(index=False))

In [None]:
# Optimised Model Comparison - VERIFIED DATA
optimised_results = pd.DataFrame({
    'Category': ['Linear', 'Ensemble', 'Non-Linear'],
    'Model': ['LDA (optimised)', 'Random Forest (optimised)', 'KNN (optimised)'],
    'Accuracy': [0.7748, 0.8709, 0.8752],
    'MCC': [0.6712, 0.8146, 0.8161],
    'F1 (Weighted)': [0.7628, 0.8401, 0.8655],
    'Features': [30, 37, 30]
})

print("\n" + "="*70)
print("OPTIMISED MODEL COMPARISON")
print("="*70)
print(optimised_results.to_string(index=False))

In [None]:
# MCC Per Attack Class Comparison - VERIFIED DATA
mcc_per_class = pd.DataFrame({
    'Attack Class': ['benign', 'dos', 'probe', 'r2l', 'u2r'],
    'LDA': [0.673, 0.786, 0.575, 0.513, 0.579],
    'Random Forest': [0.763, 0.984, 0.926, 0.330, 0.820],
    'KNN': [0.786, 0.946, 0.850, 0.567, 0.572]
})

print("\n" + "="*70)
print("MCC PER ATTACK CLASS (Optimised Models)")
print("="*70)
print(mcc_per_class.to_string(index=False))

In [None]:
# Visualization: MCC Comparison - VERIFIED DATA
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Overall MCC - VERIFIED VALUES
models = ['LDA', 'Random Forest', 'KNN']
baseline_mcc = [0.6644, 0.8135, 0.7602]  # Verified baseline MCC
optimised_mcc = [0.6712, 0.8146, 0.8161]  # Verified optimised MCC

x = np.arange(len(models))
width = 0.35

bars1 = axes[0].bar(x - width/2, baseline_mcc, width, label='Baseline', color='steelblue')
bars2 = axes[0].bar(x + width/2, optimised_mcc, width, label='Optimised', color='darkorange')
axes[0].set_ylabel('MCC Score')
axes[0].set_title('Overall MCC: Baseline vs Optimised')
axes[0].set_xticks(x)
axes[0].set_xticklabels(models)
axes[0].legend()
axes[0].set_ylim(0, 1)

# Add value labels on bars
for bar in bars1:
    axes[0].annotate(f'{bar.get_height():.3f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                     ha='center', va='bottom', fontsize=9)
for bar in bars2:
    axes[0].annotate(f'{bar.get_height():.3f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                     ha='center', va='bottom', fontsize=9)

# MCC per class (optimised only) - VERIFIED VALUES
attack_classes = ['benign', 'dos', 'probe', 'r2l', 'u2r']
lda_mcc = [0.673, 0.786, 0.575, 0.513, 0.579]     # From verified results
rf_mcc = [0.763, 0.984, 0.926, 0.330, 0.820]      # From verified results
knn_mcc = [0.786, 0.946, 0.850, 0.567, 0.572]     # From verified results

x2 = np.arange(len(attack_classes))
width2 = 0.25

axes[1].bar(x2 - width2, lda_mcc, width2, label='LDA', color='#2ecc71')
axes[1].bar(x2, rf_mcc, width2, label='Random Forest', color='#3498db')
axes[1].bar(x2 + width2, knn_mcc, width2, label='KNN', color='#e74c3c')
axes[1].set_ylabel('MCC Score')
axes[1].set_title('MCC per Attack Class (Optimised Models)')
axes[1].set_xticks(x2)
axes[1].set_xticklabels(attack_classes)
axes[1].legend()
axes[1].set_ylim(0, 1.1)

plt.tight_layout()
plt.savefig('../figures/all_classifiers_comparison.png', dpi=150)
plt.show()

print("\nIMPROVEMENT SUMMARY:")
print(f"LDA:           {baseline_mcc[0]:.4f} → {optimised_mcc[0]:.4f} (+{(optimised_mcc[0]-baseline_mcc[0])*100:.2f}%)")
print(f"Random Forest: {baseline_mcc[1]:.4f} → {optimised_mcc[1]:.4f} (+{(optimised_mcc[1]-baseline_mcc[1])*100:.2f}%)")
print(f"KNN:           {baseline_mcc[2]:.4f} → {optimised_mcc[2]:.4f} (+{(optimised_mcc[2]-baseline_mcc[2])*100:.2f}%)")

---
## 4. What Each Implementation Added Beyond Template

### 4.1 Linear Classifier (LDA)

| Feature | Template | Our Implementation |
|---------|----------|--------------------|
| Baselines compared | 1 (LDA only) | 3 (LDA, LogReg, Ridge) |
| Hyperparameter tuning | None | 5-fold CV (solver, shrinkage) |
| Feature selection | None | Correlation-based (122 → 30) |
| MCC per class | Not shown | All 5 classes |
| Bias-variance | Template | Full decomposition |
| Visualizations | None | Confusion matrices, bar charts |

### 4.2 Ensemble Classifier (Random Forest)

| Feature | Template | Our Implementation |
|---------|----------|--------------------|
| Baselines compared | 0 (commented out) | 3 (RF, ExtraTrees, AdaBoost) |
| Hyperparameter tuning | None | RandomizedSearchCV (10 iter) |
| Feature selection | None | Importance-based (122 → 38) |
| Parameters tuned | None | 6 parameters |
| Justification table | None | With references |
| Feature importance plot | None | Top 30 features |

### 4.3 Non-Linear Classifier (KNN)

| Feature | Template | Our Implementation |
|---------|----------|--------------------|
| Baselines compared | 0 (commented out) | 3 (KNN, DT, SVM) |
| Hyperparameter tuning | None | GridSearchCV (16 combinations) |
| Feature selection | None | Correlation-based (122 → 30) |
| Parameters tuned | None | 4 parameters (k, weights, p, algorithm) |
| Distance metrics | Default | Manhattan vs Euclidean compared |
| Best result | N/A | **MCC 0.816** (highest overall) |

---
## 5. Alignment with Assignment Requirements

Based on professor's lecture (from transcript):

| Requirement | Linear | Ensemble | Non-Linear |
|-------------|--------|----------|------------|
| Compare 3 baseline algorithms | ✅ | ✅ | ✅ |
| Select best using MCC | ✅ LDA | ✅ RF | ✅ KNN |
| Apply optimization technique | ✅ HP Tuning + Feature Sel | ✅ HP Tuning + Feature Sel | ✅ HP Tuning + Feature Sel |
| Compare baseline vs optimised | ✅ | ✅ | ✅ |
| Show MCC per attack class | ✅ | ✅ | ✅ |
| Bias-variance decomposition | ✅ | ✅ | ✅ |
| Hyperparameter justification table | ✅ | ✅ | ✅ |
| Code is readable | ✅ | ✅ | ✅ |
| Visualizations included | ✅ | ✅ | ✅ |
| Results saved to files | ✅ JSON+CSV | ✅ JSON+CSV | ✅ JSON+CSV |

---
## 6. Final Summary

### Best Overall Model: **KNN (Optimised)**
- **MCC: 0.8161** (highest)
- **Accuracy: 87.52%**
- **Features: 30** (75.4% reduction from 122)
- **Member: Md Sohel Rana (TP087437)**

### Performance Ranking (by MCC)

| Rank | Model | MCC | Accuracy | Member |
|------|-------|-----|----------|--------|
| 1 | **KNN (optimised)** | **0.8161** | 87.52% | Md Sohel Rana (TP087437) |
| 2 | Random Forest (optimised) | 0.8146 | 87.09% | Imran Shahadat Noble (TP087895) |
| 3 | LDA (optimised) | 0.6712 | 77.48% | Muhammad Usama Fazal (TP086008) |

### MCC Improvement from Baseline to Optimised

| Model | Baseline MCC | Optimised MCC | Improvement |
|-------|-------------|---------------|-------------|
| LDA | 0.6644 | 0.6712 | +1.02% |
| Random Forest | 0.8135 | 0.8146 | +0.14% |
| **KNN** | **0.7602** | **0.8161** | **+7.35%** |

### Key Findings

1. **Non-linear models outperform linear** for this intrusion detection task
   - KNN achieved highest MCC (0.8161) vs LDA (0.6712)

2. **Feature reduction** improves KNN significantly (+7.35% MCC)
   - Reduced from 122 to 30 features (75.4% reduction)

3. **Distance metric matters for KNN**
   - Manhattan distance (p=1) outperforms Euclidean for this dataset

4. **Attack class performance varies**
   - **DoS attacks**: Easiest to detect (RF: 0.984, KNN: 0.946)
   - **R2L attacks**: Hardest to detect (RF: 0.330, KNN: 0.567, LDA: 0.513)

5. **Ensemble methods excel at common attacks**
   - Random Forest achieves 0.984 MCC on DoS, 0.926 on Probe

### Optimization Strategies Results

| Strategy | LDA Impact | RF Impact | KNN Impact |
|----------|-----------|-----------|------------|
| Hyperparameter Tuning | +0.68% | +0.14% | +5.59% |
| Feature Selection | Correlation filter (30 features) | Importance (37 features) | Correlation filter (30 features) |
| **Combined Effect** | **+1.02%** | **+0.14%** | **+7.35%** |

### Conclusion

The **KNN classifier** with Manhattan distance (p=1), k=3 neighbours, and distance-weighted voting achieves the best overall performance. The significant improvement (+7.35% MCC) demonstrates that KNN benefits greatly from proper hyperparameter tuning and feature selection, particularly removing redundant/correlated features.

**Data Verified:** 2024-12-13

---
## 7. Files Generated

### Notebooks
- `skML-complete.ipynb` - Professor's template
- `01_Linear_Classifier.ipynb` - LDA implementation
- `02_Ensemble_Classifier.ipynb` - Random Forest implementation
- `03_NonLinear_Classifier.ipynb` - KNN implementation
- `04_Group_Comparison.ipynb` - Group comparison

### Results (JSON)
- `results/linear_lda_results.json`
- `results/ensemble_rf_results.json`
- `results/nonlinear_knn_results.json`

### Figures
- `figures/linear_feature_correlation.png`
- `figures/linear_baseline_vs_optimised.png`
- `figures/linear_confusion_matrices.png`
- `figures/ensemble_feature_importance.png`
- `figures/ensemble_confusion_matrices.png`
- `figures/nonlinear_feature_correlation.png`
- `figures/nonlinear_confusion_matrices.png`

---
*Comprehensive comparison for DACS Assignment*