# Model Evaluation Visualisations for KNN Classifier

This section provides a detailed explanation of various visualisations that can be used to evaluate and improve the performance of a K-Nearest Neighbours (KNN) model. These plots help us detect overfitting, underfitting, performance differences across classes, and optimal hyperparameters.

---

## 1. Accuracy vs. K Plot

**Purpose:**  
This plot helps determine the optimal number of neighbours (`k`) for the KNN algorithm. It shows how training and validation accuracy change as `k` increases.

**Interpretation:**  
- If accuracy is low for all `k`, the model might be underfitting.
- If training accuracy is high and validation is low for small `k`, the model may be overfitting.
- Choose the `k` where the validation accuracy is high and stable.

---

## 2. Learning Curve

**Purpose:**  
The learning curve plots training and validation accuracy against the number of training examples.

**Interpretation:**
- If both training and validation scores are low → **underfitting**
- If training is high but validation is low → **overfitting**
- If both scores are high and close → **good generalisation**

---

## 3. Confusion Matrix with Precision, Recall, F1 Score

**Purpose:**  
Displays how well the model performs per class by showing actual vs predicted labels. It is useful for multi-class classification problems.

**Additional Metrics Shown:**
- **Accuracy** – Overall correct predictions
- **Precision** – Correct positive predictions / Total predicted positives
- **Recall** – Correct positive predictions / Total actual positives
- **F1 Score** – Harmonic mean of precision and recall

---

## 4. ROC Curve (One-vs-Rest)

**Purpose:**  
Useful in binary or multiclass classification to evaluate trade-offs between true positive rate and false positive rate for each class.

**Interpretation:**
- AUC (Area Under Curve) close to 1 = good classifier.
- AUC close to 0.5 = no better than random guessing.

---

## 5. Decision Boundary (2D Visualisation)

**Purpose:**  
Helps visualise how the KNN classifier separates different classes in a 2D feature space.

**Interpretation:**
- Clear and smooth boundaries with accurate test classification indicate a good fit.
- Irregular and noisy boundaries may indicate overfitting.

*Note: This is only possible when using two features, or after dimensionality reduction (e.g., PCA).*

---

## 6. Error Analysis Plot (Correct vs Incorrect Predictions)

**Purpose:**  
Visualises how many predictions were correct or incorrect for each class.

**Interpretation:**
- Helps identify which classes the model struggles to classify correctly.
- Useful for identifying class imbalance or confusion between similar classes.

---

## Summary of Visualisations

| Visualisation               | Purpose                                      |
|----------------------------|----------------------------------------------|
| Accuracy vs. K             | Choose the optimal number of neighbours      |
| Learning Curve             | Diagnose underfitting or overfitting         |
| Confusion Matrix + Metrics | Evaluate class-level prediction performance  |
| ROC / PR Curve             | Assess model on class imbalance              |
| Decision Boundary (2D)     | Visualise how KNN separates classes          |
| Error Plot                 | Understand per-class error distribution      |

---


# Install

In [None]:
!pip install scikit-learn matplotlib seaborn --quiet

# Overall Code for model training

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, accuracy_score



In [None]:
# --- Import libraries ---
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import accuracy_score, classification_report, r2_score, mean_absolute_error



# --- Basic cleaning ---
df = df.dropna(subset=['PI1'])                     # drop rows with missing target
df = df.dropna(axis=1, how='all')                  # remove empty columns
df = df.fillna(method='ffill')                     # simple forward fill for missing values

# --- Encode categorical variables ---
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
    df[col] = le.fit_transform(df[col].astype(str))

# --- Define features (X) and target (y) ---
y = df['PI1']
X = df.drop(columns=['PI1'])

# --- Split data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- Decide automatically: classification or regression ---
if y.nunique() < 10 and y.dtype != 'float':   # likely classification
    model = KNeighborsClassifier(n_neighbors=3)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print("✅ KNN Classification Results")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred))
else:                                         # regression
    model = KNeighborsRegressor(n_neighbors=3)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print("✅ KNN Regression Results")
    print("R²:", r2_score(y_test, y_pred))
    print("MAE:", mean_absolute_error(y_test, y_pred))

print("\nModel training complete ✔️")


# Confusion matrix, and learning curve

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay,
    classification_report, accuracy_score
)

# Ensure model exists
knn = model   # or whichever variable name you used for your trained classifier

# --- Confusion Matrix ---
labels = np.unique(y)
cm = confusion_matrix(y_test, y_pred, labels=labels)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap='Blues', values_format='d')
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()

# --- Classification Report & Accuracy ---
print("Classification Report:")
print(classification_report(y_test, y_pred, labels=labels, zero_division=0))
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")

# --- Learning Curve ---
train_sizes, train_scores, test_scores = learning_curve(
    estimator=knn,
    X=X, y=y,
    cv=5,
    scoring='accuracy',
    train_sizes=np.linspace(0.1, 1.0, 10),
    n_jobs=-1,
    shuffle=True
)

train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)

plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_scores_mean, 'o-', label='Training score')
plt.plot(train_sizes, test_scores_mean, 'o-', label='Cross-validation score')
plt.xlabel('Training examples')
plt.ylabel('Accuracy')
plt.title('Learning Curve — KNN Classifier')
plt.legend(loc='best')
plt.grid(True)
plt.tight_layout()
plt.show()


# Test if model fit is optimal

In [None]:
# Evaluate model fit status based on learning curve
def evaluate_fit(train_scores_mean, test_scores_mean, threshold_gap=0.05, threshold_low=0.75):
    final_train = train_scores_mean[-1]
    final_val = test_scores_mean[-1]
    gap = final_train - final_val

    print("\n--- Fit Evaluation ---")
    print(f"Final Training Accuracy: {final_train:.4f}")
    print(f"Final Validation Accuracy: {final_val:.4f}")
    print(f"Train-Validation Gap: {gap:.4f}")

    if final_train < threshold_low and final_val < threshold_low:
        print("Model is likely **Underfitting**: Both training and validation accuracy are low.")
    elif gap > threshold_gap:
        print("Model is likely **Overfitting**: Training accuracy is high, but validation accuracy drops.")
    else:
        print("Model is likely a **Good Fit**: Training and validation accuracy are both high and close.")


evaluate_fit(train_scores_mean, test_scores_mean)


# Find best value for k

In [None]:
# Try different k values
k_range = range(1, 21)
train_acc = []
test_acc = []

for k in k_range:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    train_acc.append(model.score(X_train, y_train))
    test_acc.append(model.score(X_test, y_test))


plt.figure(figsize=(8, 5))
plt.plot(k_range, train_acc, label='Training Accuracy', marker='o')
plt.plot(k_range, test_acc, label='Validation Accuracy', marker='x')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. K Value')
plt.legend()
plt.grid(True)
plt.show()


# Determine ROC curve

In [None]:
# imbalanced datasets
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import RocCurveDisplay


y_bin = label_binarize(y, classes=np.unique(y))
n_classes = y_bin.shape[1]


classifier = OneVsRestClassifier(KNeighborsClassifier(n_neighbors=3))
X_train, X_test, y_train_bin, y_test_bin = train_test_split(X, y_bin, test_size=0.3, random_state=42)
classifier.fit(X_train, y_train_bin)
y_score = classifier.predict_proba(X_test)


plt.figure(figsize=(8, 6))
for i in range(n_classes):
    fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_score[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'Class {i} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.title("ROC Curves for Each Class (One-vs-Rest)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.grid(True)
plt.show()


# Illustrate the feature space as a 2D representation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# --- Assume you already have df and target column PI1 (categorical) ---
# If your PI1 is numeric/continuous, use a regressor instead (decision boundaries below are for classification).

# Build X, y from your DataFrame
# Choose two numeric columns to visualise (edit these names as you like):
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
assert 'PI1' in df.columns, "PI1 must be present in df"
X = df[num_cols].drop(columns=['PI1'], errors='ignore')
y = df['PI1']

# Keep only the first two numeric features for 2D plot (or choose by name)
if len(X.columns) < 2:
    raise ValueError("Need at least two numeric features to plot a 2D decision boundary.")
feat1, feat2 = X.columns[:2]
X_vis_df = X[[feat1, feat2]].copy()

# Encode y if needed
if y.dtype == 'O' or not np.issubdtype(y.dtype, np.number):
    le = LabelEncoder()
    y_enc = le.fit_transform(y.astype(str))
    class_names = le.classes_
else:
    # ensure consecutive ints for plotting
    classes, y_enc = np.unique(y, return_inverse=True)
    class_names = classes.astype(str)

# Train/test split
X_train_vis, X_test_vis, y_train_vis, y_test_vis = train_test_split(
    X_vis_df.values, y_enc, test_size=0.3, random_state=42, stratify=y_enc
)

# Fit KNN
knn_vis = KNeighborsClassifier(n_neighbors=3)
knn_vis.fit(X_train_vis, y_train_vis)

# Mesh grid
h = 0.02
x_min, x_max = X_vis_df[feat1].min() - 1, X_vis_df[feat1].max() + 1
y_min, y_max = X_vis_df[feat2].min() - 1, X_vis_df[feat2].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = knn_vis.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# Colormaps sized to number of classes
n_classes = len(np.unique(y_enc))
# light background colors
light_colors = ['#FFEEEE','#EEFFEE','#EEEEFF','#FFF0CC','#E6E6FA','#E0FFFF','#FFE4E1','#F0FFF0','#FFFACD']
# bold point colors
bold_colors  = ['#FF3333','#33AA33','#3333FF','#FF9900','#800080','#008B8B','#CD5C5C','#228B22','#DAA520']
cmap_light = ListedColormap(light_colors[:n_classes])
cmap_bold  = ListedColormap(bold_colors[:n_classes])

plt.figure(figsize=(9, 7))
plt.contourf(xx, yy, Z, cmap=cmap_light, alpha=0.8)

# Plot train/test points
sc1 = plt.scatter(X_train_vis[:, 0], X_train_vis[:, 1], c=y_train_vis, cmap=cmap_bold,
                  edgecolor='k', marker='o', label='Train', alpha=0.9)
sc2 = plt.scatter(X_test_vis[:, 0], X_test_vis[:, 1], c=y_test_vis, cmap=cmap_bold,
                  edgecolor='k', marker='^', label='Test', alpha=0.9)

plt.xlabel(feat1)
plt.ylabel(feat2)
plt.title("2D Decision Boundary — KNN Classifier")

# Custom legend with class labels
handles, _ = sc1.legend_elements(prop="colors", alpha=0.9)
class_legend = plt.legend(handles, class_names, title="Classes", loc="upper right")
plt.gca().add_artist(class_legend)
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()


In [None]:
import pandas as pd

error_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
error_df['Correct'] = error_df['Actual'] == error_df['Predicted']

sns.countplot(x='Actual', hue='Correct', data=error_df)
plt.title("Correct vs Incorrect Predictions by Class")
plt.xlabel("True Class")
plt.ylabel("Count")
plt.legend(title='Prediction Correct')
plt.show()
