# Traffic Sign Classification Using KNN

I created a K-Nearest Neighbours (KNN) model for this project in order to group traffic signs into the right groups.  In order to improve the model's effectiveness and efficiency, the goal was to find out how well KNN handled image data while tackling problems like class imbalance and dimensionality reduction. Here is an overview of the actions I took, along with an explanation of my results, observations, and possible areas for improvement.

# Import Libraries

In [None]:
import os
import cv2
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Set Paths to Data

In [None]:
data_path = '/Users/fatima..../Documents/GitHub/Intro-AI-Coursework/data/traffic_Data/DATA'
test_path = '/Users/fatima..../Documents/GitHub/Intro-AI-Coursework/data/traffic_Data/TEST'
labels_path = '/Users/fatima..../Documents/GitHub/Intro-AI-Coursework/data/labels.csv'

# Loading and Processing Data
In order to map each class ID to a traffic sign, I first loaded the class labels from a CSV file. I examined the distribution of photographs in each class to have a better understanding of my dataset:

In [None]:
labels = pd.read_csv(labels_path)
class_image_counts = {}
for class_id in range(58):
    directory = os.path.join(data_path, str(class_id))
    if os.path.exists(directory):
        class_image_counts[class_id] = len(os.listdir(directory))
plt.figure(figsize=(10, 6))
plt.bar(class_image_counts.keys(), class_image_counts.values())
plt.xlabel('Class ID')
plt.ylabel('Number of Images')
plt.title('Number of Images in Each Class')
plt.show()



With some classes having more than 200 photos and others having fewer than 10, the bar chart showed an obvious class imbalance. The overall accuracy might decrease as a result of the model choosing classes with more examples because of this imbalance.

# Handling Class Imbalance
I duplicated images for classes with fewer than 10 samples and limited the number of samples per class to 100 in order to fix the imbalance in the class.  The goal of this modification was to achieve the greatest possible balance in the distribution of the data.


In [None]:
images, class_ids = [], []
max_samples = 100
min_samples = 10
for class_id in range(58):
    directory = os.path.join(data_path, str(class_id))
    if os.path.exists(directory):
        img_files = os.listdir(directory)
        img_count = 0
        for img_file in img_files:
            if img_count < max_samples:
                img_path = os.path.join(directory, img_file)
                image = cv2.imread(img_path)
                image = cv2.resize(image, (32, 32))
                image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
                images.append(image.flatten())
                class_ids.append(class_id)
                img_count += 1
        if img_count < min_samples:
            for i in range(min_samples - img_count):
                images.append(image.flatten())
                class_ids.append(class_id)


# Scaling Features and Reducing Dimensionality
Once the dataset had been loaded and balanced, I adjusted the photos to have an equal variation and a zero mean. This is crucial for KNN since it depends on distance calculations, which might be affected by irregular feature scales.

I then used Principal Component Analysis (PCA) to lower the number of variables.  I chose 350 key variables since they represented a majority of the variation, according to the cumulative variance plot.

In [None]:
# Convert to numpy arrays
X = np.array(images)
y = np.array(class_ids)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=350)  # Adjust components based on cumulative variance
X_pca = pca.fit_transform(X_scaled)

pca_test = PCA().fit(X_scaled)
plt.plot(np.cumsum(pca_test.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.title('PCA Cumulative Variance')
plt.show()

First, I noticed that the cumulative variance increases rapidly.  This indicates that a significant amount of the dataset's information is captured by the first few components. In other words, the first components capture the most significant features (or patterns) in the data.

The curve levels off after about 350 components, indicates that adding more components beyond this point does not significantly increase the explained variance. We have now recorded almost all the significant variance, as the curve has essentially stopped.

I chose 350 components because they make up more than 95% of the dataset’s variance. This suggests that I'm keeping most of the key information while drastically reducing the data's dimensionality, which is necessary to increase the model's efficiency without sacrificing important information.

The KNN model works more quickly and effectively when the dataset is reduced to 350 dimensions, Without losing
reliability, this method should improve the model's performance.


# Data Splitting and KNN Model Training
Almost all the dataset was used for training, while the rest was used for validation. After that, I changed the number of neighbours (k) to seven and used distance-based weighting to give closer neighbours a greater impact when defining and training a KNN model.

In [None]:
# Split Data into Training and Validation Sets
X_train, X_val, y_train, y_val = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Initialize and Train the KNN Model
k = 7
knn = KNeighborsClassifier(n_neighbors=k, weights='distance', metric='minkowski', p=2)
knn.fit(X_train, y_train);

I used Euclidean distance (Minkowski distance with p=2), which is a common choice for continuous data, and k = 7 was used to balance the model's bias and variance.
Accuracy can be increased, particularly in datasets with uneven classes, by ensuring that close to neighbours had greater impact through distance weighting.

# Assessing the Model
In order to determine if the model was overfitting or underfitting, I evaluated its accuracy on both the training and validation sets:

In [None]:
train_acc = knn.score(X_train, y_train)
val_acc = knn.score(X_val, y_val)
print(f"Training Accuracy: {train_acc:.3f}")
print(f"Validation Accuracy: {val_acc:.3f}")

1. This is usual with KNN on data that is imbalanced, that slightly lower validation accuracy points to some over fitting, but the high training accuracy shows the model fits the training data well.

2. Validation Accuracy: 0.874 - although there is an obvious decrease in comparing with the training accuracy, the model fits to unseen data very well, with an accuracy of 87.4% on the validation data set.  This is to be expected as training accuracy is usually higher than validation accuracy.

# Final Testing and Model Performance
I handled the test set in the same way using the same scaling and PCA changes in order to evaluate the model's performance on unseen data. I created a confusion matrix, a classification report, and an accuracy calculation after sorting each prediction.

In [None]:
# Convert Test Data to Numpy Arrays
X_test = np.array(images)
y_test = np.array(class_ids)

X_test_scaled = scaler.transform(X_test)
X_test_pca = pca.transform(X_test_scaled)
y_pred = knn.predict(X_test_pca)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Test Accuracy: 0.97 - 97% accuracy on the test set indicates that the model is very good at generalising to data that has never been seen before, which is an excellent result.  The model is accurately predicting new traffic sign images, as shown in the high test accuracy.