## Project Overview
This project aims to develop a supervised machine learning model for **Optical Recognition of Handwritten Digits**. The dataset is sourced from the **UCI Machine Learning Repository**, where handwritten digits from 43 different individuals were collected. The dataset consists of **64 features**, representing an 8x8 pixel image of a handwritten digit (0-9). However, the dataset is relatively small, which may impact model generalization. The goal is to classify these digits accurately using machine learning models.

## Result Summary
This project used the Optical Recognition of Handwritten Digits dataset to train SVM, KNN, and Random Forest models.
- **SVM** performed the best, achieving an accuracy of approximately **99%**.
- **Misclassification analysis** revealed that certain digits, such as `3` and `8`, were more challenging to distinguish.
- **Hyperparameter tuning** identified `C=10, gamma='scale'` as the optimal parameters for the SVM model.
- Future improvements could include using **CNN (Convolutional Neural Networks)** to enhance classification performance or applying **data augmentation** to improve model generalization.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Step 1: Load Data
**Fetching the Optical Recognition of Handwritten Digits dataset from UCI ML Repository**

In [None]:
from ucimlrepo import fetch_ucirepo 
optical_recognition_of_handwritten_digits = fetch_ucirepo(id=80) 
X = optical_recognition_of_handwritten_digits.data.features 
y = optical_recognition_of_handwritten_digits.data.targets 

## Step 2: Data Exploration
**Checking dataset shape and target distribution**

In [None]:
print("Dataset Shape:", X.shape)
print("Target Distribution:")
print(y.value_counts())

## Step 3: Data Visualization - Sample Digits
**Displaying a few sample images from the dataset**

In [None]:
fig, axes = plt.subplots(1, 5, figsize=(10, 3))
sample_indices = np.random.choice(len(X), 5, replace=False)
for i, idx in enumerate(sample_indices):
    img = X.iloc[idx].to_numpy().reshape(8, 8)
    axes[i].imshow(img, cmap='gray')
    axes[i].set_title(f"Label: {y.iloc[idx]}")
    axes[i].axis('off')
plt.show()

print("Missing values in X:", X.isnull().sum().sum())
print("Missing values in y:", y.isnull().sum().sum())

X.iloc[:, :10].hist(figsize=(10, 6), bins=16)
plt.show()


## Step 4: Data Preprocessing
**Scaling the feature values to range [0,1] and splitting into training and test sets**

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

## Step 5: Train Models
**Training three models: SVM, KNN, and Random Forest**

In [None]:
models = {
    "SVM": SVC(kernel='rbf', C=10, gamma='scale', random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=5),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    print(f"{name} Accuracy: {accuracy:.4f}")

## Step 6: Model Evaluation - Confusion Matrix for Best Model (SVM)
**Displaying confusion matrix for SVM model**

In [None]:
y_pred_svm = models["SVM"].predict(X_test)
cm = confusion_matrix(y_test, y_pred_svm)
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=range(10), yticklabels=range(10))
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix - SVM")
plt.show()

print("Classification Report - SVM:")
print(classification_report(y_test, y_pred_svm))

## Step 7: Misclassified Samples Visualization
**Visualizing some misclassified samples to understand errors**

In [None]:
y_test_np = np.array(y_test).ravel()
y_pred_svm_np = np.array(y_pred_svm)
misclassified_idx = np.where(y_test_np != y_pred_svm_np)[0]

fig, axes = plt.subplots(1, min(5, len(misclassified_idx)), figsize=(10, 3))
for i, idx in enumerate(misclassified_idx[:5]):
    img = X_test[idx].reshape(8, 8)
    axes[i].imshow(img, cmap='gray')
    axes[i].set_title(f"Pred: {y_pred_svm_np[idx]}, True: {y_test_np[idx]}")
    axes[i].axis('off')
plt.show()

## Step 8: Hyperparameter Tuning for SVM
**Using GridSearchCV to find the best hyperparameters for SVM**

In [None]:
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto', 0.01, 0.001],
    'kernel': ['rbf', 'poly']
}

grid_search = GridSearchCV(SVC(), param_grid, cv=3, scoring='accuracy', verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)