# Digit Classification 

**Instructions:**

The objective is to identify each of a large set of binary images as one of the digits from 0 to 9. Each character has 200 instances (a total of 2000 instances) and each sample is described with 298 attributes. So your features are not pixels but these attributes.

These attributes come in separate files:
1. mfeat-fou: 76 Fourier coefficients of the character shapes;
2. mfeat-fac: 216 profile correlations;
3. mfeat-mor: 6 morphological features

In [1]:
import pandas as pd

You can download the dataset (attributes) as follows:

In [None]:
help(pd.concat)

In [2]:
!git clone https://github.com/cvrg-iyte/DATA602repo.git
import pandas as pd
#names = ['mpg', 'cylinders', 'displacement', 'hp', 'weight', 'acc', 'year', 'origin', 'carname']
data1 = pd.read_csv("./DATA602repo/mfeat-fou", header=None, delim_whitespace=True)
data2 = pd.read_csv("./DATA602repo/mfeat-fac", header=None, delim_whitespace=True)
data3 = pd.read_csv("./DATA602repo/mfeat-mor", header=None, delim_whitespace=True)

Cloning into 'DATA602repo'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 48 (delta 13), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (48/48), 6.06 MiB | 2.80 MiB/s, done.


In each file, first 200 samples are of class 0, followed by sets of 200 samples for each of the classes 1 to 9. Please create your target variable (y) first.

You will build a ML system to predict which digit is a given sample. You are supposed to try different classification methods and apply best practices we have seen in the lectures such as grid search, cross validation, regularization etc.

In [20]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [21]:
# Load data
fou = data1
fac = data2
mor = data3

Loading the three data attributes into single Feature Matrix.
**'np.concatenate'** is a NumPy function that concatenates arrays along a specified axis. It takes a tuple of arrays as input and returns a new array that contains the contents of the input arrays concatenated along the specified axis.

In [22]:
X = np.concatenate((fou, fac, mor), axis=1)

In [23]:
# Create labels
y = np.zeros(2000)
for i in range(10):
    y[i*200:(i+1)*200] = i

Spliting the data into training and testing sets 

In [24]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Support vector machine (SVM) classifier is then trained on the training set.

In [25]:
# SVM
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
acc_svm = accuracy_score(y_test, y_pred_svm)
print("SVM accuracy:", acc_svm)

SVM accuracy: 0.975


In [26]:
# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
acc_knn = accuracy_score(y_test, y_pred_knn)
print("KNN accuracy:", acc_knn)

KNN accuracy: 0.9225


In [27]:
# Compare performance
print("\nPerformance comparison:")
print("SVM confusion matrix:")
print(confusion_matrix(y_test, y_pred_svm))
print("SVM classification report:")
print(classification_report(y_test, y_pred_svm))
print("\nKNN confusion matrix:")
print(confusion_matrix(y_test, y_pred_knn))
print("KNN classification report:")
print(classification_report(y_test, y_pred_knn))


Performance comparison:
SVM confusion matrix:
[[35  0  0  0  0  1  0  0  0  0]
 [ 0 47  0  0  0  0  0  0  0  0]
 [ 0  0 46  0  0  0  0  0  0  0]
 [ 0  1  0 34  0  0  0  1  0  0]
 [ 0  0  0  0 34  0  0  0  0  0]
 [ 0  0  0  3  0 31  0  0  0  0]
 [ 0  0  0  0  0  0 33  0  0  0]
 [ 0  0  0  0  0  0  0 43  0  1]
 [ 0  2  0  0  0  0  0  0 46  0]
 [ 0  1  0  0  0  0  0  0  0 41]]
SVM classification report:
              precision    recall  f1-score   support

         0.0       1.00      0.97      0.99        36
         1.0       0.92      1.00      0.96        47
         2.0       1.00      1.00      1.00        46
         3.0       0.92      0.94      0.93        36
         4.0       1.00      1.00      1.00        34
         5.0       0.97      0.91      0.94        34
         6.0       1.00      1.00      1.00        33
         7.0       0.98      0.98      0.98        44
         8.0       1.00      0.96      0.98        48
         9.0       0.98      0.98      0.98        42
