# Digit Classification 

**Instructions:**

The objective is to identify each of a large set of binary images as one of the digits from 0 to 9. Each character has 200 instances (a total of 2000 instances) and each sample is described with 298 attributes. So your features are not pixels but these attributes.

These attributes come in separate files:
1. mfeat-fou: 76 Fourier coefficients of the character shapes;
2. mfeat-fac: 216 profile correlations;
3. mfeat-mor: 6 morphological features

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

You can download the dataset (attributes) as follows:

In [None]:
help(pd.concat)

In [None]:
!git clone https://github.com/cvrg-iyte/DATA602repo.git
import pandas as pd
#names = ['mpg', 'cylinders', 'displacement', 'hp', 'weight', 'acc', 'year', 'origin', 'carname']
data1 = pd.read_csv("./DATA602repo/mfeat-fou", header=None, delim_whitespace=True)
data2 = pd.read_csv("./DATA602repo/mfeat-fac", header=None, delim_whitespace=True)
data3 = pd.read_csv("./DATA602repo/mfeat-mor", header=None, delim_whitespace=True)

fatal: destination path 'DATA602repo' already exists and is not an empty directory.


In each file, first 200 samples are of class 0, followed by sets of 200 samples for each of the classes 1 to 9. Please create your target variable (y) first.

You will build a ML system to predict which digit is a given sample. You are supposed to try different classification methods and apply best practices we have seen in the lectures such as grid search, cross validation, regularization etc.

In [None]:
# Number of instances per class
num_instances = 200

# Create target variable
y = np.zeros(num_instances * 10, dtype=int)

for i in range(10):
    y[i*num_instances:(i+1)*num_instances] = i


In [None]:
# Load data from files
X_fou = data1
X_fac = data2
X_mor = data3

In [None]:
# Concatenate features into a single array
X = np.concatenate((X_fou, X_fac, X_mor), axis=1)

In [None]:
# Create target variable
y = np.zeros(num_instances * 10, dtype=int)

In [None]:
for i in range(10):
    y[i*num_instances:(i+1)*num_instances] = i

In [None]:
# Define logistic regression pipeline with standard scaling and L2 regularization
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(penalty='l2', multi_class='ovr', solver='liblinear', max_iter=1000))
])

In [None]:
# Define hyperparameters for grid search
param_grid_lr = {
    'clf__C': [0.01, 0.1, 1, 10, 100],
}

In [None]:
# Define cross-validation strategy
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

In [None]:
# Perform grid search with cross-validation
grid_search_lr = GridSearchCV(pipe_lr, param_grid=param_grid_lr, cv=cv, scoring='accuracy')
grid_search_lr.fit(X, y)

In [None]:
# Print best hyperparameters and cross-validation accuracy
print('Best hyperparameters:', grid_search_lr.best_params_)
print('Cross-validation accuracy:', grid_search_lr.best_score_)

Best hyperparameters: {'clf__C': 0.1}
Cross-validation accuracy: 0.9849999999999998


In [None]:
from sklearn.metrics import classification_report

# Predict classes using the best estimator found by grid search
y_pred = grid_search_lr.best_estimator_.predict(X)

# Generate classification report
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       0.99      1.00      1.00       200
           2       1.00      1.00      1.00       200
           3       0.99      0.99      0.99       200
           4       1.00      1.00      1.00       200
           5       1.00      0.99      1.00       200
           6       1.00      1.00      1.00       200
           7       1.00      1.00      1.00       200
           8       1.00      1.00      1.00       200
           9       1.00      0.99      1.00       200

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000

