# Classification on `emnist`

## 1. Create `Readme.md` to document your work

Explain your choices, process, and outcomes.

## 2. Classify all symbols

### Choose a model

Your choice of model! Choose wisely...

### Train away!

Is do you need to tune any parameters? Is the model expecting data in a different format?

### Evaluate the model

Evaluate the models on the test set, analyze the confusion matrix to see where the model performs well and where it struggles.

### Investigate subsets

On which classes does the model perform well? Poorly? Evaluate again, excluding easily confused symbols (such as 'O' and '0').

### Improve performance

Brainstorm for improving the performance. This could include trying different architectures, adding more layers, changing the loss function, or using data augmentation techniques.

## 2. Classify digits vs. letters model showdown

Perform a full showdown classifying digits vs letters:

1. Create a column for whether each row is a digit or a letter
2. Choose an evaluation metric 
3. Choose several candidate models to train
4. Divide data to reserve a validation set that will NOT be used in training/testing
5. K-fold train/test
    1. Create train/test splits from the non-validation dataset 
    2. Train each candidate model (best practice: use the same split for all models)
    3. Apply the model the the test split 
    4. (*Optional*) Perform hyper-parametric search
    5. Record the model evaluation metrics
    6. Repeat with a new train/test split
6. Promote winner, apply model to validation set
7. (*Optional*) Perform hyper-parametric search, if applicable
8. Report model performance

In [35]:
# add packages
%pip install -r requirements.txt
import emnist
import pandas as pd
import numpy as np
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

Collecting scikit-learn (from -r requirements.txt (line 3))
  Downloading scikit_learn-1.4.0-1-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.6.0 (from scikit-learn->-r requirements.txt (line 3))
  Downloading scipy-1.12.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.4 kB ? eta -:--:--
     -------------------- ------------------- 30.7/60.4 kB 1.3 MB/s eta 0:00:01
     ---------------------------------------- 60.4/60.4 kB 1.1 MB/s eta 0:00:00
Collecting joblib>=1.2.0 (from scikit-learn->-r requirements.txt (line 3))
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn->-r requirements.txt (line 3))
  Downloading threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Downloading scikit_learn-1.4.0-1-cp312-cp312-win_amd64.whl (10.6 MB)
   ---------------------------------------- 0.0/10.6 MB ? eta -:--:--
   -- ------------------------------------- 0.5/10.6 MB 16

In [3]:
# extract both training and test from emnist
image, label = emnist.extract_training_samples('byclass')
raw_train = pd.DataFrame()
raw_train['label'] = label
raw_train['image'] = list(image)
image, label = emnist.extract_test_samples('byclass')
raw_test = pd.DataFrame()
raw_test['label'] = label
raw_test['image'] = list(image)

In [4]:
print(raw_train.shape)
raw_test.shape

(697932, 2)


(116323, 2)

In [6]:
# combine training and test
raw_data = pd.concat([raw_train,raw_test])
print(raw_data.shape)
#create a 10% subset of raw_data
data = raw_data.sample(frac=0.1, replace=False, random_state=1)
data.shape

(814255, 2)


(81426, 2)

In [8]:
# Define helper functions
def int_to_char(label):
    """Convert an integer label to the corresponding uppercase character."""
    if label < 10:
        return str(label)
    elif label < 36:
        return chr(label - 10 + ord('A'))
    else:
        return chr(label - 36 + ord('a'))

In [27]:
# add column "class" 
class_label = np.array([int_to_char(l) for l in data.iloc[:,0]])
# make a copy of data added with class and image_flat cols
data2 = data
data2['class'] = class_label
# add image_flat
data2['image_flat'] = data2['image'].apply(lambda x: np.array(x).reshape(-1))

In [33]:
# Split to train:test = 7:3
train = data2.iloc[0:56998, :]
test = data2.iloc[56998:, :]

In [36]:
# Initialize random forest classifier
rf_clf = RandomForestClassifier(n_estimators=20, random_state=5, max_depth=20)
# Train and evaluate model
rf_clf.fit(train['image_flat'].tolist(), train['class'])
y_pred = rf_clf.predict(test['image_flat'].tolist())

In [39]:
# Calculate performance metrics
acc = accuracy_score(test['class'], y_pred)
prec = precision_score(test['class'], y_pred, average='macro')
rec = recall_score(test['class'], y_pred, average='macro')
f1 = f1_score(test['class'], y_pred, average='macro')
cm = confusion_matrix(test['class'], y_pred)
print([acc,prec,rec,f1])

[0.7480759783854594, 0.6389624182436264, 0.5219646237835093, 0.5433253811985825]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
