# Classification on `emnist`

## 1. Create `Readme.md` to document your work

Explain your choices, process, and outcomes.

## 2. Classify letters a->g

### Subset the data

Select only the lowercase letters (a, b, ..., g) for classification

### Choose a model

Your choice of model! Choose wisely...

### Train away!

Is do you need to tune any parameters? Is the model expecting data in a different format?


### Evaluate the model

Evaluate the models on the test set, analyze the confusion matrix to see where the model performs well and where it struggles.

### Investigate subsets

On which classes does the model perform well? Poorly? Evaluate again, excluding easily confused symbols (such as 'O' and '0').

### Improve performance

Brainstorm for improving the performance. This could include trying different architectures, adding more layers, changing the loss function, or using data augmentation techniques.

## 2. Classify digits vs. letters model showdown

Perform a full showdown classifying digits vs letters:

1. Create a column for whether each row is a digit or a letter
2. Choose an evaluation metric 
3. Choose several candidate models to train
4. Divide data to reserve a validation set that will NOT be used in training/testing
5. K-fold train/test
    1. Create train/test splits from the non-validation dataset 
    2. Train each candidate model (best practice: use the same split for all models)
    3. Apply the model the the test split 
    4. (*Optional*) Perform hyper-parametric search
    5. Record the model evaluation metrics
    6. Repeat with a new train/test split
6. Promote winner, apply model to validation set
7. (*Optional*) Perform hyper-parametric search, if applicable
8. Report model performance

In [20]:
####PROBLEM 1
#data preparation, create training and testing and combine them in the end 
# Import packages
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import emnist
from IPython.display import display,Markdown
from hashlib import sha1
from sklearn.model_selection import train_test_split
#load the eminst data
# Load the data, and reshape it into a 28x28 array

# The size of each image is 28x28 pixels
size = 28 

# Extract the training split as images and labels
image, label = emnist.extract_training_samples('byclass')

# Add columns for each pixel value (28x28 = 784 columns)
raw_train = pd.DataFrame()

# Add a column showing the label
raw_train['label'] = label

# Add a column with the image data as a 28x28 array
raw_train['image'] = list(image)


# Repeat for the test split
image, label = emnist.extract_test_samples('byclass')
raw_test = pd.DataFrame()
raw_test['label'] = label
raw_test['image'] = list(image)
raw = pd.concat([raw_train,raw_test],axis=0)

In [21]:
#subset the data only for a,b,c,d,e,f,g for classification, which corresponds to the label 36-42
raw_lowercase = raw[(raw["label"] > 35.9) & (raw["label"] < 42.1)]

In [22]:
#choose random forest to train the model, for multi class classfication 
def classify_value(x):
    if x == 36:
        return 'a'
    elif x ==37:
        return 'b'
    elif x ==38:
        return 'c'
    elif x ==39:
        return 'd'
    elif x ==40:
        return 'e'
    elif x ==41:
        return 'f'
    else:
        return 'g'
raw_lowercase['letter'] = raw_lowercase['label'].apply(classify_value)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  raw_lowercase['letter'] = raw_lowercase['label'].apply(classify_value)


In [23]:
#split the data set into 75% training set and 25% testing set
raw_lowercase_training,raw_lowercase_testing = train_test_split(raw_lowercase,test_size=0.25)

In [24]:
raw_lowercase_training['image_flat'] = raw_lowercase_training['image'].apply(lambda x: np.array(x).reshape(-1))
raw_lowercase_testing['image_flat'] = raw_lowercase_testing['image'].apply(lambda x: np.array(x).reshape(-1))

In [25]:
# Create a dictionary for performance metrics
metrics_dict = {}
metrics_dict = {
    'a_vs_b_vs_c_vs_d_vs_e_vs_e_vs_f_vs_g' : { # task name (classifier for multiple class)
    
        'xgboost': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'random_forest': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'neural_network': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        }
    }
}



In [26]:
def display_metrics(task, model_name, metrics_dict):
    """Display performance metrics and confusion matrix for a model."""
    metrics_df = pd.DataFrame()
    cm_df = pd.DataFrame()
    for key, value in metrics_dict[task][model_name].items():
        if type(value) == np.ndarray:
            #cm_df = pd.DataFrame(value, index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1'])
            cm_df = pd.DataFrame(value, index=['actual a', 'actual b', 'actual c', 'actual d', 'actual e', 'actual f', 'actual g'], 
                                 columns=['predicted a', 'predicted b','predicted c','predicted d','predicted e','predicted f','predicted g'])
        else:
            metrics_df[key] = [value]
    display(Markdown(f'# Performance Metrics: {model_name}'))
    display(metrics_df)
    display(Markdown(f'# Confusion Matrix: {model_name}'))
    display(cm_df)

In [28]:
#raw_lowercase_training = raw_lowercase_training.head(100000)
#raw_lowercase_testing = raw_lowercase_testing.head(100000)

Unnamed: 0,label,image,letter,image_flat
37560,36,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",a,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
418122,40,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",e,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
141802,36,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",a,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
555998,40,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",e,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
221913,40,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",e,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...
450241,36,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",a,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
386906,40,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",e,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
564070,36,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",a,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
265754,40,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",e,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [29]:
#using random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
task = 'a_vs_b_vs_c_vs_d_vs_e_vs_f_vs_g'
model_name = 'random_forest'
metrics_dict[task] = {model_name: {}}

# Initialize random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

rf_clf.fit(raw_lowercase_training['image_flat'].tolist(), raw_lowercase_training['letter'])
y_pred = rf_clf.predict(raw_lowercase_testing['image_flat'].tolist())

# Calculate performance metrics
acc = accuracy_score(raw_lowercase_testing['letter'], y_pred)
prec = precision_score(raw_lowercase_testing['letter'], y_pred,average='macro')
rec = recall_score(raw_lowercase_testing['letter'], y_pred,average='macro')
f1 = f1_score(raw_lowercase_testing['letter'], y_pred,average='macro')
cm = confusion_matrix(raw_lowercase_testing['letter'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: random_forest

Unnamed: 0,accuracy,precision,recall,f1
0,0.960288,0.955751,0.928545,0.941047


# Confusion Matrix: random_forest

Unnamed: 0,predicted a,predicted b,predicted c,predicted d,predicted e,predicted f,predicted g
actual a,2787,3,1,20,35,3,19
actual b,13,1444,0,22,11,6,4
actual c,23,1,667,4,106,5,1
actual d,50,19,1,2905,8,6,8
actual e,51,6,24,8,7073,10,7
actual f,5,5,2,17,18,725,13
actual g,106,2,3,12,16,9,915


In [31]:
#from the confusion matrix, the model has good predictions in a, b,d,e and f, but with bad predictions in c and g. Therefore, the next step is to remove c and g to see 
#if works better. The overall accruary is pretty good. 
raw_lowercase_training = raw_lowercase_training[raw_lowercase_training["letter"].isin(["a","b","d","e","f" ])]
raw_lowercase_testing = raw_lowercase_testing[raw_lowercase_testing["letter"].isin(["a","b","d","e","f" ])]



In [33]:
#use random forest to test the test data again 
#using random forest
def display_metrics(task, model_name, metrics_dict):
    """Display performance metrics and confusion matrix for a model."""
    metrics_df = pd.DataFrame()
    cm_df = pd.DataFrame()
    for key, value in metrics_dict[task][model_name].items():
        if type(value) == np.ndarray:
            #cm_df = pd.DataFrame(value, index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1'])
            cm_df = pd.DataFrame(value, index=['actual a', 'actual b', 'actual d', 'actual e', 'actual f'], 
                                 columns=['predicted a', 'predicted b','predicted d','predicted e','predicted f'])
        else:
            metrics_df[key] = [value]
    display(Markdown(f'# Performance Metrics: {model_name}'))
    display(metrics_df)
    display(Markdown(f'# Confusion Matrix: {model_name}'))
    display(cm_df)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
task = 'a_vs_b_vs_c_vs_d_vs_e_vs_f_vs_g'
model_name = 'random_forest'
metrics_dict[task] = {model_name: {}}

# Initialize random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

rf_clf.fit(raw_lowercase_training['image_flat'].tolist(), raw_lowercase_training['letter'])
y_pred = rf_clf.predict(raw_lowercase_testing['image_flat'].tolist())

# Calculate performance metrics
acc = accuracy_score(raw_lowercase_testing['letter'], y_pred)
prec = precision_score(raw_lowercase_testing['letter'], y_pred,average='macro')
rec = recall_score(raw_lowercase_testing['letter'], y_pred,average='macro')
f1 = f1_score(raw_lowercase_testing['letter'], y_pred,average='macro')
cm = confusion_matrix(raw_lowercase_testing['letter'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: random_forest

Unnamed: 0,accuracy,precision,recall,f1
0,0.977428,0.97189,0.967371,0.969563


# Confusion Matrix: random_forest

Unnamed: 0,predicted a,predicted b,predicted d,predicted e,predicted f
actual a,2798,2,24,41,3
actual b,15,1447,20,11,7
actual d,60,18,2905,7,7
actual e,56,6,10,7096,11
actual f,5,7,20,16,737


In [None]:
#after removing poor predictions of c and g, the model works slightly better. In general, random forest works very good in binary outcome classification. 

In [34]:
#### PROBLEM 2

#data preparation, create training and testing and combine them in the end 
# Import packages
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import emnist
from hashlib import sha1
from IPython.display import display,Markdown
from sklearn.model_selection import train_test_split
#load the eminst data
# Load the data, and reshape it into a 28x28 array

# The size of each image is 28x28 pixels
size = 28 

# Extract the training split as images and labels
image, label = emnist.extract_training_samples('byclass')

# Add columns for each pixel value (28x28 = 784 columns)
raw_train = pd.DataFrame()

# Add a column showing the label
raw_train['label'] = label

# Add a column with the image data as a 28x28 array
raw_train['image'] = list(image)


# Repeat for the test split
image, label = emnist.extract_test_samples('byclass')
raw_test = pd.DataFrame()
raw_test['label'] = label
raw_test['image'] = list(image)
raw = pd.concat([raw_train,raw_test],axis=0)


In [35]:
#create a column for weather each row is a digit or letter
raw["class"] = pd.cut(raw['label'], bins=[-0.1, 9.1,61.1], labels=[0, 1])   #0 for digit and 1 for letter

In [54]:
#choose an evaluation metric
#accuracy,precision, recall, F1

In [37]:
#choose several candidate models to train: decision tree, Naive Bayes, support vector machine, random forest

In [36]:
#devide data to reserve a validation set that will not be used in training/testing 
#25% for the validation set, 75% for non-validation set(50% for training, 25% for testing)
validation,non_validation = train_test_split(raw,test_size=0.75)
non_validation_training1,non_validation_testing1 = train_test_split(non_validation,test_size=0.33,random_state=42)


In [37]:
non_validation_training1['image_flat'] = non_validation_training1['image'].apply(lambda x: np.array(x).reshape(-1))
non_validation_testing1['image_flat'] = non_validation_testing1['image'].apply(lambda x: np.array(x).reshape(-1))


In [38]:
# Create a dictionary for performance metrics
metrics_dict = {}
metrics_dict = {
    '0_vs_1' : { # task name (0 vs 1 classifier)
    
        'xgboost': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'random_forest': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'neural_network': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        }
    }
}



In [39]:
def display_metrics(task, model_name, metrics_dict):
    """Display performance metrics and confusion matrix for a model."""
    metrics_df = pd.DataFrame()
    cm_df = pd.DataFrame()
    for key, value in metrics_dict[task][model_name].items():
        if type(value) == np.ndarray:
            cm_df = pd.DataFrame(value, index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1'])
        else:
            metrics_df[key] = [value]
    display(Markdown(f'# Performance Metrics: {model_name}'))
    display(metrics_df)
    display(Markdown(f'# Confusion Matrix: {model_name}'))
    display(cm_df)
    

In [40]:
#using random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
task = '0_vs_1'
model_name = 'random_forest'
metrics_dict[task] = {model_name: {}}

# Initialize random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

rf_clf.fit(non_validation_training1['image_flat'].tolist(), non_validation_training1['class'])
y_pred = rf_clf.predict(non_validation_testing1['image_flat'].tolist())

# Calculate performance metrics
acc = accuracy_score(non_validation_testing1['class'], y_pred)
prec = precision_score(non_validation_testing1['class'], y_pred)
rec = recall_score(non_validation_testing1['class'], y_pred)
f1 = f1_score(non_validation_testing1['class'], y_pred)
cm = confusion_matrix(non_validation_testing1['class'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: random_forest

Unnamed: 0,accuracy,precision,recall,f1
0,0.895717,0.91009,0.880635,0.89512


# Confusion Matrix: random_forest

Unnamed: 0,predicted 0,predicted 1
actual 0,90830,8860
actual 1,12156,89683


In [45]:
from xgboost import XGBClassifier
from IPython.display import display, Markdown
# 0 vs 1 Classifier: XGBoost
task = '0_vs_1'
model_name = 'xgboost'

# Initialize XGBoost classifier
xgb_clf = XGBClassifier(n_estimators=100, random_state=42)

# Train and evaluate model
xgb_clf.fit(non_validation_training1['image_flat'].tolist(), non_validation_training1['class'])
y_pred = xgb_clf.predict(non_validation_testing1['image_flat'].tolist())

# Calculate performance metrics
acc = accuracy_score(non_validation_testing1['class'], y_pred)
prec = precision_score(non_validation_testing1['class'], y_pred)
rec = recall_score(non_validation_testing1['class'], y_pred)
f1 = f1_score(non_validation_testing1['class'], y_pred)
cm = confusion_matrix(non_validation_testing1['class'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)


# Performance Metrics: xgboost

Unnamed: 0,accuracy,precision,recall,f1
0,0.878266,0.888137,0.868488,0.878203


# Confusion Matrix: xgboost

Unnamed: 0,predicted 0,predicted 1
actual 0,88550,11140
actual 1,13393,88446


In [46]:
non_validation_training1

Unnamed: 0,label,image,class,image_flat
220717,55,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
245086,26,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
104012,30,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
173526,25,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
303479,8,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...
134071,22,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
222528,58,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4168,22,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
836,5,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# 0 vs 1 Classifier: Logistic Regression 
task = '0_vs_1'
model_name = 'logistic_regression'

# Initialize logistic regression classifier
lr_clf = LogisticRegression(max_iter=1000, random_state=42)

# Scale the data
# When running without scaling the data, the model does not converge
scaler = StandardScaler()
train_scaled = scaler.fit_transform(non_validation_training1['image_flat'].tolist())
valid_scaled = scaler.transform(non_validation_testing1['image_flat'].tolist())

# Initialize logistic regression classifier
lr_clf = LogisticRegression(max_iter=1000, random_state=42)

# Train and evaluate model
lr_clf.fit(train_scaled, non_validation_training1['class'])
y_pred = lr_clf.predict(valid_scaled)

# Calculate performance metrics
acc = accuracy_score(non_validation_testing1['class'], y_pred)
prec = precision_score(non_validation_testing1['class'], y_pred)
rec = recall_score(non_validation_testing1['class'], y_pred)
f1 = f1_score(non_validation_testing1['class'], y_pred)
cm = confusion_matrix(non_validation_testing1['class'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: logistic_regression

Unnamed: 0,accuracy,precision,recall,f1
0,0.736723,0.749226,0.719989,0.734317


# Confusion Matrix: logistic_regression

Unnamed: 0,predicted 0,predicted 1
actual 0,75148,24542
actual 1,28516,73323


In [57]:
#run all the models on a different training and testing set
validation,non_validation = train_test_split(raw,test_size=0.9)
non_validation_training2,non_validation_testing2 = train_test_split(non_validation,test_size=0.1,random_state=42)
non_validation_training2['image_flat'] = non_validation_training2['image'].apply(lambda x: np.array(x).reshape(-1))
non_validation_testing2['image_flat'] = non_validation_testing2['image'].apply(lambda x: np.array(x).reshape(-1))


In [58]:
#using random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
task = '0_vs_1'
model_name = 'random_forest'
metrics_dict[task] = {model_name: {}}

# Initialize random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

rf_clf.fit(non_validation_training2['image_flat'].tolist(), non_validation_training2['class'])
y_pred = rf_clf.predict(non_validation_testing2['image_flat'].tolist())

# Calculate performance metrics
acc = accuracy_score(non_validation_testing2['class'], y_pred)
prec = precision_score(non_validation_testing2['class'], y_pred)
rec = recall_score(non_validation_testing2['class'], y_pred)
f1 = f1_score(non_validation_testing2['class'], y_pred)
cm = confusion_matrix(non_validation_testing2['class'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: random_forest

Unnamed: 0,accuracy,precision,recall,f1
0,0.898448,0.915899,0.879959,0.897569


# Confusion Matrix: random_forest

Unnamed: 0,predicted 0,predicted 1
actual 0,33235,2994
actual 1,4448,32606


In [59]:
from xgboost import XGBClassifier
from IPython.display import display, Markdown
# 0 vs 1 Classifier: XGBoost
task = '0_vs_1'
model_name = 'xgboost'

# Initialize XGBoost classifier
xgb_clf = XGBClassifier(n_estimators=100, random_state=42)

# Train and evaluate model
xgb_clf.fit(non_validation_training2['image_flat'].tolist(), non_validation_training2['class'])
y_pred = xgb_clf.predict(non_validation_testing2['image_flat'].tolist())

# Calculate performance metrics
acc = accuracy_score(non_validation_testing2['class'], y_pred)
prec = precision_score(non_validation_testing2['class'], y_pred)
rec = recall_score(non_validation_testing2['class'], y_pred)
f1 = f1_score(non_validation_testing2['class'], y_pred)
cm = confusion_matrix(non_validation_testing2['class'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: xgboost

Unnamed: 0,accuracy,precision,recall,f1
0,0.877789,0.892321,0.862363,0.877086


# Confusion Matrix: xgboost

Unnamed: 0,predicted 0,predicted 1
actual 0,32373,3856
actual 1,5100,31954


In [60]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# 0 vs 1 Classifier: Logistic Regression 
task = '0_vs_1'
model_name = 'logistic_regression'

# Initialize logistic regression classifier
lr_clf = LogisticRegression(max_iter=1000, random_state=42)

# Scale the data
# When running without scaling the data, the model does not converge
scaler = StandardScaler()
train_scaled = scaler.fit_transform(non_validation_training2['image_flat'].tolist())
valid_scaled = scaler.transform(non_validation_testing2['image_flat'].tolist())

# Initialize logistic regression classifier
lr_clf = LogisticRegression(max_iter=1000, random_state=42)

# Train and evaluate model
lr_clf.fit(train_scaled, non_validation_training2['class'])
y_pred = lr_clf.predict(valid_scaled)

# Calculate performance metrics
acc = accuracy_score(non_validation_testing2['class'], y_pred)
prec = precision_score(non_validation_testing2['class'], y_pred)
rec = recall_score(non_validation_testing2['class'], y_pred)
f1 = f1_score(non_validation_testing2['class'], y_pred)
cm = confusion_matrix(non_validation_testing2['class'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: logistic_regression

Unnamed: 0,accuracy,precision,recall,f1
0,0.737647,0.751652,0.718546,0.734726


# Confusion Matrix: logistic_regression

Unnamed: 0,predicted 0,predicted 1
actual 0,27432,8797
actual 1,10429,26625


In [61]:
#the best model is random forest, which will be tested in the validation set
validation['image_flat'] = validation['image'].apply(lambda x: np.array(x).reshape(-1))

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
task = '0_vs_1'
model_name = 'random_forest'
metrics_dict[task] = {model_name: {}}

# Initialize random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

rf_clf.fit(non_validation_training2['image_flat'].tolist(), non_validation_training2['class'])
y_pred = rf_clf.predict(validation['image_flat'].tolist())

# Calculate performance metrics
acc = accuracy_score(validation['class'], y_pred)
prec = precision_score(validation['class'], y_pred)
rec = recall_score(validation['class'], y_pred)
f1 = f1_score(validation['class'], y_pred)
cm = confusion_matrix(validation['class'], y_pred)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1,
                                  'confusion_matrix': cm}

# Display performance metrics
display_metrics(task, model_name, metrics_dict)

# Performance Metrics: random_forest

Unnamed: 0,accuracy,precision,recall,f1
0,0.901099,0.91519,0.885865,0.900288


# Confusion Matrix: random_forest

Unnamed: 0,predicted 0,predicted 1
actual 0,37017,3369
actual 1,4684,36355


In [62]:
#the winner model is random forest in binary outcome classification, its accuracy is always close to 90% for both testing and validation test. 