# Classification on `emnist`

## 1. Create `Readme.md` to document your work

Explain your choices, process, and outcomes.

## 2. Classify ~~all symbols~~ letters a -> g

### Subset the data

Select only the lowercase letters (a, b, ..., g) for classification

### Choose a model

Your choice of model! Choose wisely...

### Train away!

Is do you need to tune any parameters? Is the model expecting data in a different format?

### Evaluate the model

Evaluate the models on the test set, analyze the confusion matrix to see where the model performs well and where it struggles.

### Investigate subsets

On which classes does the model perform well? Poorly? Evaluate again, excluding easily confused symbols (such as 'O' and '0').

### Improve performance

Brainstorm for improving the performance. This could include trying different architectures, adding more layers, changing the loss function, or using data augmentation techniques.

## 3. Model showdown: upper vs lowercase on abcXYZ

### Subset the data

Select out the set of upper- and lowercase (a, b, c, x, y z, A, B, C, X, Y, Z). Note that some of these classes can be confusing (e.g., x and y).

### Train and tune models

Perform a full model training and hyperparameter tuning.

1. Select candidate models, hyperparameter options, and evaluation metric
2. Set aside a validation hold-out dataset
3. Train models over K splits (use k-fold or train/test split)
    1. Split train using k-fold with the number of folds equal to the number of parameter combinations
    2. Train on k-fold split
    3. Record performance of each set of parameters
    4. Use winning set of parameters to train model on full training set
    5. Record each model's performance on that split's test set
4. Evaluate model performance and promote one model as the winner
5. Train winning model on both train + test
6. Check model performance on the validation hold-out


## 4. (_Optional_) Model comparison: classify even vs odd

**NOTE:** This is a larger dataset (~400k rows) so it will require more memory and time to train models on it. 

Alternatively, you can train models on smaller subsets of the data to get a feel for which models perform better than others. Then train the winning model on the full dataset and validate against the hold-out.

### Subset the data

Select only digits and add a column for 'is_even'. Be sure to create a validation hold-out dataset for later.

### Build and compare models

Train at least two different models, compare the results and choose a winner based on an evaluation metric of your choice.

In [1]:
%pip install -q emnist pandas pyarrow numpy matplotlib seaborn scikit-learn xgboost tensorflow
%reset -f


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import packages
import os
import string
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import emnist
from IPython.display import display, Markdown
from itertools import product

# ML packages
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, multilabel_confusion_matrix
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, KFold, ParameterGrid
# XGBoost (SVM)
from xgboost import XGBClassifier
# Deep Learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten

# Constants
SIZE = 28

In [3]:
# Define helper functions
def int_to_char(label):
    """Convert an integer label to the corresponding uppercase character."""
    if label < 10:
        return str(label)
    elif label < 36:
        return chr(label - 10 + ord('A'))
    else:
        return chr(label - 36 + ord('a'))

def show_image(row):
    """Display a single image and its corresponding label."""
    image = row['image']
    label = row['label']
    plt.imshow(image, cmap='gray')
    plt.title('Label: ' + int_to_char(label))
    plt.axis('off')
    plt.show()

def show_grid(data, title=None, num_cols=5, figsize=(20, 10)):
    """
    Display a list of images as a grid of num_cols columns.
    images: a list of images, each represented as a 28x28 numpy array
    labels: a list of labels, one for each image
    title: (optional) a title for the plot
    num_cols: (optional) number of columns to use in the grid
    figsize: (optional) size of the figure
    """
    num_images = len(data)
    num_rows = (num_images - 1) // num_cols + 1
    fig, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
    if title is not None:
        fig.suptitle(title, fontsize=16)
    for i in range(num_rows):
        for j in range(num_cols):
            index = i * num_cols + j
            if index < num_images:
                axes[i, j].imshow(data.iloc[index]['image'], cmap='gray')
                axes[i, j].axis('off')
                label = int_to_char(data.iloc[index]['label'])
                axes[i, j].set_title(label)
    plt.show()

# Get a random image of a given label from the dataset
def get_image_by_label(data, label):
    """Get a random image of a given label from the dataset."""
    images = data[data['label'] == label]['image'].tolist()
    return random.choice(images)

# Plot the training and validation accuracy during the training of a model
def plot_accuracy(history):
    """Plot the training and validation accuracy during the training of a model."""
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    epochs = range(1, len(acc) + 1)
    plt.plot(epochs, acc, 'bo', label='Training accuracy')
    plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()

# Plot the training and validation loss during the training of a model
def plot_loss(history):
    """Plot the training and validation loss during the training of a model."""
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(loss) + 1)
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()

# Normalize the pixel values of the images in the dataset to have zero mean and unit variance
# This is a common preprocessing step for neural networks, but may not be necessary in all cases
def normalize_images(images):
    """Normalize the pixel values of the images in the dataset to have zero mean and unit variance."""
    images = np.array(images)
    mean = images.mean()
    std = images.std()
    images = (images - mean) / std
    return images.tolist()

# Display metrics for a model
def display_metrics(task, model_name, metrics_dict):
    """Display performance metrics and confusion matrix for a model."""
    metrics_df = pd.DataFrame()
    cm_df = pd.DataFrame()
    for key, value in metrics_dict[task][model_name].items():
        if type(value) == np.ndarray:
            cm_df = pd.DataFrame(value, index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1'])
        else:
            metrics_df[key] = [value]
    display(Markdown(f'# Performance Metrics: {model_name}'))
    display(metrics_df)
    display(Markdown(f'# Confusion Matrix: {model_name}'))
    display(cm_df)

In [4]:
# Load data

# Extract the training split as images and labels
image, label = emnist.extract_training_samples('byclass')

# Add columns for each pixel value (28x28 = 784 columns)
emnist_train = pd.DataFrame()

# Add a column with the image data as a 28x28 array
emnist_train['image'] = list(image)
emnist_train['image_flat'] = emnist_train['image'].apply(lambda x: np.array(x).reshape(-1))

# Add a column showing the label
emnist_train['label'] = label

# Convert labels to characters
class_label = np.array([int_to_char(l) for l in label])

# Add a column with the character corresponding to the label
emnist_train['class'] = class_label

# Repeat for the test split
image, label = emnist.extract_test_samples('byclass')
class_label = np.array([int_to_char(l) for l in label])
emnist_test = pd.DataFrame()
emnist_test['image'] = list(image)
emnist_test['image_flat'] = emnist_test['image'].apply(lambda x: np.array(x).reshape(-1))
emnist_test['label'] = label
emnist_test['class'] = class_label

# Combine the training and test splits
emnist_all = pd.concat([emnist_train, emnist_test], axis=0)

# Subset for only digits 0-9
digits = emnist_all[emnist_all['label'] < 10]

# Subset for lowercase letters
lowercase = emnist_all[(emnist_all['class'] >= 'a') & (emnist_all['class'] <= 'z')]
uppercase = emnist_all[(emnist_all['class'] >= 'A') & (emnist_all['class'] <= 'Z')]

# Subset for upper- and lowercase letters a, b, c, d, e, f, g
a2g = emnist_all[(emnist_all['class'].isin(['a', 'b', 'c', 'd', 'e', 'f', 'g']))]

# Subset for upper- and lowercase letters a, b, c, x, y, z
abcxyz = emnist_all[(emnist_all['class'].isin(['a', 'b', 'c', 'A', 'B', 'C', \
                                               'x', 'y', 'z', 'X', 'Y', 'Z']))]

In [5]:
# Display the size of a2g, abcxyz, digits, and the full dataset
display(Markdown(f'# Dataset Sizes'))
display(Markdown(f'**a2g**: {len(a2g)}'))
display(Markdown(f'**abcxyz**: {len(abcxyz)}'))
display(Markdown(f'**digits**: {len(digits)}'))
display(Markdown(f'**emnist_all**: {len(emnist_all)}'))

# Dataset Sizes

**a2g**: 68795

**abcxyz**: 65926

**digits**: 402953

**emnist_all**: 814255

In [6]:
# Classify lettters as uppercase/lowercase
abcxyz['is_upper'] = abcxyz['label'] <= 35

# Classify digits as even/odd
digits['is_even'] = digits['label'] % 2 == 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  abcxyz['is_upper'] = abcxyz['label'] <= 35
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  digits['is_even'] = digits['label'] % 2 == 0


In [7]:
# Display multiclass metrics for a model
def display_multiclass_metrics(task, model_name, metrics_dict, labels):
    """Display performance metrics and confusion matrix for a model."""
    metrics_df = pd.DataFrame()
    cm_df = pd.DataFrame()
    col_names = []
    for i in labels:
        col_names += f'{int_to_char(i)}'
    for key, value in metrics_dict[task][model_name].items():
        if type(value) == np.ndarray:
            cm_df = pd.DataFrame(value, columns = [col_names], index = [col_names])
        else:
            metrics_df[key] = [value]
    display(Markdown(f'# Performance Metrics: {model_name}'))
    display(metrics_df)
    display(Markdown(f'# Confusion Matrix: {model_name}'))
    display(cm_df)

In [8]:
# (OPTIONAL) We can define all the metrics we want to track in a dictionary
metrics_dict = {
    'upper_vs_lower' : { # task name (letter vs number classifier)
        'logistic_regression': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'xgboost': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'random_forest': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'neural_network': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        }
    }, 

    'classify_symbols' : { # task name (symbol classifier)
        'logistic_regression': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'xgboost': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'random_forest': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'neural_network': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        }

    },

    'even_or_odd' : { # task name (symbol classifier)
        'logistic_regression': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'xgboost': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'random_forest': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        },
        'neural_network': {
            'confusion_matrix': [],
            'accuracy': [],
            'precision': [],
            'recall': [],
            'f1': []
        }

    }
    
}

In [9]:
# 2. Classify ~~all symbols~~ letters a -> g
# Symbol Classifier: RandomForest
task = 'classify_symbols'
model_name = 'random_forest'
metrics_dict[task] = {model_name: {}}

# Hyperparameter grid
rf_param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [None, 10, 20], 'criterion': ['log_loss', 'gini', 'entropy']}
scoring_metric = 'accuracy'

# Train/Valid Split
a2g_train, a2g_val, a2g_lab_train, a2g_lab_val = train_test_split(a2g, a2g['label'], test_size=0.2, random_state=42)

# Baby data sets
# a2g_trainb = a2g_train[:5000]
# a2g_valb = a2g_val[:2500]
# a2g_lab_trainb = a2g_lab_train[:5000]
# a2g_lab_valb = a2g_lab_val[:2500]

# Hyperparameter tuning
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# Initializing scores
best_score = 0
best_model = None

for param_combination in ParameterGrid(rf_param_grid):
    rf_classifier = RandomForestClassifier().set_params(**param_combination)
    fold_scores = []

    # 3.A. Split train using k-fold with the number of folds 
						# equal to the number of parameter combinations
    for train_index, test_index in kf.split(a2g_train):
        X_fold_train, X_fold_test = a2g_train.iloc[train_index], a2g_train.iloc[test_index]
        y_fold_train, y_fold_test = a2g_lab_train.iloc[train_index], a2g_lab_train.iloc[test_index]

        # 3.B. Train on k-fold split
        rf_classifier.fit(X_fold_train['image_flat'].tolist(), y_fold_train)

        # 3.C. Record performance of each set of parameters
        y_pred = rf_classifier.predict(X_fold_test['image_flat'].tolist())
        fold_score = accuracy_score(y_fold_test, y_pred)
        fold_scores.append(fold_score)

    # 3.D. Use winning set of parameters to train model on full training set
    avg_score = np.mean(fold_scores)
    if avg_score > best_score:
        best_score = avg_score
        best_model = rf_classifier.get_params()

# Display winning model
print(f"Best Model Parameters: {best_model}")
print(f"Best Model Average Cross-Validation Score: {best_score}")


Best Model Parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 150, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
Best Model Average Cross-Validation Score: 0.9615342215498173


In [10]:
# Train best model on full training dataset
rf_clf = RandomForestClassifier(**best_model)
rf_clf.fit(a2g_train['image_flat'].tolist(), a2g_lab_train)

# Evaluate model
y_pred = rf_clf.predict(a2g_val['image_flat'].tolist())

# Calculate performance metrics
acc = accuracy_score(a2g_lab_val, y_pred)
prec = precision_score(a2g_lab_val, y_pred, average='macro', zero_division = 1)
rec = recall_score(a2g_lab_val, y_pred, average = 'macro')
f1 = f1_score(a2g_lab_val, y_pred, average = 'macro')
cm = confusion_matrix(a2g_lab_val, y_pred, labels = rf_clf.classes_)

# Store performance metrics in dictionary
metrics_dict[task][model_name] = {'confusion_matrix': cm,
                                  'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1
                                  }

# Display performance metrics
display_multiclass_metrics(task, model_name, metrics_dict, rf_clf.classes_)

# Performance Metrics: random_forest

Unnamed: 0,accuracy,precision,recall,f1
0,0.963806,0.960231,0.931977,0.945055


# Confusion Matrix: random_forest

Unnamed: 0,a,b,c,d,e,f,g
a,2368,4,3,11,42,2,17
b,9,1148,1,15,8,8,5
c,19,1,538,2,71,2,1
d,32,16,0,2268,3,6,3
e,22,3,9,4,5667,7,4
f,6,5,2,10,11,551,8
g,84,3,1,12,19,7,721


In [11]:
# 3. Model showdown: upper vs lowercase on abcXYZ
# Models to try: Logistic regression, XGBoost, Random Forest

# Symbol Classifier: RandomForest
task = 'upper_vs_lower'
model_name = best_model.__class__.__name__
metrics_dict[task] = {model_name: {}}

# Hyperparameter grid
lr_param_grid = {'max_iter': [1000, 2000, 3000]}
xgb_param_grid = {'n_estimators': [50, 100, 200, 500], 'max_depth': [1, 2, 3, 4, 5, 6], 'eta': [0.1, 0.3]}
rf_param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [None, 10, 20], 'criterion': ['log_loss', 'gini', 'entropy']}

scoring_metric = 'accuracy'

# Scaling data
scaler = StandardScaler()
abcxyz_scaled = scaler.fit_transform(abcxyz['image_flat'].tolist())

# Train/Test split
ul_train, ul_val, ul_lab_train, ul_lab_val = train_test_split(abcxyz_scaled, abcxyz['is_upper'], test_size=0.2, random_state=42)

# Baby data sets
# ul_trainb = ul_train[:2000]
# ul_valb = ul_val[:1000]
# ul_lab_trainb = ul_lab_train[:2000]
# ul_lab_valb = ul_lab_val[:1000]

# Hyperparameter tuning
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# Initializing scores
best_score = 0
best_model = None

In [12]:
models = [
      (LogisticRegression(), lr_param_grid),
      (XGBClassifier(), xgb_param_grid),
      (RandomForestClassifier(), rf_param_grid)
]

X = ul_train.tolist()
y = ul_lab_train.tolist()



for model, param_grid in models:
     for param_combination in product(*param_grid.values()):
         param_dict = dict(zip(param_grid.keys(), param_combination))
         model.set_params(**param_dict)
         fold_scores = []

         # 3.A. Split train using k-fold with the number of folds
         # equal to the number of parameter combinations
         for train_index, test_index in kf.split(X):
             X_fold_train, X_fold_test = [X[i] for i in train_index], [X[i] for i in test_index]
             y_fold_train, y_fold_test = [y[i] for i in train_index], [y[i] for i in test_index]

             # 3.B. Train on k-fold split
             model.fit(X_fold_train, y_fold_train)

             # 3.C. Record performance of each set of parameters
             y_pred = model.predict(X_fold_test)
             fold_score = accuracy_score(y_fold_test, y_pred)
             fold_scores.append(fold_score)

         # 3.D. Use the winning set of parameters to train the model on the full training set
         avg_score = np.mean(fold_scores)
         if avg_score > best_score:
             best_score = avg_score
             best_model = model

# Print the best model
print("Best Model:", best_model)
best_model_classifier = best_model
best_model_params = best_model.get_params()
print(f"Best Model Average Cross-Validation Score: {best_score}")

Best Model: XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eta=0.3, eval_metric=None,
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=6,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=500,
              n_jobs=None, num_parallel_tree=None, ...)
Best Model Average Cross-Validation Score: 0.8488623435722411


In [18]:
# Train and test the best model
bm_fit = best_model.fit(ul_train, ul_lab_train)
y_pred = best_model.predict(ul_val)
validation_score = accuracy_score(ul_lab_val, y_pred)
print(f"Validation Hold-out Score: {validation_score}")

# Calculate performance metrics
acc = accuracy_score(ul_lab_val, y_pred)
prec = precision_score(ul_lab_val, y_pred, average='macro', zero_division = 1)
rec = recall_score(ul_lab_val, y_pred, average = 'macro')
f1 = f1_score(ul_lab_val, y_pred, average = 'macro')
cm = confusion_matrix(ul_lab_val, y_pred, labels = best_model.classes_)

# Store performance metrics in dictionary
metrics_dict[task]['xgboost'] = {'confusion_matrix': cm,
                                  'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1
                                  }

# Display performance metrics
display_metrics(task, 'xgboost', metrics_dict)

Validation Hold-out Score: 0.843242833308054


# Performance Metrics: xgboost

Unnamed: 0,accuracy,precision,recall,f1
0,0.843243,0.844843,0.839939,0.841437


# Confusion Matrix: xgboost

Unnamed: 0,predicted 0,predicted 1
actual 0,4856,1252
actual 1,815,6263


In [14]:
# 4. (_Optional_) Model comparison: classify even vs odd

# Models to try: Logistic regression, XGBoost, Random Forest

# Hyperparameter tuning
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# Initializing scores
best_score = 0
best_model = None

# Symbol Classifier: RandomForest
task = 'even_or_odd'
model_name = best_model.__class__.__name__
metrics_dict[task] = {model_name: {}}

# Hyperparameter grid
lr_param_grid = {'max_iter': [1000, 2000, 3000]}
xgb_param_grid = {'n_estimators': [50, 100, 200, 500], 'max_depth': [1, 2, 3, 4, 5, 6], 'eta': [0.1, 0.3]}
rf_param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [None, 10, 20], 'criterion': ['log_loss', 'gini', 'entropy']}

scoring_metric = 'accuracy'

# Making digits data set smaller
digitss = digits.sample(frac = 0.5, random_state = 42)

# Scaling data
scaler = StandardScaler()
digits_scaled = scaler.fit_transform(digitss['image_flat'].tolist())


# Train/Test split
d_train, d_val, d_lab_train, d_lab_val = train_test_split(digits_scaled, digitss['is_even'], test_size=0.2, random_state=42)

# Baby data sets
# d_trainb = d_train[:2000]
# d_valb = d_val[:1000]
# d_lab_trainb = d_lab_train[:2000]
# d_lab_valb = d_lab_val[:1000]

# Hyperparameter tuning
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)

# Initializing scores
best_score = 0
best_model = None

# List of models
models = [
      (LogisticRegression(), lr_param_grid),
      (XGBClassifier(), xgb_param_grid),
      (RandomForestClassifier(), rf_param_grid)
]

X = d_train.tolist()
y = d_lab_train.tolist()



for model, param_grid in models:
     for param_combination in product(*param_grid.values()):
         param_dict = dict(zip(param_grid.keys(), param_combination))
         model.set_params(**param_dict)
         fold_scores = []

         # 3.A. Split train using k-fold with the number of folds
         # equal to the number of parameter combinations
         for train_index, test_index in kf.split(X):
             X_fold_train, X_fold_test = [X[i] for i in train_index], [X[i] for i in test_index]
             y_fold_train, y_fold_test = [y[i] for i in train_index], [y[i] for i in test_index]

             # 3.B. Train on k-fold split
             model.fit(X_fold_train, y_fold_train)

             # 3.C. Record performance of each set of parameters
             y_pred = model.predict(X_fold_test)
             fold_score = accuracy_score(y_fold_test, y_pred)
             fold_scores.append(fold_score)

         # 3.D. Use the winning set of parameters to train the model on the full training set
         avg_score = np.mean(fold_scores)
         if avg_score > best_score:
             best_score = avg_score
             best_model = model

# Print the best model
print("Best Model:", best_model)
best_model_classifier = best_model
best_model_params = best_model.get_params()
print(f"Best Model Average Cross-Validation Score: {best_score}")

Best Model: XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eta=0.3, eval_metric=None,
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=6,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=500,
              n_jobs=None, num_parallel_tree=None, ...)
Best Model Average Cross-Validation Score: 0.9920647723042562


In [17]:
# Train and test the best model
bm_fit = best_model.fit(d_train, d_lab_train)
y_pred = best_model.predict(d_val)
validation_score = accuracy_score(d_lab_val, y_pred)
print(f"Validation Hold-out Score: {validation_score}")

# Calculate performance metrics
acc = accuracy_score(d_lab_val, y_pred)
prec = precision_score(d_lab_val, y_pred, average='macro', zero_division = 1)
rec = recall_score(d_lab_val, y_pred, average = 'macro')
f1 = f1_score(d_lab_val, y_pred, average = 'macro')
cm = confusion_matrix(d_lab_val, y_pred, labels = best_model.classes_)

# Store performance metrics in dictionary
metrics_dict[task]['xgboost'] = {'confusion_matrix': cm,
                                  'accuracy': acc,
                                  'precision': prec,
                                  'recall': rec,
                                  'f1': f1
                                  }

# Display performance metrics
display_metrics(task, 'xgboost', metrics_dict)

Validation Hold-out Score: 0.993001786777844


# Performance Metrics: xgboost

Unnamed: 0,accuracy,precision,recall,f1
0,0.993002,0.99299,0.993013,0.993001


# Confusion Matrix: xgboost

Unnamed: 0,predicted 0,predicted 1
actual 0,20267,159
actual 1,123,19747
