# MNIST Classifier - Hello World of ML!
<hr/>

## Introduction

This notebook is suffixed with __LEVEL_HERO_DEMO__ because it is deployable high performant ML classifier for the [MNIST](https://paperswithcode.com/dataset/mnist) dataset. This set is presented by your instructor during the lecture. Please confer to the lecture notes for further details.

This notebook is a __demonstration notebook__ to: 
- test your DEV environment,
- show you how developing a ML model can be very simple,
- show you how a simple MLP can achieve quite-human performance in the simple task of recognizing hand-written digits.


## Imports

In [None]:
import os
import idx2numpy
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf ; tf.random.set_seed(42)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.utils import to_categorical

## Notebook parameters

In [None]:
# NumPy

np.set_printoptions(linewidth=200) # to enlarge the print() line
np.random.seed(42) # the random seed init
np.set_printoptions(precision=3) # for numpy floats: number of decimals

# to suppress scientific notations like 1.500e+00 
# np.set_printoptions(suppress=True)

# TF
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # to disable TF debug logging messages 

## Globals & hyperparameters

In [None]:
# =============
# DATA_TOPDIR
# =============

# Contain the (un)compressed idx files of MNIST

# on assieoussou (my machine)
DATA_TOPDIR = "/home/ml/datasets/mnist"

# on your machine.... 

# =======
# MNIST 
# =======

# dataset files
TRAIN_IMAGES_DATASET_FILE = os.path.join(DATA_TOPDIR, "train-images-idx3-ubyte")
TRAIN_LABELS_DATASET_FILE = os.path.join(DATA_TOPDIR, "train-labels-idx1-ubyte")
TEST_IMAGES_DATASET_FILE = os.path.join(DATA_TOPDIR, "t10k-images-idx3-ubyte")
TEST_LABELS_DATASET_FILE = os.path.join(DATA_TOPDIR, "t10k-labels-idx1-ubyte")

# The MNIST images format
num_pixels = 28 * 28

# the total number of digits
num_classes = 10

# ==========================
# Training hyperparameters
# ==========================

epochs = 10

batch_size = 8 # <= on CPU
# batch_size = 64 # <= on GPU

# =======
# Demo 
# =======

# Demo dir: where demonstration images are placed
DEMO_DIR = os.path.join(DATA_TOPDIR, "demo")
os.makedirs(DEMO_DIR, exist_ok=True)

# for demo images 
nb_demo = 10
demo_prefix = "demo_img_"


## Data Preparation (Part I)

>__Note:__
>
>In this notebook, all the steps regarding the understanding of the data are skipped. Indeed, in practice Data Scientist spend __80%__ of their time here!



In [None]:
# 1. Read each dataset into a conventional numpy 2D array
train_x_ndarray = idx2numpy.convert_from_file(TRAIN_IMAGES_DATASET_FILE)
train_y_ndarray = idx2numpy.convert_from_file(TRAIN_LABELS_DATASET_FILE)

test_x_ndarray = idx2numpy.convert_from_file(TEST_IMAGES_DATASET_FILE)
test_y_ndarray = idx2numpy.convert_from_file(TEST_LABELS_DATASET_FILE)

In [None]:
# quick check 
print(train_x_ndarray.shape, train_y_ndarray.shape) # => (60000, 28, 28) (60000,)
print(f"train_x_ndarray[0] = \n{train_x_ndarray[0]}, \ntrain_y_ndarray[0] = {train_y_ndarray[0]}" )
print(train_x_ndarray[0].shape, train_y_ndarray[0].shape) # => (28, 28) ()
plt.imshow(train_x_ndarray[0], cmap='gray')

## Model construction & configuration

>__Note:__
>
>In this notebook, all the steps regarding the (best) model selection and evaluation are skipped. We propose directly the best model architecture: a __Convolutional Neural Network (CNN)__ 



In [None]:
def build_model():
    model = Sequential()
    # =================================
    # Feature extractor
    # =================================
    model.add(Input(shape=(28,28,1))),
    model.add(Conv2D(filters=32, kernel_size=(5,5), activation="relu"))
    model.add(MaxPool2D(pool_size=(2, 2))),
    model.add(Dropout(0.2))
    # =================================
    # Neck
    # =================================
    model.add(Flatten()),
    # =================================
    # Head 
    # =================================
    model.add(Dense(units=64, activation="relu")) # <= on CPU
    # model.add(Dense(units=128, activation="relu")) # <= on GPU
    model.add(Dense(units=num_classes, activation="relu"))
    # =================================
    # Compile model
    # =================================
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    return model

# Construct the model and show it
model = build_model()
model.summary()

## Data Preparation (Part II)

>__Note:__
>
> Now we know our taget ML model; thus we need to finalize the preparation of our data according this model (see the Input layer defined by `Input(shape=(28,28,1))`). Here, we have some image reshaping to do. Let's go!

In [None]:
# 1. no reshape is needed! Only a type cast is required
X_train = train_x_ndarray.astype('float32')
X_test = test_x_ndarray.astype('float32')

# 2. normalization: VERY IMPORTANT!!!!
X_train = X_train / 255
X_test = X_test / 255

# 3. one hot encoding of labels
y_train = to_categorical(train_y_ndarray)
y_test = to_categorical(test_y_ndarray)
num_classes = y_train.shape[1]

X_train.shape, y_train.shape, X_test.shape, y_train.shape

## Model Training & Evaluation

In [None]:
# Fit the model
model.fit(
    x=X_train, 
    y=y_train,
    batch_size=batch_size,
    validation_data=(X_test, y_test), 
    epochs=epochs,   
    verbose=2
)

# Final evaluation of the model
scores = model.evaluate(
    x=X_test, 
    y=y_test, 
    verbose=0
)

print("\n[INFO] Model Val Accuracy: %.2f%%, Error: %.2f%%" % (scores[1]*100, 100-scores[1]*100))

## Construct the demonstration set

Here, we isolate in the `demo/` subdirectory, some testing images. We'll use them later on - once our model is trained - to demonstrate how accurate it is. 

In [None]:
def reconstruct_demo_dir(nb, from_set, target_dir):
    # randomly collect the indices
    demo_rnd_indices = np.random.randint(1, high=len(from_set), size=nb)
    for i in demo_rnd_indices:
        plt.imsave(os.path.join(target_dir, demo_prefix + str(i) + ".jpg"), from_set[i], cmap='gray')
    print(f"[INFO] {len(demo_rnd_indices)} images have been created in {target_dir}")        

# check if the demo dir exists and contains at least nb_demo files

if os.path.exists(DEMO_DIR): 
    nb_files = len([name for name in os.listdir(DEMO_DIR) if os.path.isfile(os.path.join(DEMO_DIR, name))])
    if nb_files < nb_demo:
        # So you can safely manually add demo file 
        reconstruct_demo_dir(nb=nb_demo, from_set=test_x_ndarray, target_dir=DEMO_DIR)
    else:
        print(f"[INFO] Nothing to do because {DEMO_DIR} contains already enough images!")
else: 
    reconstruct_demo_dir(nb=nb_demo, from_set=test_x_ndarray, target_dir=DEMO_DIR)