# DEMO 3: MNIST classification with NEAI CLI

The goal of this demo is to use the MNIST dataset to create a N Class classification project with NanoEdge using the command line interface (CLI).<br>
Here, in this notebook, we will discover the dataset and covert it in the NEAI format. 
<br><br>
The original MNIST dataset can be found here: <href>http://yann.lecun.com/exdb/mnist/<href><br>



The dataset files are idx3-ubytes files, we first need to install the python package to convert them. <br>
Here we use idx2numpy to convert those files into numpy arrays:

In [None]:
# install python package 
!pip install idx2numpy
!pip install numpy
!pip install pandas
!pip install matplotlib

In [None]:
import gzip # to unzip files
import idx2numpy # to convert idx files

import numpy as np # to use numpy arrays
import pandas as pd # to use dataframe

from matplotlib import pyplot as plt # to display images of the dataset
import os # to access some directories

## extracting the images and their labels from the original dataset zip

In [None]:
# unzip the image zip
train_digit_file = gzip.open('data/train-images-idx3-ubyte.gz','r')
# read the ubyte file as np array
image_train = idx2numpy.convert_from_file(train_digit_file)
# print the shape: 60000 images of size 28x28
print(image_train.shape)

In [None]:
# unzip the label zip
train_digit_labels = gzip.open('data/train-labels-idx1-ubyte.gz','r')
# read the ubyte file as np array
label_train = idx2numpy.convert_from_file(train_digit_labels)
# print the shape : 60000 labels (0 to 9)
print(label_train.shape)

In [None]:
# for the 5 first digit
for i in range(5):
    # print the label
    print('label:',label_train[i])
    # plot the image
    plt.imshow(image_train[i], cmap='gray')
    plt.show()

## Reshape images to NanoEdge format  
Currently, we extracted 60000 images of size 28x28 and we also have their corresponding labels.
<br>
But to use them in NanoEdge Ai Studio, we need to do a 2 things:
<br>
<ul>
  <li>Convert our 28x28 images to vector of size 784</li>
  <li>Create multiples .csv file for each digit instead of having them all in one .csv</li>
</ul>

First convert the images to vector : <br><br>
<img src='img_to_vector.PNG'  width=30% height=30%/>

In [None]:
# convert 28x28 image to 784x1 vectors
X_train = np.reshape(image_train,(image_train.shape[0],784))
print('X_train shape:',X_train.shape)

# we don't touch the label
y_train = label_train
print('y_train shape:',y_train.shape)

In [None]:
# Display first digit of the dataset as image and as the converted vector (values) 

# image
plt.imshow(image_train[0], cmap='gray')
plt.show()

# vector value of the converted image
plt.plot(X_train[0])
plt.show()

Then, we create a .csv file for each digit. <br>
We will create a pandas dataframe for convenience. The dataframe will contains the 784 value of each digit and the label to know what digit we are looking at

In [None]:
# create a dataframe from the numpy array containing all the digit as vector
df = pd.DataFrame(X_train) # so contains 784 values between 0 and 255

# then we add the label as the first columns of the dataframe
df.insert(0, 'label', y_train)
df

In [32]:
# this function split a dataset per digit and save them as csv

def split_dataframe(dataframe,n_samples,path):
    # for each digit
    for digit in dataframe.label.unique():

        # create a query
        query = 'label == ' + str(digit)
        print(query)

        # execute the query
        digit_df = dataframe.query(query)
        # get a sample of 500 signals
        digit_df = digit_df.sample(n = n_samples)

        # drop the label
        values = digit_df.drop('label', axis = 1)

        # save it to csv
        values.to_csv(f'{path}classif_{digit}.csv', header=None, index= None)

In [33]:
# create a folder for the training csv
os.system('mkdir train_files')
split_dataframe(df,500,'./train_files/')

label == 5
label == 0
label == 4
label == 1
label == 9
label == 2
label == 3
label == 6
label == 7
label == 8


We can also do the same for the testing file. <br>
All the previous code is the following cell

In [34]:
# unzip the image zip and convert from idx files to arrays
image_test = idx2numpy.convert_from_file(gzip.open('data/t10k-images-idx3-ubyte.gz','r'))
# reshape to vector
X_test = np.reshape(image_test,(image_test.shape[0],784))
# convert the labels from idx files to arrays
y_test = idx2numpy.convert_from_file(gzip.open('data/t10k-labels-idx1-ubyte.gz','r'))

# create the dataframe with the digit as vector and the corresponding label
df_test = pd.DataFrame(X_train)
df_test.insert(0, 'label', y_train)

# create test file directory and create the csv for each digit
os.system('mkdir test_files')
split_dataframe(df_test,200,'./test_files/test_')

label == 5
label == 0
label == 4
label == 1
label == 9
label == 2
label == 3
label == 6
label == 7
label == 8
