# MNIST Handwritten Image Dataset

* The MNIST data set consists of 70,000 images of handwritten digits. It consists of digits from 0 to 9, and we are required to classify the class to which the image belongs. The images in the MNIST data set are 28X28 pixels, and the input layer has 784 neurons (each neuron takes 1 pixel as the input). The output layer has 10 neurons, with each giving the probability of the input image belonging to any of the 10 classes.

In [1]:
# Basic libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data Dealing packages 
import pickle # data serialization library
import gzip # compression library
from PIL import Image
from scipy import ndimage

# decorator packages
from sklearn.base import (TransformerMixin, 
                          BaseEstimator)
from typing import Dict, List

Some new libraries:
1. pickle : Python Object serialisation library which converts python object into byte streams and vice versa
2. gzip : Used to compress and decompress files with .gz extension
3. h5py : This allows us to store and manipulate large numerical datasets.


In [2]:
os.getcwd()

'/Users/kavisanthoshkumar/Library/CloudStorage/OneDrive-IllinoisInstituteofTechnology/MachineLearning_CodingStuff/deep_learning_sans/Neural_Networks/FeedForward_NN'

# Load MNIST Dataset
* The MNIST dataset which is divided into training, validation and test dataset
* The User-defined function helps us to unpack the file and extracts the training, validation and test 

In [3]:
def load_data():
    # Open gz file which is serialized object
    f = gzip.open('/Users/kavisanthoshkumar/Library/CloudStorage/OneDrive-IllinoisInstituteofTechnology/MachineLearning_CodingStuff/mnist.pkl.gz', 'rb')
    f.seek(0)

    # Using pickle we deserialized the object
    training_data, validation_data, test_data = pickle.load(
        f,
        encoding = 'latin1'
    )

    # Close the file
    f.close()

    return training_data, validation_data, test_data


training_data, validation_data, test_data = load_data()

In [34]:
# feature and target dataset
print(f"Shape of the training feature dataset: {training_data[0].shape}")
print(f" The feature dataset is: \n {training_data[0]}")

print(f"\n\n Shape of the training label dataset: {training_data[1].shape}")
print(f"The label dataset is: \n {training_data[1]}")
print(f"Length of Unique Classes: {len(np.unique(training_data[1]))}")

# Number of datapoints in each input 
print(f"\n The number of points in a single input is: {len(training_data[0][1])}")

Shape of the training feature dataset: (50000, 784)
 The feature dataset is: 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


 Shape of the training label dataset: (50000,)
The label dataset is: 
 [5 0 4 ... 8 4 8]
Length of Unique Classes: 10

 The number of points in a single input is: 784


# Step 1: Data Preprocessing

### 1.Splitting the Wrapped dataset into feature and labels

In [38]:
# Unwrap Training dataset
X_train = training_data[0]
y_train = training_data[1]

# Unwrap Validation dataset
X_valid = validation_data[0]
y_valid = validation_data[1]

# Unwrap test dataset
X_test = test_data[0]
y_test = test_data[1]


### 2. Perform One-Hot Encoding on label dataset

In [40]:
def one_hot_encoding(label_array: np.array) -> np.array:

    # Create a zero matrix of shape = (length of label matrix, max label)
    zero_arr = np.zeros((label_array.shape[0],
              label_array.max() + 1))    
    
    # Update the zero_array 
    zero_arr[np.arange(zero_arr.shape[0]), label_array] = 1.0

    return zero_arr


y_train_encoded = one_hot_encoding(y_train)
y_valid_encoded = one_hot_encoding(y_valid)
y_test_encoded = one_hot_encoding(y_test)

In [44]:
print(f"training set(X_train) shape: {X_train.shape}")
print(f"training set(y_train) shape: {y_train.shape}")

print(f"\nValidation set(X_valid) shape: {X_valid.shape}")
print(f"Validation set(y_valid) shape: {y_valid.shape}")

print(f"\ntest set(X_test) shape: {X_test.shape}")
print(f"test set(y_test) shape: {y_test.shape}")

training set(X_train) shape: (50000, 784)
training set(y_train) shape: (50000,)

Validation set(X_valid) shape: (10000, 784)
Validation set(y_valid) shape: (10000,)

test set(X_test) shape: (10000, 784)
test set(y_test) shape: (10000,)
