# Neural Networks for Handwritten Digit Recognition
* Design a neural network to recognize ten handwritten digits, 0-9. This is a multiclass classification task where one of n choices is selected. Automated handwritten digit recognition is widely used today - from recognizing zip codes (postal codes) on mail envelopes to recognizing amounts written on bank check.

Import the necessary libraries as follows,

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import linear, relu, sigmoid
import matplotlib.pyplot as plt

* Rectified linear unit (ReLU) type activation function is defined as follows,

\begin{equation}
\nonumber
    a=max(0,z)
\end{equation}

In [None]:
def my_softmax(z):  
    """ Softmax converts a vector of values to a probability distribution.
    Args:
      z (ndarray (N,))  : input data, N features
    Returns:
      a (ndarray (N,))  : softmax of z
    """    
    ### START CODE HERE ### 
    a=np.exp(z)/np.sum(np.exp(z))
    
    ### END CODE HERE ### 
    return a

* Load the data into variables X and y
    * The data set contains 5000 training examples of handwritten digits.
    * Each training example is a 20-pixel x 20-pixel grayscale image of a digit.
    * Each pixel is represented by a floating-point number indicating the grayscale intensity at that location.
    * The 20 by 20 grid of pixels is **unrolled** into a 1 by 400 row vector.
    * Each training examples becomes a single row in the data matrix X, which gives us a 5000 x 400 matrix X. Thus the shape of X is: (5000, 400) and the shape of y is: (5000, 1)

In [None]:
X,y=load_data()

* The model representation has two dense layers with ReLU activations followed by an output layer with a linear activation and a softmax activation. The parameters have dimensions that are sized for a neural network with  25 units in layer 1,  15 units in layer 2 and  10 output units in layer 3 each of which is for a digit.
*  The numerical stability is improved if the softmax is grouped with the loss function rather than the output layer during training. This has implications when building the model and using the model.

In [None]:
tf.random.set_seed(1234) # for consistent results
model = Sequential(
    [               
        ### START CODE HERE ### 
        tf.keras.Input(shape=(400,)),    #specify input shape
        Dense(units=25,activation='relu'),
        Dense(units=15,activation='relu'),
        Dense(units=10,activation='linear'),
        ### END CODE HERE ### 
    ], name = "my_model" 
)
model.summary()

* For the model compilation, the loss and the optimzaer should be defined.
    * The SparseCategoricalCrossentropy indicates the softmax should be included with the loss calculation by adding from_logits=True.
    * A popular choice for the optimizder is Adaptive Moment (Adam).

In [None]:
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
)

* Run the gradient and fit the data.
* The first line, Epoch 1/40, describes which epoch the model is currently running. For efficiency, the training data set is broken into 'batches'. The default size of a batch in Tensorflow is 32. There are 5000 examples in our data set or roughly 157 batches. The notation on the 2nd line 157/157 [==== is describing which batch has been executed.

In [None]:

history = model.fit(
    X,y,
    epochs=40
)

* Making a prediction for checking the success of the model.

In [None]:
image_of_two = X[1015]

prediction = model.predict(image_of_two.reshape(1,400))  # prediction

print(f" predicting a Two: \n{prediction}")
print(f" Largest Prediction index: {np.argmax(prediction)}")