**The Real Problem**

Recognizing multi-digit numbers in photographs captured at street level is an important component of modern-day map making. A classic example of a corpus of such street level photographs is Google’s Street View imagery comprised of hundreds of millions of geo-located 360 degree panoramic images. The ability to automatically transcribe an address number from a geo-located patch of pixels and associate the transcribed number with a known street address helps pinpoint, with a high degree of accuracy, the location of the building it represents. 

More broadly, recognizing numbers in photographs is a problem of interest to the optical character recognition community. While OCR on constrained domains like document processing is well studied, arbitrary multi-character text recognition in photographs is still highly challenging. This difficulty arises due to the wide variability in the visual appearance of text in the wild on account of a large range of fonts, colors, styles, orientations, and character arrangements. The recognition problem is further complicated by environmental factors such as lighting, shadows, specularities, and occlusions as well as by image acquisition factors such as resolution, motion, and focus blurs.

In this project we will use dataset with images centred around a single digit (many of the images do contain some distractors at the sides). Although we are taking a sample of the data which is simpler, it is more complex than MNIST because of the distractors.

**The Street View House Numbers (SVHN) Dataset**

SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data formatting but comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

**Link to the dataset:**

https://drive.google.com/file/d/1L2-WXzguhUsCArrFUc8EEkXcj33pahoS/view?usp=sharing

**Acknowledgement for the datasets.**

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng

Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. PDF http://ufldl.stanford.edu/housenumbers as the URL for this site when necessary

The objective of the project is to learn how to implement a simple image classification pipeline based on a deep neural network. The goals of this project are as follows:

● Understand the basic Image Classification pipeline and the data-driven approach (train/predict stages)

● Data fetching and understand the train/val/test splits. (10 points)

● Implement and apply a deep neural network classifier including (feedforward neural network, RELU, activations) (20 points)

● Understand and be able to implement (vectorised) backpropagation (cost stochastic gradient descent, cross entropy loss, cost functions) (20 points)

● Implement batch normalization for training the neural network (5 points)

● Print the classification accuracy metrics (5 points)

Happy Learning!

In [1]:
#For ease of working with large Dataset, we have uploaded it to Google Drive
#So, first let's mount the drive

#Importing drive module from google.colab library
from google.colab import drive

#Mount the drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#To deal with h5 File, let's import h5py library
import h5py

#Load the h5 File
f = h5py.File("/content/drive/My Drive/SVHN_single_grey1.h5")

In [3]:
#Check what all Datasets we have
list(f.keys())

['X_test', 'X_train', 'X_val', 'y_test', 'y_train', 'y_val']

In [4]:
#Load each Datasets from the extracted h5 File
X_test = f['X_test']
X_train = f['X_train']
X_val = f['X_val']
y_test = f['y_test']
y_train = f['y_train']
y_val = f['y_val']

In [5]:
#Import tensorflow
import tensorflow as tf

#Setting random seed to 5
tf.set_random_seed(42)

#Since, we have 10 digits, let's convert the Labels to One hot encoding
y_test = tf.keras.utils.to_categorical(y_test, num_classes=10)
y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
y_val = tf.keras.utils.to_categorical(y_val, num_classes=10)

In [6]:
#Initialize Sequential model
model = tf.keras.models.Sequential()

#Reshape data from 2D to 1D -> 32x32 to 1024
model.add(tf.keras.layers.Reshape((1024,),input_shape=(32,32,)))

#Add 1st hidden layer
model.add(tf.keras.layers.Dense(4096, activation='relu'))

#Add 2nd hidden layer
model.add(tf.keras.layers.Dense(1024, activation='relu'))

#Add 3rd hidden layer
model.add(tf.keras.layers.Dense(256, activation='relu'))

#Add 4th hidden layer
model.add(tf.keras.layers.Dense(64, activation='relu'))

#Add 5th hidden layer
model.add(tf.keras.layers.Dense(16, activation='relu'))

#Add OUTPUT layer
model.add(tf.keras.layers.Dense(10, activation='softmax'))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [7]:
#Create optimizer with non-default learning rate
sgd_optimizer = tf.keras.optimizers.SGD(lr=0.001)

#Compile the model
model.compile(optimizer=sgd_optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

In [8]:
#Summary of the model built
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
reshape (Reshape)            (None, 1024)              0         
_________________________________________________________________
dense (Dense)                (None, 4096)              4198400   
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              4195328   
_________________________________________________________________
dense_2 (Dense)              (None, 256)               262400    
_________________________________________________________________
dense_3 (Dense)              (None, 64)                16448     
_________________________________________________________________
dense_4 (Dense)              (None, 16)                1040      
_________________________________________________________________
dense_5 (Dense)              (None, 10)                1

In [9]:
#Importing numpy library
import numpy as np

#Converting validation Images into numpy Array because otherwise model.fit will encounter an Error at the end of each Epoch & come to a halt
X_val = np.array(X_val)

#Fit the model with 20 Epochs & Batch size of 1000. We need to use shuffle='batch' for h5 data.
model.fit(X_train,y_train,          
          validation_data=(X_val,y_val),
          epochs=20,
          shuffle='batch',
          batch_size=1000)

Train on 42000 samples, validate on 60000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f7d1d639c50>

In [10]:
#Predicting the values by model on test Dataset
y_pred=model.predict_classes(X_test)

In [11]:
#Converting one hot encoded test labels back to normal labels for comparision with predicted labels
test_y=[]
for val in y_test:
  test_y.append(np.argmax(val))

In [12]:
#Importing metrics module from sklearn library
from sklearn import metrics

#Calculate the Confusion matrix for Test data
cm=metrics.confusion_matrix(test_y,y_pred)

#Print the Confusion matrix
cm

array([[   0,    0,  108,    2,    0, 1697,    0,    6,    0,    1],
       [   0,    0,  530,    2,    0, 1289,    0,    6,    0,    1],
       [   0,    0,  655,    1,    0, 1141,    2,    3,    0,    1],
       [   0,    0,  282,    2,    0, 1431,    0,    4,    0,    0],
       [   0,    0,  139,    2,    0, 1668,    0,    2,    0,    1],
       [   0,    0,  127,    1,    0, 1631,    0,    8,    0,    1],
       [   0,    0,  341,    1,    0, 1481,    0,    9,    0,    0],
       [   0,    0,  685,    1,    0, 1117,    3,    1,    0,    1],
       [   0,    0,  120,    0,    0, 1687,    0,    4,    0,    1],
       [   0,    1,  210,    1,    0, 1576,    0,   15,    0,    1]])

After 20 Epochs, we can see that the model is giving just about 13% accuracy. But, given the complexity of the problem due to wide variability of the visual appearance of the digits, what's important is that slowly the model's accuracy is improving & running it for long, can give better results.

Though, the accuracy can remain constrained as we can see from the Confusion matrix that the model is not at all predicting the digits 0, 4, 8 & the prediction is very poor for 1, 3, 6, 7 & 9.

So, there are chances that the model's accuracy might not go above 50% even if we try for long enough.

# Understand the basic Image Classification pipeline and the data-driven approach (train/predict stages)
In this Image Classification case study, we have some wide variability of the visual appearance of the digits.

Given, it's complexity, we have tried to train the Neural network model using 5 hidden layers starting with as much as 4096 neurons to start with & slowly reducing them as we go towards the Output layer.

First, we try to train the model using training dataset consisting of 42,000 records. To verify the model performance, we then validate using validate testset which in our case is bigger than training dataset with 60,000 records. It gives us a fair idea if the model is not over-fitting.

Finally, we do testing with test dataset which has 18,000 records in this case.
This is the final performance test of the model & if the testset represents true nature of images that the model is going to see in the Production, we can expect similar performance of the model in Production.

# Data fetching and understand the train/val/test splits
We have the data saved in form of h5 file format. We have uploaded the data on Google drive so that we can directly mount drive on Colab & fetch the data easily. Otherwise, everytime we would resume our activity on the Project, we would have to manually upload the data on Colab Files since it temporarily allocates the resources for working on the Project when required.

To retrieve the data from h5 File format, we use h5py library of Python with which we can easily retrieve the Data.

Data is further divided into training, validation & test datasets identified by the Keys X_test, X_train, X_val, y_test, y_train & y_val. Here, X stands for Images stored in 32 x 32 grayscale image with each pixel having a range from 0 representing black & 255 representing white. y stands for the Image label to identify it's correct digit represented by X.

# Implement and apply a deep neural network classifier including (feedforward neural network, RELU, activations)
We have implemented & applied deep neural network classifier to classify the digits in the Images.

Since, we are using fully-connected layers only for this Project, we first have to flatten the Image into 1-D array. This leads to spatial loss of information in the Image which is very important & it would have been useful to maintain it given the complexity of the problem. To do so, we have something called as Convolutional networks which first applies small image filters & slides through entire Image to identify some patterns in the numbers represented by each Image. However, this is out-of-scope of our current Project as that topic is covered altogether in another Course.

Feedforward neural network is the Neural network that we have constructed but in this approach, we only move from Input layer, do some processing in each Neurons in each layers & then predict Output. We have no way to adjust our weights on the basis of loss function of the prediction vs the actual. Feedforward neural network can be used when somehow we know the weights & then directly predict the Output. This is the case after we complete training our model & when we need to predict using validation dataset or Test dataset as once the model is trained, we do not need to adjust the weights again. We can also use Feedforward neural network when we deploy the model in Production to predict the Digits in the Images as again there too we do not need to adjust the weights of the neurons of the Neural network.

RELU activation is a non-linear function which we use after applying linear function determined by the weights & the bias. It gives us a way to solve non-linear problems which is the case maximum times in the real-world. This is very simple function with output being same as input if it is above 0 or 0 otherwise. One of the requirements of the activation functions is that it should be differentiable so that we can adjust the weights as it gives us the direction of the slope of the activation function which helps us determine which direction we need to go to try to reach minima. In real-sense, it is not differentiable at 0 but we can take average of slopes on positive side & negative side to take out that limitation to become an activation function. RELU activation though being simple often does better than Sigmoid activation function as the slope of Sigmoid activation function almost flattens out on extremes because of which it takes long time to arrive at the solution if it gets started with the extreme values. We have used RELU activation in our Model.


# Understand and be able to implement (vectorised) backpropagation (cost stochastic gradient descent, cross entropy loss, cost functions)
We have implemented (vectorised) backpropagation in our Model.

Backpropagation is the technique with which we can adjust the weights & biases of the Model's neurons with an aim to minimize the loss function which helps us give better accuracy to the Model. It does so by calculating partial differentiation of Loss function with respect to each wights & biases. This helps us determine the slope of the Loss function at current values of weights & biases keeping other weights & biases constant while calculating each one of them. This gives us the direction in which we need to go & we have to go the opposite side of the direction of the slope as we are targetting to minimize the Loss function. We multiply these with learning rate which helps us determine the step size of our next adjustement of the weights. Too less the learning rate, will take lot of iterations to converge to the solution & too big the learning rate, we have risk of missing the global minima due to our huge step size. Due to complex shape of loss function, we have several local minimas & there is a risk that our model can get trapped in the local minima.
To avoid this, we can attempt to try multiple times as with random assignment of initial weights, we have chance of reaching global minima with next run.

Stochastic gradient descent is the method of adjusting the weights to minimize the loss function. Gradient is nothing but slope of the loss function. We would like to descent the slope to reach the minima so as to minimize the Loss function. Since, we take small batches into consideration every time to avoid huge computational requirement, this technique is called Stochastic Gradient Descent.

We use Categorical cross entropy for categorizing the 10 digits from the Image Dataset. Cross entropy is log loss function that tries punishing the wrong predictions.

Cost functions are nothing but loss functions that try to punish wrong predictions of the model & our aim is to minimize this loss function as much as possible. For regression problems, we can use Root mean square error or Mean Absolute Error as one of the Loss functions. To calculate the new adjusted weights & biases for next iteration, we do partial derivative of this Loss function with respect to each weights & biases individually. This essentially means that keeping other parameters constant, we get the direction of the slope & to minimize this loss function, we try to go in the direction opposite of the slope.

In [13]:
#Initialize Sequential model
model2 = tf.keras.models.Sequential()

#Reshape data from 2D to 1D -> 32x32 to 1024
model2.add(tf.keras.layers.Reshape((1024,),input_shape=(32,32,)))

#Add 1st hidden layer
model2.add(tf.keras.layers.Dense(4096, activation='relu'))

#Adding Batch normalization
model2.add(tf.keras.layers.BatchNormalization())

#Add 2nd hidden layer
model2.add(tf.keras.layers.Dense(1024, activation='relu'))

#Adding Batch normalization
model2.add(tf.keras.layers.BatchNormalization())

#Add 3rd hidden layer
model2.add(tf.keras.layers.Dense(256, activation='relu'))

#Adding Batch normalization
model2.add(tf.keras.layers.BatchNormalization())

#Add 4th hidden layer
model2.add(tf.keras.layers.Dense(64, activation='relu'))

#Adding Batch normalization
model2.add(tf.keras.layers.BatchNormalization())

#Add 5th hidden layer
model2.add(tf.keras.layers.Dense(16, activation='relu'))

#Adding Batch normalization
model2.add(tf.keras.layers.BatchNormalization())

#Add OUTPUT layer
model2.add(tf.keras.layers.Dense(10, activation='softmax'))

In [14]:
#Create optimizer with non-default learning rate
sgd_optimizer2 = tf.keras.optimizers.SGD(lr=0.001)

#Compile the model
model2.compile(optimizer=sgd_optimizer2, loss='categorical_crossentropy', metrics=['accuracy'])

In [15]:
#Summary of the model built
model2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
reshape_1 (Reshape)          (None, 1024)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 4096)              4198400   
_________________________________________________________________
batch_normalization (BatchNo (None, 4096)              16384     
_________________________________________________________________
dense_7 (Dense)              (None, 1024)              4195328   
_________________________________________________________________
batch_normalization_1 (Batch (None, 1024)              4096      
_________________________________________________________________
dense_8 (Dense)              (None, 256)               262400    
_________________________________________________________________
batch_normalization_2 (Batch (None, 256)              

In [16]:
#Fitting the new model with 100 Epochs, 1000 batch-size
model2.fit(X_train,y_train,          
          validation_data=(X_val,y_val),
          epochs=100,
          shuffle='batch',
          batch_size=1000)

Train on 42000 samples, validate on 60000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/1

<tensorflow.python.keras.callbacks.History at 0x7f7d0c3420f0>

# Implement batch normalization for training the neural network
We have used batch normalization in the above model.

Batch normalization is the technique with which we can increase the speed, accuracy & stability of the model.

And as we can see ourselves, though it is taking 58s on an average for completing 1 epoch with Batch normalization vs 53s on an average for each epoch without Batch normalization, after 20 Epochs, we were getting merely 13% accuracy without using Batch normalization as compared to over 17% after just the 2nd Epoch. Slight increase in time for each epoch is because the model has to learn few more mean & shift parameters for each Batch normalization layer & it also involves additional computation for normalizing the data.

But, this increase in time per Epoch is negligible as compared to the benefits we are getting using the Batch normalization including the speed of convergence for such a complex model involving more than 8 million parameters to be learnt.

There is bit of overfitting but we can tackle it with regularization techniques like introducing Dropout layer in between to randomly drop output of some neurons, using L1 & L2 regularization to avoid weights to take very high values which would overwhelm other weights, Data augmentation to artificially increase the training set & techniques like Early stopping if validation accuracy starts deteriorating after certain number of iterations.

In [17]:
#Predicting on Test dataset with new Model
y_pred2=model2.predict_classes(X_test)

In [18]:
#Calculating confusion matrix for Test dataset
cm2=metrics.confusion_matrix(test_y,y_pred2)

#Printing the confusion matrix
cm2

array([[1218,   64,   21,   51,   69,   33,  103,   64,   44,  147],
       [  49, 1291,   90,   35,   68,   35,   42,  128,   37,   53],
       [  46,   85, 1173,  106,   53,   41,   21,  166,   53,   59],
       [  51,   60,   96,  974,   21,  188,   48,  142,   70,   69],
       [  61,   96,   60,   17, 1273,   30,  105,   64,   38,   68],
       [  44,   37,   40,  210,   23, 1021,   98,   98,   81,  116],
       [ 120,   59,   44,   26,   89,  111, 1149,   55,  103,   76],
       [  44,  103,   76,   75,   27,   41,   16, 1360,   21,   45],
       [  79,   93,   54,   79,   50,   88,  140,   57, 1028,  144],
       [ 140,   97,   42,   68,   44,   81,   43,   87,  107, 1095]])

# Print the classification accuracy metrics
We have printed the confusion matrix above and as we can see, the overall test accuracy is much better as compared to previous model. After 100 Epochs, we are getting an accuracy of 64.34%.

There is still scope of improvement as we can see that the model has started overfitting training dataset with 86.52% of accuracy on training dataset after 100 Epochs.

We can use some of the Regularization techniques to avoid the problem of overfitting in Neural networks.