# Multi-Class Classification and Neural Networks

This notebook implements the third exercise from Andrew Ng’s Machine Learning Course on Coursera. In this exercise, a one-vs-all logistic regression and neural networks will be implemented to recognize hand-written digits (from 0 to 9)

## Multi-class Classification

Automated handwritten digit recognition is widely used today - from recognizing zip codes (postal codes) on mail envelopes to recognizing amounts written on bank checks. In this part, the previous exercise of logistic regression will be extended and applied to one-vs-all classification.

In [1]:
import numpy as np
import scipy.io as sio
import scipy.optimize as opt

### Loading the Dataset

In [2]:
data_dict = sio.loadmat('Data/ex3data1.mat')
print(data_dict)

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011', '__version__': '1.0', '__globals__': [], 'X': array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]]), 'y': array([[10],
       [10],
       [10],
       ...,
       [ 9],
       [ 9],
       [ 9]], dtype=uint8)}


In [3]:
print('\nFeature dimensions {} and target dimension {}\n'.format(data_dict['X'].shape, data_dict['y'].shape))


Feature dimensions (5000, 400) and target dimension (5000, 1)



Since we need to classify 10 digits with 5000 training sets, lets approach the problem with logistic regularization technique to avoid over-fitting. Our strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This algorithm is called One-Vs-Rest or One-against-All.

### Implementing Gradient and Cost functions

In [4]:
def sigmoid(z):
    """
    This function returns hypothesis
    """
    return 1 / (1 + np.exp(-z))

def cost(theta, X, y, regParam):
    """
    This function returns the cost of using theta
    """
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    term1 = np.multiply(-y, np.log(sigmoid(X * theta.T)))
    term2 = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))
    reg = (regParam / 2 * len(X)) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))
    return np.sum(term1 - term2) / (len(X)) + reg

def gradient(theta, X, y, regParam):
    """
    This function returns the gradient
    """
    theta = np.matrix(theta)
    X = np.matrix(X)
    y = np.matrix(y)
    
    error = sigmoid(X * theta.T) - y
    
    grad = ((X.T * error) / len(X)).T + ((regParam / len(X)) * theta)
    
    grad[0, 0] = np.sum(np.multiply(error, X[:,0])) / len(X) # theta 0 isn't regularized
    
    return np.array(grad).ravel()

def oneVsAll(X, y, num_labels, regParam):
    """
    This function is the implementation of single class classifier.
    Looping over the 10 class will give us the theta estimates
    """
    m, n = X.shape
        
    all_theta = np.zeros((num_labels, n + 1)) # for all 10 classes
    
    X = np.insert(X, 0, values=np.ones(m), axis=1) # add bias to X (5000, 401)
    
    for i in range(1, num_labels + 1): # target is labelled from 1..10
        theta = np.zeros(n + 1) # theta for each class (401,)
        yi = np.array([1 if label == i else 0 for label in y]) 
        yi = yi.reshape(m,1)
        
        # minimize the objective function
        result = opt.fmin_tnc(func = cost, x0 = theta, fprime = gradient, args= (X, yi, regParam))
        
        #fmin = minimize(fun=cost, x0=theta, args=(X, yi, regParam), method='TNC', jac=gradient)
        all_theta[i-1,:] = result[0]
    
    return all_theta #(10, 401)

### Train the Logistic Classifier

In [5]:
num_labels, regParam = 10, 1

"""
X: features will be of dimension (5000, 401) ~ extra bias column
y: target will be of dimension (5000, 1)

all_theta = theta values for entire 10 class (10, 401)
theta = theta for single class (401,)

"""
all_theta = oneVsAll(data_dict['X'], data_dict['y'], num_labels, regParam)

In [6]:
print('Theta at which the cost is minimum: {}'.format(all_theta))

Theta at which the cost is minimum: [[-3.70247924e-05  0.00000000e+00  0.00000000e+00 ... -2.24803603e-10
   2.31962906e-11  0.00000000e+00]
 [-8.96250749e-05  0.00000000e+00  0.00000000e+00 ...  7.26120890e-09
  -6.19965354e-10  0.00000000e+00]
 [-8.39553305e-05  0.00000000e+00  0.00000000e+00 ... -7.61695535e-10
   4.64917608e-11  0.00000000e+00]
 ...
 [-7.00832398e-05  0.00000000e+00  0.00000000e+00 ... -6.92008998e-10
   4.29241471e-11  0.00000000e+00]
 [-7.65187918e-05  0.00000000e+00  0.00000000e+00 ... -8.09503253e-10
   5.31058706e-11  0.00000000e+00]
 [-6.63412359e-05  0.00000000e+00  0.00000000e+00 ... -3.49765866e-09
   1.13668515e-10  0.00000000e+00]]


### Prediction and Evaluation

In [7]:
def predict(X, theta):
    """
    Let's check what is the result of each test input
    """
    X = np.insert(X, 0 ,values = np.ones(X.shape[0]),axis = 1)
    X = np.matrix(X)
    theta = np.matrix(theta)
    
    hyp = sigmoid(X * theta.T)
    # we need to return the location at which hyp is maximum!!

    return (np.argmax(hyp, axis = 1) + 1)

In [8]:
y_pred = predict(data_dict['X'], all_theta)
y_pred

matrix([[10],
        [10],
        [10],
        ...,
        [ 9],
        [ 7],
        [10]])

In [9]:
correct = []
for (a,b) in zip(y_pred, data_dict['y']):
    if a == b:
        correct.append(1)
    else:
        correct.append(0)

accuracy = (sum(map(int, correct)) / float(len(correct)))
print ('One Vs Rest Classifier Accuracy = {}%'.format(accuracy * 100))

One Vs Rest Classifier Accuracy = 74.6%


## Neural Networks

In the previous part, a multi-class logistic regression was implemented to recognize handwritten digits. However, logistic regression cannot form more complex hypotheses as it is only a linear classifier. More features can be added (such as polynomial features) to logistic regression, but that can be very expensive to train.

In this part, a neural network will be implemented to recognize handwritten digits using the same training set as before. The neural network will be able to represent complex models that form non-linear hypotheses. This time, there will be used parameters from a neural network that have been already trained. The goal is to implement the feedforward propagation algorithm to use the weights for prediction.

In [10]:
# Initializing Training Set
X = data_dict.get('X')
y = data_dict.get('y')

In [11]:
m, n = X.shape

### Training Neural Network

The network choosen for predicting handwritten digits has 1 input layer, 1 hidden layer and 1 output layer.
We are already provided with the weights (theta) used in training this model. Load the file and use it for prediction and evaluation.

In [12]:
input_layer_size  = 400  # 20x20 Input Images of Digits
hidden_layer_size = 25   # 25 hidden units
num_labels = 10          # 10 labels, from 0 to 9  

In [13]:
theta_dict = sio.loadmat('Data/ex3weights.mat')

In [14]:
theta1 = np.array(theta_dict.get('Theta1'))
theta2 = np.array(theta_dict.get('Theta2'))

### Prediction using Feed Forward Propagation

In [15]:
def predict(theta1, theta2, X):
    """
    This function calculates the hidden layer activation values from weights
    and features.

    a1 = feature values (X) with bias of 1
    a2 = activation unit in hidden layer
    a3 = hypothesis from the output layer
    """

    [m, n] = X.shape # m = size of training set
                     # n = number of features
    
    a1 = np.array(np.column_stack(((np.ones((m,1))), X))) # add bias + 1
    z1 = np.dot(a1, np.transpose(theta1))
    a2 = sigmoid(z1) 

    a2 = np.array(np.column_stack(((np.ones((m,1))), a2))) # add bias + 1
    z2 = np.dot(a2, np.transpose(theta2))
    a3 = sigmoid(z2)

    return a3

In [16]:
# predicting values

predicted = predict(theta1, theta2, X)
predicted.shape

(5000, 10)

In [17]:
y_pred = np.array(np.argmax(predicted, axis=1) + 1) # python is zero-indexed

predicted[:3], y_pred[:3], y[:3]

(array([[1.12661530e-04, 1.74127856e-03, 2.52696959e-03, 1.84032321e-05,
         9.36263860e-03, 3.99270267e-03, 5.51517524e-03, 4.01468105e-04,
         6.48072305e-03, 9.95734012e-01],
        [4.79026796e-04, 2.41495958e-03, 3.44755685e-03, 4.05616281e-05,
         6.53412433e-03, 1.75930169e-03, 1.15788527e-02, 2.39107046e-03,
         1.97025086e-03, 9.95696931e-01],
        [8.85702310e-05, 3.24266731e-03, 2.55419797e-02, 2.13621788e-05,
         3.96912754e-03, 1.02881088e-02, 3.86839058e-04, 6.22892325e-02,
         5.49803551e-03, 9.28008397e-01]]),
 array([10, 10, 10]),
 array([[10],
        [10],
        [10]], dtype=uint8))

In [18]:
# Evaluation 

correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print('Training Accuracy of Neural Network is {} %'.format(accuracy*100))

Training Accuracy of Neural Network is 97.52 %
