# Neural Networks: Multiclass classification with Regularization

This notebook covers the core concepts of neural networks. It reuses some explanation in another notebook: <a href="../2_logistic_regression/2_multiclass_classification_reg.ipynb"> Logistic Regression: Multiclass Classification with Regularization </a>

## Dataset
We are going to use the same data set that was used for <a href="../2_logistic_regression/2_multiclass_classification_reg.ipynb"> Logistic Regression: Multiclass Classification with Regularization </a>

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as op
import math
from sklearn import preprocessing

%matplotlib inline
data = pd.read_csv('../2_logistic_regression/letter-recognition.data', header=None)
print(data.shape)
data.head()

(20000, 17)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


## How Neural Networks work?

Where logistic regression applies a sigmoid function on the input features to produce a hypothesis directly, neural networks have one or more layers (called "hidden layers") on which the same logistic function is applied again and again, once at each layer, until the final layer, which is the output hypothesis. So essentially, the input features pass through multiple applications of logistic regression to finally produce a hypothesis.

<img src = "http://docs.opencv.org/2.4/_images/mlp.png" />

**Motivation**: The rationale behind having hidden layers vs. logistic regression is that when we use multiple layers, we can easily/efficiently represent a pretty complicated hypothesis function spread across a sequence of layers. When compared to simple logistic regression, a similar function would require us to create orders of magnitudes of new artificial polynomial features. The order of such artificial features easily goes to a several thousands even for low complexity functions. On the other hand, using only a few layers in a neural network, we can easily represent a very complex hypothesis function on the original set of features.

For a simple neural network as follows:
$$
\begin{bmatrix}
x_0 \newline
x_1 \newline
x_2 \newline
x_3
\end{bmatrix}
\rightarrow
\begin{bmatrix}
a_1^{(2)} \newline
a_2^{(2)} \newline
a_3^{(2)} \newline
\end{bmatrix}
\rightarrow
h_\theta(x)
$$

The values for each of the "activation" nodes is obtained as follows:

$$
\begin{align*}
a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \newline
a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \newline
a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \newline
h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)}) \newline
\end{align*}
$$

where 
$\Theta^{(1)}$ is the $\Theta$ corresponding to layer 1, which means the "weights" that should be applied to units in layer 1 in order to get layer 2 units.

and $g$ is the sigmoid function

So, in case of neural networks, we will have a a set of $\Theta$'s, where each $\Theta$ corresponds to one layer. Also, just like logistic regression, we will add intercept term (with value 1) at each layer.

The following code implements the hypothesis function:

In [59]:
# Sigmoid function
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

# Hypothesis for a single layer
def hypothesisLayer(Theta, A):
    sig = np.vectorize(sigmoid)
    return sig(np.dot(A, Theta))

# Hypothesis of the network
def hypothesisNeuralNetwork(Theta_list, X):
    num_layers = len(Theta_list)
    A_l = np.copy(X)
    for l in range(num_layers):
        Theta = Theta_list[l]
        A_l_1 = hypothesisLayer(Theta, A_l)
        # Add intercept term before feeding to next Theta
        A_l = np.append(np.ones((A_l_1.shape[0], 1)), A_l_1, axis = 1)
    
    return A_l[:, 1:]

Below is an example of implementing XNOR function in neural network. Here we have two layers: First layer calculates AND operation ($\theta = $[ -30, 20, 20 ]) and NOR operation ($\theta = $[10, -20, -20]), and the second layer calculates the OR operation ($\theta = $[-10, 20 20]). Hence when we combine them we get XNOR.

In [62]:
X = np.matrix('1 0 0; 1 0 1; 1 1 0; 1 1 1')
Theta_1 = np.matrix('-30 10; 20 -20; 20 -20')
Theta_2 = np.matrix('-10; 20; 20')

print("Input", X)

Theta_list = [Theta_1, Theta_2]

print("Output (XNOR)", hypothesisNeuralNetwork(Theta_list, X))


Input [[1 0 0]
 [1 0 1]
 [1 1 0]
 [1 1 1]]
Output (XNOR) [[  9.99954561e-01]
 [  4.54803785e-05]
 [  4.54803785e-05]
 [  9.99954561e-01]]


In the above output, hypothesis is approximately 1 for values where either both inputs are 0 or both are 1. It is approximately 0 for remaining inputs.