In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6) # set default figure size, 8in by 6in

# Video W5 01: Neural Network Cost Function

[YouTube Video Link](https://www.youtube.com/watch?v=18X68kLAfKY&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=51)

We will be looking this week at methods to train a neural network, e.g. ways to find the set of $\Theta$ parameters that optimize
our network performance on a given training set.  We will be looking at multilayer networks, with usually at least 1 layer of so
called hidden units, and a final layer of output units.  We will be doing either binary or multi-class classification with
our networks.  For binary classification, we would simply have a single unit in the output layer, and the answer we are looking for
is is this a positive or a negative case.  When we have $N$ multiple classes, as we already discussed, we can use a network with
$N$ units in the output layer, and we will train such that each unit represents a particular classification.

**Cost Function**

The cost function we need to use for a neural network is a generalization of the cost function we used for logistic regression. 
Recall that our logistic regression cost function with the regularization term looked like this:

$$
J(\theta) = -\frac{1}{m} \Big[ \sum_{i=1}^m  y^{(i)} \; \textrm{log} (h_\theta(x^{(i)})) + (1 - y^{(i)}) \; \textrm{log} (1 - h_\theta(x^{(i)})) \Big] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2
$$

The biggest change in notation for a neural network is that we need to sum up over multiple output units (for the most general
multi-class case).  Thus, if we have $K$ output units, we sum up their individual costs:

$$
J(\theta) = -\frac{1}{m} \Big[ \sum_{i=1}^m  \sum_{k=1}^{K} y_k^{(i)} \; \textrm{log} (h_\theta(x^{(i)}))_k + (1 - y_k^{(i)}) \; \textrm{log} (1 - h_\theta(x^{(i)}))_k \Big] + \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}} \big( \Theta_{ji}^{(l)} \big)^2
$$

Notice also that the regularization term has become more complex.  This is because we need to add in penalities for our cost for all
of the $\Theta$ parameter values in all of the layers of our network.

# Video W5 02: Backpropagation Algorithm

[YouTube Video Link](https://www.youtube.com/watch?v=SvAEX5taVKk&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=52)

The trick with multi-layer neural networks is calculating the partial derivative or gradient terms in the layers of the network.
Because of the layered nature of the network, there is no direct way to calculate the partial derivatives for hidden layers.  We
can however estimate these partial derivatives by calculating the delta or differences in our outputs at the output layer from the
correct output.  Given these deltas, we can estimate deltas for subsequent earlier layers.  Thus backpropagation works by first
doing a feed forward pass to calculate all of the activations for all of the units in all layers, then backpropagating the delta
errors, wich can give us an estimate of the partial derivatives of the functions at each layer.  Don't worry too much if you don't
follow the logic for how the backpropagation equations have been derived.  For this course, it will be sufficient to understand
the given backpropagation equations so that you can implment them in Python code.

The delta's for the output layer are computed dirrectly as simply the difference between the activiation of each unit and the
correct answer given in our training set $y$.  For the 4 layer example network from the video, the delta's at layer $L = 4$ are
given by:

$$
\delta_j^{(4)} = a_j^{(4)} - y_j
$$

As you can see, this is simply the difference between the output and the correct answer for each of the $j$ output units.  Given
these delta values for the output layer, we can estimate deltas for the 2 previous layers:

$$
\delta_j^{(3)} = (\Theta^{(3)})^T \delta^{(4)} \; .* \; g'(z^{(3)})
$$

$$
\delta_j^{(2)} = (\Theta^{(2)})^T \delta^{(3)} \; .* \; g'(z^{(2)})
$$

Notice each calculation of a layer uses the delta calculated from the next higher layer.  The $g'(z)$ represents the derivative
of the sigmoid function, which can be derived directly using calculus.  In the video the instructor uses a bit of matlab
notation in these equations.  The $.*$ means we need to do an element wise multiplication between the left and right terms.
The result will be deltas for each of the units in the indicated layer of the network.  These deltas can then be used directly
as estimates of the gradient or partial derivatives, and thus can then be used in our optimization methods like gradient descent to search for the best $\Theta$ parameters for a given network to represent a given training set of data.

**Backpropagation Algorithm**

Given a training set of $m$ training examples ${ (x^{(1)}, y^{(1)}), \cdots, (x^{(m)}, y^{(m)})}$ the video next gives
pseudo code for the basic backpropagation algorithm.  There are a lot of details here, but for all of the details it is mostly
a matter of being comfortable with the notation.  We are using subscripts $i, j$ to denote connections or $\Theta$ parameters
from the $j^{th}$ unit in a previous layer to the $i^{th}$ unit in the next layer.  And we are using $l$ to indicate the layer
number in the network.

Given this notation, we create a number of matrices (denoted by capital Delta $\Delta$, that we initially set to 0 and use
as accumulators when computing the deltas.  The algorithm given in the video is:

Set $\Delta_{ij}^{(l)} = 0$ (for all $l, i, j$).

For $i = 1 \; \textrm{to} \; m$  (we iterate over each of our training examples)

- Set $a^{(1)} = x^{(1)}$ and perform forward propagation to compute the activation $a^{(l)}$ for all units in all layers $l = 2, 3, ..., L$.
- Using $y^{(i)}$ compute the delta in the output layer $\delta^{(L)} = a^{(L)} - y^{(i)}$ 
- Backpropagate and compute the delta values in all previous layers
- Accumulate this computed delta for input example $m$ by adding $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)} \delta_i^{(l+1)}$

Finally we can add in a regularization term for the units that don't represent bias units

$$
D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)} + \lambda \Theta_{ij}^{(l)} \; \; \textrm{if} \; \; j \ne 0
$$

$$
D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)} \;\;\;\;\;\;\;\; \textrm{if} \; \; j = 0
$$

And these $D$ terms can be used as approximations of the partial derivative gradients we need in order to perform an optimization
like gradient descent on our $\Theta$ parameters of the network:

$$
\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}
$$

# Video W5 03: Backpropagation Intuition

[YouTube Video Link](https://www.youtube.com/watch?v=q1bQDyV6lsg&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=53)



# Video W5 04: Implementation Note: Unrolling Parameters

[YouTube Video Link](https://www.youtube.com/watch?v=rcDJhGtXMvk&index=54&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW)

Example of the unrolling/reshaping operations from the video, but in `Python`/`NumPy`

In [2]:
# example of the matrix reshaping in Python/NumPy
Theta1 = np.ones( (10, 11) )
print(Theta1)
print(Theta1.shape)

[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
(10, 11)


In [3]:
Theta2 = 2 * np.ones( (10,11) )
Theta3 = 3 * np.ones( (1, 11) )

In [4]:
# the matlab/octave notation Theta1(:) basically reshapes the matrix into a column vector, the
# equivalent in NumPY is
Theta3Col = Theta3.reshape( (Theta3.size,1) )
print(Theta3Col)
print(Theta3Col.shape)

[[3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]]
(11, 1)


In [5]:
# so to create the thetaVec column vector, we can do this
thetaVec = np.concatenate((Theta1.reshape( (Theta1.size,1) ), 
                           Theta2.reshape( (Theta2.size,1) ),
                           Theta3.reshape( (Theta3.size,1) ) ))
print(thetaVec.shape)
#print thetaVec

(231, 1)


In [6]:
# to get back the theta matrices from the column vector, we can do something similar to matlab
# get the Theta1 values back to a 10x11 matrix, note we use 0 based indexing in NumPy arrays
np.reshape(thetaVec[0:110], (10, 11) )

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [7]:
np.reshape(thetaVec[110:220], (10, 11) )

array([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]])

In [8]:
np.reshape(thetaVec[220:231], (1,11) )

array([[3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.]])

# Video W5 05: Gradient Checking

[YouTube Video Link](https://www.youtube.com/watch?v=I-X8_EcGYik&index=55&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW)

This is a method that helps to debug any gradient descent or optimization methods for backpropagation.  If you introduce subtle bugs in computing
the cost or gradient values that are used in an optimization method, the optimization can appear to be working.  However, you
can end up with not truly optimal parameters that you would get if your calculation of cost and gradient were completely 100%
correct.  The method shown in this video can be used to check that the result you get after optimization is actually the
best one possible, and thus that you are computing the cost and gradients correctly.

This method is based on approximating the gradient or partial derivative, using the difference of the function at 2 points
that are close together (based on the definition of the derivative of a function at a point).

If the approximate method of calculating the partial derivative is close to the computed $D$ values, then probably the
implementation is correct.  Where close is defined as being the same to some number of decimal places, for example.

Here is a very simple example of calculating the gradient (the derivative) using this numerical approximation.  Suppose
you have 

$$
J(\theta) = \theta^3
$$

Furthermore let $\theta = 1.0$ and $\epsilon = 0.01$.  We can use the formula for two sided difference for approximating the derivative

$$
\frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2 \epsilon}
$$

to approximate the derivative.  The true derivative using analytic methods for this function at the point $\theta = 1$ is

$$
\frac{d}{d \theta} J(\theta) = 3 \theta^2 = 3
$$

In [9]:
theta = 1.0
epsilon = 0.01

def J(theta):
    return theta**3.0

dtheta = ( J(theta + epsilon) - J(theta - epsilon) ) / (2.0 * epsilon)
print(dtheta)

3.0001000000000055


# Video W5 06: Random Initialization

[YouTube Video Link](https://www.youtube.com/watch?v=NhgB6FLyHJc&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=56)

Unlike before in logistic regression, there is a problem with setting the initial values for theta to 0 when using
backpropagation.  All of the activations (and all of the delta values) will be the same for all inputs given an initial
theta that is all 0's.  

A simple way to get around this is to initialize all of the theta paramters to small random values, around 0.  The
eqivalent way to create random Theta1, Theta2, etc. matrices of the right shape in NumPy is:

In [10]:
INIT_EPSILON = 0.01
Theta1 = np.random.uniform(-INIT_EPSILON, INIT_EPSILON, (10, 11) )
print(Theta1)

[[-3.12454680e-03  8.32004498e-03 -9.19365115e-03  3.33899996e-03
   8.27828660e-04 -9.84614110e-03 -1.99638783e-03 -8.60932238e-03
  -5.07014197e-03  4.67045349e-03 -6.34834815e-03]
 [ 4.52380522e-03  1.93205547e-03  8.31113968e-03 -9.91601304e-03
   2.80061488e-03  3.49869980e-03  7.22324685e-03 -2.69626347e-04
   5.57494574e-03  7.89761575e-03  7.25841850e-03]
 [-4.91499035e-03  5.64222426e-03  2.62512161e-03 -5.86958493e-03
   5.85540193e-04 -4.28655265e-03  9.84538695e-03  2.19330834e-04
  -7.09549344e-03  5.75257918e-03  6.77862673e-04]
 [-4.65847800e-03 -6.36570137e-03  1.21693034e-03 -4.01396900e-04
   3.76659162e-03 -4.94888821e-03 -5.06458896e-03 -6.46419376e-03
  -8.38951427e-03 -8.16099014e-03 -8.67199744e-03]
 [ 6.85606924e-03  4.88624278e-03  7.42659692e-03 -1.13437851e-03
  -3.01197701e-03  7.42765040e-03  5.02011390e-03 -7.87864795e-03
   1.75535621e-03 -4.31894087e-03  1.72692587e-05]
 [-3.28671085e-03 -6.69195768e-03 -9.21326953e-03  5.43986409e-03
   3.24819052e-03  

# Video W5 07: Putting it Together

[YouTube Video Link](https://www.youtube.com/watch?v=T7-ZsYlFH4M&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=57)



# Video W5 08: Autonomous Driving

[YouTube Video Link](https://www.youtube.com/watch?v=WkmplH50K1k&index=58&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW)


In [11]:
import sys
sys.path.append("../src") # add our class modules to the system PYTHON_PATH

from ml_python_class.custom_funcs import version_information
version_information()

              Module   Versions
--------------------   ------------------------------------------------------------
         matplotlib:   ['3.2.2']
              numpy:   ['1.18.5']
             pandas:   ['1.0.5']
