# Deep Feedforward Networks

If you are familier with following concepts you can move to next notebook:
>Reference http://www.deeplearningbook.org/contents/mlp.html

    1. Example: Learning XOR
    2. Gradient-Based Learning
    3. Hidden Units
    4. Architecture Design
    5. Back-Propagation and Other Diﬀerentiation Algorithms
    6. Historical Notes

In [None]:
# imports
import numpy as np
import matplotlib.pyplot as plt

**Deep feedforward networks**, also called **feedforward neural networks**, or **multilayer perceptrons(MLPs)**, are the quintessential deep learning models.

A feedforward networkdeﬁnes a mapping ***y=f(x;θ)*** and learns the value of the parameters θ that result in the best function approximation. There are no feedback connections in which outputs of the model are fed back into itself. When feedforward neural networks are extended to include feedback connections, they are called **recurrent neural networks.**

A feedforward network consist of :
   - Input Layer: where the input is given.
   - Hidden Layer: where computation takes place and inputs and outputs are not of concern from obeservation point of view.
   - Output Layer: where we expect the output. 
   
<img src = 'dffn/ffn.png'>
   
here w's are called weights (parameters θ's) multiplied to inputs.
   
To extend linear models to represent nonlinear functions of x, we can apply the linear model not to x itself but to a transformed input φ(x), where φ is a nonlinear transformation. Equivalently, we can apply the kernel trick described in basic machine learning, to obtain a nonlinear learning algorithm based on implicitly applying the φ mapping. We can think of φ as providing a set of features describing x, or as providing a new representation for x.

The question is then how to choose the mapping φ.

   1. One option is to use very generic φ like RBF but generalization to test set often remains poor due to more complexity and careful prior choosing is required. 
   2. Another option is to manually engineer the φ which require a lot of analysis and time.
   3. The strategy of Deep learning to learn φ. In this we have a model y = φ(x,θ) where we learn θ and φ is defined by all the hidden layers in the network but is unknown to us. The advantage is that the human designer only needs to ﬁnd the right general function family rather than ﬁnding precisely the right function.
   
Hidden Layers perform **activation function(S)** on the inputs it gets and returns the transformed output. These activation function contribute alot in learning process the non-linearity of φ etc.

> read this : https://en.wikipedia.org/wiki/Feedforward_neural_network

--------------

### 1. Example: Learning XOR
XOR is a logical operation with input and output set as:

Logic XOR (Exclusive OR):<br>
`
Input  Output
0 0    0
1 0    1
0 1    1
1 1    0
`

Our model provides a function y=f(x;θ) which will be optimized to reach true function f*(x)

Evaluated on our whole training set, the MSE loss function is:
$$J(θ) ={1\over4}_{x∈X}(f^∗(x) − f(x; θ))^2. $$

Now we have to define our model. If we choose a linear model like y = w<sup>T</sup>.x + b and use normal equations to solve this we will get w = 0, b = 0.5 which will output 0.5 everywhere which is not correct this happenned because the function is not linear. 

One way to solve this problem is to use a model that learns a diﬀerent feature space in which a linear model is able to represent the solution.

Speciﬁcally, we will introduce a simple feedforward network with one hidden layer containing two hidden units. The output layer is still just a linear regression model, but now it is applied to h rather than to x. where h is output of hidden layer.
<img src='dffn/xor.jpg'>

here w's are the weight's and θ's are the bias's of our model.

We now have to define our activation functions for 3,4,5. In modern neural networks,the default recommendation is to use the <br>Rectiﬁed Linear Unit ( **ReLU(z) = max{0,z}** ).

<br>Activation functions<br>
A3 = ReLU(w3<sup>T</sup>.x + b3)<br>
A4 = ReLU(w4<sup>T</sup>.x + b4)<br>
A5 = w5<sup>T</sup>.h + b5

Final form of our neural network.
$$f(x; W , c, w, b) = w^T.max\{0, W^Tx + c\} + b$$

We can then specify a solution to the XOR problem as:
<br>
```
W = [1 1
     1 1]
     
c = [0
    -1]
    
w = [1
    -2]
    
b = 0
```

> NOTE : we have not yet discussed how to obtain these parameter we only know that these parameters can solve the problem.

Let's check it....

In [None]:
# XOR example
W = np.array([[1,1],[1,1]])
c = np.array([[0],[-1]])
w = np.array([[1],[-2]])
b = 0

X = np.array([[0, 0, 1, 1],[0, 1, 0, 1]])
Y = np.array([0, 1, 1, 0])

h = np.dot(W.T,X) + c # output of hiddrn units
h = h*(h>0) # relu applied
output = np.squeeze(np.dot(w.T,h) + b) # final output
print("The output obtained is :\n",output)


-----------------
### 2. Gradient-Based Learning

Designing and training a neural network is not much diﬀerent from training anyother machine learning model with gradient descent.

The major difference between the nueral nets and other learning algorithms is that non-linearity of neural nets causes cost function to become non-convex.

#### 2.1 Cost Function

An important aspect of the design of a deep neural network is the choice of the cost function. In most cases, our parametric model deﬁnes a distribution p(y|x,θ) andwe simply use the principle of maximum likelihood.

i.e: we can use the cross entropy between the model prediction and the desired output as the cost fucntion.

##### 2.1.1 Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This means that the cost function is simply the negative log-likelihood:

$$J(θ) = −E_{x,y∼p_{data}}\log p_{model}(y\ | \ x).$$

if model is gaussian model( *N(y;f(x;θ),I)*), then we recover the mean squared error cost,

$$J(θ) ={1\over2}E_{x,y∼p_{data}}\|y − f(x; θ)\|^2+ const$$

One objective of defining cost fucntion for gradient descent algorithms is that the gradient should be large enough for guiding the learning model towards the solution and so should avoid the use of functions which gets saturated or flat as much as possible.

##### 2.1.2 Learning Conditional Statistics

Instead of learning a full probability distribution *p(y | x;θ)*, we often want to learn just one conditional statistic of y given x.
> Example, we may have f(x;θ) to predict the mean of y.

We want to design a powerful nueral net that allows f to become any function from a wide class, hence we can see the cost fucntion as a functional (A functional is a mapping from functions to real numbers.) rather than a function. Thus we can solve our problme w.r.t function instead of parameters.

Solving an optimization problem with respect to a function requires a mathematical tool called ***calculus of variations***, which give these two powerful results:

$$f∗= arg\ min_f\ E_{x,y∼p_{data}}\|y − f(x)\|^2$$f predicts the mean of y for each x.
<br>and,
$$f∗= arg\ min_f\ E_{x,y∼p_{data}}\|y − f(x)\|$$f predicts the median of y for each x.

#### 2.2 Output Units

The cost function more typically the cross entropy will be based on the output we take from the nueral network.

##### 2.2.1 Linear Units for Gaussian Output Distributions

One simple kind of output unit is based on an aﬃne transformation with no nonlinearity. These are often just called linear units.

$$y=W^T.h+b$$

which are mean of a guassian distribution,so maximizing the log-likelihood is then equivalent to minimizing the mean squared error.

##### 2.2.2 Sigmoid Units for Bernoulli Output Distributions

A sigmoid output unit is deﬁned by:

$$y = σ\left(wT.h + b\right)$$

where z = w.h +b is called a **logit** defining the exponent over which the exponential distribution is based.
This approach to predicting the probabilities in log space is natural to use with maximum likelihood learning.

When we use other loss functions, such as mean squared error, the loss can saturate anytime σ(z) saturates. The sigmoid activation function saturates to 0 when z becomes very negative and saturates to 1 when z becomes very positive. For this reason, maximum likelihood is almost always the preferred approach to training sigmoid output units which has a counter effect to exponential term due to log.

##### 2.2.3 Softmax Units for Multinoulli Output Distributions

Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. Softmax functions are most often used as the output of a n-dimensional classiﬁer:

$$softmax(z)_i={exp(z_i)\over\sum_jexp(z_j)}$$

as you would have noticed this a more general form than sigmoid and here also maximum likelihood is almost always the preferred approach to train.


##### 2.2.4 Other Output Types

The linear, sigmoid, and softmax output units described above are the most common. Neural networks can generalize to almost any kind of output layer that we wish. The principle of maximum likelihood provides a guide for how to design a good cost function for nearly any kind of output layer.

> For great learning insights, and a lot of exposure to posibilities you should study see pages 182-183, 185-187 in detail here: http://www.deeplearningbook.org/contents/mlp.html

In [None]:
# softmax vizulaization and maximum likelihood linearity:
def softmax(z):
    n = np.exp(z)
    d = np.sum(n)
    return(n/d)
    
n = 100
x = np.linspace(-2,2,n)
y = softmax(x)
print("The sum of all values of softmax(x) = ",np.sum(y))

f, ax= plt.subplots(1,2,figsize=[12,4])
ax[0].plot(x,y,label="softmax")
yd = np.log(y)/100
ax[1].plot(x,yd,label="log(softmax)")
ax[0].legend()
ax[1].legend()
plt.show()

---------------------------

### 3. Hidden units

The design of hidden units is an e xtremely active area of research and does not yet have many deﬁnitive guiding theoretical principles. Rectiﬁed linear units are an excellent default choice of hidden units.

Although some of the activations like ReLU are not differentiable at all points gradient descent seems to work on them since the nueral network does not optimize the funtion to arrive at local minima but to get close to it as much as possible also we can define a gradient e.g: for ReLU gradient at 0 to be same as that of 0+ or 0- points. We can approximate so since, When a function is asked to evaluate g(0), it is very unlikely that the underlying value truly was 0. Instead, it was likely to be some small value that was rounded to 0.

#### 3.1 Rectiﬁed Linear Units and Their Generalizations
Easy to optimize due to linearity :<br> ***ReLU***


$$ g(z) = max\{0, z\}$$

One drawback to rectiﬁed linear units is that they cannot learn via gradient-based methods on examples for which their activation is zero.

Three generalizations of rectiﬁed linear :

***Leaky ReLU***:

$$g(z, α)_i=max(0, z_i) +α_imin(0, z_i)$$

α is a small value like 0.01

***Parametric ReLU***

Tries to learn α and cosider it as a parameter.

***Maxout units*** 

generalize rectiﬁed linear units further. Instead of applying an element-wise functiong(z), maxout units divide z into groups of k values. Each maxout unit then outputs the maximum element of one of these groups:

$$g(z)i= max_{j∈G_{(i)}}z_j$$

This provides a way of learning a piecewise linear function that responds to multiple directions in the input x space. A maxout unit can learn a piecewise linear, convex function with up to k pieces. Maxout units can thus be seen as learning the activation function itself rather than just the relationship between units.

#### 3.2 Logistic Sigmoid and Hyperbolic Tangent

Prior to the introduction of rectiﬁed linear units, most neural networks used the logistic sigmoid activation function:

$$g(z) = σ(z)$$

or, the hyperbolic tangent activation function:

$$g(z) = tanh(z)$$

The sigmoid fucntions are now more typically used for output layer and not in hidden layers since its learning curve saturates far from 0. tanh is very similar to sigmoid function but has advantage over sigmoid since training a deep neural network *y=w.tanh(U.tanh(V.x))* resembles training a linear model *y = w.U.V.x* as long as the activations are kept small also tanh has negative ranged values for output which sigmoid does not give directly.

#### 3.3 Other Hidden Units

Many other types of hidden units are possible but are used less frequently. Like softmax can be used to design network with switches and a few resonable ones are:

***RBF***:

This function becomes more active as x approaches a template W<sub>:,i</sub>. Because it saturates to 0 for most x, it can be diﬃcult to optimize...
$$h_i=exp\left(−{1\overσ^2_i}\|W_{:,i}− x\|2\right)$$

***Softplus***:

This is a smooth version of the rectiﬁer...
$$g(a) =ζ(a) =log(1+e^a)$$

The softplus demonstrates that the performance of hidden unit types can be very counterintuitive one might expect it to have an advantage over the rectiﬁer due to being diﬀerentiable everywhere or due to saturating less completely, but empirically it does not


***Hard tanh***:

This is shaped similarly to the tanh and the rectiﬁer, but unlike the latter, it is bounded :

$$g(a) = max(−1, min(1, a))$$

In [None]:
# Visualization of activation functions:
def relu(z):
    o = np.maximum(z,0)
    return o

def leakyrelu(z,alpha):
    o = np.maximum(z,alpha*z)
    return o
    
def rbf(z,sigma,mu):
    o = (-1/sigma**2)*(z-mu)**2
    return o

def softplus(z):
    o = np.log(1+np.exp(z))
    return o

def sigmoid(z):
    o = 1/(1+np.exp(-z))
    return o
    
def tanh(z):
    o = np.tanh(z)
    return o

def hardtanh(z):
    o = np.maximum(-1,np.minimum(z,1))
    return o

n = 50
x = np.linspace(-10,10,n)
rx = relu(x)
rl = leakyrelu(x,0.1)
rr = rbf(x,1,0)
rp = softplus(x)
rs = sigmoid(x)
rt = tanh(x)
rh = hardtanh(x)


f, ax = plt.subplots(2,4,figsize=[16,8])
ax[0,0].plot(x,x)
ax[0,0].set_title("Linear")
ax[0,1].plot(x,rx)
ax[0,1].set_title("ReLU")
ax[0,2].plot(x,rl)
ax[0,2].set_title("Leaky ReLU")
ax[0,3].plot(x,rr)
ax[0,3].set_title("RBF")
ax[1,0].plot(x,rp)
ax[1,0].set_title("Softplus")
ax[1,1].plot(x,rs)
ax[1,1].set_title("Sigmoid")
ax[1,2].plot(x,rt)
ax[1,2].set_title("tanh")
ax[1,3].plot(x,rh)
ax[1,3].set_title("Hard tanh")
plt.show()

### 4. Architecture Design

The word architecture refers to the overall structure of the network: how many units it should have and how these units should be connected to each other.

#### 4.1 Universal Approximation Properties and Depth

There are two reasons for whichi a nueral network could fail:

   - First, the optimization algorithm used for training may not be able to ﬁnd the value of the parameters that corresponds to the desired function.
   - Second, the training algorithm might choose the wrong function as a result of overﬁtting.
   
According to the ***universal approximation theorem***, there exists a network large enough to achieve any degree of accuracy we desire, but the theorem does not say how large this network will be.

A feedforward network with a single layer is suﬃcient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In many circumstances, using deeper models can reduce the number of units required to represent the desired function and can reduce the amount of generalization error. But having a model with more than required layers may lead to overfitting.

> See fig 6.5 (pg 196) here : http://www.deeplearningbook.org/contents/mlp.html read upto page 199 for good insights.

-----------------

### 5. Back-Propagation and Other Diﬀerentiation Algorithms

we have propagated the information from input layer to ouput layer and computed the cost function. This is called ***forward propagation*** but as in gradient descent we need to compute gradient of cost function w.r.t each paramater so we pass the gradient information from output layer back to input layer. This is called the ***back propagation*** (backprop). After which we can use batch bradient descent or stochastic gradient descent to optimize the parameters.

#### 5.1 Computational Graphs

we can see each node in network as a variable indication with set of allowable operations and then we can turn our nework into a graphical structure to compute in a graphical manner.

#### 5.2 Chain Rule of Calculus

During backpropagation we can compute the differentiation of cost function w.r.t to given parameter using chain rule of derivatives and we need not calculate the derivaive analytically for every parameter.

Suppose that x ∈ R<sup>m</sup>, y ∈ R<sup>n</sup>,g maps from R<sup>m</sup> to R<sup>n</sup>, and f maps from R<sup>n</sup> to R. If y=g(x) and z=f(y), then:

$${∂z\over∂x_i}=\sum_j{∂z\over∂y_j}.{∂y_j\over∂x_i}$$

#### 5.3 Recursively Applying the Chain Rule to Obtain Backprop

We start moving from output layer computing the gradient of cost with respect to parameters and apply the chain rule to the gradient obtained from (i+1)th layer to (i)th layer to calculate the gradient w.r.t (i)th layer parameters. Activation functions and input vector of each node in our computation graphs define how each of these gradient will turn out to be.

**Forward propagation Algorithm (6.3) :**
> Read here page 208 : http://www.deeplearningbook.org/contents/mlp.html

**Back propagation Algorithm (6.4) :**
> Read here page 209 : http://www.deeplearningbook.org/contents/mlp.html

For an hand computation exmaple of back propagation--
> Example back propagation: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

You would have noticed by now that the learning gradient decreases from output to input layer resulting in slow learning for near input layers, this is the reason we usually go fo ReLU as compared to Sigmoid activation functions.

Let's implement the back progation for the XOR example we were doing...
<br>
we will maintain linear and activation caches of inputs and gradients while computing to make things easier.

In [None]:
# modeling the data
X = np.array([[0, 1, 0, 1],[0, 0, 1, 1]]) # input data structure 2x4
Y = np.array([[0, 1, 1, 0]]) # output data structure 1x4

layers = [2,2,1] # architecture of our network 2(input) --> 2(hidden) --> 1(output)

print("Representing data")
plt.scatter(X[0,:].T,X[1,:].T,c=Y.T)
plt.show()

Let's define the activation functions and there gradients with respect to input next layer gradient: 

In [None]:
# defining activation functions and gradients w.r.t input:
def relu(z): # in actual its a leaky relu performs better
    o = z*(z>0) - 0.1*(z<0)*z
    return o, z

def drelu(dA,z):# in actual its a diff leaky relu performs better
    o = (z>0)*1 - 0.1*(z<0)
    return o*dA

def sigmoid(z):
    o = 1/(1+np.exp(-z))
    return o, z

def dsigmoid(dA,z):
    o = 1/(1+np.exp(-z))
    return (o*(1-o))*dA

Let's Initialize out parameters for different layers...

In [None]:
# Initialization of parameters w's and b's
def initparams(layers):
    """
    Arguments:
    layers     -- python array (list) containing the dimensions of each layer in our network
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
    """
    
    np.random.seed(15)
    parameters = {}
    L = len(layers) # length/ depth of the network
    
    # NOTE : Representing w's and b's as matrices having dimensions, rows -> (i)th layer size, columns -> (i-1)th layer size:
    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers[l], layers[l-1])*0.01
        parameters['b' + str(l)] = np.zeros((layers[l], 1))
        assert(parameters['W' + str(l)].shape == (layers[l], layers[l-1]))
        assert(parameters['b' + str(l)].shape == (layers[l], 1))
    
    return parameters

Implement the linear part of a layer's forward propagation z = W.h + b

In [None]:
def linfwd(A, W, b):
    """
    Arguments:
    A -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter 
    cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently
    """
    
    Z = np.dot(W,A)+b
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    
    return Z, cache

Implement the forward propagation for the LINEAR->ACTIVATION layer: 

In [None]:
def linactfwd(A_prev,W,b,act):
    """
    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function, also called the post-activation value 
    cache -- a python dictionary containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
    """
    
    if act == "sigmoid":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        Z, linear_cache = linfwd(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
    elif act == "relu":
        # Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
        Z, linear_cache = linfwd(A_prev, W, b)
        A, activation_cache = relu(Z)
        
    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)  # cache = ((input, W, b), (output, z)) for that layer
    return A, cache    

Let's define forward propagation:

In [None]:
def lmodelfwd(X, parameters):
    """
    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()
    
    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
                the cache of linear_sigmoid_forward() (there is one, indexed L-1)
    """
    
    caches = []
    A = X
    L = len(parameters) // 2     # number of layers in the neural network L = len(w's) + len(b's) = 2xdepth
    
    for l in range(1, L):
        A_prev = A 
        A, cache = linactfwd(A_prev,parameters['W' + str(l)],parameters['b' + str(l)],"relu")
        caches.append(cache) 
        
    AL, cache = linactfwd(A,parameters['W' + str(L)],parameters['b' + str(L)],"sigmoid")
    
    caches.append(cache)
    
    return AL, caches

Let's define the cost fucntion: ** cross entropy cost **

In [None]:
def compcost(AL, Y):
    """
    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
    Y -- true "label" vector , shape (1, number of examples)

    Returns:
    cost -- cross-entropy cost
    """
    m = Y.shape[1] # training batch size
    # cross entropy cost
    
    cost = (-1/m)*(np.dot(Y,np.log(AL).T)+np.dot(1-Y,np.log(1-AL).T))
    
    cost = np.squeeze(cost)
    return cost

Implement the linear portion of backward propagation for a single layer:

In [None]:
def linback(dZ, cache):
    """
    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    
    A_prev, W, b = cache
    m = A_prev.shape[1]
    
    dW = np.dot(dZ,A_prev.T)/m
    db = np.sum(dZ,axis=1,keepdims=True)/m
    dA_prev = np.dot(W.T,dZ)
    
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db

Implement the backward propagation for the LINEAR->ACTIVATION layer.

In [None]:
def linactback(dA, cache, act):
    """
    Arguments:
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
    
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
    """
    
    linear_cache, activation_cache = cache
    
    if act == "relu":
        ### START CODE HERE ### (≈ 2 lines of code)
        dZ = drelu(dA, activation_cache)
        dA_prev, dW, db = linback(dZ, linear_cache)
        ### END CODE HERE ###
        
    elif act == "sigmoid":
        ### START CODE HERE ### (≈ 2 lines of code)
        dZ = dsigmoid(dA, activation_cache)
        dA_prev, dW, db = linback(dZ, linear_cache)
        ### END CODE HERE ###
    
    return dA_prev, dW, db

Let's Implement the backward propagation completely:


In [None]:
def lmodelback(AL, Y, caches):
    """
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector 
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
                the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
    
    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ... 
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ... 
    """
    
    grads = {}
    en = 10**-20 # a very small value (epsilon)
    L = len(caches) # the number of layers
    m = AL.shape[1]
    n = AL.shape[0]
    Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
    
    # Initializing the backpropagation
    
    
    try: 
        dAL = (-np.divide(Y, AL) + np.divide(1 - Y, 1 - AL))/m
    except ZeroDivisionError: 
        dAL = (-np.divide(Y, AL+np.sign(AL)*en) + np.divide(1 - Y, 1 - AL + np.sign(1-AL)*en))/m
        
    # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
    current_cache = caches[L-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linactback(dAL, current_cache, "sigmoid")
    
    for l in reversed(range(L-1)):
        # lth layer: (RELU -> LINEAR) gradients.
        # Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linactback(grads["dA" + str(l+2)], current_cache, "relu")
        
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

Great now define a function to update parameters:


In [None]:
def updateparams(parameters, grads, learning_rate):
    """
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
    """
    
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate*grads["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate*grads["db" + str(l+1)]
        
    return parameters

Finally, let's define our model:


In [None]:
def llayermodel(X, Y, layers, learning_rate = 0.0007, num_epochs = 10000, print_cost = True, printerval = 1000):
    """
    Arguments:
    X -- data, numpy array of shape (number of examples, num_px * num_px * 3)
    Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples)
    layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
    learning_rate -- learning rate of the gradient descent update rule
    num_iterations -- number of iterations of the optimization loop
    print_cost -- if True, it prints the cost every 100 steps
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
    np.random.seed(1)
    costs = []                         # keep track of cost
    
    # Parameters initialization.
    parameters = initparams(layers)
    
    # Loop (gradient descent)
    for i in range(0, num_epochs):

        # Forward propagation
        a3, caches = lmodelfwd(X, parameters)
        
        # Compute cost
        cost = compcost(a3, Y)
        
        # Backward propagation
        grads = lmodelback(a3, Y, caches)

        # Update parameters
        parameters = updateparams(parameters, grads, learning_rate)
        
        # Print the cost every 100 training example
        if print_cost and i % printerval == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
        if i % printerval == 0:
            costs.append(cost)
            
    # plot the cost
    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations /'+str(printerval))
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

Let's prediction function:


In [None]:
def predict(X,paramerters):
    AL, caches = lmodelfwd(X, parameters)
    return AL

Now, all the requirements have been met now we can train our model, so let's jump to it:

In [None]:
parameters = llayermodel(X, Y, layers, learning_rate = 1, num_epochs = 10000, print_cost = True, printerval = 500)

# prediction....
yd = predict(X,parameters)
print(parameters)
print("The Predcited values are\n",yd,"\nalmost equal to the output")

### 6. Historical Notes

> Read from reference : http://www.deeplearningbook.org/contents/mlp.html

------------------------

**NOTE** :* From next chapter this tutorial will only emphasise on the coding or examples part the theory will have to be studied from the reference book given...Reference pages will be provided if necessary.*


## Congratulation
on completing the deep nueral network part next we will introduce regularization in our models...