# Deep Neural Network Representation

---


![alt text](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/10/Screenshot-from-2018-10-16-12-16-27.png)

* This is a 4 layer Neural Network
* Here we have an i/p layer consisting 3 i/ps' - $x_{1}$, $x_{2}$, $x_{3}$ and then we have 3 hidden layers and an output layer ($\hat{y}$).
* $\mathrm{n}^{[l]}$ = no. of units in layer $l$. 
  
   Here $\mathrm{n}^{[0]} = 3$, $\mathrm{n}^{[1]} = 5$, $\mathrm{n}^{[2]} = 5$, $\mathrm{n}^{[3]} = 3$, $\mathrm{n}^{[4]} = 1$
* $\mathrm{a}^{[l]}$ = activation function of layer $l$  =  $\mathrm{g}^{[l]}(\mathrm{z}^{[l]})$

   Here $\mathrm{a}^{[0]} = x$, $\mathrm{a}^{[l]} = \hat{y}$

# Forward & Backward Propagation in Deep Neural Network

---

## Forward Propagation:
* O/P from the 1st hidden layer is, $\mathrm{a}^{[1]} = \mathrm{g}^{[1]}(\mathrm{z}^{[1]})$, where $\mathrm{z}^{[1]} = \mathrm{w}^{[1]}*\mathrm{a}^{[0]}+\mathrm{b}^{[1]}$, *Here $\mathrm{a}^{[0]}=x$*
* O/P from the 2nd hidden layer is $\mathrm{a}^{[2]} = \mathrm{g}^{[2]}(\mathrm{z}^{[2]})$, where $\mathrm{z}^{[2]} = \mathrm{w}^{[2]}*\mathrm{a}^{[1]}+\mathrm{b}^{[2]}$
* O/P from the 3rd hidden layer is $\mathrm{a}^{[3]} = \mathrm{g}^{[3]}(\mathrm{z}^{[3]})$, where $\mathrm{z}^{[3]} = \mathrm{w}^{[3]}*\mathrm{a}^{[2]}+\mathrm{b}^{[3]}$
* O/P from the o/p layer is $\mathrm{a}^{[4]} = \mathrm{g}^{[4]}(\mathrm{z}^{[4]})$, where $\mathrm{z}^{[4]} = \mathrm{w}^{[4]}*\mathrm{a}^{[3]}+\mathrm{b}^{[4]} = \hat{y}$ 

**General Forward Propagation Equation: $\mathrm{a}^{[l]} = \mathrm{g}^{[l]}(\mathrm{z}^{[l]})$, where $\mathrm{z}^{[l]} = \mathrm{w}^{[l]}*\mathrm{a}^{[l-1]}+\mathrm{b}^{[l]}$**

### Vectorizing Forward Propagation Equation:
* $\mathrm{A}^{[1]} = \mathrm{g}^{[1]}(\mathrm{Z}^{[1]})$, where $\mathrm{Z}^{[1]} = \mathrm{W}^{[1]}*\mathrm{A}^{[0]}+\mathrm{b}^{[1]}$
* $\mathrm{A}^{[2]} = \mathrm{g}^{[2]}(\mathrm{Z}^{[2]})$, where $\mathrm{Z}^{[2]} = \mathrm{W}^{[2]}*\mathrm{A}^{[1]}+\mathrm{b}^{[2]}$
* $\mathrm{A}^{[3]} = \mathrm{g}^{[3]}(\mathrm{Z}^{[3]})$, where $\mathrm{Z}^{[3]} = \mathrm{W}^{[3]}*\mathrm{A}^{[2]}+\mathrm{b}^{[3]}$
* $\mathrm{A}^{[4]} = \mathrm{g}^{[4]}(\mathrm{Z}^{[4]})$, where $\mathrm{Z}^{[4]} = \mathrm{W}^{[4]}*\mathrm{A}^{[3]}+\mathrm{b}^{[4]} = \hat{y}$

**General vectorized form of Forward Propagation Equation: $\mathrm{A}^{[l]} = \mathrm{g}^{[l]}(\mathrm{Z}^{[l]})$, where $\mathrm{Z}^{[l]} = \mathrm{W}^{[l]}*\mathrm{A}^{[l-1]}+\mathrm{b}^{[l]}$**

Here,

 * $\mathrm{w}^{[l]} = (\mathrm{n}^{[l]}*\mathrm{n}^{[l-1]})$ matrix
  
 * $\mathrm{b}^{[l]} = (\mathrm{n}^{[l]}*1)$ matrix
  
 * $\mathrm{Z}^{[l]} = 
 \begin{pmatrix}
  \mathrm{z}^{[l]}(1) & \mathrm{z}^{[l]}(2) & \cdots & \mathrm{z}^{[l]}(m) \\ 
 \end{pmatrix}$ . Here $\mathrm{z}^{[l]} =(\mathrm{n}^{[l]}*1)$ matrix
 
 * $\mathrm{A}^{[l]} = 
 \begin{pmatrix}
  \mathrm{a}^{[l]}(1) & \mathrm{a}^{[l]}(2) & \cdots & \mathrm{a}^{[l]}(m) \\ 
 \end{pmatrix}$. Here $\mathrm{a}^{[l]} =(\mathrm{n}^{[l-1]}*1)$ matrix
 
## Backward Propagation:
* $dW_{1} = \frac{dL}{dW_{1}} = x_{1}*dz$
* $dW_{2} = \frac{dL}{dW_{2}} = x_{2}*dz$
* $dW_{3} = \frac{dL}{dW_{3}} = x_{3}*dz$
* $db = \frac{dL}{dB} = dz$

# Parameters & Hyperparameters
For a Neural Network model, $\mathrm{w}^{[l]}, \mathrm{b}^{[l]}$ are main parameters.

Some of the hyperparameters are-
1. Learning Rate $\alpha$
2. Number of iterations
3. Number of hidden layers
4. Number of hidden units
5. Choice of activation function

These hyperparameters controlls $w$ and $b$.

# Building DNN step by step

In [None]:
import numpy as np

## Implementing Sigmoid Function



```
    Arguments:
    Z -- numpy array of any shape
    
    Returns:
    A -- output of sigmoid(z), same shape as Z
    cache -- returns Z as well, useful during backpropagation
```









In [None]:
def sigmoid(Z):
    
    A = 1/(1+np.exp(-Z))
    cache = Z
    
    return A, cache

In [None]:
def sigmoid_test_case():
  
  z = np.random.randn()
  
  return z

In [None]:
z = sigmoid_test_case()
A, cache = sigmoid(z)

print("z =" + str(z))
print("Sigmoid = " + str(A))
print("Cache = " + str(cache))

z =-0.19282691566774998
Sigmoid = 0.4519420872071204
Cache = -0.19282691566774998


## Implementing ReLU Function


```
    Arguments:
    Z -- Output of the linear layer, of any shape

    Returns:
    A -- Post-activation parameter, of the same shape as Z
    cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently
```



In [None]:
def relu(Z):

    A = np.maximum(0,Z)
    
    assert(A.shape == Z.shape)
    
    cache = Z 
    return A, cache

In [None]:
def relu_test_case():
  
  z = np.random.randn()
  
  return z

In [None]:
z = relu_test_case()
A, cache = relu(np.array([z]))

print("z =" + str(z))
print("Relu = " + str(A))
print("Cache = " + str(cache))

z =-0.22916360952618312
Relu = [0.]
Cache = [-0.22916361]


## Implementing the backward propagation for a single SIGMOID unit


```
    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently

    Returns:
    dZ -- Gradient of the cost with respect to Z
```



In [None]:
def sigmoid_backward(dA, cache):
    
    Z = cache
    
    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)
    
    assert (dZ.shape == Z.shape)
    
    return dZ

In [None]:
def sigmoid_backward_test_case():
  
  dA = np.random.randn()
  cache = np.random.randn()
  
  return dA, cache

In [None]:
dA, cache = sigmoid_backward_test_case()
dZ = sigmoid_backward(np.array([dA]), np.array([cache]))

print("dA =" + str(dA))
print("cache =" + str(cache))
print("Sigmoid Backward = " + str(dZ))

dA =0.690707054736731
cache =1.7260192920596784
Sigmoid Backward = [0.08859494]


## Implementing the backward propagation for a single RELU unit


```
    Arguments:
    dA -- post-activation gradient, of any shape
    cache -- 'Z' where we store for computing backward propagation efficiently

    Returns:
    dZ -- Gradient of the cost with respect to Z
```


In [None]:
def relu_backward(dA, cache):
    Z = cache
    dZ = np.array(dA, copy=True) # just converting dz to a correct object.

    # When z <= 0, we should set dz to 0
    dZ[Z <= 0] = 0

    assert (dZ.shape == Z.shape)

    return dZ

In [None]:
def relu_backward_test_case():
  
  dA = np.random.randn()
  cache = np.random.randn()
  
  return dA, cache

In [None]:
dA, cache = relu_backward_test_case()
dZ = relu_backward(np.array([dA]), np.array([cache]))

print("dA =" + str(dA))
print("cache =" + str(cache))
print("ReLU Backward = " + str(dZ))

dA =0.9157550385590463
cache =-0.8093509617462485
ReLU Backward = [0.]


## Initializing Parameters


### For 2-Layer Neural Network


```
```
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    
    Returns:
    parameters -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
```
```



In [None]:
def initialize_parameters(n_x, n_h, n_y):
  
    np.random.seed(1)
    
    W1 = np.random.randn(n_h, n_x)*0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h)*0.01
    b2 = np.zeros((n_y, 1))
    
    assert(W1.shape == (n_h, n_x))
    assert(b1.shape == (n_h, 1))
    assert(W2.shape == (n_y, n_h))
    assert(b2.shape == (n_y, 1))
    
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    
    return parameters    

In [None]:
parameters = initialize_parameters(3,2,1)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

W1 = [[ 0.01624345 -0.00611756 -0.00528172]
 [-0.01072969  0.00865408 -0.02301539]]
b1 = [[0.]
 [0.]]
W2 = [[ 0.01744812 -0.00761207]]
b2 = [[0.]]


### For L-Layer Neural Network

When we compute $W X + b$ in python, it carries out broadcasting. For example, if: 

$$ W = \begin{bmatrix}
    j  & k  & l\\
    m  & n & o \\
    p  & q & r 
\end{bmatrix}\;\;\; X = \begin{bmatrix}
    a  & b  & c\\
    d  & e & f \\
    g  & h & i 
\end{bmatrix} \;\;\; b =\begin{bmatrix}
    s  \\
    t  \\
    u
\end{bmatrix}\tag{2}$$

Then $WX + b$ will be:

$$ WX + b = \begin{bmatrix}
    (ja + kd + lg) + s  & (jb + ke + lh) + s  & (jc + kf + li)+ s\\
    (ma + nd + og) + t & (mb + ne + oh) + t & (mc + nf + oi) + t\\
    (pa + qd + rg) + u & (pb + qe + rh) + u & (pc + qf + ri)+ u
\end{bmatrix}\tag{3}  $$

---



```
    Arguments:
    layer_dims -- python array (list) containing the dimensions of each layer in our network
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector of shape (layer_dims[l], 1)
```



In [None]:
def initialize_parameters_deep(layer_dims):
    
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims) # number of layers in the network

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        
        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

        
    return parameters

In [None]:
parameters = initialize_parameters_deep([5,4,3])
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

W1 = [[ 0.01788628  0.0043651   0.00096497 -0.01863493 -0.00277388]
 [-0.00354759 -0.00082741 -0.00627001 -0.00043818 -0.00477218]
 [-0.01313865  0.00884622  0.00881318  0.01709573  0.00050034]
 [-0.00404677 -0.0054536  -0.01546477  0.00982367 -0.01101068]]
b1 = [[0.]
 [0.]
 [0.]
 [0.]]
W2 = [[-0.01185047 -0.0020565   0.01486148  0.00236716]
 [-0.01023785 -0.00712993  0.00625245 -0.00160513]
 [-0.00768836 -0.00230031  0.00745056  0.01976111]]
b2 = [[0.]
 [0.]
 [0.]]


## Forward Propagation Model


### Linear Forward:
- LINEAR
- LINEAR -> ACTIVATION where ACTIVATION will be either ReLU or Sigmoid. 
- [LINEAR -> RELU] $\times$ $(L-1)$ -> LINEAR -> SIGMOID (whole model)
The linear forward module (vectorized over all the examples) computes the following equations:

$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{4}$$

where $A^{[0]} = X$. 

---




```
    Arguments:
    A -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)

    Returns:
    Z -- the input of the activation function, also called pre-activation parameter 
    cache -- a python tuple containing "A", "W" and "b" ; stored for computing the backward pass efficiently
```



In [None]:
def linear_forward(A, W, b):
    
    Z = np.dot(W, A)+b
    
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    
    return Z, cache

In [None]:
def linear_forward_test_case():

    A = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    
    return A, W, b

In [None]:
A, W, b = linear_forward_test_case()

Z, linear_cache = linear_forward(A, W, b)
print("Z = " + str(Z))

Z = [[0.85391458 1.92375077]]


### Linear Activation
* **Sigmoid**: $\sigma(Z) = \sigma(W A + b) = \frac{1}{ 1 + e^{-(W A + b)}}$. We have provided you with the `sigmoid` function. This function returns **two** items: the activation value "`a`" and a "`cache`" that contains "`Z`" (it's what we will feed in to the corresponding backward function). To use it you could just call: 
``` python
A, activation_cache = sigmoid(Z)
```

* **ReLU**: The mathematical formula for ReLu is $A = ReLU(Z) = max(0, Z)$. We have provided you with the `relu` function. This function returns **two** items: the activation value "`A`" and a "`cache`" that contains "`Z`" (it's what we will feed in to the corresponding backward function). To use it you could just call:
```
A, activation_cache = relu(Z)
```


---



```
    Arguments:
    A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
    W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
    b -- bias vector, numpy array of shape (size of the current layer, 1)
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"

    Returns:
    A -- the output of the activation function, also called the post-activation value 
    cache -- a python tuple containing "linear_cache" and "activation_cache";
             stored for computing the backward pass efficiently
```



In [None]:
def linear_activation_forward(A_prev, W, b, activation):
    
    if activation == "sigmoid":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = sigmoid(Z)
    
    elif activation == "relu":
        Z, linear_cache = linear_forward(A_prev, W, b)
        A, activation_cache = relu(Z)
    
    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

In [None]:
def linear_activation_forward_test_case():
  
    A_prev = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    
    return A_prev, W, b

In [None]:
A_prev, W, b = linear_activation_forward_test_case()

A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation = "sigmoid")
print("With sigmoid: A = " + str(A))

A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation = "relu")
print("With ReLU: A = " + str(A))

With sigmoid: A = [[0.71669845 0.99303271]]
With ReLU: A = [[0.92814329 4.95953689]]


## L-Layer Model

For even more convenience when implementing the $L$-layer Neural Net, we will need a function that replicates the previous one (`linear_activation_forward` with ReLU) $(L-1)$ times, then follows that with one `linear_activation_forward` with SIGMOID.

![alt text](https://datascience-enthusiast.com/figures/model_architecture_kiank.png)

---



```
    Arguments:
    X -- data, numpy array of shape (input size, number of examples)
    parameters -- output of initialize_parameters_deep()
    
    Returns:
    AL -- last post-activation value
    caches -- list of caches containing:
                every cache of linear_activation_forward() (there are L-1 of them, indexed from 0 to L-1
```



In [None]:
def L_model_forward(X, parameters):

    caches = []
    A = X
    L = len(parameters) // 2 # number of layers in the neural network
    
    # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
    for l in range(1, L):
        A_prev = A 
        A, cache = linear_activation_forward(A_prev, 
                                             parameters['W' + str(l)], 
                                             parameters['b' + str(l)], 
                                             activation='relu')
        caches.append(cache)
    
    # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
    AL, cache = linear_activation_forward(A, 
                                          parameters['W' + str(L)], 
                                          parameters['b' + str(L)], 
                                          activation='sigmoid')
    caches.append(cache)
    
    assert(AL.shape == (1,X.shape[1]))
            
    return AL, caches

In [None]:
def L_model_forward_test_case_2hidden():

    X = np.random.randn(5,4)
    W1 = np.random.randn(4,5)
    b1 = np.random.randn(4,1)
    W2 = np.random.randn(3,4)
    b2 = np.random.randn(3,1)
    W3 = np.random.randn(1,3)
    b3 = np.random.randn(1,1)
  
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2,
                  "W3": W3,
                  "b3": b3}
    
    return X, parameters

In [None]:
X, parameters = L_model_forward_test_case_2hidden()
AL, caches = L_model_forward(X, parameters)
print("AL = " + str(AL))
print("Length of caches list = " + str(len(caches)))

AL = [[0.44307552 0.33404182 0.26785456 0.15502747]]
Length of caches list = 3


## Cost Function


```
    Implement the cost function defined by equation (7).

    Arguments:
    AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
    Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost -- cross-entropy cost
```



In [None]:
def compute_cost(AL, Y):
    
    m = Y.shape[1]

    cost = (-1 / m) * np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1 - Y, np.log(1 - AL)))
    
    cost = np.squeeze(cost)      # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
    assert(cost.shape == ())
    
    return cost

In [None]:
def compute_cost_test_case():
    Y = np.asarray([[1, 1, 1]])
    aL = np.array([[.8,.9,0.4]])
    
    return Y, aL

In [None]:
Y, AL = compute_cost_test_case()

print("cost = " + str(compute_cost(AL, Y)))

cost = 0.41493159961539694


## Backward propagation module
![alt text](https://datascience-enthusiast.com/figures/backprop_kiank.png)

Now, similar to forward propagation, you are going to build the backward propagation in three steps:
- LINEAR backward
- LINEAR -> ACTIVATION backward where ACTIVATION computes the derivative of either the ReLU or sigmoid activation
- [LINEAR -> ReLU] $\times$ $(L-1)$ -> LINEAR -> SIGMOID backward (whole model)

### Linear Backward:
For layer $l$, the linear part is: $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ (followed by an activation).

Suppose we have already calculated the derivative $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l]}}$. You want to get $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$.
![alt text](https://datascience-enthusiast.com/figures/linearback_kiank.png)

The three outputs $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$ are computed using the input $dZ^{[l]}$.Here are the formulas you need:
$$ dW^{[l]} = \frac{\partial{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}$$
$$ db^{[l]} = \frac{\partial{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}$$
$$ dA^{[l-1]} = \frac{\partial{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}$$



---



```
    Arguments:
    dZ -- Gradient of the cost with respect to the linear output (of current layer l)
    cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer

    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
```



In [None]:
def linear_backward(dZ, cache):

    A_prev, W, b = cache
    m = A_prev.shape[1]

    dW = (1.0/m)*np.dot(dZ, A_prev.T)
    db = (1.0/m)*np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)
    
    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)
    
    return dA_prev, dW, db

In [None]:
def linear_backward_test_case():

    dZ = np.random.randn(1,2)
    A = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    linear_cache = (A, W, b)
    return dZ, linear_cache

In [None]:
dZ, linear_cache = linear_backward_test_case()

dA_prev, dW, db = linear_backward(dZ, linear_cache)
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

dA_prev = [[-0.84836047 -1.48514293]
 [ 0.1098082   0.19223063]
 [ 0.13961806  0.24441588]]
dW = [[-0.23692617  0.03835643 -0.21702733]]
db = [[0.66896671]]


### Linear Activation Backward:
We'll create a function that merges the two helper functions: **`linear_backward`** and the backward step for the activation **`linear_activation_backward`**. 

To implement `linear_activation_backward`, we provided two backward functions:
- **`sigmoid_backward`**: Implements the backward propagation for SIGMOID unit. We can call it as follows:

```python
dZ = sigmoid_backward(dA, activation_cache)
```

- **`relu_backward`**: Implements the backward propagation for ReLU unit. We can call it as follows:

```python
dZ = relu_backward(dA, activation_cache)
```

If $g(.)$ is the activation function, 
`sigmoid_backward` and `relu_backward` compute $$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}) \tag{11}$$.  



---



```
    dA -- post-activation gradient for current layer l 
    cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
    activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
    
    Returns:
    dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
    dW -- Gradient of the cost with respect to W (current layer l), same shape as W
    db -- Gradient of the cost with respect to b (current layer l), same shape as b
```



In [None]:
def linear_activation_backward(dA, cache, activation):

    linear_cache, activation_cache = cache
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
        
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        dA_prev, dW, db = linear_backward(dZ, linear_cache)
    
    return dA_prev, dW, db

In [None]:
def linear_activation_backward_test_case():

    dA = np.random.randn(1,2)
    A = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    Z = np.random.randn(1,2)
    linear_cache = (A, W, b)
    activation_cache = Z
    linear_activation_cache = (linear_cache, activation_cache)
    
    return dA, linear_activation_cache

In [None]:
dAL, linear_activation_cache = linear_activation_backward_test_case()

dA_prev, dW, db = linear_activation_backward(dAL, linear_activation_cache, activation = "sigmoid")
print ("sigmoid:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db) + "\n")

dA_prev, dW, db = linear_activation_backward(dAL, linear_activation_cache, activation = "relu")
print ("relu:")
print ("dA_prev = "+ str(dA_prev))
print ("dW = " + str(dW))
print ("db = " + str(db))

sigmoid:
dA_prev = [[-0.01753949  0.06356647]
 [-0.01676571  0.06076213]
 [ 0.03150611 -0.11418413]]
dW = [[-0.00235162 -0.01913512 -0.04665459]]
db = [[-0.04810867]]

relu:
dA_prev = [[0. 0.]
 [0. 0.]
 [0. 0.]]
dW = [[0. 0. 0.]]
db = [[0.]]




```
    Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
    
    Arguments:
    AL -- probability vector, output of the forward propagation (L_model_forward())
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
    caches -- list of caches containing:
                every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
               the cache of linear_activation_forward() with "sigmoid" (it's caches(L-1))
    
    Returns:
    grads -- A dictionary with the gradients
             grads["dA" + str(l)] = ... 
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ... 
```





```
    Arguments:
    parameters -- python dictionary containing your parameters 
    grads -- python dictionary containing your gradients, output of L_model_backward
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
                  parameters["W" + str(l)] = ... 
                  parameters["b" + str(l)] = ...
```



## Update Parameters
In this section we will update the parameters of the model, using gradient descent: 

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}$$

where $\alpha$ is the learning rate. After computing the updated parameters, we'll store them in the parameters dictionary. 

In [None]:
def update_parameters(parameters, grads, learning_rate):
    
    L = len(parameters) // 2 # number of layers in the neural network

    # Update rule for each parameter. Use a for loop.
    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]
        
    return parameters

In [None]:
def update_parameters_test_case():

    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    W2 = np.random.randn(1,3)
    b2 = np.random.randn(1,1)
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

    dW1 = np.random.randn(3,4)
    db1 = np.random.randn(3,1)
    dW2 = np.random.randn(1,3)
    db2 = np.random.randn(1,1)
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    
    return parameters, grads

In [None]:
parameters, grads = update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)

print ("W1 = "+ str(parameters["W1"]))
print ("b1 = "+ str(parameters["b1"]))
print ("W2 = "+ str(parameters["W2"]))
print ("b2 = "+ str(parameters["b2"]))

W1 = [[ 1.68006676e-01  1.20733268e-01  1.26774057e+00 -5.86221134e-01]
 [ 4.63854489e-01  4.33088653e-01 -3.47965123e-04  6.94981206e-01]
 [ 5.99177453e-01 -8.99395484e-01  9.85400720e-01 -1.15605872e+00]]
b1 = [[0.10270444]
 [0.05647119]
 [0.43757182]]
W2 = [[-0.56318736  0.08862028 -0.12654138]]
b2 = [[0.89319316]]
