# Homewok 3: Computation Graph

Welcome to the course **AI and Deep learning**!

Computation graph, especially the backpropagation, is of great importance in deep learning, and it makes the training of various neural networks possible. Since it is so important, backpropagation has been coded for famous deep learning platforms, such as TensorFlow and PyTorch. That is, we only need to define the forward propagation, which is the architecture of the neural network. Then, the backpropagation will be performed automatically. However, in this homework, we will manually code up both forward propagation and backpropagation, so that we will have a better understanding about the computation graph. Hope you enjoy the third homework!  

**Learning Goal**: In this homework, we first revisit the logistic regression and use it to illustrate the basic procedure for forward propagation and backpropagation. Then, we move to general fully connected neural networks and use computation graph to optimize it. After this homework, you will know:
 * The basic procedure to train a model using computation graph
 * How badly a wrongly specified model performs
 * How to code up a neural network with one hidden layer.
 * How differently neural networks perform with different hidden nodes.
 


## Table of content
* [1 - Packages](#1)
* [2 - Generate a training dataset](#2)
  * [2.1 - Generate a training dataset](#2.1)
  * [2.2 - Parameter estimation](#2.2)
  * [2.3 - Integration](#2.3)
* [3 - Neural Network](#3) 
  * [3.1 - Data generation](#3.1)
  * [3.2 - Architecture of a fully connected neural network with one hidden layer](#3.2)
  * [3.3 - Integration](#3.3)
  * [3.4 - Play by yourself!](#3.4)


<a name='1'></a>
## 1- Packages

In order to finish a task, we need commands from certain **Python** packages. Again, one of the commonly used package is **numpy**.

In [None]:
import numpy as np
import matplotlib.pyplot as plt # for plots

<a name='2'></a>
## 2 - Logistic regression revisit

<a name='2.1'></a>
### 2.1 - Generate a training dataset

First, we generate a training dataset from a pre-specified logistic regression model. **In order to guarantee that our simulation results are reproducible, we need to control the random seed.** That is, after controlling the seed, others can generate the **SAME** random variables as we did, so our simulation results can be reproduced.

Consider the following logistic regression model 
$$
y^{(i)}\sim\mbox{Bernoulli}\{\pi(x^{(i)})\},\\
\mathrm{logit}\{\pi(x^{(i)})\} = b_0 +w_{00}x^{(1i)}+w_{01}x^{(2i)},
$$
where $\mbox{Bernoulli}(p)$ is a Bernoulli distribution with success probability $p$, $x^{(i)} = (x^{(1i)},x^{(2i)})^T$, $b_0=-0.5$, $w_{00}=0.1$, $w_{01}=-0.1$, $x^{(ki)}\sim N(2,2^2)$ $k=1,2$.
Let us write a function to generate a training dataset of size $n$ with a random number $rn$. 

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
def sigmoid(x):
    # x: input
    
    sig = 1/(1 + np.exp(-x))

    return sig

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
def train_data_generation(n, rn):
    # n: sample size
    # rn: random seed
    
    np.random.seed(rn)
    x = np.random.normal(2,2**2, (n,2))
    z = -0.5 + 0.1*x[:,0] - 0.1 * x[:,1]
    a = sigmoid(z)
    y = [np.random.binomial(1,prob,1) for prob in a]
    y = np.array(y)    
    
    return x, y
    

To visualize your data, you may would like to run the following code.

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
x, y = train_data_generation(1000, 100)

fig, ax = plt.subplots()
scatter = ax.scatter(x[:,0], x[:,1],  c=y[:,0])
legend1 = ax.legend(*scatter.legend_elements(),
                    loc="lower right", title="Classes")
ax.set_title('Simulated training dataset')
plt.show()


<a name='2.2'></a>
### 2.2 - Parameter estimation

Different from what we have done in Homework 2, we implement a computation graph for parameter estimation based on vectorization and a (batch) gradient descent algorithm. Check the slides for Section 1.2 for details. In this part, we separately code up the forward propagation and backpropagation, and we use `dictionary` to return the corresponding values with meaningful keys names. 

The following code is useful to briefly understand the `dictionary` structure.

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
dic = {'W[0]': 0 , 'W[1]':1} # Construct a dictionary

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
i=0
dic['W['+str(i)+']'] #use variables to extract the values for `W[0]`

For a logistic regression model, we will use a dictionary `par` to store the values for the parameters and use a dictionary `grad` to store those for the gradients. Please notice that the two dictionaries are updated until convergence.

First, we need to initilize the two dictionaries. We implement the following strategies for initialization. 
   * Initialize the weights by a random vector, whose elements are independently generated from a normal distribution with mean zero and standard deviation one. Please use a random seed to keep the code reproducible.
   * Initialize the bias by zero.

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def Initialize_pars(d,rn):
    # d: the dimension of the feature.
    # rn: random seed
    
    # Step 1. Set random seed
    # Step 2. Initialize w with size (d,1)
    # Step 3. Initialize b with size (1,1)

    
    ### YOUR CODE BEGINS HERE (approximately 3 lines)
    np.random.seed(rn)
    w = np.random.normal(0,1,(d,1))
    b = np.zeros((1,1))
    ### YOUR CODE ENDS
    
    par = {
        'w': w,
        'b': b
          }
    
    return par
        

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
d = x.shape[1]
rn = 1234
par = Initialize_pars(d,rn)
print(par['w'])
print('Your result should be:\n [[ 0.47143516]\n  [-1.19097569]] ')

We have already introduced the vectorization for logistic regression models. Specifically, we have (check the second homework for details)
$$
\nabla J(\tilde{w}) =n^{-1}X^T(A-Y), \quad H(\tilde{w}) = n^{-1}X^TWX,
$$
where $Y=(y^{(1)},\ldots,y^{(n)})^T$, $A=(a^{(1)},\ldots,a^{(n)})^T$ and $W = \mbox{diag}((a^{(1)}(1-a^{(1)}),\ldots, a^{(n)}(1-a^{(n)})))$. 

A computation graph consists of forward propagation and backpropagation.
    * Forward propagation computes the cost function as well as others given the current parameters. 
    * Backpropagation computes the derivatives based on the values computed from the forward propagation. 
**Please notice that it is enough to "cache" a's.**

First, we consider the forward propagation. Given the current model parameters, we need to calculate 
 * $Z = X w + b$, where  
 * $A = \sigma(Z)$
 * An informal vectorization for the cost $J$  is $ n^{-1}\{-(Y\log A + (1-Y) \log (1-A))\}^{T}1_{n}$, where $1_{n}=(1,\ldots,1)^T$ is a vector of 1's with length $n$. 

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def forward(x, y, par):
    # x: feature matrix of size nX2
    # y: target vector of size nX1
    # par: dictionary with currect parameters.

    # Step 1. Obtain the sample size n
    # Step 2. Obtain Z
    # Step 3. Obtain A
    # Step 4. Obtain J
    # Step 5. Cache J and A. The reason we cache J is that we would like to monitor the value of the cost function 
    #         as iterations goes by.
    
    
    ### YOUR CODE BEGINS HERE (approximately 4 lines)
    n = x.shape[0]
    Z = x @ par['w'] + par['b']
    A = sigmoid(Z)
    J = - (y * np.log(A) + (1-y) * np.log(1-A)).transpose() @ np.ones((n,1))/n    
    ### YOUR CODE ENDS
    
    cache = {
        'J': J,
        'A': A
    }
    return cache
    

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
cache = forward(x, y, par)
print(cache['J'])
print('Your result should be:\n[[1.62409418]]')

Next, we consider the backpropagation. Given the values obtained from the forward propagation, we need to obtain the following results. 
 * 'Error term': $err = A-Y$
 * Gradient for weights: $dw = X^{T}err/n$
 * Gradient for bias: $db = err^{T}1_n/n$.

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def backprop(x, y, cache):
    # x: feature matrix of size nX2
    # y: target vector of size nX1
    # cache: cached values for A
    
    # Step 1. Obtain the sample size n
    # Step 2. Obtain Z
    # Step 3. Obtain A
    # Step 4. Obtain J
    # Step 5. Cache J and A. The reason we cache J is that we would like to monitor the value of the cost function 
    #         as iterations goes by.
    
    ### YOUR CODE BEGINS HERE (approximately 4 lines)
    n = x.shape[0]
    err = cache['A'] - y
    dw = (x.transpose() @ err) / n 
    db = (err.transpose() @ np.ones((n,1))) / n
    ### YOUR CODE ENDS
    
    grad = {
        'dw': dw,
        'db': db
    }
    return grad

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
grad = backprop(x, y, cache)
print("Your dw is:")
print(grad['dw'])
print("The expected dw is:\n[[ 0.34420118]\n [-1.04006133]]\n")
print("Your dw is:")
print(grad['db'])
print("The expected db is:\n[[0.02902235]]")


After we finish the computation graph, we need to update the model parameters using a **learning rate** $\alpha$ by 
$$w = w - \alpha dw,\quad b = b - \alpha db$$

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def update_par(par, grad, alpha):
    # par: dictionary with currect parameters.
    # grad: dictionary with gradients
    # alpha: learning rate
    
    # Step 1. Update w
    # Step 2. Update b
    
    
    ### YOUR CODE BEGINS HERE (approximately 2 lines)
    par['w'] -= alpha * grad['dw']
    par['b'] -= alpha * grad['db']
    ### YOUR CODE ENDS
    
    return par


In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
alpha = 0.01
par = update_par(par, grad, alpha)
print("Your updated w is")
print(par['w'])
print("The expected w is:\n[[ 0.46455114]\n [-1.17017447]]\n")
print("Your updated b is:")
print(par['b'])
print("The expected b is:\n[[-0.00087067]]")

<a name='2.3'></a>
### 2.3 - Integration

Up to now, we have finished one iteration for the (batch) gradient descent algorithm. We need to put things together to obtain the estimators for the model parameters. 

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def est_par_logistic(x, y, alpha, M, rn):
    # x: feature matrix of size nX2
    # y: target vector of size nX1
    # alpha: learning rate
    # M: maximum number of iterations
    # rn: random seed for the initialization
 
    # Step 1. Obtain the dimension of features
    # Step 2. Initialize the parameters
    # Step 3. Iteration 
    #        Step 3.1 Forward propagation
    #        Step 3.2 Backpropagation
    #        Step 3.3 Update parameter
    
    
    ### YOUR CODE BEGINS HERE (approximately 5 lines)
    d = x.shape[1]
    par = Initialize_pars(d,rn)
    
    for i in range(M):
        cache = forward(x, y, par)
        grad = backprop(x, y, cache)
        par = update_par(par, grad, alpha)
     ### YOUR CODE ENDS
    
        if i % 500 == 0:
            print("After %4d iterations, the cost is %10.8f" % (i, cache['J']))

    return par


    

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
x, y = train_data_generation(5000, 100)
par = est_par_logistic(x, y, 0.005, 10000, 1234)

Your cost function should decrease. The estimation procedure can be stopped if the cost function remains stable.

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
print("Your estimated w is")
print(par['w'])
print('The expected value for w is\n[[ 0.09970773]\n [-0.09344206]]\n')
print("Your estimated b is")
print(par['b'])
print('The expected value for w is\n[[-0.49004407]]\n')

Please notice that those values shoule be very close to the truth. Next, let's visualize the estimation result. 

In [None]:
x1_margin = np.linspace(-15,20,200)
x2_margin = np.linspace(-15,20,200)
x1_grid, x2_grid = np.meshgrid(x1_margin,x2_margin)
y_grid = sigmoid(par['b'] + par['w'][0] * x1_grid + par['w'][1]*x2_grid)
y_grid[y_grid>=0.5] = 1
y_grid[y_grid<0.5] = 0

In [None]:
plt.contourf(x1_grid, x2_grid, y_grid, cmap=plt.cm.Spectral)
scatter = plt.scatter(x[:,0], x[:,1], c = y[:,0], cmap=plt.cm.Spectral,s=1)
plt.legend(*scatter.legend_elements()) # add legend
plt.show()

<a name='3'></a>
## 3 - Neural Network
In the previous section, we have discussed how to train a logistic regression model using computation graph. As we have mentioned in the class, parametric models suffer from model mis-specification. If the logistic regression model is wrongly specified, the inference may be wrong.

<a name='3.1'></a>
### 3.1 - Data generation

Consider the following logistic regression model 
$$
y^{(i)}\sim\mbox{Bernoulli}\{\pi(x^{(i)})\},\\
\pi(x^{(i)}) = \lVert x^{(i)}\rVert/2,$$
where $\mbox{Bernoulli}(p)$ is a Bernoulli distribution with success probability $p$, $x^{(i)} = (x^{(1i)},x^{(2i)})^T$, $x^{(1i)} = r^{(i)}\cos(\theta^{(i)})$, $x^{(2i)} = r^{(i)}\sin(\theta^{(i)})$, $r^{(i)}\sim\mbox{Uniform}(0,2)$, $\theta^{(i)}\sim\mbox{Uniform}(0,2\pi)$, $\mbox{Uniform}(a,b)$ is a uniform distribution over the interval $(a,b)$, and $\lVert x\rVert = (x_1^2+x_2^2)^{1/2}$ is the Euclidean norm for a vector $x=(x_1,x_2)^T$. Clearly, the training data is not generated from a logistic regression model.

Let us write a function to generate a training dataset of size $n$ with a random number $rn$. 

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def train_data_generation_nn(n, rn):
    # n: sample size
    # rn: random seed
    
    # Step 1. Set random seed
    # Step 2. Generate r
    # Step 3. Generate theta
    # Step 4. Generate x
    # Step 5. Generate y
    
    ### YOUR CODE BEGINS HERE (approximately 5 lines)
    np.random.seed(rn)
    r = np.random.uniform(0,2,(n,1))
    theta2 = np.random.uniform(0,2*np.pi,(n,1))
    x = np.concatenate((r * np.cos(theta2), r * np.sin(theta2)),axis = 1)
    y = np.random.binomial(1, r/2, (n,1))
    ### YOUR CODE ENDS
    
    return x, y

Next, we visualize the generated data. 

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
x, y = train_data_generation_nn(1000, 100)

fig, ax = plt.subplots()
scatter = ax.scatter(x[:,0], x[:,1],  c=y[:,0])
legend1 = ax.legend(*scatter.legend_elements(),
                    loc="lower right", title="Classes")
ax.set_title('Simulated training dataset')
plt.show()


From the above figure, we can see that there is no linear boundary such that the two classes can be separated nicely. That is, the logistic regression model is intuitively wrongly specified for this kind of dataset. Now, let's blindly fit a logistic regression model to this dataset and visualize the fitted result.

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
par = est_par_logistic(x, y, 0.01, 10000, 1234)

After 10\,000 iterations, the cost decreases to 0.69, and the estimated model parameters are very close to zero. The estimated parameter indicates that the features does not contribute to the response too much in fitted logistic regression model. Clearly, this is not the case, but the features contribute in a non-linear manner.

Next, let's look at the estimation results.

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
x1_margin = np.linspace(-2,2,200)
x2_margin = np.linspace(-2,2,200)
x1_grid, x2_grid = np.meshgrid(x1_margin,x2_margin)
y_grid = sigmoid(par['b'] + par['w'][0] * x1_grid + par['w'][1]*x2_grid)
y_grid[y_grid>=0.5] = 1
y_grid[y_grid<0.5] = 0



plt.contourf(x1_grid, x2_grid, y_grid, cmap=plt.cm.Spectral)
scatter = plt.scatter(x[:,0], x[:,1], c = y[:,0], cmap=plt.cm.Spectral,s=1)
plt.legend(*scatter.legend_elements()) # add legend
plt.show()

The estimated logistic regression wrongly predict all the points in the rectangle area  $[-2,2]\times[-2,2]$ to be 0, which does not make any sense. 

<a name='3.2'></a>
### 3.2 - Architecture of a fully connected neural network with one hidden layer

As mentioned in the slides, we would like to try a neural network with one hidden layer, and use computation graph to train the model parameters. Specifically, we need to follow the following steps. 

 * Initialize the model parameters. Random initialization is used for weights, and the bias terms are set to be 0.
 * Based on the current parameters, we use forward propagation to calculate the cost function as well as other intermediate terms. Please remember, we need to cache the cost function as well as "A"'s
 * Use backpropagation to obtain the derivations with respect to the model parameters. 
 * Update the model parameter by (batch) gradient descent method with a learning rate $\alpha$.

First, we initialize the model parameters. Please notice that different from the ones in the logistic regression model, we use weight matrices with rows indicating a certain weight. 

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def Initialize_pars_nn(d,na,rn):
    # d: the dimension of the feature.
    # na: the dimension of the hidden layer. That is, the number of neurons.
    # rn: random seed
    
    # Step 1. Set random seed
    # Step 2. Initialize W1 (Pay attention to the dimension)
    # Step 3. Initialize b1 (Pay attention to the dimension)
    # Step 4. Initialize W2 (Pay attention to the dimension)
    # Step 5. Initialize b2 (Pay attention to the dimension)
    
    ### YOUR CODE BEGINS HERE (approximately 5 lines)
    np.random.seed(rn)
    W1 = np.random.normal(0,1,(na,d))
    b1 = np.zeros((na,1))
    W2 = np.random.normal(0,1,(1,na))
    b2 = np.zeros((1,1))
    ### YOUR CODE ENDS
    
    par = {
        'W1': W1,
        'b1': b1,
        'W2': W2,
        'b2': b2,
          }
    
    return par

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
d = x.shape[1]
rn = 1234
par = Initialize_pars_nn(d,4,rn)
print(par['W1'][0,:])
print('Your result should be:\n[ 0.47143516 -1.19097569]')

Given the current parameters, we next need a forward propagation to obtain the cost as well as "A"s for the backpropagation.

Given the current parameters, we next need a forward propagation to obtain the cost as well as "A"s for the backpropagation.

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def forward_nn(x, y, par):
    # x: feature matrix of size nX2
    # y: target vector of size nX1
    # par: dictionary with currect parameters.

    # Step 1. Obtain W1 from par
    # Step 2. Obtain b1 from par
    # Step 3. Obtain W2 from par
    # Step 4. Obtain W2 from par
    # Step 5. Obtain Z1
    # Step 6. Obtain A1
    # Step 7. Obtain Z2
    # Step 8. Obtain A2
    # Step 9. Obtain J
    # Step 10. Cache J, A1 and A2. The reason we cache J is that we would like to monitor the value of the cost function 
    #         as iterations goes by.
    
    
    ### YOUR CODE BEGINS HERE (approximately 9 lines)
    W1 = par["W1"]
    b1 = par["b1"]
    W2 = par["W2"]
    b2 = par["b2"]
    
    Z1 = x @ W1.transpose() + b1.transpose() # W1 X + b1
    A1 = sigmoid(Z1)
    Z2 = A1 @ W2.transpose() + b2
    A2 = sigmoid(Z2)
    J = - np.mean(y * np.log(A2) + (1-y) * np.log(1-A2))
    ### YOUR CODE ENDS
    
    cache = {
        'J': J,
        'A1': A1,
        'A2': A2
    }
    return cache

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
cache = forward_nn(x, y, par)
print(cache['J'])
print('Your result should be:\n0.7317784853362318')

Next, we consider the backpropagation.

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def backprop_nn(x, y, par, cache):
    # x: feature matrix of size nX2
    # y: target vector of size nX1
    # par: dictionary containing the current parameters.
    # cache: cached values for A
    
    # Step 1. Obtain the sample size n
    # Step 3. Obtain A1 from the cache
    # Step 3. Obtain A2 from the cache
    # Step 3. Obtain dZ2
    # Step 3. Obtain dW2
    # Step 3. Obtain db2
    # Step 3. Obtain dZ1 using W2 from par
    # Step 2. Obtain dW1
    # Step 3. Obtain db1
    # Step 5. Cache dW1, db1, dW2, db2
    
    ### YOUR CODE BEGINS HERE (approximately 9 lines)
    n = x.shape[0]
    A1 = cache['A1']
    A2 = cache['A2']
    dZ2 = A2 - y
    dW2 = dZ2.transpose() @ A1 / n
    db2 = np.mean(dZ2,  keepdims=True)
    dZ1 = (dZ2 @ par['W2']) * (A1*(1-A1)) 
    dW1 = dZ1.transpose() @ x / n
    db1 = np.mean(dZ1,axis = 0, keepdims = True).transpose()
    ### YOUR CODE ENDS
    
    grad = {
        'dW1': dW1,
        'db1': db1,
        'dW2': dW2,
        'db2': db2,
    }
    return grad

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
grad = backprop_nn(x, y, par, cache)
print("The first row of your dw1 is:")
print(grad['dW1'][0,:])
print("The expected value is:\n[-3.19236049e-04  2.16698484e-05]\n")
print("The inverse of your db1 is:")
print(grad['db1'].transpose())
print("The expected value is:\n[[ 5.96842958e-05 -1.42505549e-02  2.40350654e-03  1.45257415e-03]]")

Next, update your model parameters. 

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def update_par_nn(par, grad, alpha):
    # par: dictionary with currect parameters.
    # grad: dictionary with gradients
    # alpha: learning rate
    
    # Step 1. Update W1
    # Step 2. Update b1
    # Step 3. Update W2
    # Step 4. Update b2    
    
    ### YOUR CODE BEGINS HERE (approximately 4 lines)
    par['W1'] -= alpha * grad['dW1']
    par['b1'] -= alpha * grad['db1']
    par['W2'] -= alpha * grad['dW2']
    par['b2'] -= alpha * grad['db2']
    ### YOUR CODE ENDS
    
    return par


In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
alpha = 0.01
par = update_par_nn(par, grad, alpha)
print("The first row of your updated w1 is:")
print(par['W1'][0,:])
print("The expected value is:\n[ 0.47143836 -1.19097591]\n")
print("The transpose of your updated b1 is:")
print(par['b1'].transpose())
print("The expected b is:\n[[-1.19368592e-06  2.85011098e-04 -4.80701309e-05 -2.90514829e-05]]")

<a name='3.3'></a>
### 3.3 - Integration

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def est_par_nn(x, y, na, alpha, M, rn):
    # x: feature matrix of size nX2
    # y: target vector of size nX1
    # na: the dimension of the hidden layer. That is, the number of neurons.
    # alpha: learning rate
    # M: maximum number of iterations
    # rn: random seed for the initialization
 
    # Step 1. Obtain the dimension of features
    # Step 2. Initialize the parameters
    # Step 3. Iteration 
    #        Step 3.1 Forward propagation
    #        Step 3.2 Backpropagation
    #        Step 3.3 Update parameter
     
    
    ### YOUR CODE BEGINS HERE (approximately 5 lines)
    d = x.shape[1]
    par = Initialize_pars_nn(d,na, rn)
    
    for i in range(M):
        cache = forward_nn(x, y, par)
        grad = backprop_nn(x, y, par, cache)
        par = update_par_nn(par, grad, alpha)
    ### YOUR CODE ENDS        
        if i % 500 == 0:
            print("After %4d iterations, the cost is %10.8f" % (i, cache['J']))
    
    return par


In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
x, y = train_data_generation_nn(1000, 100)
par = est_par_nn(x, y, 4, 0.01, 10000, 1234)

After 10\,000 iterations, the cost decreases to 0.63, which is much better than the 0.69 for the wrongly specified logistic regression model.

We have obtained the trained neural network, we need another function for prediction. Please notice that "A2" in the cache is the prediction based on the currect parameter. Thus, you only need to copy your code for "forward_nn" except the last line for the cost. 

In [None]:
# PLEASE DO NOT CHANGE THE EXISTING CODE
def prediction_nn(x_test, par):
    # x_new: a test feature matrix of size n_testX2
    # par: the trained parameter dictionary
 
    # Step 1. Obtain W1 from par
    # Step 2. Obtain b1 from par
    # Step 3. Obtain W2 from par
    # Step 4. Obtain W2 from par
    # Step 5. Obtain Z1
    # Step 6. Obtain A1
    # Step 7. Obtain Z2
    # Step 8. Obtain A2
     
    
    ### YOUR CODE BEGINS HERE (approximately 8 lines)
    W1 = par["W1"]
    b1 = par["b1"]
    W2 = par["W2"]
    b2 = par["b2"]
    
    Z1 = x_test @ W1.transpose() + b1.transpose() # W1 X + b1
    A1 = sigmoid(Z1)
    Z2 = A1 @ W2.transpose() + b2
    A2 = sigmoid(Z2)
    ### YOUR CODE ENDS        

    return A2

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
x1_margin = np.linspace(-2,2,200)
x2_margin = np.linspace(-2,2,200)
x1_grid, x2_grid = np.meshgrid(x1_margin,x2_margin)
x_test = np.c_[x1_grid.ravel(), x2_grid.ravel()]
y_pred = prediction_nn(x_test, par)
y_pred[y_pred>=0.5]=1
y_pred[y_pred<0.5]=0

y_cont = y_pred.reshape(x1_grid.shape)


plt.contourf(x1_grid, x2_grid, y_cont, cmap=plt.cm.Spectral)
scatter = plt.scatter(x[:,0], x[:,1], c = y[:,0], cmap=plt.cm.Spectral,s=0.5)
plt.legend(*scatter.legend_elements()) # add legend
plt.show()

<a name='3.4'></a>
### 3.4 - Play by yourself!

With only 4 hidden nodes, we cannot get a good boundary. How about we try a neural network with 10 hidden nodes?

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
x, y = train_data_generation_nn(1000, 100)
par = est_par_nn(x, y, 10, 0.01, 10000, 1234)

In [None]:
# PLEASE DO NOT CHANGE THE FOLLOWING CODE
x1_margin = np.linspace(-2,2,200)
x2_margin = np.linspace(-2,2,200)
x1_grid, x2_grid = np.meshgrid(x1_margin,x2_margin)
x_test = np.c_[x1_grid.ravel(), x2_grid.ravel()]
y_pred = prediction_nn(x_test, par)
y_pred[y_pred>=0.5]=1
y_pred[y_pred<0.5]=0

y_cont = y_pred.reshape(x1_grid.shape)


plt.contourf(x1_grid, x2_grid, y_cont, cmap=plt.cm.Spectral)
scatter = plt.scatter(x[:,0], x[:,1], c = y[:,0], cmap=plt.cm.Spectral,s=0.5)
plt.legend(*scatter.legend_elements()) # add legend
plt.show()