## Quantitative Foundations of Data Science HW3

## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Ruchita Paithankar
    - Email: 
- Group member 2
    - Name:
    - Email: 
- Group member 3
    - Name: 
    - Email: 

### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

## Implementing chain rule in Artificial neural network (10 points)

Let us start with one input and one output neuron. 
From the class, recall 
$$a^{(1)}=\sigma(z^{(1)})$$
$$z^{(1)} = w^{(1)}a^{(0)}+b^{(1)}$$

We can find how good or bad our neural network is based on a cost function $C_k$ for $k$th training example.
$$C_k = (a^{(1)}-y)^2$$
Later we will see how to apply this to an entire set of training examples. But now let us look at an instance where $x=1$ and $y=0$, i.e., for $x=1$ input the label is $y=1$. Consider the starting weight $w^{(1)}= 1.3$ and bias parameter $b^{(1)}=-0.1$. Using the below code, calculate $a^{(1)}$ and then the cost $C_1$.

In [2]:
import numpy as np

In [3]:
# First we set the state of the network
σ = np.tanh
w1 = 1.3
b1 = -0.1

# Then we define the neuron activation.
def a1(a0):
    z = w1 * a0 + b1
    return σ(z)

# Experiment with different values of x below.
x = 1
print((a1(x)-1)**2)


0.02767078976828052


The cost function of a training set is the average of the individual cost functions of the data in the training set,

$$C = \frac{1}{N} \sum_k C_k$$ where $N$ is the total number of training examples.



To improve the performance of the neural network on the training data, we can vary the weight and bias. We can calculate the derivative of the example cost with respect to these quantities using the chain rule.

$$\frac{\partial C_k}{\partial w^{(1)}} = \frac{\partial C_k}{\partial a^{(1)}}\frac{\partial a^{(1)}}{\partial z^{(1)}}\frac{\partial z^{(1)}}{\partial w^{(1)}}$$
$$\frac{\partial C_k}{\partial b^{(1)}} = \frac{\partial C_k}{\partial a^{(1)}}\frac{\partial a^{(1)}}{\partial z^{(1)}}\frac{\partial z^{(1)}}{\partial b^{(1)}}$$
Individually, these derivatives take fairly simple form. Go ahead and calculate them.

for $$\sigma(z) = tanh(z)$$ 
$$\sigma^{'}(z) = 1/cosh^{2}(z) $$

In [3]:
# Your task is to replace ???

In [4]:
# First define our sigma function.
sigma = np.tanh

# Next define the feed-forward equation.
def a1 (w1, b1, a0) :
    z = w1 * a0 + b1
    return sigma(z)

# The individual cost function is the square of the difference between
# the network output and the training data output.
def C (w1, b1, x, y) :
    return (a1(w1, b1, x) - y)**2

# This function returns the derivative of the cost function with
# respect to the weight.
def dCdw (w1, b1, x, y) :
    z = w1 * x + b1
    dCda = 2 * (a1(w1, b1, x) - y) # Derivative of cost with activation
    dadz = 1/np.cosh(z)**2 # derivative of activation with weighted sum z
    dzdw = x # derivative of weighted sum z with weight
    return dCda * dadz * dzdb # Return the chain rule product.

# This function returns the derivative of the cost function with
# respect to the bias.
# It is very similar to the previous function.
# You should complete this function.
def dCdb (w1, b1, x, y) :
    z = w1 * x + b1
    dCda = 2 * (a1(w1, b1, x) - y)
    dadz = 1/np.cosh(z)**2
    # Change the next line to give the derivative of
     # the weighted sum, z, with respect to the bias, b.
 
    dzdb = 1
    return dCda * dadz * dzdb



In [5]:
# Sanity check
# Let's start with an unfit weight and bias.

w1 = 2.3
b1 = -1.2
# We can test on a single data point pair of x and y.
x = 0
y = 1
# Output how the cost would change
# in proportion to a small change in the bias
print( dCdb(w1, b1, x, y) )



-1.1186026425530913


### Bonus section

In real world, you will have a large number of training examples and hence use matrix or vector operations instead of scalar multiplications.
You are given the weight matrix $W$ and bias vector $b$. Complete the code for finding $a1, C, dCdW and dCdb$

In [6]:
# Define the activation function.
sigma = np.tanh

# Let's use a random initial weight and bias.
W = np.array([[-0.94529712, -0.2667356 , -0.91219181],
              [ 2.05529992,  1.21797092,  0.22914497]])
b = np.array([ 0.61273249,  1.6422662 ])

# define our feed forward function
def a1 (a0) :
  # Notice the next line is almost the same as previously,
  # except we are using matrix multiplication rather than scalar multiplication
  z = W @ a0 + b
  # Everything else is the same though,
  return sigma(z)

# Next, if a training example is,
x = np.array([0.7, 0.6, 0.2])
y = np.array([0.9, 0.6])

# Then the cost function is,
d = a1(x) - y # Vector difference between observed and expected activation
C = d @ d # Absolute value squared of the difference.

In [7]:
# First define our sigma function.
sigma = np.tanh

# Next define the feed-forward equation.
def a1 (w1, b1, a0) :
    z = np.matmul(w1 , a0) + b1
    return sigma(z)


# This function returns the derivative of the cost function with
# respect to the weight.
def dCdw (w1, b1, x, y) :
   
    dCda = 2 * (a1(w1, b1, x) - y) # Derivative of cost with activation
    
    dadz = 1/np.square(np.cosh(np.matmul(w1,x) + b1)) # derivative of activation with weighted sum z

    J = (dCda * dadz).reshape((2,1))

    dzdw = x # derivative of weighted sum z with weight

    J = J * dzdw
    
    return J # Return the chain rule product.

# This function returns the derivative of the cost function with
# respect to the bias.
# It is very similar to the previous function.
# You should complete this function.
def dCdb (w1, b1, x, y) :
    
    dCda = 2 * (a1(w1, b1, x) - y)
    dadz = 1/np.cosh(w1 @ x + b1)**2
    # Change the next line to give the derivative of
     # the weighted sum, z, with respect to the bias, b.
 
    dzdb = 1
    return dCda * dadz * dzdb

dCdb (W, b, x, y) 

array([-2.19184549e+00,  1.42277240e-03])

In [8]:
dCdw (W, b, x, y)

array([[-1.53429185e+00, -1.31510730e+00, -4.38369099e-01],
       [ 9.95940681e-04,  8.53663441e-04,  2.84554480e-04]])