# MIT 6.036 Spring 2019: Homework 6

This homework does not include provided Python code. Instead, we encourage you to write your own code to help you answer some of these problems, and/or test and debug the code components we do ask for. All of the problems should be simple enough that hand calculation should be possible, but it may be convenient to write some short programs to explore the neural networks, particularly for problem 2.


This homework builds on the material in the notes on neural networks up through and including section 6 on loss functions.

In particular, in this homework we consider neural networks with multiple layers. Each layer has multiple inputs and outputs, and can be broken down into two parts:

- A __linear__ module that implements a linear transformation: $z_j = ( \sum_{i=1} ^ {m} x_iw_{i,j} ) + w_{0j}$ specified by a weight matrix $W$ and a bias vector $W_0$. The output is $[z_1, \ldots, z_n]^T$. 
- An __activation__ module that applies an activation function to the outputs of the linear module for some activation function $f$, such as Tanh or ReLU in the hidden layers or Softmax (see below) at the output layer. We write the output as: $[f(z_1), \ldots, f(z_m)]$, although technically, for some activation functions such as softmax, each output will depend on all the $z_i$, not just one.

We will use the following notation for quantities in a network:

- Inputs to the network are $x_1, \ldots, x_d$.
- Number of layers is $L$
- There are $m^l$ inputs to layer $l$
- There are $n^l = m^{l+1}$ outouts from layer $l$
- The weight matrix for layer $l$ is $W^l$, an $m^l \times n^l$ matrix, and the bias vector (offset) is $W_0^l$ an $n^l \times 1$ vector
- The outputs of the linear module for layer $l$ are known as __pre-activation__ values and denoted $z^l$
- The activation function at layer $l$ is $f^l(\cdot)$ 
- Layer $l$ activations are $a^l = [f^l(z_1^l), \ldots, f^l(z_{n^l}^l)]^T$
- The output of the network is the values $a^L = [f^L(z_1^L), \ldots, f^L(z_{n^L}^L)]^T$
- Loss function $Loss(a,y)$ measures the loss of output values $a$ when the target is $y$

# 1) Loss functions and output activations: classification
When doing classification, it's natural to think of the output values as being discrete: +1 and -1. But it is generally difficult to use optimization-based methods without somehow thinking of the outputs as being continuous (even though you will have to discretize when it's time to make a prediction).


## 1.1) Hinge loss, linear activation
When we looked at the SVM objective for classification, we did this:

- Defined the output space to be $\mathbb{R}$
- Developed the hinge loss function
$$ Loss(a,y) = L_h(ya) = \begin{cases}
        0, & \text{if } ya \gt 1 \\
        1-ya, & \text{otherwise } 
       \end{cases}
$$
where $a$ is the continuous output (we're using $a$ here to be consistent with the neural network terminology of _activation_) and $y$ is the desired/target output
- Tried to find parameters $\theta$ of our model to minimize loss summed over the training data

Consider a single "neuron" with a linear activation function; that is, where $a_1^L = \sum_k w_{k,1}^L x_k + w_{0,1}^L$. In this case, we have $L=1$ and $f^L(z)=z$.

## 1.1.A) 
Write a short program to compute the gradient of the loss function with respect to the weight vector (not the bias): $\nabla_{w^L}Loss(a_1^l,y)$ when $Loss(a,y) = L_h(ya)$.
- `x` is a column vector
- `y` is a number, a label
- `a` is a number, an activation

It should return a column vector.



In [34]:
def hinge_loss_grad(x, y, a):
    return np.where(y*a > 1, 0, -y*x)

## 1.2) Log loss, sigmoidal activation
Another way to make the output for a classifier continuous is to make it be in the range $(0,1)$, which admits the interpretation of being the predicted __probability__ that the example is positive. A convenient way to make the activation of a unit be in the range $(0,1)$ is to use a sigmoid function:

$$ \sigma(z) = \frac{1}{1 + e^{-z}}$$.

The figure below shows a sigmoid activation function on the left, with the rectified linear (ReLU) activation function on the right for comparison.

<p align="center">
    <img src="https://introml_oll.odl.mit.edu/cat-soop/_static/6.036/homework/hw06/sig_relu_v1.png" width="600"/>
</p>


## 1.2.A) 
What is an expression for the derivative of the sigmoid with respect to $z$, expressed as a function of $z$, its input?

$\frac{d\sigma}{dz} = -1(1 + e^{-z})^{-2} (-e^{-z}) = \frac{e^{-z}}{(1 + e^{-z})^2}$


## 1.2.B) 
What is an expression for the derivative of the sigmoid with respect to $z$, but this time expressed as a function of 
$o = \sigma(z)$, its output? Hint: Think about the expression $1 - \frac{1}{1+e^{-z}}$.

$1 - \frac{1}{1+e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} = 1 - \sigma(z)$

$\frac{d\sigma}{dz} = (1 - \sigma(z))\sigma(z) = \left(\ \frac{e^{-z}}{1 + e^{-z}} \right)\ \left(\ \frac{1}{1 + e^{-z}} \right)\ = \frac{e^{-z}}{(1 + e^{-z})^2}$

$\frac{d\sigma}{dz} = \frac{e^{-z}}{(1 + e^{-z})^2} = \left(\ \frac{e^{-z}}{1 + e^{-z}} \right)\ \left(\ \frac{1}{1 + e^{-z}} \right)\ = (1 - \sigma(z))\sigma(z) = (1 - o)o$

__In this model, we will consider positive points to have label +1, and negative points to have label 0.__

We need a loss function that works well when we are predicting probabilities. A good choice is to ask what probability is assigned to the correct label. We will interpret the value outputted by our classifier as the probability that the example is positive. So, if the output value is $a$ and the true label is $+1$, then the probability assigned to the true label is $a$; on the other hand, if the true label is $0$, then the probability assigned to the true label is $1−a$. Because we actually will be interested in the probability of the predictions on the whole data set, we'd want to choose weights to __maximize__

$$\prod_t P(a^{(t)}, y^{(t)})$$ 

where $P(a^{(t)}, y^{(t)})$ is the probability that the network predicts the correct label for data point $(t)$.

Using a notational trick (which turns an if expression into a product) that might seem unmotivated now, but will be useful later, we can write the probability $P(a, y)$ as

$$ P(a, y) = a^y(1-a)^{(1-y)} $$

## 1.2.C) 
What is the value of $P(a,y)$ when $y=0$?

## 1.2.D) 
What is the value of $P(a,y)$ when $y=1$?

## 1.2.E) 
Find a simplified expression for $\log (P(a, y))$ that does not use exponentiation. Note that we refer to the natural logarithm $\ln$ as $\log$ throughout this assignment, consistent with the lecture notes.

In fact, because log is a monotonic function, the same weights that maximize the product of the probabilities will minimize the _negative log likelihood_ ("likelihood" is the same as probability; we just use that name here because the phrase is an idiom in machine learning, abbreviated NLL):

$$ Loss(a,y) = NLL(a, y) = -y\log a - (1-y) \log(1-a) $$

Our objective function (over our $n$ data points) will then be

$$ \sum NLL(a^{(t)}, y^{(t)}) = -\sum_{t=1} ^ n \left[\ y^{(t)}\log a^{(t)} + (1-y^{(t)}) \log(1-a^{(t)}) \right]\ $$ 

Remember that $a^{(t)}$ is our model's output for training example $t$, and $y^{(t)}$ is the true label (+1 or 0).
Now, we can think about a single unit with a sigmoidal activation function, trained to minimize NLL. So, $a_1^L = \sigma( \sum_k w_{k,1}^L x_k + w_{0,1}^L )$. In this case, we have $L=1$
## 1.2.F) 
Write a formula for the gradient of the NLL with respect to the first weight, $\nabla_{w_{1,1}^L} NLL(a_1^L, y)$ for a single training example. Hint: consider using the chain rule; the final answer (expression) is very short.

$$
\begin{align}
    NLL(a_1^L, y) &= -y\log a - (1-y) \log(1-a) \\
    &= -y\log( \sigma ( \sum_k w_{k,1}^L x_k + w_{0,1}^L ) ) - (1-y) \log( 1 - \sigma ( \sum_k w_{k,1}^L x_k + w_{0,1}^L ) ) \\
\end{align}
$$

$z_1^L = \sigma ( \sum_k w_{k,1}^L x_k + w_{0,1}^L )$

$$
\begin{align}
    \frac{\partial}{\partial w_{1,1}^L} NLL(a_1^L, y) &= -y \frac{1}{\sigma(z_1^L)} \sigma(z_1^L) ( 1- \sigma(z_1^L) )x_1 - (1 - y) \frac{1}{1- \sigma(z_1^L)} (-1) \sigma(z_1^L) (1 - \sigma(z_1^L)) x_1 \\
    &= -y (1 - \sigma(z_1^L)) x_1 + (1 -y) \sigma(z_1^L) x_1 \\
    &= x_1 (-y + y \sigma(z_1^L) - y \sigma(z_1^L) + \sigma(z_1^L)) \\
    &= x_1 (a_1^L - y)
\end{align}
$$

## 1.2.G)
Write a formula for the gradient of the NLL with respect to the full weight vector, $\nabla_{W^L} NLL(a_1^L, y)$, for a single training example.

$$
\begin{align}
    \nabla_{W^L} NLL(a_1^L, y) &= \frac{\partial Loss}{\partial a} \ \frac{\partial a}{\partial z} \ \frac{\partial z}{\partial w} \\
    &= \left(\ \frac{-y}{a} + \frac{(1-y)}{1-a} \right)\ \left(\ (1-a)a \right)\ \left(\ x \right)\ \\
    &= \left(\ (-y)(1-a) + a(1-y) \right)\ \left(\ x \right)\ \\
    &= ( -y +ya + a - ya )x \\
    &= (a-y)x \\
\end{align}
$$

# 2) Multiclass classification
What if we needed to classify homework problems into three categories: enlightening, boring, impossible? We can do this by using a "one-hot" encoding on the output, and using three output units with what is called a "softmax" (SM) activation module. It's not a typical activation module, since it takes in all $n_L$ pre-activation values $z_j^L$ in $\mathbb{R}$ and returns $n_L$ output values $a_j^L \in [0,1]$ such that $\sum_j a_j^L = 1$. This can be interpreted as representing a probability distribution over the possible categories.

The individual entries are computed as

$$ a_j = \frac{e^{z_j}}{\sum_{k=1} ^ {n^L} e^{z_k}}$$
 
We'll describe the relationship of the vector $a$ on the vector $z$ as 

$$a=SM(z)$$

The network below shows a one-layer network with a linear module followed by a softmax activation module.

<p align="center">
    <img src="https://introml_oll.odl.mit.edu/cat-soop/_static/6.036/homework/hw06/softmax.png" width="600"/>
</p>

## 2.A)
What probability distribution over the categories is represented by $z^L = [-1, 0, 1]^T$?

Now, we need a loss function $Loss(a,y)$ where $a$ is a discrete probability distribution and $y$ is a one-hot vector encoding of a single output value. It makes sense to use negative log likelihood as a loss function for the same reasons as before. So, we'll just extend our definition of NLL from earlier:

$$ NLL(a, y) = - \sum_{j=1} ^ {n^L} y_j\ln a_j^L $$


Note that the above expression is for multi-classes (number of class $\gt$ 2). For two-classes, the expression reduce to what you saw after Problem 1.2.E.

## 2.B)
If $a = [0.3, 0.5, 0.2]^T$ and $y = [0, 0, 1]^T$, what is $NLL(a,y)$?

Now, we can think about a single layer with a softmax activation module, trained to minimize NLL. The pre-activation values (the output of the linear module) are:

$$ z_j^L = \sum_k w_{k,j}^L x_k + w_{0,j}^L $$

and $a^L = SM(z^L)$.
To do gradient descent, we need to know $ \frac{\partial}{\partial w_{k,j}^L}NLL(a^L, y) $. We'll reveal the secret (that you might guess from Problem 1) that it has an awesome form! (Please consider deriving this, for fun and satisfaction!)

$$ \frac{\partial}{\partial w_{k,j}^L}NLL(a^L, y) = x_k(a_j^L - y_j) $$

And of course, it's easy to compute the whole matrix of these derivatives, $\nabla_{w^L} NLL(a^L, y)$, in one quick matrix computation.

## 2.C)
Suppose we have two input units and three possible output values, and the weight matrix $W^L$ is

$$
W^L = \begin{bmatrix}
    1 & -1 & 2 \\
    -1 & 2 & 1 \\
\end{bmatrix}
$$

or in Python form: `w = np.array([[1, -1, -2], [-1, 2, 1]])`.

Assume the biases are zero, the input $x = [1,1]^T$ (e.g., `x = np.array([[1, 1]]).T)`, and the target output $y=[0,1,0]^T$ (e.g., `y = np.array([[0, 1, 0]]).T)`. What is the matrix $\nabla_{w^L} NLL(a^L, y)$? Hint: You might want to solve using Python and numpy, or using colab for calculation.

## 2.D) 
What is the predicted probability that $x$ is in class 1, before any gradient updates? (Assume we have classes 0, 1, and 2.)

## 2.E)
Using step size 0.5, what is $W^L$ after one gradient update step?

## 2.F)
What is the predicted probability that $x$ is in class 1, given the new weight matrix?

In [19]:
import numpy as np

def softmax(z):
    return np.exp(z) / np.sum(np.exp(z))

 **Problem 2A**

In [33]:
z = np.array([[-1, 0, 1]]).T
# your code here
# 2A
a = softmax(z)
np.round(a, 3)

array([[0.09 ],
       [0.245],
       [0.665]])

**Problem 2.C-F**

In [31]:
w = np.array([[1, -1, -2], [-1, 2, 1]])
x = np.array([[1], [1]])
y = np.array([[0, 1, 0]]).T
# your code here
# 2C
z = w.T@x
# 3x1
a = softmax(z)
# 2x1 1x3 = 2x3
nll_gradient = x @ ((a - y).T)
print(np.round(nll_gradient, 3))

# 2D
print(a[1:2, :])
w = w - 0.5*g

# 2E
print(np.round(w, 3))
a = softmax(w.T@x)

# 2F
print(a[1:2, :])

[[ 0.245 -0.335  0.09 ]
 [ 0.245 -0.335  0.09 ]]
[[0.66524096]]
[[ 0.878 -0.833 -2.045]
 [-1.122  2.167  0.955]]
[[0.77245284]]


# 3) Neural Networks
In this problem we will analyze a simple neural network to understand its classification properties. You might find the colab file useful. However, we encourage you to go through all the calculation by hand once, which should be a good practice.

Consider the neural network given in the figure below, with ReLU activation functions ($f^1$ in the figure) on all hidden neurons, and softmax activation ($f^2$ in the figure) for the output layer, resulting in softmax outputs ($a_1^2$  and $a_2^2$  in the figure).

<p align="center">
    <img src="https://introml_oll.odl.mit.edu/cat-soop/_static/6.036/homework/hw06/nnet.png" width="600"/>
</p>

Given an input $x = [x_1, x_2]^T$, the hidden units in the network are activated in stages as described by the following equations:

$$
z_1^1 = x_1w_{1,1}^1 + x_2w_{2,1}^1 + w_{0,1}^1 \qquad a_1^1 = \max \{ z_1^1, 0\} \\
z_2^1 = x_1w_{1,2}^1 + x_2w_{2,2}^1 + w_{0,2}^1 \qquad a_2^1 = \max \{ z_2^1, 0\} \\
z_3^1 = x_1w_{1,3}^1 + x_2w_{2,3}^1 + w_{0,3}^1 \qquad a_3^1 = \max \{ z_3^1, 0\} \\
z_4^1 = x_1w_{1,4}^1 + x_2w_{2,4}^1 + w_{0,4}^1 \qquad a_4^1 = \max \{ z_4^1, 0\} \\
$$

$$
z_1^2 = a_1^1w_{1,1}^2 + a_2^1w_{2,1}^2 + a_3^1w_{3,1}^2 + a_4^1w_{4,1}^2 + w_{0,1}^2 \\
z_2^2 = a_1^1w_{1,2}^2 + a_2^1w_{2,2}^2 + a_3^1w_{3,2}^2 + a_4^1w_{4,2}^2 + w_{0,2}^2 \\
$$

The final output of the network is obtained by applying the _softmax_ function to the last hidden layer,

$$
a_1^2 = \frac{e^{z_1^2}}{e^{z_1^2} + e^{z_2^2}} \\
a_2^2 = \frac{e^{z_2^2}}{e^{z_1^2} + e^{z_2^2}} \\
$$

In this problem, we will consider the following setting of parameters:

$$
\begin{equation}
\begin{bmatrix}
    w_{1,1}^1 & w_{1,2}^1 & w_{1,3}^1 & w_{1,4}^1 \\
    w_{2,1}^1 & w_{2,2}^1 & w_{2,3}^1 & w_{2,4}^1 \\
\end{bmatrix}
=
\begin{bmatrix}
    1 & 0 & -1 & 0 \\
    0 & 1 & 0 & -1 \\
\end{bmatrix}
,\qquad
\begin{bmatrix}
    w_{0,1}^1 \\
    w_{0,2}^1 \\
    w_{0,3}^1 \\
    w_{0,4}^1 \\
\end{bmatrix}
=
\begin{bmatrix}
    -1 \\
    -1 \\
    -1 \\
    -1 \\
\end{bmatrix}
\end{equation}
$$

$$
\begin{equation}
\begin{bmatrix}
    w_{1,1}^2 & w_{1,2}^2 \\
    w_{2,1}^2 & w_{2,2}^2 \\
    w_{3,1}^2 & w_{3,2}^2 \\
    w_{4,1}^2 & w_{4,2}^2 \\
\end{bmatrix}
=
\begin{bmatrix}
    1 & -1 \\
    1 & -1 \\
    1 & -1 \\
    1 & -1 \\
\end{bmatrix}
,\qquad
\begin{bmatrix}
    w_{0,1}^2 \\
    w_{0,2}^2 \\
\end{bmatrix}
=
\begin{bmatrix}
    0 \\
    2 \\
\end{bmatrix}
\end{equation}
$$

# 3.1) Output
Consider the input $x_1 = 3, x_2 = 14$

## 3.1.A) 
What are the outputs of the hidden units, $(f^1(z_1^1),f^1(z_2^1), f^1(z_3^1), f^1(z_4^1))$?

## 3.1.B)
What is the final output $(a_1^2, a_2^2)$ of the network?


In [35]:
# layer 1 weights
w_1 = np.array([[1, 0, -1, 0], [0, 1, 0, -1]])
w_1_bias = np.array([[-1, -1, -1, -1]]).T
# layer 2 weights
w_2 = np.array([[1, -1], [1, -1], [1, -1], [1, -1]])
w_2_bias = np.array([[0, 2]]).T

# your code here
def relu(z):
    return np.where(z > 0, z, 0)

# 3.1.A
x = np.array([[3, 14]]).T
z_1 = (w_1.T @ x) + w_1_bias
a_1 = relu(z_1)
print(a_1.reshape(1, -1).tolist())
z_2 = (w_2.T @ a_1) + w_2_bias
a_2 = softmax(z_2)

# 3.1.B
print(a_2.reshape(1, -1).tolist())

[[2, 13, 0, 0]]
[[0.9999999999993086, 6.914400106935422e-13]]


# 3.2) Unit decision boundaries
Let's characterize the _decision boundaries_ in $x$-space, corresponding to the four hidden units. These are the regions where the input to the units $z_1^1, z_2^1, z_3^1, z_4^1$ are exactly zero.

Hint: You should draw a diagram of the decision boundaries for each unit in the $x$-space and label the sides of the boundaries with 0 and + to indicate whether the unit's output would be exactly 0 or positive, respectively. (The diagram should be a 2D plot with $x_1$ and $x_2$ on each axis, with lines for $z_1^1 = 0, z_2^1 = 0, z_3^1 = 0, z_4^1 = 0$

## 3.2.A)
What is the shape of the decision boundary for a single unit?

Find link to graph [here](https://www.desmos.com/calculator/kdildkysvt).

## 3.2.B)
Enter a 2 x 4 matrix where each column represents a (different) input vector $[x_1, x_2]^T$ each of which is on the decision boundary for the first unit, that is, for which $z_1^1 = 0$ (There are multiple possible answers.)

Any points on the rightmost vertical line representing $z_1^l = 0 in the graph linked above.$

## 3.2.C)
Consider the following input vectors: $x^{(1)} = [0.5, 0.5]^T, x^{(2)} = [0, 2]^T, x^{(3)} = [-3, 0.5]^T$. Enter a matrix where each column represents the outputs of the hidden units $(f(z_1^1), \cdots, f(z_4^1))$ for each of the input vectors. You can use your diagram of decision boundaries.

You can read these off the diagram of decision boundaries


# 3.3) Network outputs
In our network above, the output layer with two softmax units is used to classify into one of two classes. For class 1, the first unit's output should be larger than the other unit's output, and for class 2, the second unit's output should be larger. This generalizes nicely to $k$ classes by using $k$ output units.

(We have previously examined addressing two-class classification problems using a single output unit with a sigmoid activation; this is another way to address them.)

Let's characterize the region in $x$-space where this network's output indicates the first class (that is, $a_1^2$ is larger) or indicates the second class (that is, $a_2^2$ is larger). Your diagram from the previous part will be useful here.

What is the output value of the neural network in each of the following cases? Write your answer for $a_i^2$ as expressions, you can use powers of $e$, for example, `e**2 + 1`; the exponents can be negative, `e**(-2) + 1`.

Case 1) For $f(z_1^1) + f(z_2^1) + f(z_3^1) + f(z_4^1) = 0$

In [None]:
import numpy as np

T  = np.matrix([[0.0 , 0.1 , 0.9 , 0.0],
[0.9 , 0.1 , 0.0 , 0.0],
[0.0 , 0.0 , 0.1 , 0.9],
[0.9 , 0.0 , 0.0 , 0.1]])
g = 0.9
r = np.matrix([0, 1., 0., 2.]).reshape(4, 1)

print(np.linalg.solve(np.eye(4) - g * T, r))

[[6.05288295]
 [6.48663207]
 [6.7519581 ]
 [7.58553317]]
