# Homework 24: Neural Networks

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sympy as sy
import utils as utils
from fractions import Fraction
from sklearn.metrics import pairwise_distances

from IPython.display import display, HTML

# Inline plotting
%matplotlib inline

# Make sympy print pretty math expressions
sy.init_printing()

utils.load_custom_styles()

---
## Exercise 6.1


<img src="figures/homework-24/ex6.1.png" width="800" />








To simplify the proof, we ignore the bias terms, use the identify function as the activation function for all neurons and have only a single neuron in the output layer.

Suppose we have a two-layer network with a single output. The output of this network is:
$$
o_{2L} = \mathbf{w}^T \mathbf{x}
$$

Now, suppose we have a three-layer network. The output of this network can be computed in steps.

$$
o_{3L} = \mathbf{w}^{(2)T} \mathbf{h}
$$

The $\mathbf{h}$ is computed as follows:

$$
\mathbf{h} = \mathbf{W}^{(1)T} \mathbf{x}
$$

Combining the two expressions, we get:

$$
o_{3L} = \mathbf{w}^{(2)T} \mathbf{W}^{(1)T} \mathbf{x}
$$

The expression above can be simplied:
$$
o_{3L} = \mathbf{w}^T_{c} \mathbf{x}
$$
where $\mathbf{w}_{c} =\mathbf{w}^{(2)T} \mathbf{W}^{(1)T}$

This shows that the three-layer network and two layer network are equivalent when the activation function for all neurons are linear.

A two-layer network using only linear activation functions cannot define a nonlinear decision function because the output of such a network expresses a linear combination of $\mathbf{x}$ i.e., the network can only learn a linear decision function. Since we showed that a three-layer network with linear activation function is equivalent to a two-layer network, this 3-layer network cannot learn a nonlinear decision function despite its extra layer.

---
## Exercise 6.2


<img src="figures/homework-24/ex6.2.0.png" width="800" />








---
### Exercise 6.2.1


<img src="figures/homework-24/ex6.2.1.png" width="800" />








The neural network has two weight matrices:
- $\mathbf{W}^{(1)} \in \mathbb{R}^{D \times L}$, between input layer and the hidden layer
- $\mathbf{W}^{(2)} \in \mathbb{R}^{L \times K}$, between hidden layer and the output layer

The number of weights is: $D\cdot L + L\cdot K$

---
### Exercise 6.2.2


<img src="figures/homework-24/ex6.2.2.png" width="800" />







The hidden layer has $L$ neurons. Each neuron in the hidden layer requires $D$ multiplications and $D-1$ additions.

The output layer has $K$ neurons. Each neuron in the output layer requires $L$ multiplications and $L-1$ additions


Therefore, we need:
- $L \cdot  D    + K \cdot L$ multiplications
- $L \cdot (D-1) + K \cdot (L-1)$ additions


---
## Exercise 6.3


<img src="figures/homework-24/ex6.3.png" width="800" />






The Equation (6.2) in question is given:

<img src="figures/homework-24/eq-6.2.png" width="800" />



If the entries of the weight vector are all the same value $C$ i.e., $\mathbf{W}_k^{(1)}=[C, C, \cdots, C]^T$, then we have:

$$
h_k =  f \left(  \sum_{i=1}^{D} C x_i  \right) =  f \left( C \sum_{i=1}^{D}  x_i  \right )
$$

This means that $h_k$ is constant for $k = 1, \cdots, K$ i.e., all neurons in the hidden layer will produce the same value. This is equivalent to having a hidden layer with a single neuron which acts like a bottleneck in the network. The output of the hidden layer will be a vector whose entries are all the same. Essentially, this hidden layer will reduce the input $\mathbf{x} \in \mathbb{R}^{D}$ to $h_k \in \mathbb{R}$.

---
## Exercise 6.4


<img src="figures/homework-24/ex6.4.png" width="800" />





<img src="figures/homework-24/3-layer-nn-training-algo.png" width="800" />




























---
## Exercise 6.5


<img src="figures/homework-24/ex6.5.png" width="800" />








<img src="figures/homework-24/rbf-training.png" width="800" />














---
## Exercise 6.6


<img src="figures/homework-24/ex6.6.png" width="800" />






