# Deep Learning Theoretical Aspects

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import sklearn
%matplotlib inline


Much of the power of neural networks comes from the nonlinearity that is inherited in activation functions.  
Show that a network of N layers that uses a linear activation function can be reduced into a network with just an input and output layers.

(Write down what is the output of two layers and use induction to claim for all layers).


In [None]:
# Write your answer here

#### Answer:

Let's assume that activation functions are linear that means we have some $f(x) = W*x +b$, where W is a matrix of the weight matrix, x is input layer, b is a bias and the result of this function give us a next layer.

In case we have only 2 layers we will get:

$x_1 = W_0*x_0 + b_0$
$x_2 = W_1*x_1 + b_1 = W_1*W_0*x_0 + W_1*b_0 + b_1$

the second expression is a linear function of the original layer with weights $W_1*W_0$ and bias $W_1*b_0 + b_1$ which means we do not need the 1st layer to calculate 2nd layer, we just need to adjust the weights and bias.

We could extend this formula to include more layers and therefore show that linear activation functions will be not powerfull in NN.



### Derivatives of Activation Functions (30 points)
Compute the derivative of these activation functions:

1 Sigmoid
<img src="https://cdn-images-1.medium.com/max/1200/1*Vo7UFksa_8Ne5HcfEzHNWQ.png" width="150">



$$f'(t) = \frac{d}{dt}(1+e^{-t})^{-1} = -(1+e^{-t})^{-2}\frac{d}{dt}(1+e^{-t}) =-(1+e^{-t})^{-2} * -1*e^{-t} = (1+e^{-t})^{-2}*e^{-t} = \frac{e^{-t}}{(1+e^{-t})^2}$$

2 Relu 

<img src="https://cloud.githubusercontent.com/assets/14886380/22743194/73ca0834-ee54-11e6-903f-a7efd247406b.png" width="200">

$$f'(x) = \begin{cases} 0, &\text{ if } x<0 \\ ֿ1, &\text{   if }x>0 \end{cases}$$ 

And undefined for $x=0$

3 Softmax
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/e348290cf48ddbb6e9a6ef4e39363568b67c09d3" width="250">

$$\sigma(z_j) = \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}}$$



$$\frac{\partial \sigma(z_j)}{\partial z_j} = \frac{\frac{\partial}{\partial z_j} e^{z_j} \cdot \sum_{k=1}^{K} e^{z_k} - e^{z_j} \cdot \frac{\partial}{\partial z_j} \sum_{k=1}^{K} e^{z_k}}{(\sum_{k=1}^{K} e^{z_k})^2}=
\frac{e^{z_j} \cdot \sum_{k=1}^{K} e^{z_k} - e^{z_j} \cdot (e^{z_j} + \sum_{k\neq j} e^{z_k})}{(\sum_{k=1}^{K} e^{z_k})^2}=
\frac{e^{z_j} \cdot (\sum_{k=1}^{K} e^{z_k} - e^{z_j})}{(\sum_{k=1}^{K} e^{z_k})^2}$$


$$\frac{\partial \sigma(z_j)}{\partial z_j} = \sigma(z_j) (1 - \sigma(z_j))$$

### Back Propagation (30 points)
Use the chain rule and backprop (also called the generalized delta rule) to compute the partial derivatives for these computations (i.e., dz/dx1, dz/dx1, dz/dx3):

```
z = x1 + 5*x2 - 3*x3^2
```

$$z = x_1 + 5*x_2 - 3*x_3^2$$

$$\frac{\partial z}{\partial x_1} = 1$$
$$\frac{\partial z}{\partial x_2} = 5$$
$$\frac{\partial z}{\partial x_3} = -6x_3$$

```
z = x1*(x2-4) + exp(x3^2) / 5*x4^2
```

$$z = x_1*(x_2-4) + \frac{e^{x_3^2}}{ 5*x_4^2}$$

$$\frac{\partial z}{\partial x_1} = x_2 - 4$$
$$\frac{\partial z}{\partial x_2} = x_1$$
$$\frac{\partial z}{\partial x_3} = \frac{2x_3e^{x_3^2}}{5x_4^2}$$
$$\frac{\partial z}{\partial x_4} = -\frac{2exp(x_3^2)}{5x_4^3}$$

```
z = 1/x3 + exp( (x1+5*(x2+3)) ^2 )
```

$$z = \frac{1}{x_3} + e^{(x_1+5*(x_2+3)) ^2}$$

$$\frac{\partial z}{\partial x_1} = 10(x_1 + 5(x_2+3))e^{(x_1+5(x_2+3))^2}$$
$$\frac{\partial z}{\partial x_2} = 50(x_1 + 5(x_2+3))e^{(x_1+5(x_2+3))^2}$$
$$\frac{\partial z}{\partial x_3} = -\frac{1}{x_3^2}$$

### Puppy or bagel? (20 points)
We've seen in class the (hopefully) funny examples of challenging images (Chihuahua or muffin, puppy or bagel etc.). 

Let's say you were asked by someone to find more examples like that. You are able to call the 3 neural networks that won the recent ImageNet challenges, and get their predictions (the entire vector of probabilities for the 1000 classes).  

Describe methods that might assist you in finding more examples.

In [None]:
# Write your answer here

### Convolution (20 points)
Consider the following convolution filters:
```python
k1 = [ [0 0 0], [0 1 0], [0 0 0] ]
k2 = [ [0 0 0], [0 0 1], [0 0 0] ]
k3 = [ [-1-1 -1], [-1 8 -1], [-1 -1 -1] ]
k4 = [ [1 1 1], [1 1 1], [1 1 1] ] / 9
```

Can you guess what each of them computes?

In [None]:
# Write your answer here