# Deep Learning

## Gradient Descent

$
\begin{eqnarray*}
\beta_{i+1}&\Leftarrow&\beta_{i}-\alpha \frac {\partial{J}} {\partial \beta}\\
\end{eqnarray*}
$

## Sigmoid function

$
\begin{eqnarray*}
a(z)&=&\frac{1}{1+e^{-z}}\\
\frac {d{a}} {d z}&=&a(1-a)
\end{eqnarray*}
$

## Hyperbolic Tan function

$
\begin{eqnarray*}
a(z)&=&{tanh(z)}\\
\frac {d{a}} {d z}&=&1-a^2\\
\end{eqnarray*}
$

******



















# Q1. Gradient Descent
For a simple GD please show results from first 5 iterations. Note that all parameters in this question are scalar.
Let

$
\begin{eqnarray*}
J&=& \frac{1}{2}(y_p-3)^2 \\
y_p &=& 2w - 1\\
\end{eqnarray*}
$

## Q1.1
write down the update method for $w$ to minimize the value of J using GD technique \[ 5\]

### ms1.1
$
\begin{eqnarray*}
w_{i+1} &\Leftarrow& w_{i} - \alpha \frac {d{J}} {d w}\\
w_{i+1} &\Leftarrow& w_{i} - \alpha \frac {d{J}} {d y_p} \frac {d{y_p}} {d w}\\
w_{i+1} &\Leftarrow&w_{i} - \alpha (y_p-3)(2)\\
\end{eqnarray*}
$


## Q1.2
Using GD method and update $w$ for 5 iterations. Let $\alpha=0.1$ and $w_{i=0}=-1.0$. \[5\]


| $i$ | $w_i$  | $w_{i+1}$  |
| --- |:------:| ----------:|
| 0   | -1.0   |..........       |
| 1   |  ..... |..........       |
| 2   |  ..... |..........       |
| 3   |  ..... |..........       |
| 4   |  ..... |..........       |


In [1]:
#ms1.2
import numpy as np
alpha=0.1
wi=-1.0
for i in range(5):
    yp=2.0*wi-1.0
    wf=wi-alpha*(yp-3.0)*(2.0)
    print("%d %5.2f %5.2f "%(i,wi,wf))
    wi=wf

0 -1.00  0.20 
1  0.20  0.92 
2  0.92  1.35 
3  1.35  1.61 
4  1.61  1.77 


# Q2. Single Layer Perceptron

From a simple single layer perceptron using GD for update $w_0$ and $b_0$ to reduce $J$, where $\ast$ is element-wise multiplication
\begin{eqnarray}
    g(h)&=&\frac{1}{1+e^{-h}}\nonumber\\
       J&=&\frac{1}{2}(y_p - y_t)^2\nonumber\\
       y_p&=&g(x.w_0 + b_0)\nonumber\\
\end{eqnarray}

# Q2.1
Write update equations to adjust $w_0$. \[5\]


# Q2.2
Write update equations to adjust $b_0$ \[5\]

### ms2.1
$
\begin{eqnarray*}
w_{i+1} &\Leftarrow& w_{i} - \alpha \frac {d{J}} {d w}\\
w_{i+1} &\Leftarrow& w_{i} - \alpha \frac {d{J}} {d y_p} \frac {d{y_p}} {d w}\\
w_{i+1} &\Leftarrow&w_{i} - \alpha (y_p-y_t)y_p(1-y_p)x\\
\end{eqnarray*}
$

### ms2.1
$
\begin{eqnarray*}
b_{i+1} &\Leftarrow& b_{i} - \alpha \frac {d{J}} {d b}\\
b_{i+1} &\Leftarrow& b_{i} - \alpha \frac {d{J}} {d y_p} \frac {d{y_p}} {d b}\\
b_{i+1} &\Leftarrow& b_{i} - \alpha (y_p-y_t)y_p(1-y_p)\\
\end{eqnarray*}
$

In [2]:
#ms1.2
import numpy as np
def g(h):
    return 1.0/(1+np.exp(-h))
alpha=0.1
wi=-1.0
bi=0
x=2.0
yt=0.5
for i in range(200):
    yp=g(x*wi+bi)
    wf=wi-alpha*(yp-yt)*yp*(1.0-yp)*x
    bf=bi-alpha*(yp-yt)*yp*(1.0-yp)
    if i%20==0:
        print("%3d  wi:%5.2f  wf:%5.2f  bi:%5.2f  bf:%5.2f  yp:%4.2f  yt:%4.2f"%(i,wi,wf,bi,bf,yp,yt))
    wi=wf
    bi=bf

  0  wi:-1.00  wf:-0.99  bi: 0.00  bf: 0.00  yp:0.12  yt:0.50
 20  wi:-0.83  wf:-0.82  bi: 0.09  bf: 0.09  yp:0.17  yt:0.50
 40  wi:-0.64  wf:-0.63  bi: 0.18  bf: 0.19  yp:0.25  yt:0.50
 60  wi:-0.47  wf:-0.46  bi: 0.27  bf: 0.27  yp:0.34  yt:0.50
 80  wi:-0.35  wf:-0.34  bi: 0.33  bf: 0.33  yp:0.41  yt:0.50
100  wi:-0.28  wf:-0.28  bi: 0.36  bf: 0.36  yp:0.45  yt:0.50
120  wi:-0.24  wf:-0.24  bi: 0.38  bf: 0.38  yp:0.47  yt:0.50
140  wi:-0.22  wf:-0.22  bi: 0.39  bf: 0.39  yp:0.49  yt:0.50
160  wi:-0.21  wf:-0.21  bi: 0.39  bf: 0.39  yp:0.49  yt:0.50
180  wi:-0.21  wf:-0.21  bi: 0.40  bf: 0.40  yp:0.50  yt:0.50


# Q3. Multiple Layer Perceptron

For a two Layer Perceptron network we can use GD to minimize loss as following; 

$
\begin{eqnarray*}
    g(z)&=&\frac{1}{1+e^{-z}}\nonumber\\
    J&=&\frac{1}{2}(y_p - y_t)^2\nonumber\\
\end{eqnarray*}
$

, the input $x$ and the output $y$ attach to the network

$
\begin{eqnarray}    
    a_0 &\Leftarrow &x\nonumber\\
    y_p& \Leftarrow &a_2\nonumber\\
\end{eqnarray}
$


, forward network

$
\begin{eqnarray}
z_0&=&a_0.w_0+b_0\nonumber\\
a_1&=&g(z_0)\nonumber\\
z_1&=&a_1.w_1+b_1\nonumber\\
a_2&=&g(z_1)\nonumber\\
\end{eqnarray}
$


## Q3.1 
Explain the calculation process to $w_0$ in order to minimize $J$. \[5\]

$
\begin{eqnarray}
    \frac{\partial J}{\partial w_0} &=& \frac{\partial J}{\partial y_p} \frac{\partial y_p}{\partial a_2} 
\frac{\partial a_2}{\partial z_1} \frac{\partial z_1}{\partial a_1} \frac{\partial a_1}{\partial z_0} \frac{\partial z_0}{\partial w_0} \nonumber\\
\end{eqnarray}
$

## Q3.2
Let input size of the variable as following. Write size of all variables \[5\]


| $variable$ | $size$  |
| ------   |:------:|
| $x$     | $\textbf{[3x1]}$   |
| $w_0$   |  $[3x4]$ |
| $b_0$   |  $\textbf{[4]}$ |
| $z_0$   |  $[4x1]$ |
| $a_1$   |  $[4x1]$ |
| $w_1$   |  $[4x2]$ |
| $b_1$   |  $[2]$ |
| $z_1$   |  $[2x1]$ |
| $a_2$   |  $[2x1]$ |
| $y_p$   |  $[2x1]$ |
| $y_t$   |  $\textbf{[2x1]}$ |



# Q4 Regularization and Training method
Describe the objective and process of the following regularization methods
1. Regularization penalty in cost function \[2\]
2. Dropout \[2\]
3. Early stopping \[1\]
4. Gradient Descent \[1\]
5. Stochastic Gradient Descent \[2\]
6. Mini-batch \[2\]


# ms4

1. to **maintain bias and weight** in the **stable range**
2. **deactivating some neurons** to reduce training complexity and genrate a **generalized model** and not heavily rely on a specific neural pathway
3. **avoid overfitting** by early stopping the training process when there is **no/less progress**
4. a method to adjust internal parameters to **minimize the loss**, adjusting the parameters **once per episode**
5. **update** the parameters **once per data**
6. **update** the parameters **once per batch**