# Deep Learning

### Zhentao Shi

## Neural networks

* Neural network is the workhorse of AI
* A type of nonlinear models (with a structure)

![NN](graph/Colored_neural_network.png)

## Layers

* The transition from layer $k-1$ to layer $k$ can be written as

$$
\begin{align*}
z_l^{(k)} & = b_{l0}^{(k-1)} + \sum_{j=1}^{p_{k-1} } w_{lj}^{(k-1)} a_j^{(k-1)} \\ 
a_l^{(k)} & = \sigma ( z_l^{(k)})
\end{align*}
$$

where $a_j^{(0)} = x_j$ is the input.

* The latent variable $z_l^{(k)}$ usually takes a linear form
* *Activation function* $\sigma(\cdot)$ is usually a simple nonlinear function
* Popular choices
  * Sigmoid: ($1/(1+\exp(-x))$)
  * Rectified linear unit (ReLu) $z\cdot 1\{x\geq 0\}$

## Why Does It Work?

* Animated video by [3Blue1Brown](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)

* Feedforward: criterion evaluation
* Back propagation: parameter adjustment

## Optimization

* One-layer feedforward NN for demonstration
* Input: $p$
* Hidden nodes: $K$
  
* Criterion: 
$$
\min_{\theta}   \frac{1}{2}\sum_{i=1}^n  Q_i \textrm{ where } Q_i = [y_i - f^{(2)}(X_i) ]^2
$$
where
$$
\begin{align*}
f^{(2)}(X_i) & =  \beta^{(2)} + \sum_{j=1}^K w_{j}^{(2)} \sigma \left( z_j\right) \\
z_j & =\beta_j^{(1)} + \sum_{\ell=1}^p w_{j\ell}^{(1)} x_{i} 
\end{align*}
$$



## Gradient method

Taylor expansion

$$
Q(\theta_{t+1}) = Q(\theta_t) +  \nabla^{\top} Q(\theta_t) (\theta_{t+1}-\theta_{t}) + h.o.t.
$$
where
* $\nabla Q(\theta_t)$ is **Gradient**
* $(\theta_{t+1}-\theta_{t})$ is unknown, use $p_t$ (**length of step**) to replace it as
$$
Q(\theta_{t+1}) = Q(\theta_t) +  \nabla^{\top} Q(\theta_t) p_t
$$
* Choose $p_t = - \alpha \cdot \nabla Q(\theta_t)$ ensures reduction in $Q$, where $\alpha$ is the **learning rate**.



## Backpropagation

* Output layer -> hidden layer
\begin{align*}
\frac{\partial Q_{i}}{\partial\beta^{(2)}} & =-\left[y_{i}-f^{(2)}\left(X_{i}\right)\right]\\
\frac{\partial Q_{i}}{\partial w_{j}^{(2)}} & =-\left[y_{i}-f^{(2)}\left(X_{i}\right)\right]\sigma\left(z_{j}\right)
\end{align*}

* Hidden layer -> input layer: by the chain rule 
\begin{align*}
\frac{\partial Q_{i}}{\partial\beta^{(1)}} & =\frac{\partial Q_{i}}{\partial\beta^{(2)}}\cdot\sigma'\left(z_{j}\right)\\
\frac{\partial Q_{i}}{\partial w_{j}^{(1)}} & =\frac{\partial Q_{i}}{\partial w_{j}^{(2)}}\cdot\sigma'\left(z_{j}\right)x_{i}
\end{align*}

## Stochastic Gradient Descent

* Large n
* Sample a *minibatch*
  * Unbiased gradient, but large variance
* Learning rate
* Many epochs



## Regularization

* $L_1$-norm (Lasso)
* $L_2$-norm (ridge)
* Learning rate
* Number of epochs and minibatches


## Frameworks

* Google's `Tensorflow`
  * Keras: high level, easy to implement
* Meta's `pytorch`
  * Literal style
  * Easy to use/reuse

## Simulation Example

* Use NN to solve Poisson regression
  * A trivial example for demonstration
  * No hidden layer
  * Keep the essence
  
* See `data_example/nn_Poisson_Keras_HD.ipynb`
* See `data_example/nn_torch.ipynb`

## Network Structures

* Time series
  * Recurrent NN (RNN)
  * Long term and short term memory (LSTM) (See `data_example/nn_LSTM.ipynb`)
* Graphics
  * Convolutional NN (CNN)


#  Theory is Incomplete

* Theoretical understanding is an ongoing endeavor.
* Hornik, Stinchcombe, and White (1989):
  * A single hidden layer neural network, given enough many nodes, is a *universal approximator* for any measurable function.
* Deep learning: engineering breakthrough
* Big data available

<!-- ## Reinforcement Learning

* Policy function $d(x_t; \theta)$, where $\theta$ is the parameter
* Response $y_t$, with reward $r( d (x_t; \theta), y_t)$
* Optimal invariant parameter
$$
\theta^* = \arg \max_{\theta} \sum_{t=1}^T r( d(x_t; \theta), y_t)
$$

* Regret: 

$$
 \sum_{t=1}^T [ r( d(x_t; \theta^*), y_t) - r( d(x_t; \theta_t), y_t)] 
$$ -->