## Regression

$J=\displaystyle \sum_{i=1}^n(\vec{p}\cdot{\vec{\tilde{x_i}}}-y_i)^2$

$\displaystyle \vec{p}^* = arg \min_{\vec{p}} J(\vec{p})$

$ \displaystyle \forall i :\frac{\partial J}{\partial p_j}=0 \;\;\quad \frac{\partial J}{\partial p_j} =\sum_{i=1}^n 2(\vec{p}\cdot{\vec{\tilde{x_i}}}-y_i)\vec{\tilde{x_i}}$

solution is $\displaystyle \quad \vec{p}^*= (X^TX)^{-1}X^T\vec{y}$



**Over-determined**:

$\displaystyle \quad arg \min_{\vec{p}}(||A\vec{p}-\vec{b}||_2 +\lambda g(p)) $

**Under-determined**:

$\displaystyle \quad arg \min_{\vec{p}} g(p) \quad$ subject to  $||A\vec{p}-\vec{b}||_2\le \epsilon$

## Loss

$\displaystyle L(\hat{y},y)=\frac{1}{N} \sum_i (\hat{y}_i -y_i)^2 \quad$   MSE

$\displaystyle L(\hat{y},y)=-(y \,\text{log}\hat{y} +(1-y)\,\text{log}(1-\hat{y})) \quad $ Cross Entropy

$\displaystyle L(\hat{y},y)=-\sum_c(y_{o,c})\,\text{log}p_{o,c} \quad $ Cross Entropy Multi class

## Cost 

$\displaystyle J(W,b) = \frac{1}{M}\sum_{i=1}^m L(\hat{y}^{(i)},y^{(i)})$

## Logistic Regression

We'll start with single nueron logistic regression using activation function **sigmoid**

* Loss Cross Entropy
* Maximum Likelihood
* Convex Optimization

### Computation Graph

$\hat{y} = \sigma(\vec{w}^T\vec{x} + b) \rightarrow $ cross entropy $\rightarrow L(\hat{y},y)$


## Multiple Examples Training set

* **Forward Propagation**: Computing the loss through forward pass for a single training example
* **Backward Propagation**: Computing gradients of parameters through backward pass for a single training example $\\$

* **Batch**: Traning set could be divided into smaller sets called batches
* **Iteration**: When an entire batch is passed both forward and backward
* **Epoch**: When an entire dataset is passed both forward and backward through the NN once

## Multiple Outputs 
Sigmoid -> softmax with one hot encoding

$softmax (\hat{y})_i= \frac{e^{y_i}}{\sum_ie^{y_i}}$

## Other Loss/Activations
* Changing **activation** will change the gradients

* Changing **loss** will change the gradients and could make the composition unseparable

* Most activations we discussed have analytic gradients. It is possible that there is no analytic expression. 

* Gradients could be evaluated numerically
    - When **analytic** expression is not available
    - When it is **faster** to evaluate them numerically
    
* Use central difference formula 
* Check against analytic gradient for several examples

<img src="images/2_1.png" width ="500" height="350"/>


## Curriculum Learning 
* Training machine learning models with particular order. Starting with easier subtasks and gradually increase the difficulty level of the tasks (For example, NLP problem learn words and then learn sentences)

* Both traning set and cost functions aree updated accordingly

## Stochastic Gradient Descent

**Almost surely convergence**

* Performs an update for each training example $x^{(i)}$ and label $y^{(i)}$

* The values of the loss and parameters will fluctuate
    - (+) will discover better minimums
    - (-) convergence to chosen minimum will keep overshooting

* Learning rate plays a very important role

## Mini-batch GD

$$w_{k+1} =w_k -\alpha \cdot \nabla_wJ(w;x^{(i:i+n)};y^{(i:i+n)})$$

* Mini-batch GD is a hybrid method between GD and SGD. 
* Performs an update for every mini-batch of n traning examples.

    - (+) reduces the variance of the parameter updates
    - (+) efficient in computing the gradient w.r.t a mini-batch
* mini-batch sizes range between 50-256

**Challenges**
* Chossing a proper learning rate can be difficult

* Learning rate smart schedule
    - Annealing
    - Change of J below threshold
* Variable learning for different parameters
* Suboptimal local (saddle points)


**Parameters**
* Model Parameters: W, b, activation, output, cost

* Hyper-parameters: Batch/minibatch size, learning parameters, external parameters

## Methods for choosing learning rate
**1. Learning rate decay**

* $\displaystyle \alpha = \frac{\alpha_0}{1+\text{decr}\cdot\text{epnum}}$

* $\displaystyle \alpha = d^{\text{epnum}}\cdot \alpha_0$

* $\displaystyle \alpha = \frac{d\cdot\alpha_0}{\sqrt{\text{epnum}}}$

**2. Momentum Method**

$$ v_{k+1} =\gamma v_k+\alpha\cdot\nabla_wJ(w_k)$$
$$w_{k+1} =w_k -v_{k+1}$$

<div class="verticalhorizontal">
    <img src="images/2_2.png" width ="500" height="150" alt="centered image" />
</div>

**3. Nesterov Accelerated Gradient**

$$v_{k+1}=\gamma v_k +\alpha \cdot\nabla_wJ(w_k-\gamma v_k)$$
$$w_{k+1}=w_k-v_{k+1}$$


**4. Adagrad**
$$w_{k+1,j} =w_{k,j} -\frac{\alpha}{\sqrt{G_{k,jj}+\epsilon}}\cdot g_{k,j}$$

g is our gradient

* Adagrad uses a different learning rate for every parameter $w_j$ at every step $k$. $G$ is diagonal matrix of sum squared gradient values.

* Performs smaller update (i.e. low learning rates) for parameters associated with frequntly occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features.

**5. RMSProp**

$$E[g^2]_k =\gamma E[g^2]_{k-1} + (1-\gamma)g_k^2$$
$$w_{k+1} =w_k -\frac{\eta}{\sqrt{E[g^2]_k+ \epsilon}}g_k$$

* Prevents accumulation by adding regularizing term in the running average (exponentially decaying)

* Beneficial for RNNs

**6. Adadelta**

\begin{align}
E[\Delta w^2]_k = \gamma E[\Delta &w^2]_{k-1} +(1-\gamma) \Delta w_k^2\\
\text{RMS}[\Delta w]_k &=\sqrt{E[\Delta w^2]_k +\epsilon}\\
                    \Delta w_k    &=-\frac{\text{RMS}[\Delta w]_{k-1}}{\text{RMS}[g]_k}g_k\\
                    w_{k+1}&=w_k+\Delta w_k
\end{align}

* Generalizes RMSProp /Adagard for considering RMS instead of accumulation of grad

* No learning rate parameter

**7. AdaM - Adaptive Moment Estimation**

\begin{align}
m_k &=\beta_1m_{k-1} + (1-\beta_1)g_k\\
v_k &=\beta_2v_{k-1} +(1-\beta_2)g_k^2\\
w_{k+1}&=w_k -\frac{\eta}{\sqrt{\hat{v}_k}+\epsilon}\hat{m}_k
\end{align}

* Keeps track of 2 moments: mean and variance
* Normalizes them to prevent biases
    - $\displaystyle \hat{m}_k =\frac{m_k}{1-\beta_1^k}$
    - $\displaystyle \hat{v}_k =\frac{v_k}{1-\beta_2^k}$

**Additional**
- AdaMax: Generalization of AdaM to L-infinity norm
- Nadam: Nesterov AdaM
- AMSgrad: Max normalization instead of exponential in AdaM

**Notes on Choosing Opimizers**

* RSMProp & AdaDelta adaptive
* AdaM adaptive + momentum  -> robust
* SGD as a first pass
