# Deep Leaning

- Deep learning is a subfield of machine learning that uses multi-layered artificial neural networks to extract progressively higher-level features from the raw input.
- inspired by the structure and function of the brain, namely the interconnecting of many neurons.
- teaches computers to do what comes naturally to humans: learn by example
    - AI - science of making things smart
    - ML - approach to achieve AI, where machines learn from experience
    - DL - subset of ML, where we use neural networks to implement ML

<img alt="picture 5" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/b9bae26b218ef0aa016e0f79b6dd9abd5b937a4b9686036f202d1fee3b9ab394.png" style="display: block; margin-left: auto; margin-right: auto;">


## Fundamentals of Neural Network

### Why Deep Learning?

Limitations of linear models:
- Linearity implies the weaker assumption of monotonicity (increse or decrease in feature must always cause an increase or decrease in model's output)
- Linearity is not always plausible
    - predicting health as a function of body temperature (temp above and below 37 deg C indicates greater risk)
    - classifying images of cats and dogs (increasing the intensity of the pixel at location (13, 17) always increase or decrease the likelihood that the image depicts a dog)
- We overcome this by using deep neural networks to learn both a representation via hidden layers and a linear predictor that acts upon that representation

### Applications of Deep Learning

| Area | Description |
|------|-------------|
| image recognition | identify and classify objects, patterns, or scenes in images |
| natural language processing | understand, interpret, and generate human language |
| speech recognition | convert spoken language into text or commands |
| autonomous vehicles | enable self-driving capabilities for cars and drones |
| healthcare | assist in disease diagnosis, medical imaging, and drug discovery |
| finance | predict stock prices, fraud detection, and risk assessment |
| recommender systems | personalize recommendations for products or content |
| robotics | enhance perception and decision-making in robotic systems |
| gaming | create realistic simulations and improve in-game AI |
| cybersecurity | detect and prevent security threats and intrusions |
| generative models | generate realistic images, videos, and text |
| virtual assistants | power natural language interfaces and virtual helpers |
| environmental monitoring | analyze data for climate modeling and environmental studies |

## Perceptron and Perceptron learning algorithm

<img alt="picture 4" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/3151c7650e8ee2f625afc8a34a9960a1ff70c1d03d7aebb89e7dd72a4d51bfb3.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

- It is a binary classification algorithm that forms the foundation of neural networks (Frank Rosenblatt in 1957)
- A perceptron takes a vector of real-valued inputs, multiplies each input by a weight, sums the weighted inputs, adds a bias and then finally outputs a 1 if the result is greater than some threshold and outputs a -1 otherwise.
    - $ h = 1 \text{ if } \sum_{i=0}^{d} w_i x_i > 0 \text{ else } -1 $
- During training, the algorithm adjusts the weights based on misclassifications, attempting to minimize the error
- The perceptron learning algorithm is limited to linearly separable problems and is a single-layer neural network, paving the way for more complex models in modern deep learning

#### Using perceptron to learn binary functions
- Write function in terms of A, B, C, etc with 1 and 0
- Write function in terms of $x_1, x_2, t$ etc with 1 and -1
- Write the equation of the form $ w_0 x_0 + w_1 x_1 = \hat{y} $ (for NOT) or $ w_0 + w_1 x_1 + w_2 x_2 = \hat{y} $ (for AND, OR, NAND, NOR, XOR)
    - Create different equations for different values of $x_1$ and $x_2$ and solve for $w_0$, $w_1$ and $w_2$
    - The LHS of the equation will be $ > 0 $ for $ \hat{y} = 1 $ and $ \leq 0 $ for $ \hat{y} = -1 $

1. NOT function
    - $w_0 = 1, w_1 = -1$
2. AND function
    - $w_0 = -1, w_1 = 2, w_2 = 2$
3. OR function
    - $w_0 = 2, w_1 = 2, w_2 = 2$

Perceptron learning algorithm:
- Initialize weights to $0$ or small random numbers
- For each example $x^{(i)}$:
    - Compute the output value $\hat{y}$
    - Update the weights:
        - $ w_j \leftarrow w_j + \Delta w_j $
        - $ \Delta w_j = \eta (y^{(i)} - \hat{y}^{(i)}) x_j^{(i)} $
        - $ \eta $ is the learning rate (typically a small value like 0.01)

Running through an example:
- Do multiple epochs (iterations) until convergence
    - each epoch is a complete pass through the training data 
        - for every training example, we compute the output and update the all the weights

| $x_1$ | $t$ | $w_0$ | $w_1$ | $w_2$ | z | $\hat{y}$ | $isequal(t, \hat{y})$ | $\Delta w$ | New $w$ |
|-------|-----|-------|-------|-------|---|-----------|-----------------------|------------|---------|
| ..    | ..  | ..    | ..    | ..    | ..| ..        | ..                    | ..         | ..      |



## Linear Regression

- linear regression can be used to fit a model to an observed dataset of values of the response (dependent variable) and explanatory variables (independent variables / features)
- $x^{(i)}$ is the vector of input variables / features, $x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} _{((n+1) \times 1)}$, where $n$ is the number of features, with $x_0^{(i)} = 1$ being the intercept term. 
- $y^{(i)}$ is the output variable / target.
- $(x^{(i)}, y^{(i)})$ is a training example.
- $\{(x^{(i)}, y^{(i)}) : i = 1 \dotsm m\}$ is the training set, where $m$ is the number of examples in the training set.

Goal : to learn a function $h(x) : \text{space of input values} \rightarrow \text{space of output values}$, so that $h(x)$ is a good predictor for the corresponding value $y$

#### Equations

If we decide to approximate $y$ as a linear function of $x$, then for the $i^{th}$ training example:

$$\hat{y}^{(i)} =  h_\theta(x^{(i)}) = \theta_0 + \theta_1 x^{(i)}_1 + \theta_2 x^{(i)}_2 + \dotsm + \theta_n x^{(i)}_n = \sum_{j=0}^n \theta_j x^{(i)}_j$$

This is called **simple / univariate** linear regression for $n = 1$, and **multiple** linear regression, (if $n > 1$). This is different from **multivariate** regression, which pertains to multiple dependent variables and multiple independent variables. [Link](https://stats.stackexchange.com/q/2358/331716)

Then we can define the cost function as:
$$J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$$

This is the **ordinary least squares (OLS)** cost function, working to minimize the **mean squares error (MSE)**.

Goal : to choose $\theta$ so as to minimize $J(\theta)$

#### Vectorized

$$ X = \begin{bmatrix} - \left( x^{(1)} \right)^T - \\ - \left( x^{(2)} \right)^T - \\ \vdots \\ - \left( x^{(m)} \right)^T - \end{bmatrix}_{(m \times (n+1))} , \qquad \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix}_{((n+1) \times 1)} \qquad and \qquad y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} _{(m \times 1)}$$

Then the vector of predictions, 

$$ \hat{y} =  X\theta = \begin{bmatrix} - \left( x^{(1)} \right)^T\theta - \\ - \left( x^{(2)} \right)^T\theta - \\ \vdots \\ - \left( x^{(m)} \right)^T\theta - \end{bmatrix}_{(m \times 1)} $$

We can rewrite the least-squares cost as following, replacing the explicit sum by matrix multiplication:

$$J(\theta) = \frac{1}{2m} (X\theta - y)^T(X\theta - y)$$

#### Finding coefficients for simple linear regression

The simple linear regression model is $y = \theta_0 + \theta_1 x$, where $\theta_0$ is the intercept and $\theta_1$ is the slope. The coefficients are found by minimizing the sum of squared residuals (SSR), which is the sum of the squares of the differences between the observed dependent variable ($y$) and those predicted by the linear function ($\hat{y}$).

Make a table with $x_i$, $y_i$, $x_i - \bar{x}$, $y_i - \bar{y}$, $(x_i - \bar{x})^2$, $(x_i - \bar{x})(y_i - \bar{y})$.

Equation for $\theta_1$: $$\theta_1 = \frac{\sum_{i=1}^m (x^{(i)} - \bar{x})(y^{(i)} - \bar{y})}{\sum_{i=1}^m (x^{(i)} - \bar{x})^2}$$
Equation for $\theta_0$: $$\theta_0 = \bar{y} - \theta_1 \bar{x} $$

#### Assumptions for linear regression
1. dependent and independent variables are linearly related
2. independent variables are not random
3. residuals are normally distributed
4. residuals are homoscedastic (constant variance)

### Polynomial Regression

Polynomial regression is a form of regression analysis in which the relationship between the independent variable $x$ and the dependent variable $y$ is modelled as an $n^{th}$ degree polynomial in $x$. Polynomial regression fits a nonlinear relationship between the value of $x$ and $y$.
- Simplest form of polynomial regression is a quadratic equation, $y = \theta_0 + \theta_1 x + \theta_2 x^2$

### Normal Equation

The normal equation is an analytical solution to the linear regression problem with a ordinary least square cost function. That is, to find the value of $\theta$ that minimizes $J({\theta})$, take the [gradient](https://mathinsight.org/gradient_vector) of $J(\theta)$ with respect to $\theta$ and equate to $0$, ie $\nabla_\theta J(\theta) = 0$.

Solving for $\theta$, we get 

$$\theta = (X^TX)^{-1} X^Ty$$

[Here](https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/) is a post containing the derivation of the normal equation.

### Gradient Descent

Gradient descent is based on the observation that if the function $J({\theta})$ is differentiable in a neighborhood of a point $\theta$, then $J({\theta})$ decreases fastest if one goes from $\theta$ in the direction of the negative gradient of $J({\theta})$ at $\theta$. 

Thus if we repeatedly apply the following update rule, ${\theta := \theta - \alpha \nabla J(\theta)}$ for a sufficiently small value of **learning rate**, $\alpha$, we will eventually converge to a value of $\theta$ that minimizes $J({\theta})$.

For a specific paramter $\theta_j$, the update rule is 

$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J({\theta}) $$

Using the definition of $J({\theta})$, we get

$$\frac{\partial}{\partial \theta_j} J({\theta}) = \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$

Therefore, we repeatedly apply the following update rule:

$\qquad Loop \: \{$
    $\qquad \qquad \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{simultaneously update } \theta_j \text{ for all } j$
$\qquad \}$

This method looks at every example in the entire training set on every step, and is called **batch gradient descent (BGD)**. 

When the cost function $J$ is convex, all local minima are also global minima, so in this case gradient descent can converge to the global solution.

There is an alternative to BGD that also works very well:

$\qquad Loop \: \{$
    $\qquad \qquad for \: i=1 \: to \: m \: \{$
    $\qquad \qquad \qquad \theta_j := \theta_j - \alpha \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{simultaneously update } \theta_j \text{ for all } j$
    $\qquad \qquad \}$
$\qquad \}$

This is **stochastic gradient descent (SGD)** (also incremental gradient descent), where we repeatedly run through the training set, and for each training example, we update the parameters using gradient of the error for that training example only.

Whereas BGD has to scan the entire training set before taking a single step, SGD can start making progress right away with each example it looks at. 

Often, SGD gets $\theta$ *close* to the minimum much faster than BGD. However it may never *converge* to the minimum, and $\theta$ will keep oscillating around the minimum of $J(\theta)$; but in practice these values are reasonably good approximations. Also, by slowly decreasing $\alpha$ to $0$ as the algorithm runs, $\theta$ converges to the global minimum rather than oscillating around it.

## Logistic Regression
- transforms the output of a linear regression model into a probability by applying the logistic function (sigmoid function) $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
- the output of the logistic function is interpreted as the probability of the input belonging to the positive class, $$ h_{\theta}(x) = P(y = 1 \mid x) = \sigma(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}} $$
    - if $\theta^Tx = 0$, then $P(y = 1 \mid x) = 0.5$
    - if $\theta^Tx \gg 0$, then $P(y = 1 \mid x) \approx 1$
    - if $\theta^Tx \ll 0$, then $P(y = 1 \mid x) \approx 0$
    - here $f(x) = \theta^Tx$ is called logit function
- works by determining the weights $\theta$ such that the predicted probability is maximized for the positive class and minimized for the negative class
- you maximize the log-likelihood function, $$ \ell(\theta) = \sum_{i=1}^m y^{(i)} \log P(y^{(i)} = 1 \mid x^{(i)}) + (1 - y^{(i)}) \log P(y^{(i)} = 0 \mid x^{(i)}) $$ 
    - if $y^{(i)} = 1$, then $P(y^{(i)} = 1 \mid x^{(i)})$ is maximized, which happens when $\theta^Tx^{(i)}$ is maximized
    - if $y^{(i)} = 0$, then $P(y^{(i)} = 0 \mid x^{(i)})$ is maximized, which happens when $\theta^Tx^{(i)}$ is minimized
- this method is called maximum likelihood estimation (MLE)
- Logit function
    - the logit function is the inverse of the logistic function, $$ \text{logit}(p) = \log \left( \frac{p}{1 - p} \right) = \sigma^{-1}(p) $$
    - because of this, logit is also called log-odds, because it is the logarithm of the odds, $$ \text{odds}(p) = \frac{p}{1 - p} $$
- dependent variable follows Bernoulli distribution
- cost function is $$ J(\theta) = - \frac{1}{m} \sum_{i=1}^m y^{(i)} \log P(y^{(i)} = 1 \mid x^{(i)}) + (1 - y^{(i)}) \log P(y^{(i)} = 0 \mid x^{(i)}) $$
- gradient descent update rule is $$ \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)}\right)x_j^{(i)} \qquad \text{for } j \ge 1 $$ but here the hypothesis function is different from that of linear regression, as defined above

## Multilayer Perceptron (MLP)

We overcome the limitations of linear models by incorporating hidden layers into our model. The hidden layers allow us to model non-linear relationships between our features and the output. The hidden layers are also called as representation layers as they learn a representation of the data that is used by the final layer to make the prediction.
- Think of the first $L-1$ layers as learning a representation of the data, and the final layer as using that representation to make a linear prediction.
- This is called a multilayer perceptron (MLP) or a feedforward neural network (FFNN).

<img alt="picture 2" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/5d52b7a5eb8f67b1ae9a5f73b0555decc85d27247eee9a33a219452acd4842af.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

Say we have a MLP with $L = 1$ hidden layer. The input layer has $d$ features, the hidden layer has $h$ units and the output layer has $q$ units. Then:
- We denote by $\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}$ the weight matrix between the input layer and the hidden layer, and by $\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}$ the weight matrix between the hidden layer and the output layer.
- The bias vector for the hidden layer is denoted by $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}$ and the bias vector for the output layer is denoted by $\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}$.
- Let the input matrix be $\mathbf{X}$, hidden layer activations be $\mathbf{H}$ and output layer activations be $\mathbf{O}$. Then:
$$ 
\underbrace{\overbrace{\mathbf{X}}^{\text{input}}}_{\mathbb{R}^{n \times d}} 
\underbrace{\xrightarrow{\frac{\mathbf{W}^{(1)}}{\mathbf{b}^{(1)}}}}_{\frac{\mathbb{R}^{d \times h}}{\mathbb{R}^{1 \times h}}}
\underbrace{\overbrace{\mathbf{H}}^{\text{hidden}}}_{\mathbb{R}^{n \times h}} 
\underbrace{\xrightarrow{\frac{\mathbf{W}^{(2)}}{\mathbf{b}^{(2)}}}}_{\frac{\mathbb{R}^{h \times q}}{\mathbb{R}^{1 \times q}}}
\underbrace{\overbrace{\mathbf{O}}^{\text{output}}}_{\mathbb{R}^{n \times q}} 
$$

$$
\begin{align*}
\mathbf{H} &= \sigma (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}) \\
\mathbf{O} &= \mathbf{H} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}
\end{align*}
$$ where $\sigma$ is the activation function, which is applied row-wise (i.e. one example at a time).

To build more general models, we can stack multiple hidden layers on top of each other. This is called a deep neural network, e.g. $\mathbf{H}^{(1)} = \sigma_1 (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})$ and $\mathbf{H}^{(2)} = \sigma_2 (\mathbf{H}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)})$ thus yeilding more expressive models.

#### MLP for XOR gate

<img alt="picture 6" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/6f69975403e76ca25b578e60e9ad99f5cb9009b8a1c701e0adedb4dd4e088cc1.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

#### MLP for complicated decision boundaries
1. Network must fire when input is in the correct region
    - the number of hidden units must be at least the number of lines that define the region
    - if the region is comprised of more than one region
        - then there will be a third layer that combines the outputs for each of the regions
    <img alt="picture 9" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/8ada04e55fb85856b5a27a7be5f19ec5123d3cde4b4c09c6e562743e32e139cc.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">
2. Network must represent any universal boolean function
    - a one hidden layer MLP can represent any boolean function
    <img alt="picture 10" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/e391c6063de7631f3db02423c9a60cfa24e20c32555c97d79b64d1e4f52338ce.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">
3. In general, XOR of N variables requires the following number of perceptrons:
    - For a singe layer network, this will require $2^{N-1}$ perceptrons in the first hidden layer and 1 perceptron in the second hidden layer.
    - For a deep network, this will require $3(N-1)$ perceptrons arranged in $2log_2(N)$ layers.
    <img alt="picture 11" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/ffd592683d030dc2d2378ac1257e200e16e3327a9fdf389ac7494fd35cdf8f9b.png" width="300" style="display: block; margin-left: auto; margin-right: auto;">



## Activation functions

If we use a linear activation function, then the model can only learn linearly separable functions. Hence, we use non-linear activation to enable the model to learn non-linearly separable problems. 
- should be monotonic, i.e. increasing the input should always increase the output.
- should be defined everywhere, continuous and preferably differentiable everywhere.
    - backprop algorithm uses gradient descent to update the weights and hence requires the activation function to be differentiable.
- should be computationally efficient to compute and differentiate.

Problems:
1. Vanishing gradient problem:
    - as the gradient is backpropagated through the layers, it gets multiplied by the gradient of the activation function at each layer
    - if the gradient of the activation function is less than 1, then the gradient will keep getting smaller as we propagate through the layers
    - this will cause the weights to not get updated at the earlier layers and the model will not learn
    - typically seen in sigmoid and tanh activation functions, for very deep networks
2. Exploding gradient problem: 
    - when the gradients during backpropagation are very large, the weights get updated by a large amount
    - can happen due to improper initialization of weights or large learning rates
3. Dying ReLU problem: 
    - when the input to the ReLU is negative, the gradient is 0
    - this causes the weights to not get updated and an inactive neuron 
    - once a ReLU neuron is inactive, it will not activate again as the gradient will always be 0 when the output is zero, dying forever

Different activation functions:

| Name | Activation Function | Derivative |
| --- | --- | --- |
| Sigmoid | $\sigma(x) = \frac{1}{1+e^{-z}}$ | $\sigma(z)(1 - \sigma(z))$ |
| Tanh | $\tanh(z) = \frac{e^z-e^{-z}}{e^z+e^{-z}}$ | $(1 - \tanh^2(z))$ |
| ReLU | $g(x) = \max(0, z)$ | $1$ if $z > 0$ else $0$ |
| Leaky ReLU | $g(x) = \max(\alpha z, z)$ | $1$ if $z > 0$ else $\alpha$ |
| ELU | $g(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{otherwise} \end{cases}$ | $1$ if $z > 0$ else $\alpha e^x$ |
| Softplus | $g(x) = \log(1 + e^x)$ | $\sigma(z)$ |


1. Sigmoid
    - squashes the input to the range (0, 1)
    - saturates when the input is very large or very small
    - not zero-centered
    - not computationally efficient
2. Tanh
    - squashes the input to the range (-1, 1)
    - saturates when the input is very large or very small
    - zero-centered
    - not computationally efficient
3. ReLU
    - saturates when the input is negative
    - computationally efficient
    - suffers from dying ReLU problem
4. Leaky ReLU: allows a small positive gradient when the input is negative
    - computationally efficient
    - does not suffer from dying ReLU problem
5. Parametric ReLU: allows the negative slope to be learned
    - computationally efficient
    - does not suffer from dying ReLU problem
6. Exponential Linear Unit (ELU):
    - saturates when the input is negative
    - does not suffer from dying ReLU problem
7. Softplus:
    - saturates when the input is negative

<img alt="picture 12" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/30326a9d501a81322aeecfaccd79ee8728bb2aa402e835219c6620f850fc2169.png" width="1000" style="display: block; margin-left: auto; margin-right: auto;">

### MLP as classifiers and Universal approximators

There are results that show that MLPs are universal approximators, i.e. they can approximate any function. This is true for MLPs with a single hidden layer. However, in practice, we use MLPs with multiple hidden layers as they are more expressive.
- We can approximate most functions much more compactly with a deep MLP than a wide MLP.

### Issue of Depth and Width

- Depth: The number of hidden layers in the network
- Width: The number of units in each hidden layer
- The number of parameters in a single hidden layer MLP is $(d \times h + h) + (h \times q + q)$.
- The number of parameters in a deep MLP with $L$ hidden layers is $(d \times h_1) + (h_1 \times h_2) + \dots + (h_{L-1} \times h_L) + (h_L \times q) + (h_1 + h_2 + \dots + h_L + q)$.

## Computation Graphs
- Error propogation, backward differentiation on a computation graph is used to compute the gradients of the loss function for a network
- Neural language models use a neural network as a probabilistic classifier, to compute the probability of the next word given the previous n words
- Neural language models can use pretrained embeddings, or can learn embeddings from scratch in the process of language modeling

$L(a, b, c) = c(a + 2b)$

1. First we do the forward pass,
    <img alt="picture 6" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/adebdf40c90c52c1126e918616285d6f5feaf2f5522e5ba476f211c885f262b9.png" width="500" style="display: block; margin-left: auto; margin-right: auto"/>
2. Then we can compute the derivatives using the chain rule,
    - $\frac{\partial L}{\partial c} = e, \frac{\partial L}{\partial a} = \frac{\partial L}{\partial e} \frac{\partial e}{\partial a}, \frac{\partial L}{\partial b} = \frac{\partial L}{\partial e} \frac{\partial e}{\partial d} \frac{\partial d}{\partial b}$  
3. Now we compute 
    - $\frac{\partial L}{\partial e} = c, \frac{\partial L}{\partial c} = e$
    - $\frac{\partial e}{\partial a} = 1, \frac{\partial e}{\partial d} = 1$
    - $\frac{\partial d}{\partial b} = 2$
4. Then we do backward pass,
    <img alt="picture 7" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/a135e1010c19f7550ed6d8472ec297388fd1b6e2406fa53f0fbcb77778e06da8.png" width="500" style="display: block; margin-left: auto; margin-right: auto"/>

Sample computation graph for a simple 2-layer neural net with two input dimensions and 2 hidden dimensions.

<img alt="picture 8" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/e729d679c81d447bdf098a9cf6b74ceddc3e9a2fc974fa2335a9e48fe9218ccf.png" width="500" style="display: block; margin-left: auto; margin-right: auto"/>


#### Underfitting and Overfitting

Error(model) = Bias(model) + Variance(model) + Irreducible Error

Bias : how far off in general the model is from the actual value. High bias means the model is not complex enough to capture the underlying trend of the data. Low bias means the model is complex enough to capture the underlying trend of the data.

Variance : how much the model changes based on the training data. High variance means the model changes a lot based on the training data. Low variance means the model does not change much based on the training data.

Deep leaning recipe

<img alt="picture 13" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/c7245191c0988fc5b022a124b27ac8ee8d9ab5b89dd3733a8afd9c5d8ed920de.png" width="500" style="display: block; margin-left: auto; margin-right: auto">

**Underfitting** – High bias and low variance
- model does not fit the training data and does not generalize well to unseen data

Techniques to reduce underfitting :
1. Increase model complexity
2. Increase number of features, performing feature engineering
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.

**Overfitting** – High variance and low bias
- model fits the training data well, but does not generalize well to unseen data

Techniques to reduce overfitting :
1. Increase training data (data augmentation)
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.
6. Ensemble learning (bagging, boosting, stacking)
7. Cross-validation, holdout validation, k-fold cross-validation

<img src="https://i.imgur.com/b4CWHHf.png" width="500" style="display: block; margin-left: auto; margin-right: auto;">

Simple models like linear and logistic regression are prone to underfitting, whereas complex models like decision trees and neural networks are prone to overfitting.


### Adding regularization

Regularization is a technique to reduce overfitting in machine learning. This technique discourages learning a more complex or flexible model, by shrinking the parameters towards $0$.

We can regularize machine learning methods through the cost function using $L1$ regularization or $L2$ regularization. $L1$ regularization adds an absolute penalty term to the cost function, while $L2$ regularization adds a squared penalty term to the cost function. A model with $L1$ norm for regularisation is called **lasso regression**, and one with (squared) $L2$ norm for regularisation is called **ridge regression**. [Link](https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261)

$$J(\theta)_{L1} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n |\theta_j| \right)$$

$$J(\theta)_{L2} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n \theta_j^2 \right)$$

The partial derivative of the cost function for lasso linear regression is:

\begin{align}
& \frac{\partial J(\theta)_{L1}}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left(x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} 
& \qquad \text{for } j = 0 \\
& \frac{\partial J(\theta)_{L1}}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{2m} signum (\theta_j)
& \qquad \text{for } j \ge 1
\end{align}

Similarly for ridge linear regression,

\begin{align}
& \frac{\partial J(\theta)_{L2}}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left(x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} 
& \qquad \text{for } j = 0 \\
& \frac{\partial J(\theta)_{L2}}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m} \theta_j 
& \qquad \text{for } j \ge 1
\end{align}


#### For logistic regression
- regularized cost functions are exactly the same as in linear regression
    - L1 regularization : $$ J(\theta)_{L1} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n |\theta_j| \right) $$
    - L2 regularization : $$ J(\theta)_{L2} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda}{2m} \left( \sum_{j=1}^n \theta_j^2 \right) $$


These equations can be substituted into the general gradient descent update rule to get the specific lasso / ridge update rules.

Elastic Net regression is a combination of lasso and ridge regression. It's regularization term is a combination of the $L1$ and $L2$ regularization terms. The cost function is:
$$ J(\theta)_{ElasticNet} = \frac{1}{2m} \left( \sum_{i=1}^m \left( h_\theta\left( x^{(i)} \right) - y^{(i)} \right)^2 \right) + \frac{\lambda_1}{2m} \left( \sum_{j=1}^n |\theta_j| \right) + \frac{\lambda_2}{2m} \left( \sum_{j=1}^n \theta_j^2 \right)$$

#### Note:
- $\theta_0$ is NOT constrained
- scale the data before using Ridge regression
- $\lambda$ is a hyperparameter: bigger results in flatter and smoother model 
- Lasso tends to completely eliminate the weights of the least important features (i.e., setting them to 0) and it automatically performs feature selection
- Last way to constrain the weights is Elastic net, a combination of Ridge and Lasso
- When to use which?
    * Ridge is a good default
    * If you suspect some features are not useful, use Lasso or Elastic
    * When features are more than training examples, prefer Elastic



## Optimization

### Unconstrained Optimization

To find local minima of a function using derivatives, you can follow these steps:

1. Find the first derivative of the function.
2. Set the first derivative equal to $0$ and solve for $x$. This will give you the critical points of the function.
3. Find the second derivative of the function.
4. Plug each critical point into the second derivative. If $f'(x) > 0$, then it is a local minimum. If $f'(x) < 0$, then it is a local maximum.

#### For multi variable functions

The point $(a,b)$ is a critical point (or a stationary point) of $f(x,y)$ provided one of the following is true,
1. $\nabla f(a,b) = \vec{0}$ (this is equivalent to saying that $f_x(a,b) = 0$ and $f_y(a,b) = 0$),
2. $f_x(a,b)$ and/or $f_y(a,b)$ doesn’t exist.

### Optimization using gradient descent

The main idea of gradient-descent, a first-order optimization algorithm, is to take a step from the current point of magnitude proportional to the negative gradient of the function at the current point.

$$ \text{For }x_1 = x_0 - \gamma(\nabla f(x_0))^T \text{ for a small step-size } \gamma > 0 \text{ then } f(x_1) \leq f(x_0).$$

The above suggests that in order to find the optimum of $f$, say at $f(x^*)$, we can start at some initial point $x_0$ and then iterate according to $x_{i+1} = x_i - \gamma_i(\nabla f(x_i))^T$. For a suitable step-size $\gamma_i$, the sequence of points $f(x_0) \geq f(x_1) \geq …$ converges to some local minimum.

$$\theta_{j} = \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)$$

where $\theta_{j}$ is the $j^{th}$ parameter of the model, $\alpha$ is the learning rate, and $J(\theta)$ is the cost function. The partial derivative of $J(\theta)$ with respect to $\theta_{j}$ gives us the direction of steepest descent at $\theta_{j}$.

<img alt="picture 14" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/ad7132dadcc3fed3ec1d25831c19a416543fde96bebf8a584c392da6a21c1655.png" width="500" style="display: block; margin-left: auto; margin-right: auto">

#### Hessian matrix

The Hessian is the collection of all second-order partial derivatives. If $f(x, y)$ is a twice (continuously) differentiable function, then $\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}$ i.e., the order of differentiation does not matter, and the corresponding Hessian matrix $$H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial x \partial y} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix}$$ is symmetric. The Hessian is denoted as $\nabla^2_{x,y}f(x,y)$.

#### Dynamic learning rate
- Replace $\alpha$ with $\alpha(t)$, where there is a time dependent learning rate
- Different strategies for $\alpha(t)$
    - $\alpha(t) = \alpha_i$ whenever progress in optimization stall. 
    - $\alpha(t) = \alpha_0 e^{-\lambda t}$, where $\lambda$ is a hyperparameter, leading to exponential decay of learning rate
    - $\alpha(t) = \alpha_0(\beta t + 1) ^ {-\gamma}$, where $\beta$ and $\gamma$ are hyperparameters, leading to polynomial decay of learning rate

### Momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction $\beta$ of the update vector of the past time step to the current update vector. The momentum term $\beta$ is usually set to $0.9$ or a similar value.
- It replaces gradients with a leaky average over past gradients, thus accelerating convergence significantly.
- It prevents stalling of the optimization process that is much more likely to occur for stochastic gradient descent.
- The effective number of gradients is given by $1/(1 − \beta)$ due to exponentiated down weighting of past data.

<img alt="picture 15" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/5d19d40177c95691f3992b8e5636607fcbcc707bcb46a5a7726415ead68ac4a6.png" width="500" style="display: block; margin-left: auto; margin-right: auto">

#### Adagrad:

Adagrad adapts the learning rate for each parameter in proportion to its update history. This means that parameters with a lot of updates will have their learning rates decayed more than parameters with fewer updates. This can be useful when dealing with sparse data.

$$ g_{t} = \nabla J(\theta_{t}) $$ 
$$ s_{t} = s_{t-1} + g_{t} \odot g_{t} $$ 
$$ \theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{s_{t} + \epsilon}} \odot g_{t} $$

where $J$ is the objective function, $\theta$ is the parameter vector, $g_t$ is the gradient of the objective function with respect to $\theta$ at time step $t$, $\eta$ is the learning rate, $s_t$ is a diagonal matrix where each diagonal element $i,i$ is the sum of the squares of the gradients with respect to $\theta_i$ up to time step $t$, and $\epsilon$ is a smoothing term to avoid division by zero.

#### RMSProp:

RMSProp is similar to Adagrad but uses an exponentially decaying average of past squared gradients instead of accumulating all past squared gradients. This makes it more suitable for non-stationary problems.

$$ g_{t} = \nabla J(\theta_{t}) $$ 
$$ s_t = \beta s_{t-1} + (1-\beta)g^2_t $$ 
$$ \theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{s_t + \epsilon}} g_t $$

where $J$ is the objective function, $\theta$ is the parameter vector, $g_t$ is the gradient of the objective function with respect to $\theta$ at time step $t$, $\eta$ is the learning rate, $s_t$ is the weighted moving average of the squared gradient at time step $t$, $\beta$ is the decay rate for the moving average (usually set to 0.9), and $\epsilon$ is a smoothing term to avoid division by zero.

#### Adam:

Adam combines the best properties of Adagrad and RMSProp. It uses an exponentially decaying average of past squared gradients like RMSProp and also keeps an exponentially decaying average of past gradients like momentum.

$$ g_{t} = \nabla J(\theta_{t}) $$ 
$$ v_t = \beta_1 v_{t-1} + (1-\beta_1)g_t $$ 
$$ s_t = \beta_2 s_{t-1} + (1-\beta_2)g^2_t $$ 
$$ \hat{v}_t = \frac{v_t}{1-\beta^t_1} $$
$$ \hat{s}_t = \frac{s_t}{1-\beta^t_2} $$
$$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{s}_t + \epsilon}}\hat{v}_t $$

where $J$ is the objective function, $\theta$ is the parameter vector, $g_t$ is the gradient of the objective function with respect to $\theta$ at time step $t$, $\eta$ is the learning rate, $v_t$ and $s_t$ are estimates of the first and second moments of the gradients respectively, $\hat{v}_t$ and $\hat{s}_t$ are bias-corrected estimates of the first and second moments of the gradients respectively, $\beta_1$ and $\beta_2$ are exponential decay rates for the moment estimates (usually set to 0.9 and 0.999 respectively), and $\epsilon$ is a smoothing term to avoid division by zero.