## Gradient Descent

In machine learning a loss function is used to peanlise the model for learning parameters $\theta$ that fail to approximate $F$ accurately. A loss function is similar to an evaluation metric, with the important difference: loss functions should always be differentiable, while evaluation metrics don't have to be differentiable.

Gradient descent is how a machine learning model learns parameters $\theta$ learns incrementally. Gradient descent only works for differentiable loss functions - find the local minimum of the loss function with respect to parameters. The optimisation method is called *gradient descent* because it involves differentiating the loss function with respect to each parameter to obtain the gradient, and then 'moving' these parameters in the direction of steepest descent on the loss function/surface.

Recalling our formulation of single variable linear regress;
$$\hat{y} = f(X_i;theta) = \theta_0 + \theta_1x_1$$

We can choose the following loss function since it differentiates nicely:
$$L(x;\theta) = \frac{1}{2}(y-\hat{y})^2$$

The loss function measures the squared difference between predictions and actual values.

Gradient of $L$ w.r.t $y$:
$$\frac{\partial{L}}{\partial{y}} = -(y - \hat(y))$$

Gradient of $L$ w.r.t $\theta_1$:
$$\frac{\partial{L}}{\partial{\theta_1}} = \frac{\partial{L}}{\partial{\hat{y}}} \cdot \frac{\partial{\hat{y}}}{\partial{\theta_1}} = -(y-\hat{y})\cdot x_1$$

Gradient of $L$ w.r.t $\theta_0$
$$\frac{\partial{L}}{\partial{\theta_0}} = \frac{\partial{L}}{\partial{\theta_0}} \cdot \frac{\partial{L}}{\partial{\theta_1}} = -(y - \hat{y})$$

We update the learnable parameters using their gradients and the learning data. The learning rate is tunable hyper-parameter $\eta$.
$$\theta_1 = \theta_1 - \eta \frac{\partial{L}}{\partial{\theta_1}}$$
$$\theta_0 = \theta_0 - \eta \frac{\partial{L}}{\partial{\theta_0}}$$

We often initialise the parameters $\theta$ to a random value. In some cases we can use prior information to start somewhere better.

## Simplified understanding
Imagine being in a valley and blindfolded, goal is to reach the lowest point of the valley. How would you do it?
1. Feel which way the ground slopes
2. Take small steps downhill
3. Repeat until you can't go any lower
This is what *gradient descent* is, the 'valley' is your loss function (how wrong your model is), and you're trying to find the lowest point (where your model makes the fewest mistakes)

### Key components
Loss function: Measures hopw **wrong** models' prediction is
Parameters ($\theta$): Number oyur models adjusts to get better predictions
Gradient; Slope that tells you which direction to move parameters
Learning Rate ($\eta$): How big of a step to take in that direction

### Process
1. Start with random parameters
2. Make predictions with your model
3. Calculate how wrong you are (loss)
4. Find which direction to change parameters to reduce loss (gradient)
5. Update parameters by taking a small step in that direction
6. Repeat until loss stops decreasing

![Title](../images/lr_gd.gif)

#### Why gradient descent?
Why not just solve for $\frac{\partial{L}}{\partial{\theta_1}} = \frac{\partial{L}}{\partial{\theta_0}} = 0$, to find the paraemters for the minimum of the loss function.

In deep learning, it is common for there to be $1000$s of parameters in a model, which creates high-dimensional parameter spaces. It is usually computationally itractable or even impossible to calculate the global minimum of these loss functions. In very complex deep learning problems, we often settle for local minimum.

<img src="../images/gd_multi_modal.gif" width="750" align="center">