# Gradient Descent Explained

In calculus when we want to minimize or maximize the value of a function, we take it's first derivative and set it's value to 0. Why?

Probably this is too far into the trek without reaching the basecamp. So let's reach the basecamp first.

Why do we need a model when we have the data points that we can use to infer the patterns? We need a model on those data points (perhaps also with some regularization and bias considerations) because we are not aware of the actual function that created that data. We are only trying to guess what that original function could have been. This guess is our model.

When we have a function that defines our model (a.k.a the hypothesis) we realise that the function could take so many values for the different weights/parameters we have and it is not a good idea to go with the first set of values for the weights that seem decent enough. Since we are defining a hypothesis, we don't know the actual function that generated that data, at least we should define the best one possible which, by best, I mean it fits the data the best it can.

How do we measure the performance of difference models (varying weights) for the given data?

A cost/loss function is something that defines the value of error in estimations of the model/function/hypothesis. So then we basically need to minimize the loss function in order for the errors between the actual and predicted values to be lower. Now what could we change in the model to change the value of the predictions and bring it closer to actual values? Of course, the weights! Don't forget, the hypothesis/model is a function of the data and the weights don't change for one hypothesis, it gives us the predicted values whereas the cost function is a function of different weight values and it gives us the estimated error/loss between actuals and predicted values given that particular choice of weights.

So then how do we figure out which function fits the data the best? By trying out different weights for that function, where each set of weights would give us a different estimate of the error and we choose the one where the error is the least. This error is also called loss.

But! 

How do we know which weights to choose? Do we try all the available numbers in the world? Is that computationally possible? No!

Enter **Gradient Descent**



Gradient descent is an optimization algorithm used to determine the values of parameters (coefficients) of a function \( f \) that minimize a cost function.

It is particularly effective when the parameters cannot be calculated analytically (e.g., using linear algebra) and must instead be found through an optimization process.

Let's say we have a cost function which is dependent on a single weight value w, shown below:


( add image here)



On the x-axis is w and on the y-axis we have the cost function. The curve depicts the different values of cost function \( f \) for different values of w. Remember this is the same cost function that we wanted to minimize but didn't know what is the exact value of w that brings it to its minimum value.

Here we can visually observe that point but how do we navigate our w value towards it programmatically?

Enter **Calculus**



So how do we get to the lowest point in that curve that gives us the minimum error and gives us the weights values that occur at that point?

### Presenting!!!
#### How Gradient Descent Works

1. **Initialization**: Start with initial guesses for the parameters. These can be random values or zeros.
2. **Compute the Gradient**: Calculate the gradient (partial derivatives) of the cost function with respect to each parameter. The gradient indicates the direction and rate of fastest increase of the cost function.
3. **Update the Parameters**: Adjust the parameters in the opposite direction of the gradient. This step is repeated iteratively until the cost function converges to a minimum value.

    \[
    \theta = \theta - \alpha \nabla J(\theta)
    \]

    Where:
    - \( \theta \) represents the parameters.
    - \( \alpha \) is the learning rate, a small positive number that controls the step size.
    - \( \nabla J(\theta) \) is the gradient of the cost function \( J \) with respect to \( \theta \).

4. **Convergence**: The algorithm stops when the cost function reaches a minimum value, which means the parameters are optimized.


So let's say we randomly chose a value for w which lands us at point A on the curve for the loss function. Fine, the next step is to compute the gradient with respect to w.

But what really is a gradient?

A gradient is a vector that points in the direction of the steepest increase of a function and its magnitude represents the rate of increase. Basically in 2-D it can be called the slope that we used to study in school. In a geometric sense, it generalizes the concept of slope to multiple dimensions. The visualization shows the same 

(3rd quadrant pic)

(PS: A tangent line or plane touches a curve or surface at a single point and has the same slope or direction as the curve or surface at that point.)



We call this algorithm gradient descent because the idea is to descent down the curve where the minima is supposed to be. Okay so we gotta get down the curve now from the random point where we are at right now. So how do we move down? We have to try the next value of w which takes us closer but which value should we try?

Here comes the third step of gradient descent. Update w using the slope at the current point. If the slope is positive (see point A in the illustration), it means the minima is towards the left because we must be at a higher point in the right currently. So as per the equation in step 3 above, we reduce the value of w using ``alpha`` as a learning rate to control how big updates we make to w. On the other hand if the slope/gradient is negative (see point B in the illustration) it means the minima is to the right of us and we must update w to move it to the right. Then the - sign times the negative sign from the slope value become positive and w is increased this time.


<image 4>

It is interesting what happens slowly as we move closer to the minima. The gradient value will start to dicrease as the slope will start to become flatter and flatter. At one point the slope/gradient will be almost or equal to 0. That would be our minima or the w for which the loss function value is the least. We now have the w that fits the data the best.

### Considerations

#### 1. Learning Rate


Learning rate must be chosen wisely as:
1. if it is too small, then the model will take some time to learn.
2. if it is too large, model will converge as our pointer will shoot and weâ€™ll not be able to get to minima.


![image.png](attachment:image.png)

Source: https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c


#### 2. Feature Scaling

Preprocessing the data to ensure that features are on a similar scale can significantly improve the performance of gradient descent. The untreated data's cost function is very elongated and asymmetric which can cause the algorithm to take much more time and a small learning rate would work.


![image-3.png](attachment:image-3.png)


![image-2.png](attachment:image-2.png)




Source: https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normalizing-inputs

#### When to Use Gradient Descent

Gradient descent is particularly useful when:

1. **Analytical Solutions are Infeasible**: For many complex models, the parameters cannot be calculated analytically using linear algebra or other closed-form solutions. Gradient descent provides a numerical solution.
2. **High-Dimensional Data**: It is effective in optimizing models with a large number of parameters, which is common in deep learning.
3. **Non-Convex Functions**: Gradient descent can handle non-convex cost functions often encountered in neural networks, where traditional optimization methods might fail.


#### Types of Gradient Descent

1. **Batch Gradient Descent**: Uses the entire dataset to compute the gradient at each step. It provides accurate updates but can be computationally expensive for large datasets.
2. **Stochastic Gradient Descent (SGD)**: Uses one data point at a time to compute the gradient. It is much faster and can escape local minima, but it introduces more noise in the parameter updates.
3. **Mini-Batch Gradient Descent**: A compromise between batch and stochastic gradient descent. It uses a small random subset of the data (mini-batch) to compute the gradient. It balances the efficiency and accuracy of updates.

