# Lesson 3: Accelerating Convergence: Implementing Momentum in Gradient Descent Algorithms

Here's the content formatted in Markdown:

---

# Accelerating Convergence: Implementing Momentum in Gradient Descent Algorithms

## Getting Started with Momentum
Hello! Today, we will learn about a powerful technique that makes our Gradient Descent move faster, like a ball rolling down a hill. We call this **Momentum**.

## What's Momentum and How It Works
Momentum improves our Gradient Descent. How does it do that? Remember how a ball on top of a hill starts rolling down? If the slope is steep, the ball picks up speed, right? That's what momentum does to our Gradient Descent. It makes it move faster when the slope (our 'hill') points in the same direction over time.

## How to Add Momentum to Gradient Descent
Let's get down to coding! Here's a little piece of code to demonstrate the effect of momentum in a gradient descent process. We will use a gradient function, `grad_func()`. The weight or parameter (`theta`) starts at a point and moves down the slope by adjusting itself in every iteration or 'epoch':

$$
v := v \cdot \gamma + \alpha \cdot \text{gradient}
$$

$$
\theta := \theta - v
$$

Where:

- **θ** is the parameter vector,
- **gradient** is the gradient of the cost function with regards to the parameters at the current parameter value,
- **α** is the learning rate,
- **v** is the velocity vector (initialized to 0), and
- **γ** is the momentum parameter (a new hyperparameter).

A higher **γ** will result in faster convergence.

### Python Implementation:

```python
gradient = grad_func(theta)
v = gamma * v + learning_rate * gradient
theta = theta - v
```

We compute the gradient from the current parameters. Then, we calculate the new momentum, a combination of the old momentum, our learning rate, and the gradient. We update our parameter by subtracting this momentum from it.

## Compare Gradient Descents: Setup
Now let's visualize how momentum aids in faster convergence (which means getting to the answer quicker) in the following code snippet:

```python
import matplotlib.pyplot as plt
import numpy as np

def func(x):   
    return x**2

def grad_func(x): 
    return 2*x

gamma = 0.9
learning_rate = 0.01
v = 0
epochs = 50

theta_plain = 4.0  
theta_momentum = 4.0

history_plain = []    
history_momentum = []    

for _ in range(epochs):
    history_plain.append(theta_plain)
    gradient = grad_func(theta_plain)
    theta_plain = theta_plain - learning_rate * gradient

    history_momentum.append(theta_momentum)
    gradient = grad_func(theta_momentum)
    v = gamma * v + learning_rate * gradient
    theta_momentum = theta_momentum - v
```

Here, we implement plain and momentum gradients within one loop and track the history of weight changes to visualize them later.

## Compare Gradient Descents: Visualization
Let's visualize the comparison:

```python
plt.figure(figsize=(12, 7))
plt.plot([func(theta) for theta in history_plain], label='Gradient Descent')
plt.plot([func(theta) for theta in history_momentum], label='Momentum-based Gradient Descent')
plt.xlabel('Epoch')
plt.ylabel('Cost')
plt.legend()
plt.grid()
plt.show()
```

Here is the result:

![Momentum vs Gradient Descent](attachment:image.png)

Here, we compare Gradient Descent (without momentum) and Momentum-based Gradient Descent on the same function (**x²**). The graph shows how the cost (value of the function) changes over time (or epochs). The cost gets smaller faster for the Momentum-based method. That's because it gets a speed boost from the momentum, just like the ball rolling down the hill!

## Wrapping Up
You've done it! You've understood how to use momentum to improve Gradient Descent and seen it in action. Doesn't the ball-on-a-hill analogy make it easier to understand? Now, it's time to put your knowledge into practice! If you remember how a rolling ball picks up speed, you'll never forget how momentum improves Gradient Descent. Happy practicing and coding!

---

## Visualizing Momentum in Gradient Descent

## Adjusting Momentum in Gradient Descent

## Adding Momentum to Gradient Descent

## Optimizing the Roll: Momentum in Gradient Descent