# Lesson 4: Understanding and Implementing RMSProp in Python

Here's the content formatted in Markdown:

---

# Understanding and Implementing RMSProp in Python

## Introduction to RMSProp
Hello! Today, we will dive into **RMSProp** (Root Mean Square Propagation). This sophisticated optimization algorithm accelerates convergence by adapting the learning rate for each weight separately, addressing the limitations of previous techniques such as Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent, and momentum. Our focus today is understanding RMSProp and coding it from scratch in Python to optimize multivariable functions.

## Recap on Gradient Descent Techniques
Let's begin with a quick recap: SGD and Mini-Batch Gradient Descent can be sensitive to learning rates and may converge slowly. Even momentum, which mitigates these issues to an extent, has limitations. When a uniform learning rate is applied across all parameters, efficient optimization might not be achieved. This is where RMSProp steps in to offer a solution.

## Understanding RMSProp
RMSProp, an advanced optimization algorithm, adjusts the gradient descent step for each weight individually, accelerating training and allowing faster convergence. This optimization is achieved by RMSProp keeping track of a running average of the square of gradients and then using this information to scale the learning rate.

### RMSProp Mathematically
For RMSProp, we add another layer to the update rule of SGD. This additional layer scales each update with the inverse of the square root of the sum of squares of recent gradients. Here, gradients measure the quantity and direction of change for the weights. The mathematical expression is:

$$
s_{dw} = \rho \cdot s_{dw} + (1 - \rho) \cdot dw^2
$$

$$
w = w - \frac{\alpha \cdot dw}{\sqrt{s_{dw}} + \epsilon}
$$

Where:

- **\( w \)** is the parameter vector,
- **\( dw \)** is the gradient of the cost function with regards to the parameters at the current parameter value,
- **\( \alpha \)** is the learning rate,
- **\( s_{dw} \)** is the running average of the square of the gradients (initialized to 0), and
- **\( \rho \)** is the momentum parameter (a new hyperparameter, generally set to 0.9).

A higher **\( \rho \)** will result in a faster convergence. The small additive constant **\( \epsilon \)** ensures numerical stability by avoiding division by zero.

## RMSProp in Python Code
Let's now encapsulate the RMSProp concept into Python code. We will define an RMSProp function, which takes the learning rate, decay factor \( \rho \), a small number \( \epsilon \), gradient, and prior squared gradient (initialized to 0) as inputs and returns the updated parameters and updated squared gradients.

```python
def RMSProp(learning_rate, rho, epsilon, grad, s_prev):
    # Update squared gradient
    s = rho * s_prev + (1 - rho) * np.power(grad, 2)

    # Calculate updates
    updates = learning_rate * grad / (np.sqrt(s) + epsilon)
    return updates, s
```

## Application of RMSProp on Multivariable Function Optimization
Now let's apply RMSProp to find the minimum of a multivariable function \( f(x, y) = x^2 + y^2 \). Corresponding gradients are \( \frac{df}{dx} = 2x \) and \( \frac{df}{dy} = 2y \). We've set the initial starting point to \( (x, y) = (5, 4) \), and picked common choices for hyperparameters (\( \rho = 0.9 \), \( \epsilon = 1e-6 \), and \( \text{learning\_rate} = 0.1 \)), running our optimizer over 100 epochs.

```python
def f(x, y):
    return x**2 + y**2

def df(x, y):
    return np.array([2*x, 2*y])

coordinates = np.array([5.0, 4.0])
learning_rate = 0.1
rho = 0.9
epsilon = 1e-6
max_epochs = 100

s_prev = np.array([0, 0])

for epoch in range(max_epochs + 1):
    grad = df(coordinates[0], coordinates[1])
    updates, s_prev = RMSProp(learning_rate, rho, epsilon, grad, s_prev)
    coordinates -= updates
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, current state: {coordinates}")
```

### Output:
```sh
Epoch 0, current state: [4.68377233 3.68377236]
Epoch 20, current state: [2.3688824  1.47561697]
Epoch 40, current state: [0.95903672 0.35004133]
Epoch 60, current state: [0.13761293 0.00745214]
Epoch 80, current state: [3.91649374e-04 3.12725069e-09]
Epoch 100, current state: [-3.07701828e-17  2.18862195e-20]
```
As you can see, \( x \) and \( y \) quickly approach 0, which is indeed the minimum of the given function.

## Evaluation of RMSProp Over Other Gradient Descent Techniques
Lastly, we can compare the performance of RMSProp with SGD, Mini-Batch Gradient Descent, or Momentum-based Gradient Descent by examining how efficiently each one arrives at the global minimum of a cost function. For a two-variable function like in the example, RMSProp may not show significant differences, but it is known for its high efficiency in handling complex and large-scale machine learning tasks.

It reduces the oscillations and high variance in parameter updates by introducing the moving average into the gradient, often leading to quicker convergence and improved stability in the learning process. This makes it particularly useful for handling complex models and large datasets in deep learning applications.

## Conclusion
Well done! Now, you comprehend RMSProp and can code it in Python. As an advanced optimization technique, RMSProp allows for faster convergence, making it a robust tool in your machine learning toolbox.

Next, we will have hands-on exercises for you to practice and reinforce these new concepts. Remember, practice strengthens learning and expands understanding. Happy coding!

---

## RMSProp Assisted Space Navigation

## Scaling the Optimizer: Adjusting RMSProp with Gamma

## Adjust the Decay Rate in RMSProp Algorithm

## Implement RMSProp Update

## Implement RMSProp's Squared Gradient Update