# rmsprop
What about some machine learning related topic, today? Rmsprop is a gradient-based optimization technique proposed by [Geoffrey Hinton at his Neural Networks](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) Coursera course.

The concept of neural networks has been known for decades, but researchers have been failing to train any kind of slightly complex network. While there were more reasons to that, the one that was very hard to address was a gradient magnitude.

Gradients of very complex functions like neural networks have a tendency to either vanish or explode as the energy is propagated through the function. And the effect has a cumulative nature — the more complex the function is, the worse the problem becomes.

Rmsprop is a very clever way to deal with the problem. It uses a moving average of squared gradients to normalize the gradient itself. That has an effect of balancing the step size — decrease the step for large gradient to avoid exploding, and increase the step for small gradient to avoid vanishing.

I have implemented three versions of gradient-based techniques.

* [simple] gradient descent
* rmsprop
* rmsprop with momentum

And I will use them to find an inversion of matrix `A`. The loss function will be a squared difference between `AX` and unit matrix `I` with the corresponding derivative.

$$ AX = I $$
$$ F(x) = \sum_{i,j}(AX-I)_{i,j}^2 $$
$$ \Delta F(X) = 2A^T (AX-I) $$

Here is the plot of loss function for each method with respect to the number of steps.

![plot](resource/day69-rmsprop.png)

I need to say, this example is non-representative. Rmsprop was developed as a stochastic technique for mini-batch learning and in this toy function has a little chance to show how good it really is.

It is no surprise that momentum is the real deal, yet, rmsprop is still doing quite well.

In [1]:
import numpy as np
from bokeh.plotting import figure, show, output_notebook

## algorithm

In [2]:
def gradient_descent(F, dF, x, steps=100, lr=0.001):
    loss = []
    
    for _ in range(steps):
        dx = dF(x)
        x -= lr * dx
        loss.append(F(x))

    return x, loss

In [3]:
def rmsprop(F, dF, x, steps=100, lr=0.001, decay=.9, eps=1e-8):
    loss = []
    dx_mean_sqr = np.zeros(x.shape, dtype=float)

    for _ in range(steps):
        dx = dF(x)
        dx_mean_sqr = decay * dx_mean_sqr + (1 - decay) * dx ** 2
        x -= lr * dx / (np.sqrt(dx_mean_sqr) + eps)
        loss.append(F(x))
    
    return x, loss

In [4]:
def rmsprop_momentum(F, dF, x, steps=100, lr=0.001, decay=.9, eps=1e-8, mu=.9):
    loss = []
    dx_mean_sqr = np.zeros(x.shape, dtype=float)
    momentum = np.zeros(x.shape, dtype=float)

    for _ in range(steps):
        dx = dF(x)
        dx_mean_sqr = decay * dx_mean_sqr + (1 - decay) * dx ** 2
        momentum = mu * momentum + lr * dx / (np.sqrt(dx_mean_sqr) + eps)
        x -= momentum
        loss.append(F(x))

    return x, loss

## function

In [5]:
def F(x):
    residual = A @ x - np.eye(len(A), dtype=float)
    return np.sum(residual ** 2)

In [6]:
def dF(x):
    return 2 * A.T @ (A @ x - np.eye(len(A), dtype=float))

In [7]:
A = np.array([
    [2, 5, 1, 4, 6],
    [3, 5, 0, 0, 0],
    [1, 1, 0, 3, 8],
    [6, 6, 2, 2, 1],
    [8, 3, 5, 1, 4],
], dtype=float)

## optimization

In [8]:
X, loss1 = gradient_descent(F, dF, A * 0, steps=300)
(A @ X).round(2), loss1[-1]

(array([[ 0.79, -0.01,  0.18,  0.19, -0.08],
        [-0.01,  0.8 ,  0.  ,  0.2 , -0.07],
        [ 0.18,  0.  ,  0.85, -0.15,  0.07],
        [ 0.19,  0.2 , -0.15,  0.66,  0.13],
        [-0.08, -0.07,  0.07,  0.13,  0.95]]), 0.5469198476714346)

In [9]:
X, loss2 = rmsprop(F, dF, A * 0, steps=300)
(A @ X).round(2), loss2[-1]

(array([[ 0.84, -0.05,  0.1 ,  0.1 , -0.04],
        [-0.04,  0.82,  0.03,  0.19, -0.03],
        [ 0.12,  0.03,  0.9 , -0.08,  0.04],
        [ 0.15,  0.2 , -0.12,  0.75,  0.07],
        [-0.08, -0.09,  0.04,  0.1 ,  0.99]]), 0.32394771623791685)

In [10]:
X, loss3 = rmsprop_momentum(F, dF, A * 0, steps=300)
(A @ X).round(2), loss3[-1]

(array([[ 0.99,  0.01,  0.  , -0.01,  0.  ],
        [-0.  ,  1.  ,  0.  , -0.  ,  0.  ],
        [-0.  ,  0.01,  1.  , -0.01,  0.  ],
        [-0.01,  0.01,  0.  ,  0.99,  0.  ],
        [-0.01,  0.01,  0.  , -0.01,  1.  ]]), 0.0006230388777262454)

In [11]:
output_notebook()

plot = figure()
plot.line(x=range(len(loss1)), y=loss1, color='steelblue', legend='gd')
plot.line(x=range(len(loss2)), y=loss2, color='green', legend='rmsprop')
plot.line(x=range(len(loss3)), y=loss3, color='red', legend='rmsprop+momentum')

show(plot)