This notebook lays the foundation for the other more advanced optimization algorithms.

# Exponentially Weighted Averages

The key intuition for exponentially weighted averages is encapsulated by this formula:

$$v_t = \beta \ v_{t-1} + (1 - \beta) \ \theta_t$$

We weight each parameter $\theta$ in order to get a 'moving average' of the actual value. Suppose that we express the temperature ($v_t$) of a certain area as a function of the current day ($t$).

![Temperature](./images/temperature.png)

By applying the above formula, we can think of $v_t$ as approximately averaging over $\frac{1}{1 - \beta}$ days of temperature at a single point of time, $t$. Below is the curve that we get if we set $\beta = 0.9$.

![Temperature Fit (small beta)](./images/temperature-fit-small.png)

What happens if we increase the value of $\beta$? Say let's increase to $\beta = 0.98$.

![Temperature Fit (large beta)](./images/temperature-fit-large.png)

We will get the green curve. Note that the curve is **smoother** but also at the same time it moves towards the **right**.

## Intuition

The intuition here is similar to the operating concept I've learnt earlier for scheduling (if I recall correctly). If we expand out the term for $v_t$, we observe that each term reflects the **current value** and the **other terms at different $t$**. All the coefficients will add up to $1$.

## Implementation

The implementation for this is really straightforward, just follow the equations! Remember to initialize $v_0 = 0$. 

# Bias Correction

There is a slight problem with the implementation of exponentially weighted averages. Refer to the plot below:

![Bias Correction](./images/bias-correction.png)

The purple line is an addition with $\beta = 0.98$. We observe that the temperature starts very close to zero. It is easy to observe why this is the case when we take a look at the formula for exponentially weighted averages.

Our initial value starts with $0$, therefore $v_1 = 0 + (1 - 0.98) \ \theta_t$, which gives us $v_1 = 0.02 \  \theta_t$. This results in a very small value in the beginning, requiring the exponentially weighted average to **'warm up'** before the estimate goes back to being more indicative of our training examples.

To solve this problem, we apply something known as **bias correction**.

After we calculate the value of $v_t$, we perform the following correction update:

$$v^{\text{corrected}}_t := \frac{v_t}{1 - \beta^t}$$

This helps us reduce the effect of $1 - \beta$ in our $v_t$ calculation, we divide it by a very small value $< 1$ (same effect as multiplying) to compensate for the small value. As $t$ increases, we observe that $1 - \beta^t \approx 1$. The correction operation will not differ too much from the original $v_t$ value.

People normally do not implement bias correction, but personally I think it is not something too difficult to implement, so why not? :)