In [1]:
%%html
<style>
    /* Jupyter */
    .rendered_html tr, .rendered_html th, .rendered_html td {
        text-align: left; 
    }
</style>

# Holt-Winters Method for Time Series Forecasting

## Learning Objectives
- Examples of general Time Series
- Holt-Winters Method for predicting seasonal trends
- Representing average, trend, seasonality
- Exponential smoothing
- Mathematics of the Holt-Winters method

Note to self: Change the learning objectives according to content, add pre-requisites if needed.

In this notebook we'll cover an introduction to time series, key terms used to describe the important features of time series and how to forecast time series using the Holt-Winters method.

## Time Series

A time series is a sample of measurements of some interesting quantity taken repeatedly over a sustained period of time. It is mathematically represented as a set of vectors x(t),t = 0,1,2,... where t represents the time elapse and the variable x(t) is treated as a random variable. A defining characteristic of a time series is that it is a list of observations where the ordering matters so the data is not necessarily independent and identically distributed. Changing the order of the datapoints could change the meaning of the data. 

Time series appear in many different disciplines in the real world such as Economics, Epidemiology, Social Sciences, Physical Sciences etc. 

Let's have a look at an explicit example of a time series. The following time series dataset is taken from the NASA website and describes the increase in monthly CO2 levels (parts per million) since 2005. 

Insert code showing table preview of dataset and then time plot of data set.

Add more examples maybe...

## Forecasting
The rest of this notebook will be focusing on time series forecasting and building up to how to use the Holt-Winters method for time series forecasting in particular. But first what is the difference between the terms time series analysis and time series forecasting? Time series analysis is a form of descriptive modeling, this means that someone conducting time series analysis will be looking at a dataset to identify trends and seasonal patterns in the hostorical data, fitting mathematical models to capture the underlying nature of the process generating the data etc. Time series forecasting is a form of predictive modelling with the goal to predict a future value at a particular point in time based on the values we do know.

### Some notation and terminology:

In a time series the values we do know are referred to as observed values and the values we are trying to forecast are referred to as expected values. In general, we use the notation $\hat{y}$ to denote expected values. 

For example, if we have a series that looks like [2,4,6,8,10], we might forecast the next value of this series to be 12. Using this terminology and notation, the observed values are $y_1=2$, $y_2=4$, $y_3=6$, $y_4=8$, $y_5=10$ i.e the observed series is [2,4,6,8,10] and the next expected value is $\hat{y_6}=12$.

It's important to have some metrics to evaluate the accuracy of our forecasts. 

The error is the difference between an observed value and its forecast. Given a training dataset {$y_{1},\dots, y_{T}$} and a test dataset {$y_{T+1}, y_{T+2},\dots$}, the error of a forecast at a given time interval T+h is denoted as $e_{T+h}=y_{T+h} - \hat{y_{T+h}}$. 

As the error can be positive or negative it is more helpful to use the absolute terms or as common convention square the error the value is always positive. The sum of squared errors (SSE) is given by $SSE = {\Sigma_{i=1}}^{i=n} ( y_{i} - \hat{y_{i}})^{2}$. The SSE measures the inexplained variability or discrepancy between the observed data and the forecasted data. Another common metric used is the mean squared error which is given by $MSE=\frac{1}{n}{\Sigma_{i=1}}^{i=n} ( y_{i} - \hat{y_{i}})^{2}$.

Note to self: explain where n comes from or change the notion in error definition to include n.

## Some simple forecasting methods

#### Naive Method

This is the simplest forecasting method. For naive forecasts we set all forecasts to be the same value as the last observed value. That is 

$\hat{y}_{T+1}=y_{T}$

For example, if we have a time series that looks like [14,20,18,17,24], then using the naive method the forecast for the next point would be 24. 

#### Average Method

This method is simply the expected value of the next datapoint is the arithmetic mean of all of the previous datapoints. That is

$\hat{y}_{T+1}=\frac{1}{T}\Sigma_{i=1}^{i=T} y_{i}$

For example, if we have a time series that looks like [19.2,17.8,15.1,14.3,15.0,16.7,15.2], then using the average method the forecast for the next point would be 16.2.

#### Moving Averages

An improvement over the taking average of all points is instead only taking the average of the n latest datapoints. In this method only the most recent values matter. In practise this forecasting method can be effective if the right choice of n is used. 

$\hat{y}_{T+1}=\frac{1}{n}\Sigma_{i=0}^{i=n-1} y_{T-i}$

#### Weighted Moving Averages

A weighted moving average is a moving average but within the window of n points each point is assigned a different weighting. Typically the most recent points are assigned a higher weight as these would be more relevant to the forecast being made. Note that the weights assigned must add to 1. 

$\hat{y}_{T+1}=\frac{1}{n}\Sigma_{i=1}^{i=n} w_{i} . y_{T+1-i}$

where $w_{1}, w_{2}, \cdots, w_{n}$ are weights to be assigned.

Note to self: Code up the implementation of each of these forecasting methods,

## Exponential Smoothing

- What is exponential smoothing?
    - Include mathematical formula
    - Explain parameter alpha (smoothing factor)
- Constraints of single exponential smoothing
    - Only good for forecasting single point

## Double Exponential Smoothing : Level and Trend
- Introduce terminology: level, trend
    - Introduce notation of each term 
    - Point out features such as level, trend with code example
- Give definitions of Holts linear trend method/ Double Exponential Smoothing
    - Explain the terms in the equations
- Constraints of double exponential smoothing 
    - Can forecast two datapoints

## Holt- Winters Method for Forecasting 
- Explain concept of seasonality
    - Add code of time series example with seasonality component
- Introduce Holt-Winters Method for Forecasting 
    - Add the formulas
    - Explain the parameters
- Can now forecast as many datapoints into the future as you like
- Setting up the initial trend and seasonality term 
    - Include formula
    - Add python code for generating initial values for trend and seasonality
- Final algorithm, code for Holt-Winters Forecasting
- Note on choosing values for alpha, beta, gamma by minimizing SSE
    - Can do this by trial and error
    - Nelder - Mead algorithm
    - Suggest further reading or links for more info on this

## Summary
- Include summary of key takeaways
- Include challenges