<h1 style="text-align: center;">Working Time Series Data</h1>

_November 18, 2020_

### Learning goals:
- Understand and explain the stationarity assumption in Time Series data
- Visualize a time series
- Remove trends in time series in order to satisfy the assumption of stationarity

<img src='time_series_animation.gif'/>

## What is Time Series Data?

A series of values of a quantity obtained at successive times, with __equal intervals__ between them.

What are some examples?
- The temperature of July recorded daily
- The weekly average price of a stock in the past year 
- The average annual government budget in the past 30 years

We will put our focus on *univariate time series*, which records a single observation, or variable, at different but equal time intervals. 


Now that we know a little bit about time series data, what are some of the characteristics of it? The most notable characteristics in time series data are the patterns that could emerge, specifically, *trends* and *seasonality*. 
- Trend:
<img src="attachment:Screen%20Shot%202019-07-25%20at%209.50.55%20AM.png" width="500" >

- Seasonality:
patterns that occur as a function of specific seasons. It is the variations that occur at specific regular intervals less than a year, such as quartly, hourly, or weekly. What are some examples that would manifest seasonal pattern?

An important pattern in time series that occur is __stationarity__, which is an assumption that lays the foundations for time series forecasting and modeling. 

## Stationarity 

#### What is stationary data? 

Stationary data is data in which summary statistics (mean, variance, covariance) are not a function of time

<h3 style="text-align: center;">Constant Mean</p>

<img src='mean_nonstationary.webp'/>

<h3 style="text-align: center;">Constant Variance</p>

<img src='variance_nonstationary.webp'/>

<h3 style="text-align: center;">Constant Covariance</p>

<img src='covariance_nonstationary.webp'/>

#### Why does data to be stationary for modeling? 

Stationarity is important because without it a model describing the data will vary in accuracy at different time points. 

Stationarity also lays the foundations of the typical time series models, such as AR, MA, ARMA. So a violation of stationarity can cause problem in prediction. 


### Testing Stationarity:
- Examining the visualization 
- Examining the summary statistics 
- The Dickey-Fuller Test

#### 1. Visually examinig the data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
shampoo = pd.read_csv('data/shampoo.csv', header=0, usecols=[1])

In [2]:
# generate monthly data
years = pd.date_range('2019-01', periods=len(shampoo), freq='M')
shampoo.index = years

In [None]:
# when you have a "Date" column, make sure to turn it into datetime and set datetime as index

# code

In [None]:
shampoo.plot()

#### 2. Examine summary statistics

In [None]:
from pandas import Series
import numpy as np
series = Series.from_csv('data/shampoo.csv', header=0)
X = series.values
X = X[np.logical_not(np.isnan(X))]
split = round(len(X) / 2)
X1, X2 = X[0:split], X[split:]
mean1, mean2 = X1.mean(), X2.mean()
var1, var2 = X1.var(), X2.var()
print('mean1=%f, mean2=%f' % (mean1, mean2))
print('variance1=%f, variance2=%f' % (var1, var2))

<h4>Dickey Fuller Test</h4>

The Dickey Fuller Test is a statistical test for testing stationarity. The Null-hypothesis for the test is that the time series is not stationary. So if the test statistic is less than the critical value, we reject the null hypothesis and say that the series is stationary. The Dickey Fuller test is available in stattools from the StatsModels module. 

A series of steps can be taken to stationarize your data - also known -  as removing trends (linear trends, seasonaility/periodicity, etc - more details on transformations <a href='http://people.duke.edu/~rnau/whatuse.htm'>here</a>) - we do this by taking differences of the variable over time, log transforming, or seasonal differencing.

#### Some terminology:
- Lag: for some specific point t, the observed $X_{t-i}$ (i-th period back) is called the i-th lag of $X_t$

## Differencing

We use differencing to remove your data's dependence on time (temporal dependence). 

Differencing is performed by subtracting the previous observation (lag=1) from the current observation.

difference(t) = observation(t) - observation(t-1)

__Discussion question__:
If we have a time series dataset of [1,2,3,4,5,6,7,8,9,10], and we want to difference is by lag=1, what would the result be?

In [None]:
#### manually differencing a series!

In [None]:
# create a differenced series
def difference(dataset, interval=1):
    diff = []
    for i in range(interval, len(dataset)):
        value = dataset[i] - dataset[i - interval]
        diff.append(value)
    return Series(diff)

In [None]:
shampoo['Sales of shampoo over a three year period'].values

In [None]:

### Differencing using pandas/numpyt
#plot of data to see visualize trends
shampoo.plot()

In [None]:
#call .diff off a pandas/numpy series to get differenced values
diff = shampoo.diff().rename(index=str, columns={"Sales of shampoo over a three year period": "Differenced Observations"})

In [None]:
diff.head()

In [None]:
shampoo.head()

In [None]:
#plot of differenced data (more stationary)
plt.figure(figsize=(10,5))
plt.plot(diff)

Sometimes, we have to difference the differenced data (known as a second difference) to achieve stationary data. <b>The number of times we have to difference our data is the order of differencing</b> - we will use this information when building our model.

One we have achieved stationarity the next step in fitting a model to address any autocorrelation that remains in the differenced series. Time series exhbit two kinds of behaviors. Often, you can predict a value in a time series using a past value or values. Othertimes, the past values can be misleading - consider the stock market - everyday stock prices experience shock due to randomness. The effect of that shock generally diminishes quickly and has little effect on future prices. Determining which behaviors are present in our time series is essential so we can properly model the behavior. 

## Rolling Mean
Rolling mean is a "sliding window" method in which we calculate a mean every window, given the window size. For example, for a window of 3, we'd calculate mean shampoo sale of first 3 months, month 2-4, month 3-5, etc. This allows us to check if the mean changes over time.

In [None]:
# exercise

# calculate the rolling mean & std with a window of 3 and plot the origin shampoo dataset, rolling mean, 
# rolling std together 

# your code here


## Autocorrelation

#### What is autocorrleation? 

It is the correlation between one time series and the same time series shifted by k periods. 

In [None]:
shampoo_sales_lag_1 = shampoo.shift(periods=1)
shampoo_sales_lag_2 = shampoo.shift(periods=2)

In [None]:
plt.plot(shampoo)
plt.plot(shampoo_sales_lag_1)


In [None]:
plt.plot(shampoo)
plt.plot(shampoo_sales_lag_2)

#### ACF

In [None]:
from statsmodels.graphics.tsaplots import plot_acf

In [None]:
#plot autocorrelation for each lag (alpha is confidence interval)
plot_acf(shampoo[:-1], alpha=.05);

Looks like the first four lags have some pretty strong autocorrelation - note for future model building

Sometimes, autocorrelation propogates down to other lags. The influence of a strong autocorrelation causes additional lags to highly autocorrelated. To discover the true relationship between lags we can use the PACF (partial autocorrelation function) 

#### PACF

Partial autocorrelation looks at the correlation between a point and particular lag without the influence of itermediary lags. This helps us see the direct relationship between certain lags. 

In [None]:
from statsmodels.graphics.tsaplots import plot_pacf

In [None]:
plot_pacf(shampoo[:-1], alpha=.05, lags=20);

Now that we now how to analyze the patterns in our time series we can procede with creating models to create forecasts!


<b>Additional Resources</b>

https://www.youtube.com/watch?v=Prpu_U5tKkE

https://newonlinecourses.science.psu.edu/stat510/node/41/
