![Cloud-First](../image/CloudFirst.png) 


# SIT742: Modern Data Science 
**(Module: Big Data Analytics)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/sit742](https://github.com/tulip-lab/sit742/issues)


Prepared by **SIT742 Teaching Team**

---


## Session 5A: Time Series

**The purpose of this session is to illustrate**

1. Understand the Basic Time series Structure.

##What is Time Series?


**Time series data, which means any information collected over a regular interval of time, in their operations.**

Before we are jumping into the time series forecasting, it is essential for us to understand the time series structure, feature and also the ways of processing. We will start on some simple but interesting exploration tasks first.

Tasks List:

> Reading and Displaying Data

> Stationarity



In [None]:
import warnings
warnings.filterwarnings('ignore')

##**Task1: Reading and Displaying Data**

####To start, let’s import the Pandas library and read the airline passenger data into a data frame:

In [None]:
import pandas as pd 
df = pd.read_csv("https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/timeseries-data.txt")
print(df.head(10))

We can see that the data contains a column labeled “Date” that contains datetime. In that column, the dates are formatted as year–month-day. We also see that the data starts in the year 1959.

The second column is labeled “Births,” and it contains the number of new births for the year–month-day as quantitative column. 

From the above data, we could see **two important** findings:


1.   Datetime (Time Stamps) is the key index to sort the time sereis data, and also the important feature for us to obtain more informative patterns.
2.   Quantitative Unit Variable(s) is the column to provide the quantitative information on each timestamp.



###1.1 Timestamp Formatting

As we see above, the "Date" column has the formate of year-month-day, it will allow us to aggregate the "Births" with a grain level of "day". In some situation, mutiple grain level you might want to try, therefore,
the formatting on timestamps is important step to process the time series data.

We will do below subtasks and plot the time series accordingly:



*   Format "Date" to month level
*   Aggregate the "Births" with month level






In [None]:
# we first format the Date column to a datetime format which pandas could read as "datetime"
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
print(df.head())

In [None]:
# we extract the year from Date (make sure datetime format for Date)
df['year_month'] = df['Date'].dt.to_period('M')
print(df.head())

In [None]:
# Let's aggregate the Births in year level
df.groupby(['year_month'])['Births'].agg('sum')

In [None]:
#Let's try to plot the births in month level, to avoid change the interval on df, we create a new dataframe for this task
df_year_month = pd.DataFrame(df.groupby(['year_month'])['Births'].agg('sum'),columns=['Births'])

#import the matplotlib seaborn for plotting 
import matplotlib.pyplot as plt
import seaborn as sns 
plt.rcParams["figure.figsize"] = (20,3)
#we might need to convert the index of the new dataframe to timestamp or datetime
sns.lineplot(df_year_month.index.to_timestamp(),df_year_month.Births)

In [None]:
#let's compare with the original time series with day level
df.index = df['Date']
sns.lineplot(df.index,df.Births)

**Summary: **

Comparing the original time series plot with the month level one, 
it is clear to see that the month level time series could be easy to find the pattern -- Q3 - Q4 is the peak for new births.
The above finding inspired us to think: 
***How to find the trend pattern from a given time series? what is the seasonality***?

##**Task 2 Stationary**

Stationarity is a key part of time series analysis. Simply put, stationarity means that the manner in which time series data changes is constant. A stationary time series will not have any trends or seasonal patterns. You should check for stationarity because it not only makes modeling time series easier, but it is an underlying assumption in many time series methods. Specifically, stationarity is assumed for a wide variety of time series forecasting methods including autoregressive moving average (ARMA).

For obtaining the stationary timeseries data, we firstly learn some 
basic terms regarding the feature of the time series:



1.   Seasonality
2.   Trend
3.   Random Noise
2.   Rolling Statistics




###2.1 Seasonality, Trend and Random noise

Seasonality is a simple term that means while predicting a time series data there are some months in a particular domain where the output value is at a peak as compared to other months.
For example, you might always see that the Dec and Nov is the peak for retail shoppings every year.

The trend is also one of the important factors which describes that there is certainly increasing or decreasing trend time series. The trend could describe
the pattern of either increasing or decreasing from your time series clearly.

Random noise is what the time series left when both trend and seasonality
has been removed. It is a series that’s not predictable.

To be able to generate the seasonality and trend from the time series, 
we usually need to do the **decomposition** first.
The time series could be seen as either **Addictive and Multiplicative** time series. 

Addictive: Y = Trend + Seasonality + Random noise, 
Multiplicative: Y = Trend * Seasonality * Random noise

To do: 



1.   Decomposite the birth time series with addictive mode
2.   Decomposite the birth time series with multiplicative mode
3.   Generate the seasonality plot
4.   Generate the trend plot



In [None]:
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse

# let's use the birth data with year-month-day format in the decomposition
df = df[['Births']]
# Additive Decomposition
add_result = seasonal_decompose(df['Births'], model='additive')


In [None]:
plt.rcParams["figure.figsize"] = (15,10)
add_result.plot().suptitle('Births Decompose', fontsize=12)
plt.show()

In [None]:
# Multiplicative Decomposition 
mul_result = seasonal_decompose(df['Births'], model='multiplicative')

In [None]:
mul_result.plot().suptitle('Births Decompose', fontsize=12)
plt.show()

**Summary: **

From the above two decompositions, it could be seen that the seasonality of birth data rotates on every 5 days. The trend shows that the births on period of Sep - Oct are significantly higher. 
Next, we will see wether the trend is more stationary than the original time series.

###Task 2.1 Stationary Test

Stationary is constantly mean and constant variance. Adfuller is a simple test which tells that if the time series is stationary which is a kind of hypothesis testing. The Null hypothesis is time series are non-stationary. If the p-value is less than 5 percent then reject the NULL hypothesis else accept the NULL hypothesis

We will compare the trend with original time series, 
also we will see whether a rolling average could also offer the stationary time series.

To do:


1.   Generate rolling average time series 
2.   Run adfuller test on trend, original and rolling average time series



In [None]:
# we could use the pandas rolling mean to obtain the moving 
rolling_mean = df.rolling(7).mean()

In [None]:
# Let's plot the rolling mean, trend and origianl time series in one plot

plt.plot(df, color="blue",label="Original birth data")
plt.plot(rolling_mean, color="red", label="Rolling Mean birth data")
plt.plot(mul_result.trend, color="green", label="Rolling Mean birth data")
plt.legend(loc="best")

In [None]:
from statsmodels.tsa.stattools import adfuller
#let's pass the time series into the adfuller test function, firstly let's test the original
adft = adfuller(df.Births,autolag="AIC")

In [None]:
# let's check the test results

output_df = pd.DataFrame({"Values":[adft[0],adft[1],adft[2],adft[3], adft[4]['1%'], adft[4]['5%'], adft[4]['10%']]  , "Metric":["Test Statistics","p-value","No. of lags used","Number of observations used", 
                                                        "critical value (1%)", "critical value (5%)", "critical value (10%)"]})
print(output_df)

In [None]:
# let's test with trend and rolling mean
adft_trend = adfuller(mul_result.trend.values[~np.isnan(mul_result.trend.values)],autolag="AIC")
adft_rolling = adfuller(rolling_mean.Births.values[~np.isnan(rolling_mean.Births.values)],autolag="AIC")

In [None]:
output_df_trend = pd.DataFrame({"Values":[adft_trend[0],adft_trend[1],adft_trend[2],adft_trend[3], adft_trend[4]['1%'], adft_trend[4]['5%'], adft_trend[4]['10%']]  , "Metric":["Test Statistics","p-value","No. of lags used","Number of observations used", 
                                                        "critical value (1%)", "critical value (5%)", "critical value (10%)"]})
print(output_df_trend)

In [None]:
output_df_rolling = pd.DataFrame({"Values":[adft_rolling[0],adft_rolling[1],adft_rolling[2],adft_rolling[3], adft_rolling[4]['1%'], adft_rolling[4]['5%'], adft_rolling[4]['10%']]  , "Metric":["Test Statistics","p-value","No. of lags used","Number of observations used", 
                                                        "critical value (1%)", "critical value (5%)", "critical value (10%)"]})
print(output_df_rolling)

We can see that our data is not stationary from the fact that our p-value is greater than 5 percent and the test statistic is greater than the critical value

**Summary: **

From the adfuller test, we could see that the original time series has t-test value smaller than the 1% and 5%, also P value is extremely small. Therefore,
the original time series is more stationary than trend and rolling average. Why is it? Let's check the definition of stationary -- value changes is constant, so it means the variance is close to 0. Trend and rolling average's changes via datetime is not a constant.
