# Practice Lab M05 (Version 3)
## Time Series Forecasting, Anomaly and Rule mining

Practice Lab M04 will focus on how to do the time series forecasting, anomaly detection and rule mining

## To do

- Use ARIMA to predict the time series
- Using isolation forest to find the anomaly
- Conduct the association rule mining.


## Tasks 1 Time series and Anomaly detection
### Task 1.1 Reading time series
We first read the given time series (https://raw.githubusercontent.com/tulip-lab/open-data/master/HK2012-2018/United_States.csv or the 'US_HK.csv' in data folder). However, before doing any forecasting, we will need to define some error metrics for evaluation purpose. In there we will use mean abusolute percentage error and mean abusolute error

In [None]:
!pip install apyori

In [None]:
import warnings                                  # `do not disturbe` mode
warnings.filterwarnings('ignore')

import numpy as np                               # vectors and matrices
import pandas as pd                              # tables and data manipulations
import matplotlib.pyplot as plt                  # plots
import seaborn as sns                            # more plots

from dateutil.relativedelta import relativedelta # working with dates with style
from scipy.optimize import minimize              # for function minimization

import statsmodels.formula.api as smf            # statistics and econometrics
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs

from itertools import product                    # some useful functions
from tqdm import tqdm_notebook



%matplotlib inline

In [None]:
#- read the time series by using pandas as 'ads' 
ads = pd.read_csv('xxx', index_col=['date'], parse_dates=['date'])
ads.index = pd.to_datetime(ads.index)

In [None]:
# plot the time series

Mean Absolute Percentage Error: this is the same as MAE but is computed as a percentage, which is very convenient when you want to explain the quality of the model to management, $[0, +\infty)$

$\texttt{MAPE} = \frac{100}{n}\sum\limits_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{y_i}$

In [None]:
# - defne the function of mean abusolute percentage error in numpy


### Task 1.2 Moving average smoothing forecasting
After having the time series, let's do Moving Average Smoothing Forecasting

Let's start with a naive hypothesis: "tomorrow will be the same as today". However, instead of a model like $\hat{y}_{t} = y_{t-1}$ (which is actually a great baseline for any time series prediction problems and sometimes is impossible to beat), we will assume that the future value of our variable depends on the average of its $k$ previous values. Therefore, we will use the **moving average**.

$\hat{y}_{t} = \frac{1}{k} \displaystyle\sum^{k}_{n=1} y_{t-n}$

In [None]:
# define the moving average function by numpy
def moving_average(series, n):
    """
        Calculate average of last n observations
    """
    return xxxx

moving_average(ads, 24)

moving average has another use case - smoothing the original time series to identify trends. Pandas has an implementation available with [`DataFrame.rolling(window).mean()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html). The wider the window, the smoother the trend

In [None]:
series = ads.arrival
window = 3
# use pandas to conduct the moving average with window size of 3
rolling_mean = series.rolling(window=window).mean()

In [None]:
from sklearn.metrics import  mean_absolute_error
mae = mean_absolute_error(series[window:], rolling_mean[window:])
deviation = np.std(series[window:] - rolling_mean[window:])
lower_bond = rolling_mean - (mae + 1 * deviation)
upper_bond = rolling_mean + (mae + 1 * deviation)
# plot the upper bound and lower bound by using the mean abusolute error + 1*standard deviation



### Task 1.3 Anomaly detection
After having the original time series, the moving avergae and the upper bound and lower bound in one plot, we could see some anomalies, could we use isolation to detect them

In [None]:
from sklearn.ensemble import IsolationForest
# write code for training the isolation on original time series
rng = np.random.RandomState(0)
clf = IsolationForest(max_samples=100, random_state=rng)


In [None]:
# find the anomalies in original time series and mark them in the plot with original time series

### Task 1.4 Exponential smoothing forecasting
Now we would like to try the exponential smoothing forecasting.

Now, let's see what happens if, instead of weighting the last $k$ values of the time series, we start weighting all available observations while exponentially decreasing the weights as we move further back in time. There exists a formula for **[exponential smoothing](https://en.wikipedia.org/wiki/Exponential_smoothing)** that will help us with this:

$$\hat{y}_{t} = \alpha \cdot y_t + (1-\alpha) \cdot \hat y_{t-1} $$

Here the model value is a weighted average between the current true value and the previous model values. The $\alpha$ weight is called a smoothing factor. It defines how quickly we will "forget" the last available true observation. The smaller $\alpha$ is, the more influence the previous observations have and the smoother the series is.

Exponentiality is hidden in the recursiveness of the function -- we multiply by $(1-\alpha)$ each time, which already contains a multiplication by $(1-\alpha)$ of previous model values.

In [None]:
def exponential_smoothing(series, alpha):
    """
        series - dataset with timestamps
        alpha - float [0.0, 1.0], smoothing parameter
    """
    result = [series[0]] # first value is same as series
    for n in range(1, len(series)):
        result.append(alpha * series[n] + (1 - alpha) * result[n-1])
    result = pd.DataFrame(result, index=series.index)
    return result

In [None]:
#Using given exponential smoothing forecasting function to obtain the new time series
#plot the exponential smoothing forecasting time series with original together in one plot


## Task 2 ARIMA
### Task 2.1 Find parameter p, d and q
Now we want to find the parameter of p, d and q for ARIMA before training

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
#let's pass the time series into the adfuller test function, firstly let's test the original
# Use adfuller test to determine the d

# let's check the test results


In [None]:
#Use pacf and acf plot to draw the auto correlation plot for p and q


- $p$ is most probably x since it is the last significant lag on the PACF, after which, most others are not significant. 
- $d$ equals x because the sereis after first differential process  is quite stationary
- $q$ should be somewhere around x as well as seen on the ACF

### Task 2.2 Train the ARIMA
Now we will use the parameter p, d and q to train the ARIMA

In [None]:
from statsmodels.tsa.arima_model import ARIMA
# train the ARIMA and plot the fitted time series with original time series together



### Task 2.3 (Advanced) Assocaition Rule Mining
We want to do the association rule mining, but before doing it, we will need to discrete the value from given time series
Selecting the 3 feature from DataFrame and use Apriori to mine some rules out

In [None]:
tran = ads[['arrival','Hong kong','Hong kong dollar']]

In [None]:
tran['date'] = tran.index.month_name()

Discrete the 3 features into 4 equal-sized bins by using [`pd.qcut(series, 4)`](https://pandas.pydata.org/docs/reference/api/pandas.qcut.html)

In [None]:
newcols=[]
for i in tran.columns[:-1]:
  newcol = i+'_bin'
  tran[newcol] = pd.qcut(tran[i], 4,labels=["low", "medium", "high","high+"])
  print('finish_'+i)
  newcols.append(newcol)

In [None]:
tran = tran[newcols+['date']]

In [None]:
tran.head(5)

In [None]:
# Run the rule mining and print the rule

In [None]:
for i in range(len(results)):
    print("##############################################################################")
    print(i)
    print(results[i])
    print(results[i].items)