<h1>Introduction</h1>

This notebook will dive deep into timeseries analysis, the pre-processing and cleaning methodologies required. We will also look at statistical models for forecasting future values.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
from colorama import Fore

from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

import warnings  
warnings.filterwarnings('ignore')

np.random.seed(7)

In [None]:
df = pd.read_csv('../input/aceawaterprediction/Aquifer_Petrignano.csv')
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df = df[df.Rainfall_Bastia_Umbra.notna()]
df.head()

In [None]:
df = df.reset_index(drop=True)
df = df.drop(['Depth_to_Groundwater_P24', 'Temperature_Petrignano'], axis=1)

In [None]:
df.head()

In [None]:
# Simplifying the column names
df.columns = ['date', 'rainfall', 'depth_to_groundwater', 'temperature', 'drainage_volume', 'river_hydrometry']
# Separating the target column
targets = ['depth_to_groundwater'] 
features = [feature for feature in df.columns if feature not in targets]
df.head()

In [None]:
print(np.min(df['date']))
print(np.max(df['date']))

In [None]:
from datetime import datetime, date

df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df.head().style.set_properties(subset=['date'], **{'background-color': 'lightblue'})

<h2>Data visualization</h2>

We have the following features

* Rainfall - Quantity of rain falling (mm)
* Temperature - Temperature in celsius
* Volume - Indicates water volume taken from drinking water treatment plant (cubic m)
* Hydrometry - Indictaes groundwater level (m)

Target label

* Depth to groundwater - Groundwater level (m from ground floor)

In [None]:
f, ax = plt.subplots(nrows = 5, ncols = 1, figsize=(15, 25))
for i, column in enumerate(df.drop('date', axis=1).columns):
    sns.lineplot(x=df['date'], y=df[column].fillna(method='ffill'), ax=ax[i], color='royalblue')
    ax[i].set_title(f'Feature: {column}', fontsize=15)
    ax[i].set_ylabel(column, fontsize=12)
    ax[i].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

<h2>Pre-processing</h2>

We have to check two major things

* Chronological order of dates - The dates should be in chronological order. This can be achieved by sorting the dates

* Equidistant intervals - The difference between adjacent dates should be uniform and constant. We can decide on a constant time interval and resample data.

In [None]:
# Chronological order
df = df.sort_values(by='date')

In [None]:
df['difference'] = df['date'] - df['date'].shift(1)
df[['date', 'difference']].head()

In [None]:
# We note that the values are equal - equidistant time stamps
df['difference'].sum(), df['difference'].count()

In [None]:
df.drop('difference', axis=1)
df.isnull().sum()

We also note that there are some zero values that seem to be nukll for `drainage_volume` and `river_hydrometry`. We will be replace them with NAN values and fill them afterwards. We will plot graphs to see where we encounter the missing values

In [None]:
df[df['drainage_volume']==0]

In [None]:
df[df['drainage_volume']==0]

In [None]:
np.max(df['drainage_volume'])

In [None]:
np.inf

We will fill up the NULL values with np.inf, this will give us a pseudo blank space in the graph. We will fill up the blank space with the actual value of the plot given.

In [None]:
copied = df['river_hydrometry'].copy()
df['river_hydrometry'] = df['river_hydrometry'].replace(0, np.nan)

copied2 = df['drainage_volume'].copy()
df['drainage_volume'] = df['drainage_volume'].replace(0, np.nan)

In [None]:
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(15, 15))

# We are showcasing the presence of zero-values in both the features

sns.lineplot(x=df['date'], y=copied, ax=ax[0], color='darkorange', label='original')
sns.lineplot(x=df['date'], y=df['river_hydrometry'].fillna(np.inf), ax=ax[0], color='dodgerblue', label='modified')
ax[0].set_title('Feature: Hydrometry', fontsize=14)
ax[0].set_ylabel(ylabel='Hydrometry', fontsize=14)
ax[0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])


sns.lineplot(x=df['date'], y=copied2, ax=ax[1], color='darkorange', label='original')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[1], color='dodgerblue', label='modified')
ax[0].set_title('Feature: Drainage', fontsize=14)
ax[0].set_ylabel(ylabel='Drainage', fontsize=14)
ax[0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

In [None]:
df.T.loc[:,2300:2340].isna()

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(16, 5))

sns.heatmap(df.T.isna(), cmap='Blues')
ax.set_title('Missing Values', fontsize=16)

for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(15)
    
plt.show()

### We have multiple ways to handle missing values

* Fill NaN with outlier or zero - Filling missing values with outliers such as 0 or infinity. This is a very naive approach. We can use values like -999 or something instead.

* Fill NaN with mean values - This is also not sufficient, a naive apporach

* Fill NaN with last value - A cascading fill operation, this could work better

* Fill NaN value with linearly interpolated values - Make use of neighbouring values to fill up the current cell.

In [None]:
f, ax = plt.subplots(nrows=4, ncols=1, figsize = (15, 12))

sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(0), ax=ax[0], color='darkorange')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[0], color='royalblue')
ax[0].set_title('Fill NULL with 0')
ax[0].set_ylabel(ylabel='Drainage Volume')

mean = df['drainage_volume'].mean()
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(mean), ax=ax[1], color='darkorange')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[1], color='royalblue')
ax[0].set_title('Fill NULL with Mean')
ax[0].set_ylabel(ylabel='Drainage Volume')

sns.lineplot(x=df['date'], y=df['drainage_volume'].ffill(), ax=ax[2], color='darkorange')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[2], color='royalblue')
ax[0].set_title('Fill NULL with Ffill')
ax[0].set_ylabel(ylabel='Drainage Volume')

sns.lineplot(x=df['date'], y=df['drainage_volume'].interpolate(), ax=ax[3], color='darkorange')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[3], color='royalblue')
ax[0].set_title('Fill NULL with Interpolate')
ax[0].set_ylabel(ylabel='Drainage Volume')


for i in range(4):
    ax[i].set_xlim([date(2019, 5, 1), date(2019, 10, 1)])

We see that interpolation is the best option for our dataset

In [None]:
df['drainage_volume'] = df['drainage_volume'].interpolate()
df['river_hydrometry'] = df['river_hydrometry'].interpolate()
df['depth_to_groundwater'] = df['depth_to_groundwater'].interpolate()

<h2>Changing the granularity of data</h2>

We can perform resampling for additional information on the given data

* Upsampling - Frequency of samples is increased (Days to hours)
* Downsampling - Frequency of samples is decreased (Days to weeks)

We will perform downsampling with the .resample() function

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=3, sharex=True, figsize=(16,12))
sns.lineplot(df['date'], df['drainage_volume'], color='royalblue', ax=ax[0, 0])
ax[0, 0].set_title('Daily Drainage volume')

resampling = df[['date', 'drainage_volume']].resample('7D', on='date').sum().reset_index(drop=False)
sns.lineplot(resampling['date'], resampling['drainage_volume'], color='royalblue', ax=ax[1, 0])
ax[1, 0].set_title('Weekly Drainage volume')

resampling = df[['date', 'drainage_volume']].resample('M', on='date').sum().reset_index(drop=False)
sns.lineplot(resampling['date'], resampling['drainage_volume'], color='royalblue', ax=ax[2, 0])
ax[2, 0].set_title('Monthly Drainage volume')

for i in range(3):
    ax[i, 0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

sns.lineplot(df['date'], df['temperature'], color='royalblue', ax=ax[0, 1])
ax[0, 1].set_title('Daily temperature')

resampling = df[['date', 'temperature']].resample('7D', on='date').sum().reset_index(drop=False)
sns.lineplot(resampling['date'], resampling['temperature'], color='royalblue', ax=ax[1, 1])
ax[1, 1].set_title('Weekly temperature')

resampling = df[['date', 'temperature']].resample('M', on='date').sum().reset_index(drop=False)
sns.lineplot(resampling['date'], resampling['temperature'], color='royalblue', ax=ax[2, 1])
ax[2, 1].set_title('Monthly temperature')\

for i in range(3):
    ax[i, 1].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
plt.show()

In [None]:
# We can see that weekly graphs are quite smoothened out, we can make use of them

downsample = df[['date','depth_to_groundwater', 'temperature','drainage_volume', 'river_hydrometry','rainfall']].resample('7D', on='date').mean().reset_index(drop=False)
df = downsample.copy()

<h2>Stationarity</h2>

Some time-series models, such as such as ARIMA, assume that the underlying data is stationary. Stationarity describes that the time-series has

* constant mean and mean is not time-dependent
* constant variance and variance is not time-dependent
* constant covariance and covariance is not time-dependent

The check for stationarity can be done in different ways

* Visually - Plot time series and check for trends or seasonality
* Basic statistics - Split the time series and compute mean/variance of each partition
* Statistical test - Augmented Dickey fuller test

In [None]:
window_size = 52 # Our data is in weekly granularity and 52 weeks - 1 year
f, ax = plt.subplots(nrows=2, ncols = 1, figsize=(15, 12))

# The first year values will be NULL as we require 52 previous observations to calculate
sns.lineplot(x=df['date'], y=df['drainage_volume'], ax=ax[0], color='royalblue')
sns.lineplot(x=df['date'], y=df['drainage_volume'].rolling(window_size).mean(), ax=ax[0], color='black')
sns.lineplot(x=df['date'], y=df['drainage_volume'].rolling(window_size).std(), ax=ax[0], color='orange')
ax[0].set_title('Ground water - Non stationary \nNon constant mean and non constant variance')
ax[0].set_ylabel('Drainage Volume')
ax[0].set_xlim()

sns.lineplot(x=df['date'], y=df['temperature'], ax=ax[1], color='royalblue')
sns.lineplot(x=df['date'], y=df['temperature'].rolling(window_size).mean(), ax=ax[1], color='black')
sns.lineplot(x=df['date'], y=df['temperature'].rolling(window_size).std(), ax=ax[1], color='orange')
ax[1].set_title('Temperature - Non stationary \nVariance is time-dependent (seasonality)')
ax[1].set_ylabel('Temperature')
ax[1].set_xlim()

<h2>Unit root test</h2>

It is a characteristic that makes it non-stationary and the ADF test belongs to this category of tests. A unit root is said to exist in a time series of alpha=1 in below equation:


$Y_t = 	\alpha Y_{t-1} + 	\beta X_{e} + \epsilon $

where Yt is the vlaue of the time series at time 't' and Xe is an exogenous variable. The presence of unit root implies that the time series is non-stationary.

<h3>Augmented Dickey-fuller</h3>

It is a statistical test called a unit-root test. They are a cause for nonstationary. 
* H0 - Null hypothesis - Time series has unit root (not stationary)
* H1 - Alternate hypothesis - Time series has no unit root (time series is stationary)

If the null hypothesis can't be rejected, we can conclude that it is stationary. For the purpose of hypothesis testing, we can work with either p-values or critical values.

* p-value > significance level (default: 0.05): Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
* p-value <= significance level (default: 0.05): Reject the null hypothesis (H0), the data does not have a unit root and is stationary.

If we want to work with critical values, where the null hypothesis can be rejected if the test statistic is less than the critical value:

* ADF statistic > critical value: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
* ADF statistic < critical value: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.

In [None]:
from statsmodels.tsa.stattools import adfuller

res = adfuller(df['depth_to_groundwater'].values)
res

In [None]:
df.head()

In [None]:
f, ax = plt.subplots(nrows=3, ncols=2, figsize=(15, 9))

def adfullers(series, title, ax):
    res = adfuller(series)
    significance = 0.05
    adf_res = res[0]
    p = res[1]
    crit_1 = res[4]['1%']
    crit_5 = res[4]['5%']
    crit_10 = res[4]['10%']
    
    # This will let us know at what significance level is our data stationary
    if (p<significance) & (adf_res<crit_1):
        linecolor='green'
    elif (p<significance) & (adf_res<crit_5):
        linecolor='orange'
    elif (p<significance) & (adf_res<crit_10):
        linecolor='red'
    else:
        linecolor='purple'
    sns.lineplot(x=df['date'], y=series, ax=ax, color=linecolor)
    ax.set_title(f'ADF statistic {adf_res}, p-value: {p:0.3f}\n Critical value 1% {crit_1:0.3f} Critical value 5% {crit_5:0.3f} Critical value 10% {crit_10:0.3f}')
    ax.set_ylabel(title)
    
adfullers(df['rainfall'].values, 'Rainfall', ax[0, 0])
adfullers(df['temperature'].values, 'Temperature', ax[1, 0])
adfullers(df['river_hydrometry'].values, 'River_Hydrometry', ax[0, 1])
adfullers(df['drainage_volume'].values, 'Drainage_Volume', ax[1, 1])
adfullers(df['depth_to_groundwater'].values, 'Depth_to_Groundwater', ax[2, 0])

f.delaxes(ax[2, 1])
plt.tight_layout()
plt.show()

If the data is not stationary but we want to use a model such as ARIMA (that requires this characteristic), the data has to be transformed. The two most common methods to transform series into stationarity ones are:

* Transformation: e.g. log or square root to stabilize non-constant variance
* Differencing: subtracts the current value from the previous

In [None]:
# Log transforms
df['depth_to_groundwater_log'] = np.log(abs(df['depth_to_groundwater']))
f, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))
adfullers(df['depth_to_groundwater_log'], 'Transformed Depth to groundwater', ax[0])
sns.distplot(df['depth_to_groundwater_log'], ax=ax[1])

Differencing can be done in different orders:
* First order differencing: linear trends with $z_i = y_i - y_{i-1}$
* Second-order differencing: quadratic trends with $z_i = (y_i - y_{i-1}) - (y_{i-1} - y_{i-2})$
* and so on...

In [None]:
# First order diferencing
diff = np.diff(df['depth_to_groundwater'])
# As first value is NULL, we have to add 0 to make it equal length
df['depth_to_groundwater_diff_1'] = np.append([0], diff)
df['depth_to_groundwater_diff_1']

In [None]:
f, ax=plt.subplots(nrows=1, ncols=1, figsize=(15, 6))
adfullers(df['depth_to_groundwater_diff_1'], 'Difference\n Depth to groundwater', ax)

<h2>Feature creation</h2>

In [None]:
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df['day'] = pd.DatetimeIndex(df['date']).day
df['day_of_year'] = pd.DatetimeIndex(df['date']).dayofyear
df['week_of_year'] = pd.DatetimeIndex(df['date']).weekofyear
df['quarter'] = pd.DatetimeIndex(df['date']).quarter
df['season'] = df['month'] % 12 // 3 + 1
df[['date', 'year', 'month', 'day', 'day_of_year', 'week_of_year', 'quarter', 'season']].head()

<h2>Cyclic features</h2>

Our new time columns are cyclic in nature. The months will cycle between 1 and 12 for every year. When the difference between months increment by 1 during a year, between two years,the `month` column will jump from 12 (December) to 1 (January), this is a (-11) difference which confuses some models.

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 3))

sns.lineplot(x=df['date'], y=df['month'], color='royalblue')
ax.set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
plt.show()

In [None]:
# Convert them into cyclic values
df['month_sin'] = np.sin(2*np.pi*df['month']/12)
df['month_cos'] = np.cos(2*np.pi*df['month']/12)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(6,6))
sns.scatterplot(x=df.month_sin, y=df.month_cos, color='royalblue')
plt.show()

<h2>Time series decomposition</h2>

It involves splitting up a series into level, trend, seasonality and noise. The components are elaborated as follows

* Level - Average value in the series
* Trend - Increasing or decreasing value in the series
* Seasonality - Repeating short-term cycle in series
* Noise - Random variation in the series

It allows us to thinki about a time series, and understand problems during time series analysis and forecasting. All the series have a level and noise. Trend and seasonality is optional. We can think of the components as additive or multiplicative

* **Additive**: $y(t) = Level + Trend + Seasonality + Noise$
* **Multiplicative**: $y(t) = Level * Trend * Seasonality * Noise$


In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

core_columns =  [
    'rainfall', 'temperature', 'drainage_volume', 
    'river_hydrometry', 'depth_to_groundwater'
]

for column in core_columns:
    decomp = seasonal_decompose(df[column], period=52, model='additive', extrapolate_trend='freq')
    df[f"{column}_trend"] = decomp.trend
    df[f"{column}_seasonal"] = decomp.seasonal

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=4, sharex=True, figsize=(16, 8))

for i, column in enumerate(['temperature', 'depth_to_groundwater']):
    res = seasonal_decompose(df[column], freq=52, model='additive', extrapolate_trend='freq')
    ax[0,i].set_title('Decomposition of {}'.format(column), fontsize=16)
    res.observed.plot(ax=ax[0,i], legend=False, color='dodgerblue')
    ax[0,i].set_ylabel('Observed', fontsize=14)
    res.trend.plot(ax=ax[1,i], legend=False, color='dodgerblue')
    ax[1,i].set_ylabel('Trend', fontsize=14)
    res.seasonal.plot(ax=ax[2,i], legend=False, color='dodgerblue')
    ax[2,i].set_ylabel('Seasonal', fontsize=14)
    res.resid.plot(ax=ax[3,i], legend=False, color='dodgerblue')
    ax[3,i].set_ylabel('Residual', fontsize=14)

plt.show()

<h2>Lag</h2>

Calculating each variable with `shift()` to compare the correlation with other variables

In [None]:
weeks_in_month = 4

for column in core_columns:
    df[f'{column}_seasonal_shift_b_2m'] = df[f'{column}_seasonal'].shift(-2 * weeks_in_month)
    df[f'{column}_seasonal_shift_b_1m'] = df[f'{column}_seasonal'].shift(-1 * weeks_in_month)
    df[f'{column}_seasonal_shift_1m'] = df[f'{column}_seasonal'].shift(1 * weeks_in_month)
    df[f'{column}_seasonal_shift_2m'] = df[f'{column}_seasonal'].shift(2 * weeks_in_month)
    df[f'{column}_seasonal_shift_3m'] = df[f'{column}_seasonal'].shift(3 * weeks_in_month)

<h2>EDA</h2>

In [None]:
f, ax = plt.subplots(nrows=5, ncols=1, figsize=(15, 12))
f.suptitle('Seasonal Components of Features', fontsize=16)

for i, column in enumerate(core_columns):
    sns.lineplot(x=df['date'], y=df[column + '_seasonal'], ax=ax[i], color='royalblue', label='P25')
    ax[i].set_ylabel(ylabel=column, fontsize=14)
    ax[i].set_xlim([date(2017, 9, 30), date(2020, 6, 30)])
    
plt.tight_layout()
plt.show()

Through this we can observe some trends:

* depth_to_groundwater: reaches its maximum around May/June and its minimum around November
* temperature: reaches its maxmium around August and its minimum around January
* drainage_volume: reaches its minimum around July.
* river_hydrometry: reaches its maximum around February/March and its minimum around September

In [None]:
f, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 8))

corrmat = df[core_columns].corr()

sns.heatmap(corrmat, annot=True, vmin=-1, vmax=1, cmap='coolwarm_r', ax=ax[0])
ax[0].set_title('Correlation Matrix of Core features', fontsize=16)

shifted_cols = [
    'depth_to_groundwater_seasonal',
    'temperature_seasonal_shift_b_2m',
    'drainage_volume_seasonal_shift_2m',
    'river_hydrometry_seasonal_shift_3m'
]

corrmat = df[shifted_cols].corr()
sns.heatmap(corrmat, annot=True, vmin=-1, vmax=1, cmap='coolwarm_r', ax=ax[1])
ax[1].set_title('Correlation Matrix of shifted features', fontsize=16)
plt.tight_layout()
plt.show()

<h2>Autocorrelation analysis</h2>

After a time series has been stationarized by differencing, the next step in fitting an ARIMA model is to determine whether AR or MA terms are needed to correct any autocorrelation that remains in the differenced series. By looking at the autocorrelation function (ACF) and partial autocorrelation (PACF) plots of the differenced series, you can tentatively identify the numbers of AR and/or MA terms that are needed.

* **Autocorrelation Function (ACF):** P = Periods to lag for eg: (if P= 3 then we will use the three previous periods of our time series in the autoregressive portion of the calculation) P helps adjust the line that is being fitted to forecast the series. P corresponds with MA parameter
* **Partial Autocorrelation Function (PACF):** D = In an ARIMA model we transform a time series into stationary one(series without trend or seasonality) using differencing. D refers to the number of differencing transformations required by the time series to get stationary. D corresponds with AR parameter.
Autocorrelation plots help in detecting seasonality.

In [None]:
from pandas.plotting import autocorrelation_plot

autocorrelation_plot(df['depth_to_groundwater_diff_1'])
plt.show()

In [None]:
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(16, 8))

plot_acf(df['depth_to_groundwater_diff_1'], lags=100, ax=ax[0])
plot_pacf(df['depth_to_groundwater_diff_1'], lags=100, ax=ax[1])

plt.show()

<h2>Modeling for time series</h2>

Our time series can be of two forms - Univariate or multivariate
* Univariate - single time-dependent variable
* Multivariate - multiple time-dependent variables

We will see how to do cross-validation using time-series data

In [None]:
from sklearn.model_selection import TimeSeriesSplit
X=df['date']
y=df['depth_to_groundwater']
folds = TimeSeriesSplit(n_splits=3)

We use the TimeSeriesSplit method provided by Sklearn for the data. This cross-validation object is a variation of KFold. In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.

Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them.

Training set has size: `i*n_samples//(n_splits+1) + n_samples % (n_splits+1)` in the `i`th split. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate. The training size will keep increasing.

Below we can see two cases:

1. We use the normal KFold method for time series data. The data keeps on growing and the successive data (relative to the training data) is used as a testing set

2. We keep the training data size constant, and convert the test set of the previous batch, into the training set of the new batch, while simultaneously selecting the successive data points (relative to the new training data) as our testing data.

In [None]:
f, ax = plt.subplots(nrows=3, ncols=2, figsize=(16, 9))
for i, (train_index, valid_index) in enumerate(folds.split(X)):
    X_train, X_valid = X[train_index], X[valid_index]
    y_train, y_valid = y[train_index], y[valid_index]
    
    sns.lineplot(X_train, y_train, ax=ax[i, 0], color='royalblue', label='train')
    sns.lineplot(x=X_train[(len(X_train) - len(X_valid)):len(X_train)], y=y_train[(len(X_train)-len(X_valid)):len(X_train)], ax=ax[i,1], color='royalblue', label='train')
    for j in range(2):
        sns.lineplot(x=X_valid, y=y_valid, ax=ax[i, j], color='darkorange', label='validation')
    ax[i, 0].set_title(f"Rolling Window with Adjusting Training Size (Split {i+1})", fontsize=16)
    ax[i, 1].set_title(f"Rolling Window with Constant Training Size (Split {i+1})", fontsize=16)
    
for i in range(3):
    ax[i, 0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
    ax[i, 1].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
    
plt.tight_layout()
plt.show()

<h2>Models for univariate time series</h2>

We will go for univariate time series analysis first. Only one variable is varying over time. Example - a temperature sensor. Every second, we only have a single-dimension value i.e. the temperature.

In [None]:
training_size = int(0.85*len(df))
test_size = len(df)-training_size

single_var = df[['date', 'depth_to_groundwater']].copy()
single_var.columns=['ds','y']

training_set = single_var.iloc[:training_size, :]
# For univariate, we have to rename column
cleaned = training_set.copy()
cleaned.rename(columns={'time_series':'ds','variable':'y'}, inplace=True)
x_train, y_train = pd.DataFrame(single_var.iloc[:training_size, 0]), pd.DataFrame(single_var.iloc[:training_size, 1])
x_valid, y_valid = pd.DataFrame(single_var.iloc[training_size:, 0]), pd.DataFrame(single_var.iloc[training_size:, 1])

<h2>Prophet</h2>

Our first model, for modeling the statistical problem is Prophet. It is an open-source library developed by Facebook, for univariate forecasting. It implements an additive time series forecasting model, and the implementation supports trends, seasonality and holidays. 

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

from fbprophet import Prophet

model = Prophet()
model.fit(cleaned)

In [None]:
y_pred = model.predict(x_valid)
score_mae = mean_absolute_error(y_valid, y_pred.tail(test_size)['yhat'])
score_rmse = math.sqrt(mean_squared_error(y_valid, y_pred.tail(test_size)['yhat']))

In [None]:
Fore

In [None]:
print(Fore.GREEN + 'RMSE: {}'.format(score_rmse))

In [None]:
f, ax = plt.subplots(1, figsize=(15, 10))

model.plot(y_pred, ax=ax)
sns.lineplot(x=x_valid['ds'], y=y_valid['y'], ax=ax, color='orange', label='Observed')
ax.set_title(f'Prediction \n MAE: {score_mae:.2f}, RMSE: {score_rmse:.2f}', fontsize=14)
ax.set_xlabel('Date')
ax.set_ylabel('Depth to Groundwater')
plt.show()

<h2>ARIMA model</h2>
The auto-regressive integrated moving average (ARIMA) describes the autocorrelations in the data. **the model assumes the data to be stationary**.

<h3>Steps to analyze ARIMA</h3>

* Check stationarity - If time series has trend/seasonality component, must be made stationary before using ARIMA.

* Difference - If time series is non-stationary, needs to be stationarized through differencing. Take first difference and check stationarity, if not, do it with different forms of differencing. 

* Filter out validation sample - Validate how accurate a model is. We use the split of training and validation sets.

* Select AR and MA terms - Use ACF and PACF to decide whether to have AR terms, MA terms or both.

* Build the model - Build model and set number of periods to forecast to N

* Validate model - Compare predicted values to the actuals in the validation sample.

In [None]:
from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(y_train, order=(1,1,1))
model_fit = model.fit()
y_pred, se, conf = model_fit.forecast(90)

score_mae = mean_absolute_error(y_valid, y_pred)
score_rmse = math.sqrt(mean_squared_error(y_valid, y_pred))

print(Fore.GREEN + 'RMSE: {}'.format(score_rmse))

In [None]:
len(y_train)

In [None]:
len(y_valid)

In [None]:
# The model will start predicting from 510+ onwards
# We will plot the validation data (which succeeds the current data) to check
f, ax = plt.subplots(1, figsize=(15, 10))

model_fit.plot_predict(1, 600, ax=ax)
sns.lineplot(x=x_valid.index, y=y_valid['y'], ax=ax, color='black', label='Ground truth')
ax.set_title(f'Prediction \n MAE: {score_mae:.2f}, RMSE: {score_rmse:.2f}', fontsize=14)
ax.set_xlabel('Date')
ax.set_ylabel('Depth to Groundwater')
plt.show()

In [None]:
# The model will start predicting from 510+ onwards
# We will plot the validation data (which succeeds the current data) to check
f, ax = plt.subplots(1, figsize=(15, 10))
sns.lineplot(x=x_valid.index, y=y_pred, ax=ax, color='red', label='predicted')
sns.lineplot(x=x_valid.index, y=y_valid['y'], ax=ax, color='black', label='Ground truth')

ax.set_xlabel('Date')
ax.set_ylabel('Depth to Groundwater')

<h2>Auto-ARIMA</h2>

In [None]:
!pip install pmdarima

In [None]:
from statsmodels.tsa.arima_model import ARIMA
import pmdarima as pm

model = pm.auto_arima(y_train, start_p=1, start_q=1, test='adf', max_p=3, max_q=3, m=1, d=None, seasonal=False, start_P=0, D=0, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True)
print(model.summary())

In [None]:
model.plot_diagnostics(figsize=(16,8))
plt.show()

The insights from the given plots

* Standarized residual - The residual error fluctutates around a mean of zero and has a uniform variance between (-4, 4)

* Histogram plus estimated density - The plot suggests normal distribution with mean zero

* Normal Q-Q - The blue dots are over the red line for the most part, suggests low skewing

* Correlogram - The ACF plot shows the residual errors are not autocorrelated

<h2>RNNs and LSTMs</h2>

We will create a multi-layered LSTM model to forecast our given model. We will go through the following steps:

* Creation of dataset
* Feature normalization
* Splitting data
* Reshaping and cleaning
* Model creation and training
* Predicting

In [None]:
from sklearn.preprocessing import MinMaxScaler

df_rnn = single_var['y']
df_rnn

In [None]:
df_rnn.shape

In [None]:
# We use filter here as we require a 2D array for the scaler 
testing = single_var.filter('y')
testing.shape

In [None]:
labels = testing.values
scaler = MinMaxScaler(feature_range=(-1, 0))
labels_scaled = scaler.fit_transform(labels)

labels_scaled[:20]

In [None]:
labels_scaled.shape

In [None]:
training_size

In [None]:
rolling_window = 52
train, test = labels_scaled[:training_size-rolling_window, :], labels_scaled[training_size-rolling_window:, :]

In [None]:
train.shape

In [None]:
test.shape

In [None]:
def create_dataset(dataset, look_back=1):
    X, Y = [], []
    # start from 52
    # Take 0:51, append it to X
    # Take the y-value at 52, append it to y
    for i in range(look_back, len(dataset)):
        a = dataset[i-look_back:i, 0]
        X.append(a)
        Y.append(dataset[i, 0])
    return np.array(X), np.array(Y)

x_train, y_train = create_dataset(train, rolling_window)
x_test, y_test = create_dataset(test, rolling_window)

**So what does the above code do?**

What we have done is created a dataset with dimensions - (Total days-Rollback) * Rollback. What this data represents is as follows:

Take the first record - It denotes the first 52 values in the time series (0:51) while the y-value denotes the 52nd value. 

The second record denotes the 1-52 values in the time series,  while the y-value denotes the 53rd value.

In [None]:
x_train.shape

In [None]:
x_train[0]

In [None]:
y_train.shape

In [None]:
# Reshape the data as the input expects - [samples, time steps/sequence, features]
# The data will be passed in as a sequence
# If we are stacking LSTMs, we need to return sequences
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1))

print(len(x_train), len(x_test))

In [None]:
x_train[3]

In [None]:
x_train.shape

In [None]:
y_train.shape

In [None]:
from keras.models import Sequential
from keras.layers import Dense, LSTM 

model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(24))
model.add(Dense(1))

model.compile(optimizer='adam', loss='mse')

In [None]:
model.fit(x_train, y_train, batch_size=1, epochs=5, validation_data=(x_test, y_test))
model.summary()

Our sequences in the train set are in the format:

1st row - [0, 1, 2, 3, 4....]
2nd row - [1, 2, 3, 4, 5....]
3rd row - [2, 3, 4, 5, 6....]

So each output corresponding to each row, will form a complete forecast, taking `rollback` number of values behind it, to make a prediction.

In [None]:
x_train

In [None]:
# Lets predict with the model
# We are passing in sequences and getting an output for each
train_predict = model.predict(x_train)
test_predict = model.predict(x_test)

In [None]:
train_predict.shape

In [None]:
# Re-scale the predictions back
train_predict = scaler.inverse_transform(train_predict)
y_train = scaler.inverse_transform([y_train])

test_predict = scaler.inverse_transform(test_predict)
y_test = scaler.inverse_transform([y_test])

# RSME and MAE error
score_rmse = np.sqrt(mean_squared_error(y_test[0], test_predict[:,0]))
score_mae = mean_absolute_error(y_test[0], test_predict[:,0])
print(Fore.GREEN + 'RMSE: {}'.format(score_rmse))

**NOTE**

Here we are not predicting the future. We are just predicting the labels using the data we have. For predicting future values, we will make one prediction, append it to the data, pop out the first value, and make another prediction.

Here we assume that we have a sequence at each step (which is present in our x_valid) and instead of using the predictions made at each step, we're just using the pattern we know upto that point to make the next value prediction.

This wouldn't be possible if we didn't have the dataset (as we wouldn't know what value to append, without predicting it) but here we know

In [None]:
x_train_ticks = single_var.head(training_size)['ds']
y_train = single_var.head(training_size)['y']
x_test_ticks = single_var.tail(test_size)['ds']

# Plot the forecast
f, ax = plt.subplots(1)
f.set_figheight(6)
f.set_figwidth(15)

sns.lineplot(x=x_train_ticks, y=y_train, ax=ax, label='Train Set') #navajowhite
sns.lineplot(x=x_test_ticks, y=test_predict[:,0], ax=ax, color='green', label='Prediction') #navajowhite
sns.lineplot(x=x_test_ticks, y=y_test[0], ax=ax, color='orange', label='Ground truth') #navajowhite

ax.set_title(f'Prediction \n MAE: {score_mae:.2f}, RMSE: {score_rmse:.2f}', fontsize=14)
ax.set_xlabel(xlabel='Date', fontsize=14)
ax.set_ylabel(ylabel='Depth to Groundwater', fontsize=14)

plt.show()

<h2>Multivariate Prophet</h2>

In [None]:
feature_columns = [
    'rainfall',
    'temperature',
    'drainage_volume',
    'river_hydrometry',
]
target_column = ['depth_to_groundwater']

train_size = int(0.85 * len(df))

multivariate_df = df[['date'] + target_column + feature_columns].copy()
multivariate_df.columns = ['ds', 'y'] + feature_columns

train = multivariate_df.iloc[:train_size, :]
# Split the feature columns and label column
x_train, y_train = pd.DataFrame(multivariate_df.iloc[:train_size, [0,2,3,4,5]]), pd.DataFrame(multivariate_df.iloc[:train_size, 1])
x_valid, y_valid = pd.DataFrame(multivariate_df.iloc[train_size:, [0,2,3,4,5]]), pd.DataFrame(multivariate_df.iloc[train_size:, 1])

train.head()

In [None]:
x_train.head()

In [None]:
from fbprophet import Prophet


# Train the model
model = Prophet()
model.add_regressor('rainfall')
model.add_regressor('temperature')
model.add_regressor('drainage_volume')
model.add_regressor('river_hydrometry')

model.fit(train)
y_pred = model.predict(x_valid)

# Calcuate metrics
score_mae = mean_absolute_error(y_valid, y_pred['yhat'])
score_rmse = math.sqrt(mean_squared_error(y_valid, y_pred['yhat']))

print(Fore.GREEN + 'RMSE: {}'.format(score_rmse))

In [None]:
# Plot the forecast
f, ax = plt.subplots(1)
f.set_figheight(6)
f.set_figwidth(15)

model.plot(y_pred, ax=ax)
sns.lineplot(x=x_valid['ds'], y=y_valid['y'], ax=ax, color='orange', label='Ground truth') #navajowhite

ax.set_title(f'Prediction \n MAE: {score_mae:.2f}, RMSE: {score_rmse:.2f}', fontsize=14)
ax.set_xlabel(xlabel='Date', fontsize=14)
ax.set_ylabel(ylabel='Depth to Groundwater', fontsize=14)

plt.show()