**Introduction**

> Vector autoregression (VAR) is a *stochastic process model used to capture the linear interdependencies among multiple time series*. VAR models generalize the univariate autoregressive model (AR model) by allowing for more than one evolving variable. All variables in a VAR enter the model in the same way: *each variable has an equation explaining its evolution based on its own lagged values, the lagged values of the other model variables, and an error term*.

Taken from: [Vector autoregression](https://en.wikipedia.org/wiki/Vector_autoregression)

**Practical Use**

The model predicts number a variety of parameters which can be translated into practical use.
For example, Predicting the number of patients that will need hospital care, can help the country to be better prepared towards what is expected.
The same holds for the prediction of number of tests to be performed, number of patients expected to be entered to home confinment and more.

**Related Work and Credits**

[Analysis and Prediction on Coronavirus (Italy)](https://www.kaggle.com/vanshjatana/analysis-and-prediction-on-coronavirus-italy/data)

Large parts of code snippets used for VAR modeling were taken from: [Vector Autoregression (VAR) – Comprehensive Guide with Examples in Python](https://www.machinelearningplus.com/time-series/vector-autoregression-examples-python/)

**Imports**

In [None]:
import pandas as pd
import numpy as np

from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import grangercausalitytests
from statsmodels.tools.eval_measures import rmse, aic

import matplotlib.pyplot as plt
%matplotlib inline

import datetime

**Settings**

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)
pd.plotting.register_matplotlib_converters()
np.set_printoptions(suppress=True)

**Reading Data**

In [None]:
ita_regional=pd.read_csv("../input/covid19-in-italy/covid19_italy_region.csv")

**Basic EDA**

In [None]:
ita_regional.info()

In [None]:
# Checking the percentage of missing data in each column
per_missing = ita_regional.isna().sum()*100/len(ita_regional)
per_missing.sort_values(ascending=False)

In [None]:
# Check for the period covered by the data (total # of days)
ita_regional['Date'] = pd.to_datetime(ita_regional['Date']).dt.normalize()
(ita_regional.Date.max()-ita_regional.Date.min()) + datetime.timedelta(days=1)

In [None]:
var_df = ita_regional.groupby('Date')[['HospitalizedPatients', 'IntensiveCarePatients', 'TotalHospitalizedPatients',
                                      'HomeConfinement', 'CurrentPositiveCases', 'NewPositiveCases',
                                      'Recovered', 'Deaths', 'TotalPositiveCases', 'TestsPerformed']].sum().reset_index()
print("df shape: ", var_df.shape)

In [None]:
var_df.head()

In [None]:
# Droping columns who are part of other columns (e.g., 
#    TotalHospitalizedPatients = HospitalizedPatients + IntensiveCarePatients)

var_df.drop(['HospitalizedPatients', 'IntensiveCarePatients', 'NewPositiveCases', 'TotalPositiveCases', 'CurrentPositiveCases'],
            axis=1, inplace=True)

var_df.head(n=5)

In [None]:
type(var_df['Date'])

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(22,5))

for ycol, ax in zip(['TotalHospitalizedPatients', 'HomeConfinement',
                                                  'Recovered', 'Deaths', 'TestsPerformed'], axes):

    var_df.plot(kind='line', x='Date', y=ycol, ax=ax, alpha=0.5, color='r')

**VAR**

**Checking for Causlity**

> Granger causality is a concept of causality derived from the notion that causes may not occur after effects and that *if one variable is the cause of another*, knowing the status on the cause at an earlier point in time can enhance prediction of the effect at a later point in time (Granger, 1969; Lütkepohl, 2005, p. 41)

Taken from: [Vector Autoregressive (VAR) Models and Granger Causality in Time Series Analysis in Nursing Research: Dynamic Changes Among Vital Signs Prior to Cardiorespiratory Instability Events as an Example](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5161241/)

In [None]:
def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False, maxlag=5):    
    
    """Check Granger Causality of all possible combinations of the Time series.
    The rows are the response variable, columns are predictors. 

    data      : pandas dataframe containing the time series variables
    variables : list containing names of the time series variables.
    """
    df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
            if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
            min_p_value = np.min(p_values)
            df.loc[r, c] = min_p_value
    df.columns = [var + '_x' for var in variables]
    df.index = [var + '_y' for var in variables]
    return df  

In [None]:
grangers_causation_matrix(var_df, variables = ['TotalHospitalizedPatients', 'HomeConfinement',
                                                  'Recovered', 'Deaths', 'TestsPerformed']) 

The test *Null Hypothesis is that the coefficients of the corresponding past values are zero; That is the X does not cause Y*. 
The P-values in the table are lesser than our significance level (0.05), which implies that the Null Hypothesis can be rejected.

**Checking for Cointegration**

> Cointegration tests analyze non-stationary time series— processes that have variances and means that vary over time. In other words, the method allows you to estimate the long-run parameters or equilibrium in systems with unit root variables (Rao, 2007).

Taken from: [Cointegration: Definition, Examples, Tests](https://www.statisticshowto.datasciencecentral.com/cointegration/)

More information about python implementation and the test results can be found here:

[Test](https://www.statsmodels.org/dev/generated/statsmodels.tsa.vector_ar.vecm.coint_johansen.html)

[Results](https://www.statsmodels.org/dev/generated/statsmodels.tsa.vector_ar.vecm.JohansenTestResult.html#statsmodels.tsa.vector_ar.vecm.JohansenTestResult)

In [None]:
from statsmodels.tsa.vector_ar.vecm import coint_johansen

def cointegration_test(df, alpha=0.05): 
    """Perform Johanson's Cointegration Test and Report Summary"""
    out = coint_johansen(df,-1,5)
    d = {'0.90':0, '0.95':1, '0.99':2}
    traces = out.lr1
    cvts = out.cvt[:, d[str(1-alpha)]]
    def adjust(val, length= 6): return str(val).ljust(length)

    # Summary
    print('Name   ::  Test Stat > C(95%)    =>   Signif  \n', '--'*20)
    for col, trace, cvt in zip(df.columns, traces, cvts):
        print(adjust(col), ':: ', adjust(round(trace,2), 9), ">", adjust(cvt, 8), ' =>  ' , trace > cvt)

In [None]:
cointegration_test(var_df[['TotalHospitalizedPatients', 'HomeConfinement',
                                                  'Recovered', 'Deaths', 'TestsPerformed']])

Train-Test Split

In [None]:
test_frec = 0.25
n_test = round((len(var_df)) * test_frec)
df_train, df_test = var_df[0:-n_test], var_df[-n_test:]
# df_train_copy = df_train.copy()
df_train.drop('Date',1, inplace=True)

**Unit Root Test (checking for stationaity)**

> In statistics, a unit root test tests whether a time series variable is non-stationary and possesses a unit root. *The null hypothesis is generally defined as the presence of a unit root and the alternative hypothesis is either stationarity*, trend stationarity or explosive root depending on the test used.

Taken from: [Unit root test](https://en.wikipedia.org/wiki/Unit_root_test)

In [None]:
def adfuller_test(series, signif=0.05, name='', verbose=False):
    """Perform ADFuller to test for Stationarity of given series and print report"""
    r = adfuller(series, autolag='AIC')
    output = {'test_statistic':round(r[0], 4), 'pvalue':round(r[1], 4), 'n_lags':round(r[2], 4), 'n_obs':r[3]}
    p_value = output['pvalue'] 
    def adjust(val, length= 6): return str(val).ljust(length)

    # Print Summary
    print(f'    Augmented Dickey-Fuller Test on "{name}"', "\n   ", '-'*47)
    print(f' Null Hypothesis: Data has unit root. Non-Stationary.')
    print(f' Significance Level    = {signif}')
    print(f' Test Statistic        = {output["test_statistic"]}')
    print(f' No. Lags Chosen       = {output["n_lags"]}')

    for key,val in r[4].items():
        print(f' Critical value {adjust(key)} = {round(val, 3)}')

    if p_value <= signif:
        print(f" => P-Value = {p_value}. Rejecting Null Hypothesis.")
        print(f" => Series is Stationary.")
    else:
        print(f" => P-Value = {p_value}. Weak evidence to reject the Null Hypothesis.")
        print(f" => Series is Non-Stationary.")

In [None]:
# ADF Test on each column
for name, column in df_train.iteritems():
    adfuller_test(column, name=column.name)
    print('\n')

In [None]:
# 1st difference
df_differenced = df_train.diff().dropna()

In [None]:
# ADF Test on each column
for name, column in df_differenced.iteritems():
    adfuller_test(column, name=column.name)
    print('\n')

In [None]:
# 2nd Difference
df_differenced = df_differenced.diff().dropna()

In [None]:
# ADF Test on each column
for name, column in df_differenced.iteritems():
    adfuller_test(column, name=column.name)
    print('\n')

As you can see, after 2 series differences, we have 2 stationary columns under significance level of 5%, 1 stationary column under significance level of 0.1%, and 2 non-stationary columns (under plausible significance level).
This is not ideal - however, because we're using "short" time series, I've decided to go on with only 2 diffrences and not to add more differences.  

Modeling

In [None]:
model = VAR(df_differenced[['TotalHospitalizedPatients', 'HomeConfinement',
                                                  'Recovered', 'Deaths', 'TestsPerformed']])

fitted = model.fit(6)
fitted.summary()

Choosing number of lags to be inserted into the model is a matter of trial and error, and can be changed according to the regression results (above), the durbin-watson test results (will be explained in a moment), and other metrics (e.g., RMSE, MAE, etc.)

**Checking for Residuals' Autocorrelaotion**

We'll use Durbin-Watson test for this (denoted as *d*):

> The value of d always lies between 0 and 4. 
> 
> d = 2 indicates no autocorrelation.
> 
> If d < 2, there is evidence of positive serial correlation. As a rough rule of thumb, if d < 1.0, there may be cause > for alarm. Small values of d indicate successive error terms are positively correlated.
> 
> If d > 2, successive error terms are negatively correlated. In regressions, this can imply an underestimation of the > level of statistical significance.

Taken from (modified by the author): [Durbin–Watson statistic](https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic)


In [None]:
from statsmodels.stats.stattools import durbin_watson
out = durbin_watson(fitted.resid)

for col, val in zip(var_df[['TotalHospitalizedPatients', 'HomeConfinement',
                                                  'Recovered', 'Deaths', 'TestsPerformed']], out):
    print(col, ':', round(val, 2))

**Forecasting**

In [None]:
# Get the lag order
lag_order = fitted.k_ar

# Input data for forecasting
forecast_input = df_differenced.values[-lag_order:]
forecast_input

In [None]:
var_df_forecast = var_df[['TotalHospitalizedPatients', 'HomeConfinement',
                                                  'Recovered', 'Deaths', 'TestsPerformed']]

fc = fitted.forecast(y=forecast_input, steps=n_test)
df_forecast = pd.DataFrame(fc, index=var_df_forecast.index[-n_test:], columns=var_df_forecast.columns + '_2d')
df_forecast

Turning Forecasting into original values

In [None]:
def invert_transformation(df_train, df_forecast, second_diff=False, third_diff=False):
    """Revert back the differencing to get the forecast to original scale."""
    df_fc = df_forecast.copy()
    columns = df_train.columns
    for col in columns:        
        # Roll back 3rd Diff
        if third_diff:
            df_fc[str(col)+'_2d'] = (df_train[col].iloc[-2]-df_train[col].iloc[-3]) + df_fc[str(col)+'_3d'].cumsum()
        # Roll back 2nd Diff
        if second_diff:
            df_fc[str(col)+'_1d'] = (df_train[col].iloc[-1]-df_train[col].iloc[-2]) + df_fc[str(col)+'_2d'].cumsum()
        # Roll back 1st Diff
        df_fc[str(col)+'_forecast'] = df_train[col].iloc[-1] + df_fc[str(col)+'_1d'].cumsum()
    return df_fc

In [None]:
df_results = invert_transformation(df_train, df_forecast, second_diff=True, third_diff=False)        
df_results.loc[:, ['TotalHospitalizedPatients_forecast', 'HomeConfinement_forecast',
                                                  'Recovered_forecast', 'Deaths_forecast', 'TestsPerformed_forecast']]

In [None]:
df_results

Results Visualization

In [None]:
df_results['Date'] = var_df['Date'][13:17]
df_test.set_index('Date',inplace=True)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(22,6))

for col, ax in zip(['TotalHospitalizedPatients', 'HomeConfinement',
                                                  'Recovered', 'Deaths', 'TestsPerformed'], axes):

    df_results.plot(kind='line', y=[col+'_forecast'], x='Date', ax=ax, alpha=0.5, color='r', legend=True).autoscale(axis='x',tight=True)
    df_test[col][-n_test:].plot(legend=True, ax=ax)
    ax.set_title(col + ": Forecast vs Actuals")
plt.tight_layout();

In [None]:
from statsmodels.tsa.stattools import acf
def forecast_accuracy(forecast, actual):
    mape = np.mean(np.abs(forecast - actual)/np.abs(actual))  # MAPE
    me = np.mean(forecast - actual)             # ME
    mae = np.mean(np.abs(forecast - actual))    # MAE
    mpe = np.mean((forecast - actual)/actual)   # MPE
    rmse = np.mean((forecast - actual)**2)**.5  # RMSE
    corr = np.corrcoef(forecast, actual)[0,1]   # corr
    mins = np.amin(np.hstack([forecast[:,None], 
                              actual[:,None]]), axis=1)
    maxs = np.amax(np.hstack([forecast[:,None], 
                              actual[:,None]]), axis=1)
    minmax = 1 - np.mean(mins/maxs)             # minmax
    return({'mape':mape, 'me':me, 'mae': mae, 
            'mpe': mpe, 'rmse':rmse, 'corr':corr, 'minmax':minmax})

In [None]:
print('Forecast Accuracy of: TotalHospitalizedPatients')
accuracy_prod = forecast_accuracy(df_results['TotalHospitalizedPatients_forecast'].values, df_test['TotalHospitalizedPatients'])
for k, v in accuracy_prod.items():
    print(k, ': ', round(v,4))

print('\nForecast Accuracy of: HomeConfinement')
accuracy_prod = forecast_accuracy(df_results['HomeConfinement_forecast'].values, df_test['HomeConfinement'])
for k, v in accuracy_prod.items():
    print(k, ': ', round(v,4))

print('\nForecast Accuracy of: Recovered')
accuracy_prod = forecast_accuracy(df_results['Recovered_forecast'].values, df_test['Recovered'])
for k, v in accuracy_prod.items():
    print(k, ': ', round(v,4))

print('\nForecast Accuracy of: Deaths')
accuracy_prod = forecast_accuracy(df_results['Deaths_forecast'].values, df_test['Deaths'])
for k, v in accuracy_prod.items():
    print(k, ': ', round(v,4))

print('\nForecast Accuracy of: TestsPerformed')
accuracy_prod = forecast_accuracy(df_results['TestsPerformed_forecast'].values, df_test['TestsPerformed'])
for k, v in accuracy_prod.items():
    print(k, ': ', round(v,4))

**Considering the length of our data, the results seems to be reasonable (altough not perfect :)).** 

**It might be the case that the model predictions will be better, as we get more updated data to feed into the model.** 

**In addition, I invite you to use this model (and modify it) in order to make similar predictions to other countries.**