# Machine Learning
In this notebook, we will be using the following machine learning models time series of World Happiness:
- Vector Autoregression (VAR)
- Autoregression (AR)
- Autoregressive Integrated Moving Average (ARIMA)

Firstly, let us take a look at our data (We will use United States as an example)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.simplefilter(action='ignore', category= FutureWarning)
warnings.simplefilter(action='ignore', category= UserWarning)

In [None]:
data = {}
for x in range (2015,2022):
    data[x] = pd.read_csv(f'../cleaned_data/{str(x)}.csv')
countries = data[2015]['Country']

time = {}
for x in countries:
    time[x] = pd.read_csv(f'../time_series/{x}.csv')
    time[x] = time[x].set_index('Year')

# Example of time series: United States
df = time['United States']
print(df)
fig, axes = plt.subplots(nrows=7, ncols=1, dpi=120, figsize=(6,20))
for i, ax in enumerate(axes.flatten()):
    ax.plot(df[df.columns[i]], color='red', linewidth=1)
    # Decorations
    ax.set_title(df.columns[i])
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')
    ax.spines["top"].set_alpha(0)
    ax.tick_params(labelsize=6)

plt.tight_layout();

# 1. Vector Autoregression (VAR)

**VAR** is a multivariate forecasting algorithm that is used when two or more time series influence each other. For our data, certain variables such as **Trust**, may affect other variables, such as **Freedom**.


In [None]:
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tools.eval_measures import rmse, aic

From here, we can see that some of factors affect others (Example: Trust_y and Freedom_x have p-value of < 0.05)
We will now set up the VAR Model.

# 1. Autoregression on Score

In [None]:
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.stattools import adfuller

ar_score_pred = data[2021][['Country','Score']].copy()
ar_pred_list = []

#Applying model on all countries
for country in countries:
    df = time[country]['Score']

    df_stationarityTest = adfuller(df, autolag='AIC')
    
    train_data, test_data = df[0:6], df[6]
    ar_model = AutoReg(df,lags = 1).fit()
    pred = ar_model.predict(start=len(train_data), end=(len(df)-1), dynamic=False)
    ar_pred_list.append(pred.values[0])
    
ar_score_pred['Pred'] = ar_pred_list
ar_score_pred

As we can see, applying autoregression with a time lag of 1 from years 2015-2020 to predict 2021 data has worked and yielded results. 

# 2. Autoregression on Variables

In [None]:
ar_var_pred = data[2021][['Country','Score']].copy()


#Applying model on all countries
for var in ['Economy','Family','Health','Freedom','Generosity', 'Trust']:
    var_list = []
    for country in countries:
        df = time[country][var]
        train_data, test_data = df[0:6], df[6:7]
        
        ar_model = AutoReg(df,lags = 1).fit()
        pred = ar_model.predict(start=len(train_data), end=(len(df)-1), dynamic=False)
        var_list.append(pred.values[0])

    ar_var_pred[var] = var_list
ar_var_pred

# 3. ARIMA

In [None]:
import statsmodels.api as sm
arimatest = time['Zimbabwe']['Score']
model =  sm.tsa.arima.ARIMA(arimatest[0:6], order=(1,1,2))
model_fit = model.fit()
print(model_fit.summary())
model_fit.forecast(15, alpha=0.05)