# Exploratory Data Analysis

In [None]:
%%capture
!pip install pandas
!pip install geopandas
!pip install plotly-express
!pip install nbformat
!pip install -U kaleido
!pip install pycountry-convert

In [None]:
import plotly.express as px
import pandas as pd
import pycountry_convert as pc

We first add continent information to our data frame

In [None]:
def f(x): return(pc.country_alpha2_to_continent_code(pc.country_alpha3_to_country_alpha2(x)))

df = pd.read_csv('../Data/global_average_yearly_temp_with_features_clean.csv')
df['datetime_year'] = pd.to_datetime(df['year'], format = "%Y")
df['iso_continents_code'] = df['iso_code'].apply(f) 


## Global Trends
We now see the general global trends of temperature overtime has been increasing. When we plot each continents temperature, we notice that the data has been most consistent for Europe, Asia, North America and South America, but Africa has sparse data before 1850.

### Global Average Temperature

In [None]:
year_group_by = df.groupby("year").mean()
year_group_by = year_group_by.reset_index()
fig = px.line(year_group_by, x = "year", y="AvgYearlyTemp", title = "Average Temperature Over Time")
fig.show()


### Continent Average Temperature 

In [None]:
continent_group_by = df.groupby(["year","iso_continents_code"]).mean()
continent_group_by = continent_group_by.reset_index()
fig = px.line(continent_group_by, x = "year", y="AvgYearlyTemp", color = "iso_continents_code")
fig.show()

### Map of the World With Average Temperature

In [None]:
fig = px.choropleth(df, locations="iso_code", color= "AvgYearlyTemp", 
                    hover_name= "iso_code", animation_frame= "year", title = "Average Yearly Temperature" )
# interactive
fig.show()

### Percent Change In Temperature Year Over Year 

In [None]:
year_group_by.sort_values(['year'], inplace = True, ascending=[False])
temp_label, co2_label = 'Avg. Temperature Yearly Percent Change', 'Avg. Co2 Yearly Percent Change'
year_group_by[temp_label] = year_group_by['AvgYearlyTemp'].pct_change() * 100
year_group_by[co2_label] = year_group_by['co2'].pct_change() * 100
fig = px.bar(year_group_by, x='year', y=temp_label, color = temp_label, color_continuous_scale ='bluered')
fig.show()

## Trends Between Data

In [None]:
corr = df.corr()
fig = px.imshow(corr, True, title = "Correlation Heatmap")
fig.show()

In [None]:
fig = px.scatter_matrix(df, dimensions=["cumulative_co2", "population", "co2", "oil_co2", "AvgYearlyTemp"])
# interactive plot
fig.show()

# Creating Sarimax Model With Excog Variables At A Country Level

We will be creating a Sarimax model at country level because each time forecasting relies on the trends in data, therefore you cannot have multiple instances for a single time point. Moreover, as a simple model, it only requires not many datapoints to make predictions. Below, we detail the process that we have for our model.

1. Data Preperation
2. Hypertuning Model Parameters With Training and Validation Set
3. Testing model on holdout set


In [None]:
%%capture
!pip install pandas
!pip install plotly-express
!pip install statsmodels
!pip install tqdm


In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
import statsmodels as sm 
from sklearn.metrics import r2_score
from tqdm import tqdm
import plotly.express as px
from sklearn.preprocessing import StandardScaler
# model imports
from statsmodels.tsa.arima.model import ARIMA
import statsmodels.api as sm1
from sklearn.metrics import mean_squared_error, r2_score
import itertools
import math
from sklearn.decomposition import PCA
import pickle


## Preparing Data for Arima
We have multple task to prepare data for ARIMA
### Data Wrangling
1. We have to cast our Year into a datetime object
2. We then have to set it as an index of the array
3. Finally, we have to set the period of the data to Yearly frequency

### Spliting Data into Training, Validation, and Holdout
We selection the following time periods.
```
Training (1750 - 1950)
Validation (1950 - 1980)
Testing (1980 - 2013)
```

### Scaling Data By Standard Scalar
The numerical method used for ARIMA involves BFGS, which is not
scale invariant, therfore we scale the x-features by that of
the training data

In [63]:
def get_country_df(country):
    df = pd.read_csv('../Data/global_average_yearly_temp_with_features_clean.csv')
    df = df[df["iso_code"] == country]
    df["Year_idx"] = pd.to_datetime(df.year, format="%Y")
    df = df.set_index("Year_idx")
    df.index = df.index.to_period("Y")
    return df

def get_test_train_valid(country, train_split = '1950', validation_split = '1980'):
    country_df = get_country_df(country)
    
    train = country_df.loc[:train_split]
    valid = country_df.loc[train_split: validation_split]
    test  = country_df.loc[validation_split:]
    
    return train, test, valid

def scale(train_x, valid_x, test_x):
    scaler = StandardScaler()
    model_x = pd.concat([train_x, valid_x])

    scaler.fit(model_x)
    model_x = scaler.transform(model_x)
    train_x = scaler.transform(train_x)
    valid_x = scaler.transform(valid_x)
    test_x  = scaler.transform(test_x)
    return train_x, valid_x, test_x, model_x

def pca_fitter(train_x, threshold = .99):
    n_features = train_x.shape[1]
    
    for n in range(1, n_features + 1):
        pca_model = PCA(n)
        pca_model.fit(train_x)
        
        if sum(pca_model.explained_variance_ratio_) >= threshold:
            break
    return pca_model
        
    

## Hypertuning Model Parameters With Training and Validation Set
We opt for an exhaustive grid search to find the parameters that give the best result. Here, we only use the following assumptions for the sarimax model. The external regressors are time varying, and have measurement error. 


In [64]:
def eval_arima_excog_walk_forward_validation_model(train_x, train_y, test_x, test_y, arima_order):
    # DO NOT USE THIS 

    history = [x for x in train_y]
    predictions = []
    
    for i in range(len(test_x)):
        model = ARIMA(history, exog = train_x, order = arima_order)
        model_fit = model.fit()
        
        y_hat = model_fit.forecast(exog = test_x.iloc[[i]])
        predictions.append(y_hat[0])
        
        train_x = pd.concat([train_x, test_x.iloc[[i]]], axis = 0)
        history.append(test_y[i])

    rmse = math.sqrt(mean_squared_error(test_y, predictions))
    return model_fit, rmse, predictions

def eval_arima_excog(train_x, train_y, test_x, test_y, arima_order, trend = 'c'):
    # DO NOT USE THIS 
    model = ARIMA(train_y, exog = train_x, order = arima_order, trend = trend)
    model_fit = model.fit()
    predictions = model_fit.forecast(len(test_x), exog = test_x)
    rmse = math.sqrt(mean_squared_error(test_y, predictions))
    return model_fit, rmse, predictions

def eval_sarimax_excog(train_x, train_y, test_x, test_y, arima_order):
    # USE THIS
    model = sm1.tsa.statespace.SARIMAX(train_y, exog = train_x, order = arima_order, \
                                       time_varying_regression = True, mle_regression = False, measurement_error = True)
    model_fit = model.fit(disp = 0)
    predictions = model_fit.forecast(len(test_x), exog = test_x)
    rmse = math.sqrt(mean_squared_error(test_y, predictions))
    predictions.index = test_y.index
    return model_fit, rmse, predictions

def eval_excog_models(train_x, train_y, test_x, test_y, p_values, d_values, q_values, f):
    arima_orders = itertools.product(*[p_values, d_values, q_values])
    arima_orders = list(arima_orders)
    best_order, best_score = None, float("inf") 
    
    count = 0
    for arima_order in tqdm(arima_orders):
        try:
            _, rmse, _ = f(train_x, train_y, test_x, test_y, arima_order)
            
            if rmse < best_score:
                best_score, best_order = rmse, arima_order
                print(f"ARIMA RMSE = {best_score}")
        except Exception as e:
            print(e)
            pass
        count += 1
    print('DONE')
    return best_order
    

In [65]:
def ml(country, p_values = [0, 1, 2, 3], d_values = range(0, 3), q_values = range(0, 3)):
    train, test, valid = get_test_train_valid(country)

    train_x, train_y = train[train.columns[2:-2]], train["AvgYearlyTemp"]
    test_x, test_y = test[test.columns[2:-2]], test["AvgYearlyTemp"]
    valid_x, valid_y = valid[valid.columns[2:-2]], valid["AvgYearlyTemp"]

    train_x, valid_x, test_x, model_x = scale(train_x, valid_x, test_x)

    pca_model = pca_fitter(train_x)
    train_x = pca_model.transform(train_x)
    valid_x = pca_model.transform(valid_x) 
    test_x  = pca_model.transform(test_x)
    model_x = pca_model.transform(model_x)
    
    model_y = pd.concat([train_y, valid_y])

    best = eval_excog_models(train_x, train_y, valid_x, valid_y, p_values, d_values, q_values, eval_sarimax_excog)

    model_fit, rmse, predictions = eval_sarimax_excog(model_x, model_y, test_x, test_y, best)
    
    return best, test_y, predictions

In [66]:
def plot(df, country, layout, line1 = 'AvgYearlyTemp', line2 = 'Predictions', x = 'year', error = 'AvgTempUncertainty'):
    from plotly.subplots import make_subplots
    import plotly.graph_objects as go


    fig = go.Figure(layout = layout)
    trace_1 = go.Line(name = f"Actual {country} Avg Yearly Temp", 
                     x = test[x],
                     y = test[line1],
                     error_y = dict(type='data', array = test[error]))


    trace_2 = go.Line(name = f"Predictions of {country} Avg Yearly Temp", 
                     x = test[x],
                     y = test[line2])

    fig.add_trace(trace_1)
    fig.add_trace(trace_2)
    
    return fig

## Testing on the holdout
We now use the best parameters from an exhaustive grid search to predict on our holdout set

In [67]:
results = {"FRA": None, "DEU": None, 
           "CAN": None, "ESP": None, 
           "IND": None}

for k in results.keys():
    results[k] = ml(k)


  3%|██▎                                                                                 | 1/36 [00:00<00:13,  2.65it/s]

ARIMA RMSE = 4.8265999555331565


  8%|███████                                                                             | 3/36 [00:01<00:13,  2.37it/s]

Invalid dimensions for design matrix: requires 7 columns, got 6


 14%|███████████▋                                                                        | 5/36 [00:01<00:12,  2.57it/s]

ARIMA RMSE = 0.39317069012112366



Maximum Likelihood optimization failed to converge. Check mle_retvals

 17%|██████████████                                                                      | 6/36 [00:03<00:18,  1.62it/s]

Invalid dimensions for design matrix: requires 8 columns, got 6



Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum L

DONE


  3%|██▎                                                                                 | 1/36 [00:00<00:08,  3.95it/s]

ARIMA RMSE = 4.55147269442265


  6%|████▋                                                                               | 2/36 [00:00<00:10,  3.25it/s]

ARIMA RMSE = 4.492407342799966


  8%|███████                                                                             | 3/36 [00:00<00:10,  3.19it/s]

Invalid dimensions for design matrix: requires 7 columns, got 6



Maximum Likelihood optimization failed to converge. Check mle_retvals

 14%|███████████▋                                                                        | 5/36 [00:01<00:08,  3.71it/s]

ARIMA RMSE = 0.6233501705083685


 17%|██████████████                                                                      | 6/36 [00:02<00:11,  2.71it/s]

Invalid dimensions for design matrix: requires 8 columns, got 6



Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals

 47%|███████████████████████████████████████▏                                           | 17/36 [00:07<00:09,  1.96it/s]

ARIMA RMSE = 0.6052236845897783



Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization f

DONE


  3%|██▎                                                                                 | 1/36 [00:00<00:10,  3.47it/s]

ARIMA RMSE = 4.904007329165935


  6%|████▋                                                                               | 2/36 [00:00<00:14,  2.28it/s]

ARIMA RMSE = 4.890337004139657



Maximum Likelihood optimization failed to converge. Check mle_retvals

  8%|███████                                                                             | 3/36 [00:01<00:16,  2.00it/s]

Invalid dimensions for design matrix: requires 7 columns, got 6



Maximum Likelihood optimization failed to converge. Check mle_retvals

 14%|███████████▋                                                                        | 5/36 [00:01<00:11,  2.74it/s]

ARIMA RMSE = 1.2798322160794686


 17%|██████████████                                                                      | 6/36 [00:02<00:11,  2.59it/s]

ARIMA RMSE = 1.0520517740007513
Invalid dimensions for design matrix: requires 8 columns, got 6



Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals

 33%|███████████████████████████▋                                                       | 12/36 [00:04<00:10,  2.23it/s]

ARIMA RMSE = 0.6003606479597758



Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retva

ARIMA RMSE = 0.585198618500025



Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-invertible starting MA parameters found. Using zeros as starting parameters.

100%|███████████████████████████████████████████████████████████████████████████████████| 36/36 [00:21<00:00,  1.70it/s]


DONE


  3%|██▎                                                                                 | 1/36 [00:00<00:07,  4.55it/s]

ARIMA RMSE = 4.010271015859658


  6%|████▋                                                                               | 2/36 [00:00<00:09,  3.60it/s]

ARIMA RMSE = 3.758255052387073


  8%|███████                                                                             | 3/36 [00:01<00:14,  2.27it/s]

Invalid dimensions for design matrix: requires 7 columns, got 6


 14%|███████████▋                                                                        | 5/36 [00:01<00:09,  3.33it/s]

ARIMA RMSE = 1.9812068188945628



Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-invertible starting MA parameters found. Using zeros as starting parameters.



Invalid dimensions for design matrix: requires 8 columns, got 6



Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization 

DONE



Non-invertible starting MA parameters found. Using zeros as starting parameters.



ARIMA RMSE = 16.056644528602405


  6%|████▋                                                                               | 2/36 [00:00<00:07,  4.45it/s]

ARIMA RMSE = 16.016080812482375



Non-invertible starting MA parameters found. Using zeros as starting parameters.

  8%|███████                                                                             | 3/36 [00:00<00:10,  3.13it/s]

Invalid dimensions for design matrix: requires 6 columns, got 5


 14%|███████████▋                                                                        | 5/36 [00:01<00:08,  3.63it/s]

ARIMA RMSE = 0.6108186120377728



Non-invertible starting MA parameters found. Using zeros as starting parameters.



Invalid dimensions for design matrix: requires 7 columns, got 5



Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-invertible starting MA parameters found. Using zeros as starting parameters.



ARIMA RMSE = 0.47570918103877796


 28%|███████████████████████                                                            | 10/36 [00:03<00:10,  2.38it/s]

ARIMA RMSE = 0.42915757217720096



Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals

 47%|███████████████████████████████████████▏                                           | 17/36 [00:07<00:09,  2.02it/s]

ARIMA RMSE = 0.36700898805176674



Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals

 72%|███████████████████████████████████████████████████████████▉                       | 26/36 [00:13<00:08,  1.24it/s]

ARIMA RMSE = 0.3583480123415916



Maximum Likelihood optimization failed to converge. Check mle_retvals

 75%|██████████████████████████████████████████████████████████████▎                    | 27/36 [00:14<00:08,  1.08it/s]

ARIMA RMSE = 0.3547846462272414



Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals


Maximum Likelihood optimization failed to converge. Check mle_retvals


Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals

100%|███████████████████████████████████████████████████████████████████████████████████| 36/36 [00:20<00:00,  1.74it/s]

ARIMA RMSE = 0.35466725848349
DONE




Non-invertible starting MA parameters found. Using zeros as starting parameters.


Maximum Likelihood optimization failed to converge. Check mle_retvals



In [68]:

with open('Temperatue_Time_Series.pickle', 'wb') as fh:
   pickle.dump(results, fh)

In [69]:
pickle_off = open ("Temperatue_Time_Series.pickle", "rb")
results = pickle.load(pickle_off)

In [71]:
layout = dict(xaxis=dict(title='Year'),
              yaxis=dict(title='Temperature'))

for k, v in results.items():
    _, test, _ = get_test_train_valid(k)
    print("P, Q, D", v[0])
    test['Predictions'] = v[-1]
    fig = plot(test, k, layout)
    fig.show()

P, Q, D (0, 1, 1)


P, Q, D (1, 2, 1)


P, Q, D (3, 1, 1)


P, Q, D (0, 1, 1)


P, Q, D (3, 2, 2)


Interpert the coefficient
Arima 
-> p, q, d
Five Countries
Time Series Specific
R