## GGDP Simulation

After imputing some of the data I still don't have enough samples to perform a wide enough quantile regression, specifcally QR which focuses on lower quantiles (such as q=0.05) which I am particularly interested in. There are now two options: imputing more data as I could possibly try and do some more calculations to extend the data imputation or because we can see GGDP and GDP distributions are quite similar maybe we can somehow simulate ggdp data using GDP data.

In [30]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.stattools import coint

ggdp_data = pd.read_pickle('ggdp_processed_data/finland_ggdp_imputed.pkl')
ggdp_data_clean = ggdp_data.dropna(axis=0, how='any')

#checking that both series are I(1)
print(sm.tsa.stattools.adfuller(ggdp_data_clean['ggdp_ppp_growth']))
print(sm.tsa.stattools.adfuller(ggdp_data_clean['gdp_ppp_growth']))

#Now we check cointegration for both time-series
print(coint(ggdp_data_clean['ggdp_ppp_growth'], ggdp_data_clean['gdp_real_growth']))
#both series seems cointegrated at 5% significance, this means that both have the same stochastic trend and that I can regress one on the other. 

(-4.444262050238324, 0.0002476349897740484, 0, 28, {'1%': -3.6889256286443146, '5%': -2.9719894897959187, '10%': -2.6252957653061224}, -72.11463123511959)
(-4.4516567707225025, 0.0002402108531347846, 0, 28, {'1%': -3.6889256286443146, '5%': -2.9719894897959187, '10%': -2.6252957653061224}, -71.86327779433003)
(-4.189783246851691, 0.003780017050402098, array([-4.33034332, -3.56305066, -3.19939082]))


Using Dickey Fuller test: both time series ggdp_ppp_growth and gdp_real_growth are $I(1)$.
When we check if they are co-integrated using Augmented Engle-Granget two-step cointegration test.
We see that we get a critical value of -4.189. We can reject the null hypothesis that they are not cointegrated at 5% significance.
Since both the time series are cointegrated they have the same stochastic trend, errors are stationary. Meaning that we can regress one on the other without it being a spurious regression.
We now want to check the OLS regression to see if we can get a significant model

In [31]:
ggdp_model = smf.ols('ggdp_ppp_growth ~ gdp_real_growth', data=ggdp_data_clean)
ggdp_fit = ggdp_model.fit()
print(ggdp_fit.summary())

                            OLS Regression Results                            
Dep. Variable:        ggdp_ppp_growth   R-squared:                       0.766
Model:                            OLS   Adj. R-squared:                  0.757
Method:                 Least Squares   F-statistic:                     88.38
Date:                Sun, 09 Jun 2024   Prob (F-statistic):           5.25e-10
Time:                        17:28:10   Log-Likelihood:                 79.163
No. Observations:                  29   AIC:                            -154.3
Df Residuals:                      27   BIC:                            -151.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           0.0250      0.004     

We got a significant model: $$GDP Growth \[PPP\] = 0.0250 + 0.9331 \cdot GDP Real Growth_t + \epsilon_t $$
We can use this model estimate past GGDP figures. we cannot use this to estimate quarterly data because this was assumed using annual dynamics.
This is very helpful because it means we can expand our data to the future and to the past quite a lot.


In [32]:
from typing import Union

import numpy as np


def simulate_missing_ggdp_values(ppp_growth: Union[np.nan, float], real_growth: float, alpha: float, beta: float):
    if not np.isnan(ppp_growth):
        return ppp_growth
    return alpha + beta * real_growth


ggdp_data_simulated = ggdp_data[['ggdp_ppp_growth', 'gdp_real_growth']]
ggdp_data_simulated['ggdp_ppp_growth_simulated'] = ggdp_data_simulated.apply(
    lambda x: simulate_missing_ggdp_values(x['ggdp_ppp_growth'],
                                           x['gdp_real_growth'],
                                           ggdp_fit.params['Intercept'],
                                           ggdp_fit.params['gdp_real_growth']), axis=1)

ggdp_data_simulated.to_pickle('ggdp_processed_data/ggdp_ppp_simulated_data.pkl')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ggdp_data_simulated['ggdp_ppp_growth_simulated'] = ggdp_data_simulated.apply(
