### Forecasting GDP Growth Rates using Principal Component Analysis

#### Dimensionality reduction refers to techniques for reducing the number of input variables in training data.

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction.

Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data.

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.

#### So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.

#### STEP 1: STANDARDIZATION

The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis.

#### STEP 2: COVARIANCE MATRIX COMPUTATION

The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.

#### STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to determine the principal components of the data.

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on.

Allow to reduce dimensionality without losing much information. Important: Less Interpretable.

#### To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv (r'C:\Users\adim\Desktop\Special Project in Economic Research\Data\US_Forecasting_Data_1972_2021.csv')
df=df.set_index('date')
df

Unnamed: 0_level_0,log_gdp,dlog_gdp_yoy,unemp,3m_tbill,d3m_tbill,fedfund,dfed_fund,cpi,dcpi,manu_utilization,10y_bond,d10y_bond
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1/1/1972,3.720104,,5.766667,3.436667,,3.546667,,6.033333,,81.732967,6.033333,
4/1/1972,3.729852,,5.700000,3.770000,0.333333,4.300000,0.753333,6.143333,0.110000,82.782533,6.143333,0.110000
7/1/1972,3.733934,,5.566667,4.220000,0.450000,4.743333,0.443333,6.290000,0.146667,83.246700,6.290000,0.146667
10/1/1972,3.741146,,5.366667,4.863333,0.643333,5.146667,0.403333,6.373333,0.083333,85.669767,6.373333,0.083333
1/1/1973,3.751763,0.031658,4.933333,5.700000,0.836667,6.536667,1.390000,6.603333,0.230000,87.702233,6.603333,0.230000
...,...,...,...,...,...,...,...,...,...,...,...,...
10/1/2020,4.273413,-0.009941,6.766667,0.093333,-0.020000,0.090000,-0.003333,0.863333,0.213333,74.182667,0.863333,0.213333
1/1/2021,4.280024,0.002369,6.200000,0.050000,-0.043333,0.080000,-0.010000,1.316667,0.453333,74.733533,1.316667,0.453333
4/1/2021,4.287092,0.050096,5.900000,0.026667,-0.023333,0.070000,-0.010000,1.593333,0.276667,75.761367,1.593333,0.276667
7/1/2021,4.289564,0.020968,5.100000,0.046667,0.020000,0.090000,0.020000,1.323333,-0.270000,76.384367,1.323333,-0.270000


In [4]:
#Split data table into X and Y
X = df.iloc[4:,2:].values
Y=df.iloc[4:, 0:2].values


### Step 1 : Standardization

#### Eigendecomposition- Computing Eigenvalues and Eigenvectors

The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the "core" of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

### Correlation Matrix

In [52]:
df2=df.iloc[4:, ~df.columns.isin(['log_gdp', '3m_tbill', 'fedfund', 'cpi', '10y_bond'])]
df2

Unnamed: 0_level_0,dlog_gdp_yoy,unemp,d3m_tbill,dfed_fund,dcpi,manu_utilization,d10y_bond
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1/1/1973,0.031658,4.933333,0.836667,1.390000,0.230000,87.702233,0.230000
4/1/1973,0.026614,4.933333,0.903333,1.280000,0.203333,87.696100,0.203333
7/1/1973,0.020241,4.800000,1.720000,2.743333,0.400000,87.502633,0.400000
10/1/1973,0.017133,4.766667,-0.823333,-0.563333,-0.453333,88.285867,-0.453333
1/1/1974,0.002765,5.133333,0.116667,-0.673333,0.300000,86.648667,0.300000
...,...,...,...,...,...,...,...
10/1/2020,-0.009941,6.766667,-0.020000,-0.003333,0.213333,74.182667,0.213333
1/1/2021,0.002369,6.200000,-0.043333,-0.010000,0.453333,74.733533,0.453333
4/1/2021,0.050096,5.900000,-0.023333,-0.010000,0.276667,75.761367,0.276667
7/1/2021,0.020968,5.100000,0.020000,0.020000,-0.270000,76.384367,-0.270000


In [53]:
data_centered = df2 - np.mean(df2, axis = 0)

In [54]:
data_centered

Unnamed: 0_level_0,dlog_gdp_yoy,unemp,d3m_tbill,dfed_fund,dcpi,manu_utilization,d10y_bond
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1/1/1973,0.020288,-1.333333,0.861207,1.415850,0.254677,9.735418,0.254677
4/1/1973,0.015243,-1.333333,0.927874,1.305850,0.228010,9.729284,0.228010
7/1/1973,0.008871,-1.466667,1.744541,2.769184,0.424677,9.535818,0.424677
10/1/1973,0.005762,-1.500000,-0.798793,-0.537483,-0.428656,10.319051,-0.428656
1/1/1974,-0.008605,-1.133333,0.141207,-0.647483,0.324677,8.681851,0.324677
...,...,...,...,...,...,...,...
10/1/2020,-0.021311,0.500000,0.004541,0.022517,0.238010,-3.784149,0.238010
1/1/2021,-0.009001,-0.066667,-0.018793,0.015850,0.478010,-3.233282,0.478010
4/1/2021,0.038726,-0.366667,0.001207,0.015850,0.301344,-2.205449,0.301344
7/1/2021,0.009598,-1.166667,0.044541,0.045850,-0.245323,-1.582449,-0.245323


In [55]:
centered_frame = pd.DataFrame(data_centered, columns = ['dlog_gdp_yoy','unemp','d3m_tbill','dfed_fund','dcpi','manu_utilization', 'd10y_bond'])
centered_frame.head()

Unnamed: 0_level_0,dlog_gdp_yoy,unemp,d3m_tbill,dfed_fund,dcpi,manu_utilization,d10y_bond
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1/1/1973,0.020288,-1.333333,0.861207,1.41585,0.254677,9.735418,0.254677
4/1/1973,0.015243,-1.333333,0.927874,1.30585,0.22801,9.729284,0.22801
7/1/1973,0.008871,-1.466667,1.744541,2.769184,0.424677,9.535818,0.424677
10/1/1973,0.005762,-1.5,-0.798793,-0.537483,-0.428656,10.319051,-0.428656
1/1/1974,-0.008605,-1.133333,0.141207,-0.647483,0.324677,8.681851,0.324677


In [56]:
data_centered_cov = np.cov(data_centered, rowvar=False)
cov_frame = pd.DataFrame(data_centered_cov, columns = ['dlog_gdp_yoy','unemp','d3m_tbill','dfed_fund','dcpi','manu_utilization', 'd10y_bond'])
cov_frame.head()

Unnamed: 0,dlog_gdp_yoy,unemp,d3m_tbill,dfed_fund,dcpi,manu_utilization,d10y_bond
0,0.000109,-0.006224,0.002226,0.003063,0.000786,0.027046,0.000786
1,-0.006224,2.862291,-0.20984,-0.308432,-0.093929,-4.138373,-0.093929
2,0.002226,-0.20984,0.555412,0.636176,0.233065,1.041314,0.233065
3,0.003063,-0.308432,0.636176,0.863436,0.244554,1.453773,0.244554
4,0.000786,-0.093929,0.233065,0.244554,0.249911,0.418869,0.249911


In [57]:
eigen_values1, eigen_vectors1 = np.linalg.eigh(data_centered_cov)

vectors_frame = pd.DataFrame(eigen_vectors1, columns = ['dlog_gdp_yoy','unemp','d3m_tbill','dfed_fund','dcpi','manu_utilization', 'd10y_bond'])
vectors_frame.head()

Unnamed: 0,dlog_gdp_yoy,unemp,d3m_tbill,dfed_fund,dcpi,manu_utilization,d10y_bond
0,6.967564e-15,0.999998,0.000271,-0.001055,0.000943,0.000159,-0.001267
1,-2.3689710000000002e-17,0.000426,0.005639,0.002869,-0.001884,-0.977402,0.211288
2,-2.614756e-15,-0.00052,-0.806977,-0.127235,0.574134,-0.017307,-0.051674
3,1.738778e-15,-0.001353,0.568987,-0.428944,0.697778,-0.014828,-0.07173
4,-0.7071068,0.000332,0.111811,0.632328,0.295305,-0.002643,-0.021162


In [58]:
values_frame = pd.DataFrame(eigen_values1, index = ['dlog_gdp_yoy','unemp','d3m_tbill','dfed_fund','dcpi','manu_utilization', 'd10y_bond'], columns = ['eigenvalues'])
values_frame

Unnamed: 0,eigenvalues
dlog_gdp_yoy,-2.2443470000000002e-17
unemp,7.141599e-05
d3m_tbill,0.04725484
dfed_fund,0.29422
dcpi,1.39639
manu_utilization,1.963705
d10y_bond,22.09395


In [59]:
index = np.argsort(eigen_values1)[::-1]

sorted_values = eigen_values1[index]

sorted_vectors = eigen_vectors1[:, index]

In [61]:
df = pd.DataFrame(sorted_vectors, columns = ['dlog_gdp_yoy','unemp','d3m_tbill','dfed_fund','dcpi','manu_utilization', 'd10y_bond'])
# Pairwise correlation
df.corr(method ='pearson')

Unnamed: 0,dlog_gdp_yoy,unemp,d3m_tbill,dfed_fund,dcpi,manu_utilization,d10y_bond
dlog_gdp_yoy,1.0,-0.1957692,0.3361536,0.1062957,-0.001584742,0.152697,9.633497e-16
unemp,-0.1957692,1.0,0.4683407,0.1480948,-0.002207917,0.2127427,1.330795e-15
d3m_tbill,0.3361536,0.4683407,1.0,-0.2542924,0.003791195,-0.3652987,-2.218235e-15
dfed_fund,0.1062957,0.1480948,-0.2542924,1.0,0.001198821,-0.1155117,-6.346982e-16
dcpi,-0.001584742,-0.002207917,0.003791195,0.001198821,1.0,0.001722142,4.163374e-17
manu_utilization,0.152697,0.2127427,-0.3652987,-0.1155117,0.001722142,1.0,-1.033967e-15
d10y_bond,9.633497e-16,1.330795e-15,-2.218235e-15,-6.346982e-16,4.163374e-17,-1.033967e-15,1.0


In [62]:
# We arbitrarily select 3 components, but you can change that
components = 3

# Sort the eigenvectors
eigvect_subset = sorted_vectors[:, 0:components]

# See how the subset look like in a dataframe
subset_frame = pd.DataFrame(eigvect_subset, columns=['ID1','ID2','ID3'])
subset_frame.head()

Unnamed: 0,ID1,ID2,ID3
0,-0.001267,0.000159,0.000943
1,0.211288,-0.977402,-0.001884
2,-0.051674,-0.017307,0.574134
3,-0.07173,-0.014828,0.697778
4,-0.021162,-0.002643,0.295305
