In [67]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
import matplotlib as plt
import matplotlib.pyplot as plt
import seaborn as sns 
% matplotlib inline

In [130]:
df=pd.read_csv('../input/GDP_CAT.csv')

## CATALONIA GDP (2000-2016): Examination, insights and regression analysis for macroeconomic prediction

In this DataSet we have a compilation of demand components of the GDP - Gross Domestic Product -of **CATALONIA** *-one of the 17 autonomous communities of Spain-*, and the Spanish region with the highest GDP output.

The goal of this paper is to build a predictive linear regression model between the GDP, our dependent variable - explained endogenous effect - and an array of independent variables or regressors - exogenous predictors -, in our case these 6 macroeconomic indicators: 

    Consumer expenditure = C
    Consumer public administrations = PC
    Equipment of goods and others (capital investment without construction) = Inv
    Construction = Con
    Total exports goods and services = E
    Total imports goods and services = Imp

#### Abstract
 

* Unfortunately our series only accounts from 2000, the model will be built based on this Time series.


* All units of the DataFrame are presented in Millions of euros (Base 2010).


* The data has been extracted from the Idescat, economic annual Accounts of Catalonia. 

First of all, we have to reverse the DataFrame and make the rows starting point at the year 2000.

In [131]:
# Let's start of years from 2000

df = df.iloc[::-1]  # We can easily do this step with the iloc funtion

In [132]:
df = df.set_index('Year') # We establish  the YEARS as our index 

In [133]:
df # We can examine our df

### GDP performance: partial recovery after years of strong pain

In [134]:
# Let's check the GDP output evolution in the series

df.GDP.plot(figsize=(20,7), kind='area', legend=False, use_index=True, grid=True, color='aqua')

SIZE = 22
plt.rc('xtick', labelsize=SIZE)                                       # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE)                                       # fontsize of the tick Y labels 

plt.xlabel('YEARS', size=22)                                          # x title label 
plt.ylabel('GDP in Millions of €', size=22)                           # y title label 
plt.title('Total GDP of Catalonia (2000-2016)',size=30)               # plot title label                              
plt.legend(loc='upper left', prop={'size': 20})                       # legend location and size

The impact of the crisis has lasted longer in Spain than in most OECD economies. Its main effects include the collapse of the construction sector, a very high unemployment, and large government deficits.

* From 2000 to 2008 the GDP showed strong performance underpinned by the booming of the construction sector.

* From 2008 to 2013 with the effects of the great recession the GDP suffered a strong adjustment.

* In 2013 we see the start of a slow recovery. Only in 2016 has the GDP surpassed that of before the economic crash of 2008.

In [135]:
# With pandas we can easily generate a series of the **percentual change of the accumulated GDP**. Let's visualize it :

# We use the .pct_change method to create a series in our DataFrame of the GDP percentual change on year basis

df['pct_change']=(df.GDP.pct_change()*100).plot(figsize=(20,7), kind='line', legend=False, use_index=True, 
                grid=True, color='aqua')

SIZE = 18
plt.rc('xtick', labelsize=SIZE)                            # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE)                            # fontsize of the tick Y labels 
plt.axhline(y=0)                                           # we create a line for the 0% (Y=0)

plt.xlabel('YEARS', size=22)                               # x title label 
plt.ylabel('Yearly GDP growth %', size=22)                 # y title label 
plt.title('Yearly GDP growth %', size=30)                  # plot title label                              

* From 2000 to 2007 the GDP experienced strong growth underpinned by a booming housing market and a process of over lending / over-borrowing / overspending of both the private and public sectors.


* After years of poor performance and adjustment, from 2013 the economy starts an arduous uprising cycle.

### Hexbin Visualisations: correlation of 2 Macroeconomic indicators with the GDP

Frist we can look at some hexbin visualizations to get a taste of some correlations between the different macroeconomic indicators. We can also check the correlation between some indicators with the GDP, which we will use as our dependent variable.

We have to remember that in the hexbin plots, the lighter in color the hexagonal pixels, the more correlated one feature is to another.

In [136]:
df.plot(y= 'GDP', x ='Domestic demand',kind='hexbin',gridsize=45, 
        sharex=False,colormap='cubehelix',figsize=(15,5))

SIZE1 = 14
plt.rc('xtick', labelsize=SIZE1)    # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE1)    # fontsize of the tick Y labels 

plt.xlabel('Domestic demand', size=16)
plt.ylabel('GDP', size=16)
plt.title('Hexbin of Domestic Demand vs GDP', size=20)

As expected the **domestic demand** and the **GDP output** are highly correlated to one one another.

In [137]:
df.plot(y= 'GDP', x ='Exports goods and services',kind='hexbin',gridsize=45, 
        sharex=False,colormap='cubehelix',figsize=(15,5))

plt.rc('xtick', labelsize=SIZE1)    # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE1)    # fontsize of the tick Y labels

plt.xlabel('Exports goods and services', size=16)
plt.ylabel('GDP', size=16)
plt.title('Hexbin of Exports goods and services VS GDP', size=20)

The correlation between **Exports** and the **GDP output** also shows an expectable positive linear correlation, altough less strongly.

### Construction sector: from property bubble to broken sector to strong "Schumpeterian" adjustment

Let's examine the evolution of the "Construction" component of the GDP since 2000. Before the Spanish property bubble burst of 2008, this sector was a huge contributor to the GDP output.We have to take into account that Construction is one of the components of the Gross capital formation of a given Economy alongside the equipment of goods.

**Gross capital formation = Construction + Equipment of goods.** Both components, essentially compound the net private investment in a given economy.

Let's examine the sector performance since 2000 :

In [None]:
df['Const.'].plot(figsize=(20,7), kind='area', legend=False, use_index=True, grid=True, color='darkcyan')

SIZE = 22
plt.rc('xtick', labelsize=SIZE)                         # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE)                         # fontsize of the tick Y labels 

plt.xlabel('YEARS', size=26)                            # x title label 
plt.ylabel('Millions of €', size=22)                    # y title label 
plt.title('Construction (2000-2016)',
          size=30)                                      # plot title label                              

After the peak of 2007, we see a huge downturn trend which coincides with the Spanish property bubble burst of 2008, which also coincides with the start of the great recession in Spain.

We can also plot the evolution of the contribution ratio of the sector in the overall GDP :

In [139]:
# we create and plot a new series called Cons_per_GDP

Cons_per_GDP=df['Const.']/df.GDP*100

Cons_per_GDP.plot(figsize=(20,7), kind='bar', legend=False, use_index=True, grid=True, color='darkcyan')

SIZE = 18
plt.rc('xtick', labelsize=SIZE)                                        # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE)                                        # fontsize of the tick Y labels 

plt.xlabel('YEARS', size=20)                                           # x title label 
plt.ylabel('Ratio with the GDP %', size=20)                            # y title label 
plt.title('Construction as a ratio of the overall GDP (2000-2016)',
          size=26)                                                     # plot title label                         

At its peak of 2006 the construction component of the GDP accounted for more than 16% of the overall output of the economy. In the last years, the sector barely accounted for more than 6% of the overall GDP.

Using Schumpeter's concept of creative-destruction."A dynamic reassignment of resources and energies in a given economy", the partial destruction of this sector strength has seen new energies emerge in other "less speculative" and cyclic components of the GDP.

### Exports: Catalonia's approach for pulling out of the recession.

Accounting for roughly 25% of total Spanish exports, Catalonia has traditionally been Spain's exporting powerhouse. The Spanish government has been very adamant with the claim that the private sector competitivity gains (thanks to the profit-led strategy) have increased Spain's exports and pulled the country out of the recession.

**Macroeconomic strategy approach: Profit-led**

Companies gaining competitivity thanks to labor costs reductions - with a strong adjustment of salaries and labor income. Although this approach has brought Catalan, and for an extension, Spanish exports to record numbers, and also has spurred capital investment thanks to higher corporate profits, the impact and social effects of the crisis have lasted longer in Spain than in most OECD economies. 

Many theorists argue that a  **Wage-led growth strategy**  with the prioritization of the stimulation of the GDP via private internal demand, **private consumption expenditures** would have been a most optimal strategy for the economic recovery than an export-oriented recovery via lower wages and labour income devaluation, which limited the expansion of the private consumption expenditures. As the professor of economics Engelbert Stockhammer (Kingston University, UK) defends, however, in order to have a balanced Wage-led growth strategy, higher labor incomes have to be aligned with human-capital marginal productivity gains.

The ultimate goal of this paper is not to prove which approach would have been more optimal for pulling Spain and Catalonia out of the recession. Instead, it aims to generate a predictive model for the Catalan GDP output and generate insights of the evolution and tendencies of some Macroeconomic indicators in the different economic cycles.  

Now, let's examine the evolution of the ratio of total exports of goods and service with the overall GDP :

In [140]:
# We create a new series called Exports_per_GDP
Exports_per_GDP=df['Total exports goods and services']/df.GDP*100
Exports_per_GDP.plot(figsize=(20,8), kind='bar', use_index=True, grid=True, color='b')
SIZE = 20
plt.rc('xtick', labelsize=SIZE)                     # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE)                     # fontsize of the tick Y labels 
plt.xticks(rotation=0)

plt.xlabel('YEARS', size=22)                        # x title label 
plt.ylabel('Ratio with the GDP %', size=22)         # y title label 
plt.title('Total exports of goods and services as a ratio of the overall GDP (2000-2016)',
          size=28)                                  # plot title label                              

We see that at 2009, when the recession hit the hardest, te Exports vs GDP ratio was by far the lowest in the series.

Using a Hexbin correlation plot we can examine the relationship between the GDP ratio for construction and the GDP ratio for Exports :

In [141]:
# We add both columns to our DataFrame

df['Cons_per_GDP']=Cons_per_GDP
df['Exports_per_GDP']=Exports_per_GDP

df.plot(y= 'Cons_per_GDP', x ='Exports_per_GDP',kind='hexbin',gridsize=45, 
        sharex=False,colormap='cubehelix',figsize=(15,5))

plt.rc('xtick', labelsize=SIZE1)    # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE1)    # fontsize of the tick Y labels

plt.xlabel('Ratio of total exports with the GDP %', size=16)
plt.ylabel('Ratio of construction with the GDP %', size=16)
plt.title('Hexbin of the GDP ratio of exports vs GDP ratio of construction %', size=16)

As we probably expected, we can see a negative linear correlation between both components.

When one tends to increase, the other one tends to decrease. 

We can make a similar analysis with the **Domestic Demand ratio with the GDP vs Exports ratio with the GDP**. However, in order to better examine how both components behave, we can subtract the construction sector from the domestic demand  :

In [142]:
# We create the series of the GDP ratio of Domestic Demand without construction 

Domestic_Demand_per_GDP_wc=(df['Domestic demand']-df['Const.'])/df.GDP*100
df['Domestic_Demand_per_GDP_wc']=Domestic_Demand_per_GDP_wc

df.plot(y='Domestic_Demand_per_GDP_wc', x ='Exports_per_GDP',kind='hexbin',gridsize=45, 
        sharex=False,colormap='cubehelix',figsize=(15,5))

plt.rc('xtick', labelsize=SIZE1)    # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE1)    # fontsize of the tick Y labels

plt.xlabel('Ratio of total exports with the GDP %', size=16)
plt.ylabel('Ratio of Domestic Demand without cons. with the GDP %', size=16)
plt.title('Hexbin of the GDP ratio of exports VS GDP ratio of domestic demand without cons. %', size=16)

We see a slightly positive correlation. We can determine then when the Economy's energies are put outside construction, exports and domestic demand expansion can go moderately hand-in hand.

### HeatMap: correlation of our Macroeconomic indicators

We can create a HeatMap to have a general overview of the correlations of our indicators with each other.

In [143]:
# We set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 10))

plt.title('Pearson Correlation')

# We draw the heatmap using seaborn
sns.heatmap(df.corr(),linewidths=0.25,vmax=1.0, square=True, cmap="YlGnBu",linecolor='black', annot=True)

As expected, given the darker shade of the colors, we see many features with positive linear correlations amongst each other. Others like construction or construction ratio with the GDP have a negative correlation with many indicators like the Foreign balance account, exterior balance or Exports ratio per GDP.

### Trade openness: a meaningful indicator of GDP growth?

The Openness Index is an economic metric calculated as the ratio of country's total trade, the sum of exports plus imports to the country's GDP. Basically, the higher the index the larger the influence of trade on domestic activities.


In order to examine the potential positive impact of trade openness for GDP growth, we can plot the trajectory of GDP growth and the ratio of trade openness. 

In [144]:
# We create our new series Trade openness adding the total exports and the total imports 

df['trad_op']= (df['Total exports goods and services']+df['Total imports goods and services'])

In [145]:
# We add both series to our DataFrame:

df['trad_op']= (df['trad_op']/df.GDP*100)         # We calculate the GDP ratio
df['pct_change']=(df.GDP.pct_change()*100)        

In [146]:
# Let's plot both trends in the same graph using the secondary_y method

ax=df['trad_op'].plot(figsize=(20,7), kind='line', legend=False, grid=False, use_index=True,
                      color='aqua')
ax1=df['pct_change'].plot(secondary_y=True, figsize=(20,7), kind='line', legend=False, 
                          use_index=True,grid=False, color='r')

plt.axhline(y=0)
ax.legend(loc='lower left', prop={'size': 28})                  # legend location and size
ax1.legend (loc='lower right', prop={'size': 28})               # legend location and size

SIZE = 22
plt.rc('xtick', labelsize=SIZE)                                     # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE)                                     # fontsize of the tick Y labels 
                
ax.set_xlabel('YEAR',size=26)                                       # We arrange our axis labels
ax1.set_xlabel('YEAR',size=26) 
ax.set_ylabel('% Ratio of Trade Openness with the GDP', size=26)
ax1.set_ylabel('GDP growth in %', size=32)

plt.title('Trade openness and GDP growth', size=30)                 # plot title label  
plt.show() 

Both series show a remarkable alignment. What happens if instead of the relationship of trade openness with the GDP growth, we examine the same plot with the Construction sector weight in the GDP?

In [147]:
# Let's change the names:
ax=df['Cons_per_GDP'].plot(figsize=(20,7), kind='line', legend=False, use_index=True, color='aqua')
ax1=df['pct_change'].plot(secondary_y=True, figsize=(20,7), kind='line', legend=False,  use_index=True, grid=False, color='red')

plt.axhline(y=0)
ax.legend(loc='lower center', prop={'size': 28})                
ax1.legend (loc='lower left', prop={'size': 28})                 
                               
ax.set_xlabel('YEAR',size=26)                                              
ax1.set_xlabel('YEAR',size=26) 
ax.set_ylabel('Construction Ratio of the GDP in %', size=26)
ax1.set_ylabel('GDP growth in %', size=32)

plt.title('Construction weight in the GDP and GDP growth', size=30)        
plt.show() 

Here we see some interesting facts. Regarding the construction weight in the GDP and the GDP growth we also see a remarkable alignment. However, we can identify that after 2012 the alignment is completely broken. While the GDP growth starts and uprising cycle, the construction weight in the GDP continues falling and stabilizes at around 6%. Simultaneously, the ratio of trade openness also starts an uprising cycle until reaching 70% of the GDP in 2015.

It's difficult to establish a causality relationship, but the data seems to indicate that generally speaking, trade openness has a positive spillover effect for the Catalan GDP growth.

### Regression analysis using the least square method


We will make a regression model using 6 regressors :

    Consumer expenditure = C
    Consumer public adm = PC
    Equipments of goods and others (capital investment without construction) = Inv
    Constuction = Con
    Total exports goods and services = E
    Total imports goods and services = Imp



**Consumer expenditure** is the amount of final consumption made by resident households to meet their everyday needs. In most developed economies it accounts for around 60% of the gross domestic product (GDP).Therefore, it is an essential variable for determnining the GDP output.

The government component of the GDP (Concumer public adm) accounts for the all **government expenditure.**

Economists pay special attention to **capital investment** because of the role they play in improving the productive capacity of a country. In other words, capital investment makes it possible to produce at a higher level of efficiency thanks to an increase of marginal productivity.

**Total exports** account the amount by which foreigners spend on a home country's goods and services.

**Imports** represent domestic purchases of foreign-produced goods and services. So, they are deducted from the calculation of GDP. 

**Construction** represents private housing purchases (or residential investment).

In [148]:
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

We will use these macroeconomic indicators as regressors :

In [149]:
X=['Consumer expenditure household','Consumer public adm','Equip. Goods others','Const.',
   'Total exports goods and services','Total imports goods and services']

In [150]:
# We create our matrix of regressors (independent variables)
X=df[X]

# We create our dependant variable
y=df.GDP

In [151]:
# We create a linear regression object
lm = LinearRegression()

In [152]:
# We fit our model
lm.fit(X,y)

In [153]:
# From the stats models we built our linear model.
model=lm.fit(X,y)

result = sm.ols(formula="y ~ X", data=df).fit()
print(result.summary())

### Statistical tests:  Accuracy of the regression model


* The model presents very high values of the R-squared: 0.998 and Adj.R-squared: 0.997, which means that our regressors explain a 99.8% of the overall variability of the dependent variable.


* After applying the F - statistic test, we can safely say that the model is statistically significant and valid as it shows a significance level P (F-statistic), much lower than 5%. 3.16e-13


* Checking the intercept value and each individual regressor p-value, working at a confidence interval of 5%, only 'Consumer expenditure household' = C is statistically significant.


* Near multicollinearity often occurs in reality (when our regressors are highly correlated).In this case, we can estimate the regression coefficients, however, we get high standard errors and hence the estimated regression coefficients are not completely accurate. Some of our regressors are highly correlated to one another (see the HeatMap).


### Line of best fits

We build the line of best fits with the GDP predicted by our model and the actual GDP. We see a strong alignment.

In [154]:
p=lm.predict(X)

In [155]:
plt.figure(num=3, figsize=(20, 10), dpi=90, facecolor='w', edgecolor='aqua')

sns.regplot(y, p, data=df, marker='*', scatter_kws={"s": 350})

SIZE2=20  
plt.rc('xtick', labelsize=SIZE2)    # fontsize of the tick X labels 
plt.rc('ytick', labelsize=SIZE2)    # fontsize of the tick Y labels


plt.title('Predicted GDP vs Actual GDP', size=30)
plt.xlabel('Actual value', size=26)
plt.ylabel('Predicted value', size=26)
plt.show()

Let's check our errors :

In [156]:
Errors=(y-p)

print(Errors)

In [157]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=4)

print ('Fit a model X_train, and calculate MSE with y_train:', np.mean((y_train - lm.predict(X_train)) ** 2))
print ('Fit a model X_train, and calculate MSE with x_test, Y_test:', np.mean((y_test - lm.predict(X_test)) ** 2))

In [158]:
y_train - lm.predict(X_train) 

In [159]:
y_test - lm.predict(X_test)

As the observations for our model are so limited (only a series of 17 years) an outlier has a huge impact in our model's prediction capacity. Having only 17 observation is a huge drawback for model-building. 

### CONCLUSION

Altough, strong multicollinearity between the regressors impedes the possibility to infer direct causality for each of the individual GDP components, from the methodological standpoint, we can conclude that the use of a multiple regression model with GDP demand components as regressors and GDP output as a dependant variable allows the researcher to **infer empirical evidence for macroeconomic GDP prediction**. 