# Library and Data Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from scipy.stats import spearmanr
from scipy.stats import kendalltau
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor


df = pd.read_csv('/kaggle/input/the-economic-freedom-index/economic_freedom_index2019_data.csv')

# Glance at Data and Preprocessing

In [None]:
# glance at data
print(df.head())

In [None]:
print(df.info())

It seems like there is a minor problem with null values. After doing some research on Economic Freedom Index Website, it seems like some countries all nearly completely null--i.e. Iraq, Syria. This is confirmed by visualizing the pattern of null values with Seaborn. 

In [None]:
 sns.heatmap(df.isnull(), cbar=False)

Because of the clustering of the null values, a intuitive and straightforward approach is to simply delete all the rows that contain nulls. 

In [None]:
df = df.dropna(axis=0)
sns.heatmap(df.isnull(), cbar=False)

In [None]:
# it looks like some columns that should be floats have object dtype
# also some columns have extraneous symbols like $ signs and commans
# this code removes those symbols and converts to floats

columnsToChange = ['FDI Inflow (Millions)', 'GDP per Capita (PPP)', 'GDP (Billions, PPP)', 'Unemployment (%)', 'Population (Millions)']
for column in columnsToChange:
  data = df[column]
  edited = []
  for row in data:
    noComma = row.replace(',', '')
    noDollar = noComma.replace('$', '')
    edited.append(noDollar)
  df[column] = [x for x in edited]
  df[column] = df[column].astype(float)
    
print(df.info())

Some of the columns have spelling errors or are hard to access. This code renames these columns. 

In [None]:
df.rename(columns={'Country Name': 'CountryName', 
                   'Judical Effectiveness': 'Judicial Effectiveness', 
                   'Gov\'t Spending': 'Gov Spending', 
                   'Gov\'t Expenditure % of GDP ': 'Gov Expenditure % of GDP',
                    'Investment Freedom ': 'Investment Freedom'}, inplace=True)

After visting the Economic Freedom Index website again, I learned more about the structure of the dataset. Each country's '2019 Score' is simply the average of 12 component columns, each scored between 0 and 100. These components or aspects of freedom include Judicial Effectiveness, Fiscal Health, and Monetary Freedom, as well as others. Each country is also classified as follows: "Free" if total score is between 80 and 100, "Mostly Free" if the total score is between 70 and 79, "Moderately Free" if the total score is between 60 and 69, "Mostly Unfree" if total score is between 50 and 59, and "Repressed" if total score is less than 49.

The dataset also includes other economic statistics about each country, like GDP per Capita, Populaiton, and Unemployment. 

This code helps organize the dataset by storing similar groups of columns into lists for later access. 

In [None]:
# the twelve components of economic freedom
RANKS = ['Property Rights', 'Judicial Effectiveness', 'Government Integrity', 'Tax Burden', 
              'Gov Spending', 'Fiscal Health', 'Business Freedom', 
              'Labor Freedom', 'Monetary Freedom', 'Trade Freedom', 
              'Investment Freedom', 'Financial Freedom']

# earlier list plus 2019 score column
RANKS_PLUS_TOTAL = ['2019 Score'] + RANKS

# columns with other statistics for each country calculated as a percentage
PERCENTAGE_STATS = ['Tariff Rate (%)', 'Income Tax Rate (%)', 'Corporate Tax Rate (%)', 
                     'Tax Burden % of GDP', 'Gov Expenditure % of GDP', 'GDP Growth Rate (%)', '5 Year GDP Growth Rate (%)',
                     'Unemployment (%)', 'Inflation (%)', 'Public Debt (% of GDP)', 'GDP per Capita (PPP)']

# EDA

Since this dataset is documented at length at the Economic Freedom Index website, there is not a lot of EDA needed. However, here are some visualizations that I found insightful. 

In [None]:
# returns color that matches color code found on website for scores
def classifier(item):
  if item > 80:
    return "darkgreen"
  elif item > 70:
    return 'limegreen'
  elif item > 60:
    return "yellow"
  elif item > 50:
    return 'orange'
  else:
    return 'red'

# function for making boxplots
def boxPlots(df):
  fig, ax = plt.subplots(1, 1, figsize=(14, 8))
  lst = [df[column] for column in df.columns]
  medianPropDict = dict(color='black', linewidth=1.5)
  bplot = ax.boxplot(lst, vert=True, showfliers=False, positions=range(1, len(df.columns)+1), patch_artist=True, medianprops=medianPropDict)
  ax.set_xticks(range(1, len(df.columns)+1))
  ax.set_xticklabels(df.columns, rotation=45)
  ax.set_xlabel("Categories")
  ax.set_ylabel("Score")
  ax.set_title('World Scores by Category')
  
  colors = []
  for column in df.columns:
    data = df[column]
    median = data.median()
    colors.append(classifier(median))
  
  for patch, color in zip(bplot['boxes'], colors):
    patch.set_facecolor(color)
    
# calls function
boxPlots(df[RANKS_PLUS_TOTAL])

This boxplot figure provides an excellent visualization of the spread and central tendancy for the total score column and the 12 component columns. Note that the colors are aligned with the theme of the website--dark green for "Free" or over 80, light green for "Mostly Free" and over 70, and so forth. The median value for each boxplot is what determines its color classification. 

As an additional note, it is worth pointing out that the total score boxplot is generally less spread out because it contains averages of the other columns. 

In [None]:
# produces correlation matrix heatmap with only half included--omits redundant symmetrical half and diagnol 1 values
def correlationMatrix(df):
  corr = df.corr()
  mask = np.triu(np.ones_like(corr, dtype=bool))
  f, ax = plt.subplots(figsize=(11, 9))
  cmap = sns.diverging_palette(20, 230, as_cmap=True)
  sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
  plt.title("Correlation Matrix")


# calls function with 12 component columns
correlationMatrix(df[RANKS])

It is no surprise that most of the 12 component columns are positively correlated. It makes intuitive sense that a country with high Government Integrity also has high Judicial Effectiveness, for instance. 

What is noteworthy, though, are the two columns that seem to be slightly negatively correlated with the rest--Tax Burden and Government Spending. This makes sense too, though, when you think about it. Many "left leaning" countries are ecnomically free in many respects but also have larger governments and higher taxes.

In [None]:
# calls correlation matrix with percentage stats and 2019 score
lst = PERCENTAGE_STATS + ['2019 Score']
correlationMatrix(df[lst])

The correlations are generally much more mild, of course, but there are still some interesting points to note. 

First, a couple of the strong positive correlations aren't that interesting. For example, GDP Growth Rate and 5 Year GDP Growth Rate. Of course these are positively correlated, so not much to see there. Also, the strong positive correlation between Tax Burden and Gov Expenditure isn't that interesting, either. 

The two points that are most noteworthy about this graph is the strong negative correlation between 2019 score and Tariff Rate as well as the strong positive correlation between 2019 Score and GDP per Capita. 

Both of these correlations are interesting because of how dramatic they are. For the negative correlation between 2019 Score and Tariffs, this correlation is more unexpected. Maybe this is because American Conservatives, traditionally proponents of ecnomic freedom, have started supporting Tariffs, but I assumed before analyzing this data that Tariffs have a relatively smaller impact on economic freedom. That is clearly not the case, however. 

As for the positive correlation between 2019 Score and GDP per Capita, that is less surprising. In the report on the website, the Heritage Foundation discusses at length the connection between the total score and GDP per Capita. We will discuss this relationship more later. But for now, it's interesting to see the connection showing up already. 

# Dashboarding

To help visualize any given country's Econnomic Freedom status, I built a small dashboard-like arrangement of bar plots. Again, the colors follow from the color code in the economic freedom index website. I quite like how it turned out. 

In [None]:
def countryDashboard(countryName):
  countryData = df.loc[df.CountryName == countryName]
  

  fig = plt.figure(constrained_layout=True, figsize=(15, 9))
  gs = fig.add_gridspec(ncols=9, nrows=3)

  f_ax1 = fig.add_subplot(gs[:-1, 1:])
  string = countryName + ' Average and Category Scores'
  f_ax1.set_title(string, fontsize=16)
  data = countryData[RANKS]
  for idx, column in enumerate(data.columns):
    point = data[column].iloc[0]
    color = classifier(point)
    plt.bar(idx, height=point, color=color, width=0.7)
    plt.text(idx-0.25, point+2,point, fontsize=15)
  numColumns = range(len(data.columns))


  f_ax1.set_yticks(range(0, 110, 10))
  f_ax1.set_yticklabels([])
  f_ax1.set_xticks(numColumns)
  f_ax1.set_xticklabels(data.columns, rotation=45)
  

  f_ax2 = fig.add_subplot(gs[:-1, 0])
  totalScore = countryData['2019 Score'].iloc[0]
  color = classifier(totalScore)
  f_ax2.set_ylabel("Score")
  f_ax2.bar(1, height=totalScore, color=color, width=1.5)
  f_ax2.text(1-0.25, totalScore+2, totalScore, fontsize=15)
  f_ax2.set_xticks([1])
  f_ax2.set_xticklabels(["Average Score"], rotation=45)
  f_ax2.set_yticks(range(0, 110, 10))
  f_ax2.set_yticklabels(range(0, 110, 10))

In [None]:
# calls country dashboard function for United States
countryDashboard("United States")

Not surprisingly, the worst scores of the U.S. were Gov Spending and Fiscal Health. And the fact that the best score for the U.S. is Labor Freedom doesn't surprise me either--I remember reading about how easy it is for companies in the U.S. to fire employees for my intro ECON class last semester. 

In [None]:
# dashboard for Venezuela
countryDashboard("Venezuela")

There's not much to say here. The obviously disastrous situation in Venezuela is clearly reflected in these scores. 

In [None]:
# dashboard for China
countryDashboard("China")

In [None]:
#dashboard for Australia 
countryDashboard("Australia")

In [None]:
#dashboard for France
countryDashboard("France")

Notice the disastrous Tax Burden and Government Spending--and how they are paired with excellent scores in many of the other categories and a decent overall score. This kind of score structure is why Tax Burden and Gov Spending are negatively correlated with many of the categories, as revealed by the Correlation Matrix. 

# Model Building and Verification

The report for the 2019 Economic Freedom Index claims that the total score and GDP per Capita have a correlation coefficient of 0.64. Furthermore, it also claims that the best fitting trendline is an exponential one. In the following sections, I'll test these claims by conducting my own correlation calculations and fitting my own model. 

In [None]:
# correlation coefficients for various correlation calculations

x = df['2019 Score']
y = df['GDP per Capita (PPP)']

pearson, _ = pearsonr(x, y)
spearman, _ = spearmanr(x, y)
kendall, _ = kendalltau(x, y)

print("Pearson: ", round(pearson, 3))
print("Spearman: ", round(spearman, 3))
print("Kendall: ", round(kendall, 3))

These results seem consistent with the website's claims. Pearson's correlation coefficient is the closest to the claim of 0.64, but it also measures linear correlation--and if the relationship is exponential rather than linear, this score might not be accurate. Spearman's coefficient and Kendall's coefficient supposedly hold for non-linear relationships, but I don't know enough about how these coefficients are calculated to opine on their relative merits.

In [None]:
# calls the sub functions
def scatter_plot_trend(x, y, degrees):
    # scatter plot
    plt.figure(figsize=(9, 7))
    plt.scatter(x, y)

    # calls polyFit function to get fitted terms
    terms = getFittedTerms(x, y, degrees)

    # smooth, continuous predictions for graphing purposes
    smoothInput = range(0, 100, 1)
    smoothPredictions = predict(smoothInput, terms)

    #plots predictions for polynomial
    label = "Fitted " + str(degrees) + ' degree Polynomial'
    poly = plt.plot(smoothInput, smoothPredictions, color='black', label=label)
    plt.legend()
    plt.xlim(20, 100)
    plt.ylim(-10000, 140000)
    plt.xlabel('Score')
    plt.ylabel("GDP per Capita")
    plt.title("Score vs GDP per Capita w/ Fitted Polynomial")

    predictions = predict(df['2019 Score'], terms)


    RSQ = coefficientOfDetermination(np.array(df['GDP per Capita (PPP)']), np.array(predictions))
    print("R-Squared Score: ", round(RSQ, 3))



# produces predicted values for input x values
# abstract enough to work for any degree polynomial
def predict(x, terms):
    numberOfTerms = len(terms)
    degrees = numberOfTerms - 1
    predictions = []

    for point in x:
      prediction = 0
      for i in range(len(terms)):
        exponent = numberOfTerms - i - 1
        if i < degrees:
          prediction += terms[i] * (point ** exponent)
        else:
          prediction += terms[-1]
          predictions.append(prediction)

    return predictions

  
# extracts fitted coefficients and constant for any degree polynomial
def getFittedTerms(x, y, degrees):
    x = np.array(x)
    y = np.array(y)

    fitted = np.polyfit(x, y, degrees)
    terms = list(fitted)

    return terms

# calculates and returns R-Squared score for given lists of actual and predictions
def coefficientOfDetermination(actual, predictions):
    squaredError = [np.square(actual[i] - predictions[i]) for i in range(len(actual))]
    explainedVariance = np.sum(squaredError) / len(actual)

    mean = np.mean(actual)
    squaredError = [np.square(actual[i] - mean) for i in range(len(actual))]
    totalVariance = np.sum(squaredError) / len(actual)

    return 1 - (explainedVariance / totalVariance) 

Let's now try fitting the 2019 Score and GDP per Capita with different degree polynomials and see which scores best.

In [None]:
# produces fitted 1 degree polynomail--i.e. a fitted linear line
scatter_plot_trend(df['2019 Score'], df['GDP per Capita (PPP)'], 1)

In [None]:
# produces fitted 2 degree polynomail--a quadratic or exponential line
scatter_plot_trend(df['2019 Score'], df['GDP per Capita (PPP)'], 2)

Note that the R-Squared score improved with the 2 degree polynomial. This might lead us to conclude that that line is the better fit. This would be an interesting insight--it would mean that the GDP per Capita for a country is expected to increase exponentially with increases in econonomic freedom, not merely linearly. However, we should not be jump to this conclusion just yet. To illustrate my point, let's fit other degree polynomials to the points.  

In [None]:
# fits 3 degree poly to points
scatter_plot_trend(df['2019 Score'], df['GDP per Capita (PPP)'], 3)

Oh wow, the R-Score improved again? Does that mean that this is a better model for the point? No. Check out what happens when I fit a 40 degree polynomial to the points. 

In [None]:
# fits 40 degree polynomial to the points
scatter_plot_trend(df['2019 Score'], df['GDP per Capita (PPP)'], 40)

The R-Square score improved, but intuitively speaking, it does not make sense for the trend to be modelled by such a curve. Even ignoring the extreme fluctuations away from the main points, it is still implausible. Why the high R-Square score, then? What is happening is that the model is overfitting the data. This model might have a high score, but it wouldn't be very good at predicting additional points. It has tuned itself too narrower to these specific points to accurately predict other ones. 

Normally, this would be handled by splitting the data into a training set and a test set. The training set would be used to train the model, while the model would be scored from the test set--from points it hasn't seen yet. Let's now try that and see which model comes out ahead. 

In [None]:
# re-writing function to split data into training and test sets
def scatter_plot_trend(x, y, degrees):
    # scatter plot
    plt.figure(figsize=(9, 7))
    plt.scatter(x, y)

    # splits data into training and test sets
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=3)


    # calls polyFit function to get fitted terms
    # uses training data to fit model
    terms = getFittedTerms(x_train, y_train, degrees)

    # smooth, continuous predictions for graphing purposes
    smoothInput = range(0, 100, 1)
    smoothPredictions = predict(smoothInput, terms)

    #plots predictions for polynomial
    label = "Fitted " + str(degrees) + ' degree Polynomial'
    poly = plt.plot(smoothInput, smoothPredictions, color='black', label=label)
    plt.legend()
    plt.xlim(20, 100)
    plt.ylim(0, 140000)
    plt.xlabel('Score')
    plt.ylabel("GDP per Capita")
    plt.title("Score vs GDP per Capita w/ Fitted Polynomial")
    
    # makes predictions from x_test
    predictions = predict(x_test, terms)

    # scores from y_test
    RSQ = coefficientOfDetermination(np.array(y_test), np.array(predictions))
    print("R-Squared Score: ", round(RSQ, 3))

In [None]:
scatter_plot_trend(df['2019 Score'], df['GDP per Capita (PPP)'], 1)

In [None]:
scatter_plot_trend(df['2019 Score'], df['GDP per Capita (PPP)'], 2)

In [None]:
scatter_plot_trend(df['2019 Score'], df['GDP per Capita (PPP)'], 3)

In [None]:
scatter_plot_trend(df['2019 Score'], df['GDP per Capita (PPP)'], 40)

I am not actually sure how the R-Squared Score is ending up beyond negative 1. 

In any case, it looks like the 2nd degree polynomial is the best. Even though I have controlled the randomness of train test split with a random seed so the splitting is consistent for all polys, the random splitting might be better for some models over others. In order to test that the 2nd degree polynomial is better, I'll take the average R-Square score after calculating it many times with different random states using a while loop and compare it to the averages with other degree polys. 

In [None]:
# condensing scatter_plot_trend function to not plot anything
def score(x, y, degrees):
    # splits data
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    
    # calls polyFit function to get fitted terms
    # uses training data to fit model
    terms = getFittedTerms(x_train, y_train, degrees)
    
    # makes predictions on x test
    predictions = predict(x_test, terms)

    # scores from y_test, returns
    return coefficientOfDetermination(np.array(y_test), np.array(predictions))

x = df['2019 Score']
y = df['GDP per Capita (PPP)']

# lists to store scores for each iteration
degree1 = []
degree2 = []
degree3 = []

# sets up while loop
num_iterations = 1000
counter = 0
while counter < num_iterations:
    counter += 1
    
    # appends score for each deg poly to respective list
    degree1.append(score(x, y, 1))
    degree2.append(score(x, y, 2))
    degree3.append(score(x, y, 3))
    

# finds averages
degree1AVG = np.sum(degree1) / len(degree1)
degree2AVG = np.sum(degree2) / len(degree2)
degree3AVG = np.sum(degree3) / len(degree3)

# ouputs results in formatted way
print("1-Degree Model AVG: ", round(degree1AVG, 3))
print("2-Degree Model AVG: ", round(degree2AVG, 3))
print("3-Degree Model AVG: ", round(degree3AVG, 3))

    

This seems like fairly conclusive evidence that the 2-degree model, after all, is the best. If we wanted to be even more certain, we could ramp up the number of iterations on the while loop, add more degree polynomials, or even conduct an hypothesis test on the comparison between these two sample means. I'll end my calculations here, though. 

# Concluding Thoughts

I'll conclude this notebook by going over the motivation behind these calculations again. In their report, the Economic Freedom Index website claimed that the correlation between GDP per Capita and their Economic Freedom Score is 0.64. This seems fairly accurate. Additionally, they claimed that the trend between these two variables can be modeled with an exponential function. Because the 2-Degree polynomial seems like the best fit from my calculations, I can agree with that claim as well. 

This fact is significant because of the exponentiality here. It suggests that countries with increases in their economic freedom score can expect exponentially higher increases in their GDP per Capita. We must be careful, though--none of this has proven casaulity. My guess is that it is highly likely that confounding variables are at play. The website hasn't exactly been correct about this point--they suggest that economic freedom is causing rises in GDP per Capita, as well as other things, like social progress and environmental health. From an intuitive point of view, I would suggest that social progress is more likely causing economic freedom. 

Thanks for reading my notebook!