<h1>IS AN ECONOMICALLY FREE COUNTRY A BETTER PLACE TO LIVE?</h1>

<img src="https://www.fraserinstitute.org/sites/default/files/styles/large/public/economic-freedom-of-the-world-2018.jpg">

## My First Kernel In Python

Recently I conclued a post-degree in statistics and created a profile on Kaggle to share some things that learned and also learn more from the community. At college we were introduced to R, we learned how to make some plots and statistics tests (that was the main purpose). I even ran two kernels in R to practice EDA and Machine Learning. 

But realized that most of data scientists work with Python, and it was much more used than R. So I've decided to learn and check the advantages of the language. 

So I'm running my first kernel in Python to take some insights and to practice Python a little.

Once I'm a begginner in data science, I really appreciate if you can tell me whether the analysis is pursuing for a correct logical, the topics I can improve or what else I could do using this dataset.

# Summary

** 1. [Objectives](#objectives)** <br>
** 2. [About the Data](#about)** <br>
** 3. [Data Cleaning](#cleaning)** <br>
** 4. [Exploratory Data Analysis](#eda)** <br>
** 5. [Brics & Chile](#briccs)** <br>
** 6. [Human Development Index](#hdi)** <br>
** 7. [Merging Data Frames](#merge)** <br>
** 8. [HDI & Economic Freedom](#hdi_econ)** <br>
** 9. [Conclusions](#conc)** <br>
** 10. [References](#ref)** <br>

<a id="objectives"></a>
# 1. Objectives

As I said ealier, this kernel was made to take some insights through an Exploratory Data Analisis (EDA) and check whether a country that is more free economically has a Human Development Index (HDI) higher or not.

I'll divide this analisis in two parts. The first one I'm going to make an EDA using the Economic Freedom of the World data, and then in the second part, I'm going to compare it to HDI and see if there is any correlation between economic features and human development.

<a id="about"></a>
# 2. About the Dataset

The Economic Freedom of the World Report made by Fraser Institute is the world’s premier measurement of economic freedom, ranking countries based on five areas: size of government, legal structure and security of property rights, access to sound money, freedom to trade internationally, and regulation of credit, labour and business.

The institute divides and subdivides the characteristics for assessing the economic freedom of a country. The main ones are:

<b>Area 1: Size of Government</b> — As government spending,
taxation, and the size of government-controlled enterprises
increase, government decision-making is substituted for
individual choice and economic freedom is reduced.

<b>Area 2: Legal System and Property Rights</b> —Protection of
persons and their rightfully acquired property is a central
element of both economic freedom and civil society.
Indeed, it is the most important function of government.

<b>Area 3: Sound Money</b> — Inflation erodes the value of rightfully
earned wages and savings. Sound money is thus essential
to protect property rights. When inflation is not only high
but also volatile, it becomes difficult for individuals to plan
for the future and thus use economic freedom effectively.

<b>Area 4: Freedom to Trade Internationally</b> — Freedom to
exchange—in its broadest sense, buying, selling, making
contracts, and so on—is essential to economic freedom,
which is reduced when freedom to exchange does not
include businesses and individuals in other nations.

<b>Area 5: Regulation</b> — Governments not only use a number
of tools to limit the right to exchange internationally, they
may also develop onerous regulations that limit the right to
exchange, gain credit, hire or work for whom you wish, or
freely operate your business.

## Importing Packages

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import statsmodels.formula.api as smf
from scipy import stats
from sklearn.model_selection import train_test_split

## Loading Data

In [None]:
path = "../input/economic-freedom/efw_cc.csv"
data = pd.read_csv(path)

## Getting To Know About The Dataset 

First of all, let's check the data we're going to work with. Let's verify things such as the shape of the data, the first rows or if the dataset has any missing value.

In [None]:
print('Dimensions:',data.shape)

In [None]:
data.head(8)

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
# How many null values does the columns have?
data.isnull().sum()

<a id="cleaning"></a>
# 3. Data Cleaning

## Dropping Features With More Than 1242 Null Values

Here I'm going to let only the variables that has more than 1/3 of non-null values. As I'm going to fill the NaN values by it's median, filling these columns when they only have a few data, it can be too far from reality. 

In [None]:
data = data.loc[:, (data.isnull().sum(axis=0) <= 1242)]

In [None]:
# Rename the columns for a better undestanding
data.rename(columns={"year": "YEAR",
                     "ISO_code": "ISO_CODE",
                     "countries": "COUNTRY",
                     "rank" :"RANK",
                     "quartile": "QUARTILE",
                     "ECONOMIC FREEDOM": "SCORE",
                     "1a_government_consumption": "GOV_CONSUMPTION",
                     "1b_transfers": "TRANSFERS",
                     "1c_gov_enterprises": "GOV_ENTERPRISES",
                     "1d_top_marg_tax_rate": "TOP_MARG_TAX_RATE",
                     "1_size_government": "GOV_SIZE",
                     "2b_impartial_courts": "IMPARTIAL_COURTS", 
                     "2c_protection_property_rights": "PROTEC_PROP_RIGHTS",
                     "2d_military_interference": "MILITARY_INTERF",
                     "2e_integrity_legal_system": "INTEGRITY_LEGAL_SYST",
                     "2j_gender_adjustment": "GENDER_ADJUSTMENT",
                     "2_property_rights": "PROPERTY_RIGHTS",
                     "3a_money_growth": "MONEY_GROWTH",
                     "3b_std_inflation": "STD_INFLATION",
                     "3c_inflation": "INFLATION",
                     "3d_freedom_own_foreign_currency": "FOREIGN_CURRENCY",
                     "3_sound_money": "SOUND_MONEY",
                     "4a_tariffs": "TARIFFS",
                     "4c_black_market": "BLACK_MARKET",
                     "4d_control_movement_capital_ppl": "CONTROL_MOVEMENT",
                     "4_trade": "TRADE",
                     "5a_credit_market_reg": "CREDIT_MARKET_REG",
                     "5b_labor_market_reg": "LABOR_MARKET_REG",
                     "5_regulation": "REGULATION"}, inplace=True)

## Filling Missing Values

Here I decided to fill missing values using the median. It'll not affect the variance very much because the function will get median by country, so there is a higher chance to the value be correct.

In [None]:
# First I'm going to use 'ffill' method to fill the quartile column. It has to be an integer.
data.QUARTILE = data.QUARTILE.fillna(method='ffill')

# Then separete the numeric values to fill the missing spaces.
num_names = data._get_numeric_data().columns

data[num_names] = data.groupby('ISO_CODE')[num_names].transform(lambda x: x.fillna(x.median()))

In [None]:
data.isnull().sum()

As we can see, there are some values not filled by the function. That probably happend because in none of the years of the respective country had this information to take the median from.

But we can get a little closer by taking the median of the respective quartile. Given that countries of a same quartile share similarities. 

In [None]:
data.QUARTILE = data.QUARTILE.astype('object')

data[['TRANSFERS','GOV_ENTERPRISES','PROTEC_PROP_RIGHTS','INTEGRITY_LEGAL_SYST','TARIFFS','BLACK_MARKET']] = data.groupby('QUARTILE')\
    [['TRANSFERS','GOV_ENTERPRISES','PROTEC_PROP_RIGHTS','INTEGRITY_LEGAL_SYST','TARIFFS','BLACK_MARKET']].transform(lambda x: x.fillna(x.median()))

<a id="eda"></a>
# 4. Exploratory Data Analysis

## Correlation Matrix Heatmap

In [None]:
# Numeric Value
data_num = data._get_numeric_data()
data_cor = data_num.corr()

#Plot heatmap
sns.set(font_scale=1.4)

plt.figure(figsize=(13,13))
sns.heatmap(data_cor,  square=True, cmap='coolwarm_r')

This is the matrix corrlation heatmap of all numeric variables, but I'm going to use only the main features.

In [None]:
# Main Features
data_num_2 = data.loc[:,['SCORE', 'GOV_SIZE', 'PROPERTY_RIGHTS', 'SOUND_MONEY', 'TRADE', 'REGULATION']]
data_cor_2 = data_num_2.corr()

sns.set(font_scale=1.4)

plt.figure(figsize=(12,12))
sns.heatmap(data_cor_2,  square=True, annot=True, cmap='coolwarm_r')

Using only the five main features, we can see that most of them are strong correlated to the final score. Except for Goverment Size, that seems not to have a strong relation with the economic index.

## Which Are The Least And Top Economically Free Countries?

The dataset has the index for each country since 1970 to 2016. We can check which ones are on the bottom an on the top of the last year measured.

## Least 15 Economically Free Countries in 2016

In [None]:
sns.set_palette(sns.dark_palette("red",15, reverse=False))
sns.set_style('whitegrid')

top_15_16_least = data[data.YEAR==2016].sort_values(by='SCORE', ascending=False).tail(15)
top_15_16_least.plot('COUNTRY', 'SCORE', kind='bar', figsize=(14,8), rot=45)

plt.xlabel('COUNTRIES')
plt.ylabel('SCORE')
plt.title('Least 15 Economically Free Countries in 2016')

Most of the least 15 countries are about the same score. Only in Venezula, the last country in 2016 rank, we can see a huge difference comparing to the others. 
Venezula has been facing one of it´s biggest crisis in these last few years.

Another observable fact is that most of them are african and asian countries, except for Argentina and Venezuela that are located in South America. 

# Top 15 Most Economically Free Countries in 2016

In [None]:
sns.set_palette(sns.dark_palette("green",15, reverse=False))
sns.set_style('whitegrid')

top_15_2016 = data[data.YEAR==2016].sort_values(by='SCORE', ascending=False).head(15)
top_15_2016.plot('COUNTRY', 'SCORE', kind='bar', figsize=(14,8), rot=45)

plt.xlabel('COUNTRIES')
plt.ylabel('SCORE')
plt.title('Top 10 Most Economically Free Countries in 2016')

Given that these are the top 15 countries in 2016, we can't see a large difference among them. The only observation that can be made is that most of these countries are from Europe or Asia, and we only have one from Africa (Mauritius) and one from South America (Chile).

But have these countries always been economically free like this?

## Were they always like that?

In [None]:
names = top_15_2016['COUNTRY']
top_15 = data.loc[data['COUNTRY'].isin(names)]

sns.set_palette(sns.color_palette("colorblind",15))
sns.set_style('whitegrid')

fig, ax = plt.subplots()

for key, grp in top_15.groupby(['COUNTRY']):
    ax = grp.plot(ax=ax, kind='line', x='YEAR', y='SCORE', label=key, figsize=(20,10), linewidth=2.5)
    
plt.xlim((1970, 2016))
plt.xlabel('YEAR')
plt.ylabel('SCORE')
plt.title('SCORE BETWEEN 1970 AND 2016')

It's possible to see in this plot that most of these countries always had a high index in economic freedom. Some have had great oscillations through the years probably generated by governments or historical events. 

Except for Chile, it is the only south american country among the top 15 countries in 2016. And the most interesting fact is that Chile was a 4th quartile country by the middle of the 70's.

<a id="briccs"></a>
# 5. BRICS & Chile

Once we talking about country development, I decided to get BRICS' countries and compare them to Chile, which is the region where we saw a huge growth in its index.

For those who are unfamiliar, BRICS is an acronym for the emerging economies of Brazil, Russia, India, China and South Africa. Together they represent about 25% of the world's land mass and more than 40% of its population.

In [None]:
briccs_names = ['Brazil', 'Russia', 'India', 'China', 'Chile', 'South Africa']
briccs = data.loc[data['COUNTRY'].isin(briccs_names)]

sns.set_palette(sns.color_palette("bright",6))
sns.set_style('whitegrid')

fig, ax = plt.subplots()

for key, grp in briccs.groupby(['COUNTRY']):
    ax = grp.plot(ax=ax, kind='line', x='YEAR', y='SCORE', label=key, figsize=(18,10), linewidth=2.5)
    
plt.xlim((1970, 2016))
plt.legend(loc='lower right')
plt.xlabel('YEAR')
plt.ylabel('SCORE')
plt.title('BRICCS SCORE BETWEEN 1970 AND 2016')

Among the BRICS, all countries have an index between 6.0 and 7.0 except for Brazil that stays below 6.0.

There is no oscillation for Russia until 1990 probaly because it was one of those Null values filled by the median.

## Main Economic Freedom Features in BRIC'C'S

Now we can compare the main features for the BRICS' countries and Chile in 1970 and 2016

In [None]:
# Separate values from 1970
briccs_1970 = briccs.loc[briccs['YEAR'] == 1970]
main_feat = ['SCORE','GOV_SIZE', 'PROPERTY_RIGHTS', 'SOUND_MONEY', 'TRADE', 'REGULATION']

sns.set(font_scale=1.4)
sns.set_style('whitegrid')
briccs_1970.plot(x='COUNTRY', y=main_feat, kind='bar', rot= 0,figsize=(16,10))
plt.ylim(0,11)
plt.legend(loc='upper left')
plt.xlabel("BRIC'C'S COUNTRIES")
plt.ylabel("SCORE")
plt.title("Main Features For BRIC'C'S COUNTRIES IN 1970")

########################################################################################################################

# Separate values from 2016
briccs_2016 = briccs.loc[briccs['YEAR'] == 2016]

briccs_2016.plot(x='COUNTRY', y=main_feat, kind='bar', rot= 0,figsize=(16,10))
plt.ylim(0,11)
plt.xlabel("BRIC'C'S COUNTRIES")
plt.ylabel("SCORE")
plt.title("Main Features For BRIC'C'S COUNTRIES IN 2016")

We can observe that all features had a similar growth in BRICS countries. In Chile on the other hand, it´s possible to see a huge difference in specific variables: TRADE and PROPERTY_RIGHTS. Maybe this growth explain why Chile came from 4th quartile in 1970 to 1st quartile in 2016.

In [None]:
fig = plt.gcf()
fig.set_size_inches(16, 10)
sns.set(font_scale=1.4)

data.QUARTILE = data.QUARTILE.astype('int64')

sns.scatterplot(x='PROPERTY_RIGHTS', y='SCORE', data=data, s=45,\
                hue='QUARTILE', palette=["#9b59b6", "#3498db", "#e74c3c", "#2ecc71"])
plt.xlabel('PROPERTY RIGHTS')
plt.ylabel('SCORE')
plt.title('RELATION BETWEEN SCORE AND PROPERTY RIGHTS')

In [None]:
fig = plt.gcf()
fig.set_size_inches(16, 10)
sns.set(font_scale=1.4)

sns.scatterplot(x='TRADE', y='SCORE', data=data, s=45,\
                hue='QUARTILE', palette=["#9b59b6", "#3498db", "#e74c3c", "#2ecc71"])
plt.xlabel('TRADE')
plt.ylabel('SCORE')
plt.title('RELATION BETWEEN SCORE AND TRADE')

As we saw in the correlation matrix heatmap, TRADE and PROPERTY_RIGHTS have a strong positve correlation with the economic freedom SCORE. In other words, the higher the TRADE or PROPERTY RIGHTS the higher the index.

<a id="hdi"></a>
# 6. Human Development Index

"The HDI was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. The HDI can also be used to question national policy choices, asking how two countries with the same level of GNI per capita can end up with different human development outcomes. These contrasts can stimulate debate about government policy priorities." (Human Development Reports)

"The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. The HDI is the geometric mean of normalized indices for each of the three dimensions."(Human Development Reports)

It can be made a whole kernel only for HDI data. The data on Human Development Reports' website (http://hdr.undp.org/) is a playground for data scientists ;)

But here I'm only going to use the HDI.

## Importing Data

In [None]:
path_2 = "../input/human-development-index/Human Development Index.csv"
hdi = pd.read_csv(path_2)

As we did with the Economic Freedom dataset, first let's check the data

In [None]:
print('Dimensions:',hdi.shape)

In [None]:
hdi.head(10)
# We're going to reshape it in 3 columns

In [None]:
hdi.info()

## Cleaning Data 

In [None]:
data = data.loc[:, (data.isnull().sum(axis=0) <= 1242)]

In [None]:
hdi.head()

## Filling Missing Data

As we did with the first dataset, the NaN values will be replaced with countries median

In [None]:
hdi[['INDEX']] = hdi.groupby('COUNTRY')[['INDEX']].transform(lambda x: x.fillna(x.median())) 

hdi.isnull().sum()

## Dealing With Strings

The datasets are going to be merged by Country name, but we can see that in some country names they appear differently from one dataset to the other. So here, we're going to remove everything after the name of the country.

In [None]:
hdi['new'] = hdi['COUNTRY'].str.split(',').str[0]
hdi['COUNTRY'] = hdi['new'].str.split('(').str[0]

hdi = hdi.drop(columns='new')
hdi.COUNTRY = hdi.COUNTRY.str.strip()

hdi['COUNTRY'] = hdi['COUNTRY'].replace({'Russian Federation': 'Russia'})

## A BRIEF EDA...

Before the comparison, let's first explore the data and check if we can spot some similarities with the first one.

In [None]:
sns.set_palette(sns.dark_palette("blue",15, reverse=False))
sns.set_style('whitegrid')

hdi_15_17 = hdi[hdi.YEAR==2016].sort_values(by='INDEX', ascending=False).head(15)
hdi_15_17.plot('COUNTRY', 'INDEX', kind='bar', figsize=(14,8), rot=45, legend=None)

plt.xlabel('COUNTRIES')
plt.ylabel('HDI')
plt.title('Top 15 Countries in Human Development Index in 2016')

Not all countries on the top 15 are the same but we can spot some like, Hong Kong, Australia, Switzerland and some others.

In [None]:
hdi_2016 = hdi[hdi.YEAR==2016]
hdi_1990 = hdi[hdi.YEAR==1990] 

sns.set_style('whitegrid')

fig = plt.gcf()
fig.set_size_inches(16, 10)

sns.kdeplot(hdi_1990.INDEX, shade=True, color= "orange", legend= None)
sns.kdeplot(hdi_2016.INDEX, shade=True, color= "blue", legend= None)

plt.xlabel('HDI')
plt.title('HDI DISTRIBUTION OF 1990 AND 2016 ')

From 1990 to 2016 we can see that the average countries HDI has grown

<a id="merge"></a>
# 7. Merging Data Frames

In [None]:
hdi = hdi[hdi.YEAR != 2017]

# Get only the main columns of the Economic data
econ = data[['YEAR', 'COUNTRY', 'SCORE','QUARTILE','GOV_SIZE', 'PROPERTY_RIGHTS', 'SOUND_MONEY', 'TRADE', 'REGULATION']]

# And then merge both data on Country and Year
hdi_econ = hdi.merge(econ, how='left', on=['COUNTRY', 'YEAR'])

It'll look like this:

In [None]:
hdi_econ.head()

In [None]:
print('Dimensions:',hdi_econ.shape)

These datasets don't have the same countries and years. Now we just remove the NaN values.

In [None]:
hdi_econ = hdi_econ.dropna()
hdi_econ.describe()

<a id="hdi_econ"></a>
# 8. Relation Between HDI And Economic Freedom

### Correlation matrix with the new Data Frame

In [None]:
hdi_econ_num = hdi_econ.drop(['COUNTRY', 'YEAR'], axis=1)
hdi_econ_cor = hdi_econ_num.corr()

sns.set(font_scale=1.4)

plt.figure(figsize=(12,12))
sns.heatmap(hdi_econ_cor,  square=True, annot=True, cmap='BrBG')

### Correlation between INDEX and SCORE

In [None]:
fig = plt.gcf()
fig.set_size_inches(16, 10)
sns.set(font_scale=1.4)

sns.scatterplot(x='SCORE', y='INDEX', hue='QUARTILE',
                data=hdi_econ, s=45, palette=["#9b59b6", "#3498db", "#e74c3c", "#2ecc71"])

plt.ylabel('HDI')
plt.title('RELATION BETWEEN HDI AND ECONOMIC FREEDOM')

## BRIC'C'S One More Time...

We already saw how the BRICS countries and Chile changed their economic freedom along the years, specially Chile going from one of the least economic free coutries in 1970 to the top 15 in 2016.

But does it mean that Chile's HDI has also grown?

In [None]:
briccs_names = ['Brazil', 'Russia', 'India', 'China', 'Chile', 'South Africa']
briccs_hdi = hdi_econ.loc[hdi_econ['COUNTRY'].isin(briccs_names)]

sns.set_palette(sns.color_palette("bright",6))
sns.set_style('whitegrid')

fig, ax = plt.subplots()

for key, grp in briccs_hdi.groupby(['COUNTRY']):
    ax = grp.plot(ax=ax, kind='line', x='YEAR', y='INDEX', label=key, figsize=(16,9), linewidth=2.5)
    
plt.xlim((1990, 2016))
plt.legend(loc='lower right')
plt.xlabel('YEAR')
plt.ylabel('INDEX')
plt.title('BRICCS HDI BETWEEN 1990 AND 2016')

<a id="conc"></a>
# 9. Conclusions

First of all I'd like to say that I'm not an expert in politics or economy, it's a subject that is interesting for me and I once I'm a beginner, I made this kernel to practice my code and tried to draw some conclusions. Said that:

Through this EDA we could see that along the years, countries tend to make polices for economic freedom aiming trades, property rights and sound money. Goverment size does not affect how economically free a country is, as I personally thought it'd be. So it's not about the size of a goverment, it's about the regulations made that allows economic freedom.

We also could observe that there is a positive correlation between countries with a high human development with countries economically free. The example of Chile that I took is one of those cases. Since the mid 70's Chile transformed it's economy by adopting these policies. Unfortunately the HDI started being mesured  in 1990 so we can't see how was Chile's index before these economics changes.

Of course that's not the only reason that makes the HDI of country high, but it's interesting to see the relation of these features.

In [None]:
data.head()

In [None]:
data.info()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
y = data['GOV_SIZE']
y.head()

In [None]:
X = data.drop(columns=['COUNTRY','RANK','INFLATION','TARIFFS','MONEY_GROWTH','TOP_MARG_TAX_RATE','ISO_CODE','GOV_SIZE'])
X.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
# Merge Back Training Data to use in statsmodel
# statsTrain = X_train.merge(pd.DataFrame(y_train))
statsTrain = X_train.join(pd.DataFrame(y_train))
statsTrain.head()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import statsmodels.formula.api as smf
from scipy import stats

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))