### The Data Used Here

This project uses [this](https://www.kaggle.com/transparencyint/corruption-index) data set from Transparency International, and [this](https://www.kaggle.com/theworldbank/poverty-and-equity-database) data set from the World Bank. The former is representative of perceptions of corruption in 2017. The latter contains data about inequality observed over the course of 1974 through 2018.

This notebook was executed on [kaggle.com](http://kaggle.com), and a version of it will be maintained on [github](https://github.com/keeganland/econ323) 

### Kaggle Defaults

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## The Transparency International data on corruption

This dataset requires very little cleaning. It contains one of the variables of interest (as well as information on measurement error) as well as country codes that can be used for data frame merges. We will do an initial bar-graph visualization of it in order to do an initial intituition check.

In [None]:
corruption_file = "/kaggle/input/corruption-index/index.csv"
df_corruption_index = pd.read_csv(corruption_file)
df_corruption_index = df_corruption_index.iloc[:,:8]
df_corruption_index.head()

In [None]:
# Used documentation from https://stackabuse.com/python-data-visualization-with-matplotlib/ to 

#resize the figure so we can see all the countries listed in a large horizontal bar graph
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 20
fig_size[1] = 40
plt.rcParams["figure.figsize"] = fig_size

#show bar graph
ax_corruption_bar_graph = df_corruption_index.plot(x = "Country", y = "Corruption Perceptions Index (CPI)", kind = 'barh')

## Gut check

The data does not look surprising. Countries that have a reputation as developing countries score poorly, whereas countries that have a reputation as wealthy, developed liberal democracies score highly. 

This bar graph helps us get an intuitive feel for what CPI as a variable looks like. It varies from 0 to 100, 0 representing being perceived as most corrupt, 100 being least corrupt (that is, it runs on the "high score = good" intuition). It also serves as a guide, as the CPI varies numerically, which countries that numerical variation actually corresponds to.

In [None]:
# Return the figure size to something more managable for future plotting
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 10
plt.rcParams["figure.figsize"] = fig_size

## The World Bank's data set

The World Bank's dataset records various economic indicators for multiple regions and countries, with observations recorded by year. However, an observation does not exist for every indicator for every country. Furthermore, we are not necessarily interested in all the indicators in this data set. Because we are interested in seeing the relationship between corruption and *inequality*, we are most interested in their estimates of countries' Gini index, which is the standard measure of economic inequality. We shall therefore need to slice and clean the data frame.

In [None]:
poverty_stats_series_file = "/kaggle/input/poverty-and-equity-database/povstats-csv-zip-242-kb-/PovStatsSeries.csv"
poverty_stats_country_file = "/kaggle/input/poverty-and-equity-database/povstats-csv-zip-242-kb-/PovStatsCountry.csv"
poverty_stats_country_series_file = "/kaggle/input/poverty-and-equity-database/povstats-csv-zip-242-kb-/PovStatsCountry-Series.csv"

poverty_stats_data_file = "/kaggle/input/poverty-and-equity-database/povstats-csv-zip-242-kb-/PovStatsData.csv"


df = pd.read_csv(poverty_stats_data_file)
df.head()

In [None]:
df_indexed = df.set_index(["Country Code", "Indicator Code"])
df_indexed

In [None]:
df_grouped = df.groupby("Indicator Code")
df_grouped = df_grouped.get_group("SI.POV.GINI")
df_grouped = df_grouped.set_index(["Country Code"])
df_grouped

## Which number is "the" Gini index for our purposes?

A glance at the above data frame shows us that, as much as we would have liked to have a Gini index number for every country in the world at every year, the numbers we actually have correspond to irregular observations over the course of 1974 through to 2018. Further, as time marches on, economic, social, and political forces will be acting to change the level of inequality in any given country. The irregularity of the observations may disguise interesting trends within a country, to take one example, or patterns that represent causal forces acting on many countries at once, to take another example.

We need some way to summarize these numbers. For the sake of argument, we will assume that there is no systematic biases effecting when a country could be observed for the sake of this data set, meaning the mean of our observations should be a good estimator of the actual mean Gini index for each country over this period.

While of less intrinsic interest, we also will look at minimum observed Gini indexes and maximum observed Gini indexes. Visualizing these alongside the mean Gini indexes should give a (rough!) intuition of variance.

In [None]:
year_range = range(1974,2019,1)
country_codes = df_grouped.index
minimum_gini_series = pd.Series(index=country_codes, name="Minimum observed GINI index")
maximum_gini_series = pd.Series(index=country_codes, name="Maximum observed GINI index")
mean_gini_series = pd.Series(index=country_codes, name="Mean observed GINI index")
num_observations_series = pd.Series(index=country_codes, name="Number of estimations of GINI index")

for country in country_codes:
    
    #will be used to compute mean observed gini
    successful_gini_observations = 0
    total_gini = 0

    #conceptually, the Gini index ranges from 0 to 100, these are therefore conceptual extremes of minimum/maximum
    minimum_gini = 100
    maximum_gini = 0
    mean_gini = 0

    
    country_series = df_grouped.loc[country]
    for year in year_range:
        gini_this_year = country_series.loc[str(year)]
        if pd.notna(gini_this_year):
            successful_gini_observations = successful_gini_observations + 1
            total_gini = total_gini + gini_this_year
            if gini_this_year < minimum_gini:
                minimum_gini = gini_this_year
                #print(minimum_gini)
            if gini_this_year > maximum_gini:
                maximum_gini = gini_this_year
                #print(maximum_gini)
    
    if successful_gini_observations > 0:
        mean_gini = total_gini / successful_gini_observations

    minimum_gini_series.loc[country] = minimum_gini
    maximum_gini_series.loc[country] = maximum_gini
    mean_gini_series.loc[country] = mean_gini
    num_observations_series.loc[country] = int(successful_gini_observations)

In [None]:
#simplify the data frame now that we have summary statistics
df_grouped = df_grouped.iloc[:,:3]
df_grouped["Mean GINI"] = mean_gini_series
df_grouped["Min GINI"] = minimum_gini_series
df_grouped["Max GINI"] = maximum_gini_series
df_grouped["Number of observations"] = num_observations_series
df_grouped

## Some further data cleaning
Some observed countries simply do not have an observed Gini index at any point. These are no good to us for our purposes. We remove these from the data frame.

In [None]:
#For some countries, we simply lack any helpful data about inequality. We can pick these out because Mean GINI is still 0.
for country in country_codes:
    row = df_grouped.loc[country]
    if row.loc["Number of observations"] == 0:
        df_grouped = df_grouped.drop([country])
df_grouped

## Merging the data sets

'Country Code' is common to both data sets, allowing us to easily perform a merge. We do an *inner* merge here because that automatically excludes countries (or regions) that are only in one data set or the other. The merged data frame still contains a majority of the countries in the world.

In [None]:
df_merged = df_grouped.merge(right=df_corruption_index,how='inner',on='Country Code')
df_merged

## Linear Regressions and Visualizations

In [None]:
import seaborn as sns
from sklearn import linear_model

linear_regressor = LinearRegression() 

x_corruption = df_merged["Corruption Perceptions Index (CPI)"]
y_gini = df_merged["Mean GINI"]

sns.lmplot(data = df_merged, x = "Corruption Perceptions Index (CPI)", y = "Mean GINI")

In [None]:
sns.lmplot(data = df_merged, x = "Corruption Perceptions Index (CPI)", y = "Min GINI")

In [None]:
sns.lmplot(data = df_merged, x = "Corruption Perceptions Index (CPI)", y = "Max GINI")

## What do these graphs mean?

The Gini index is often described as ranging from a score of 0, which represents a perfectly egalitarian economy with the income or wealth of every person in the economy is exactly equal, to a score of 100, which represents an economy where all the income or wealth goes to a single person and none goes to anyone else. Thus, a lower score is indicative of a more egalitarian economy, and a higher score is indicative of a less egalitarian. The Corruption Perception Index, however, works on the "high score is good" intuition. Low scorers are perceived as corrupt, high scorers are perceived as not corrupt. 

Therefore, given the empirical theories mentioned in the introduction about corruption and rent seeking causing inequality, we would predict there to be a *negative* relationship - which is exactly what our linear regression says we do predict. The slope (though not the intercept, obviously) is even roughly the same regardless of which representative Gini index number we use.

This is of minimum value for confirming the causal hypotheses discussed at the beginning of this report, but it is consistent with those hypotheses in such a way so as to suggest to us they are on the right track.