In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.


# Project: Investigate a Dataset - Income Inequality using Gapminder

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

This analysis focuses on income inequailty as measured by the Gini Index* and its association with economic metrics such as GDP per capita, investments as a % of GDP, and tax revenue as a % of GDP. One polical metric, EIU democracy index, is also included.

This investigation can be considered a starting point for complex questions such as:

1. Is a higher tax revenue as a % of GDP associated with less income inequality?
2. Is a higher EIU democracy index associated with less income inequality?
3. Is higher GDP per capita associated with less income inequality?
4. Is higher investments as a % of GDP associated with less income inequality?

This analysis uses the gapminder dataset from the Gapminder Foundation.  The Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.

*The [Gini Index](https://en.wikipedia.org/wiki/Gini_coefficient) is a measure of statistical dispersion intended to represent the income or wealth distribution of a nation's residents, and is the most commonly used measurement of inequality. It was developed by the Italian statistician and sociologist Corrado Gini and published in his 1912 paper Variability and Mutability. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

In [None]:
pd.set_option('display.max_rows', 10)
pd.options.display.max_columns = 100
pd.set_option("display.precision", 2)

<a id='wrangling'></a>
## Data Wrangling

In this section of the report, the data was loaded, checks for cleanliness was done, and findings reported.
### General Properties

The dataset contains data from the following GapMinder datasets:

1. EIU Democracy Index:

"This democracy index is using the data from the Economist Inteligence Unit to express the quality of democracies as a number between 0 and 100. It's based on 60 different aspects of societies that are relevant to democracy universal suffrage for all adults, voter participation, perception of human rights protection and freedom to form organizations and parties.
The democracy index is calculated from the 60 indicators, divided into five ""sub indexes"", which are:

1. Electoral pluralism index;
2. Government index;
3. Political participation indexm;
4. Political culture index;
5. Civil liberty index.

The sub-indexes are based on the sum of scores on roughly 12 indicators per sub-index, converted into a score between 0 and 100.
(The Economist publishes the index with a scale from 0 to 10, but Gapminder has converted it to 0 to 100 to make it easier to communicate as a percentage.)"
https://docs.google.com/spreadsheets/d/1d0noZrwAWxNBTDSfDgG06_aLGWUz4R6fgDhRaUZbDzE/edit#gid=935776888


2. Income: GDP per capita, constant PPP dollars:
GDP per capita measures the value of everything produced in a country during a year, divided by the number of people. The unit is in international dollars, fixed 2011 prices. The data is adjusted for inflation and differences in the cost of living between countries, so-called PPP dollars. The end of the time series, between 1990 and 2016, uses the latest GDP per capita data from the World Bank, from their World Development Indicators. To go back in time before the World Bank series starts in 1990, we have used several sources, such as Angus Maddison. 
https://www.gapminder.org/data/documentation/gd001/


3. Investments (% of GDP)
Capital formation is a term used to describe the net capital accumulation during an accounting period for a particular country. The term refers to additions of capital goods, such as equipment, tools, transportation assets, and electricity. Countries need capital goods to replace the older ones that are used to produce goods and services. If a country cannot replace capital goods as they reach the end of their useful lives, production declines. Generally, the higher the capital formation of an economy, the faster an economy can grow its aggregate income.


4. Tax revenue (% of GDP)
refers to compulsory transfers to the central governement for public purposes.  Does not include social security.
https://data.worldbank.org/indicator/GC.TAX.TOTL.GD.ZS


### Initial Analysis of the Datasets


#### Tax Revenue as a Percent of GDP

Below are results of an initial analysis:

The csv files for this analysis were downloaded from the GapMinder website. They can be found here:

https://github.com/psterk1/data_analytics/tree/master/intro/final_project

The first file is 'tax_revenue_percent_of_gdp.csv'

In [None]:
tax = pd.read_csv('/kaggle/input/income-inequality/tax_revenue_percent_of_gdp.csv')

In [None]:
print("number of rows: ", tax.shape[0])
print("number of columns: {}".format(tax.shape[1]))
print("number of duplicates: {}".format(tax.duplicated().sum()))
print("datatypes:\n")
print(tax.dtypes)
print("\nSample:")
tax.head(3)

An initial analysis revealed that about half the years with > 0.50 missing values. See below:

> Because of the results below, the years with the lowest null value percentage were in the 10 year range from 2006-2016.  As a result, this yearly range was selected for the rest of the datasets.

In [None]:
tax_null = tax.isnull().sum()/tax.shape[0]
tax_null.to_frame().transpose()

#### Income Per Person - GDP Per Capita 

Below are results of an initial analysis:

In [None]:
income = pd.read_csv('/kaggle/input/income-inequality/income_per_person_gdppercapita_ppp_inflation_adjusted.csv')

In [None]:
print("number of rows: ", income.shape[0])
print("number of columns: {}".format(income.shape[1]))
print("number of duplicates: {}".format(income.duplicated().sum()))
print("datatypes:\n")
print(income.dtypes)
income.head(3)

The results below show that there no nulls.

In [None]:
income_null = income.isnull().sum()/income.shape[0]
income_null.to_frame().transpose()

#### Investment  Percent of GDP

Below are results of an initial analysis:

In [None]:
invest = pd.read_csv('/kaggle/input/income-inequality/investments_percent_of_gdp.csv')

In [None]:
print("number of rows: ", invest.shape[0])
print("number of columns: {}".format(invest.shape[1]))
print("number of duplicates: {}".format(invest.duplicated().sum()))
print("datatypes:\n")
print(invest.dtypes)
print("\nSample:")
invest.head(3)

The results below show that the years with the lowest null value percentage were in the 10 year range from 2006-2016.  As a result, this yearly range was selected for the rest of the datasets.

In [None]:
invest_null = invest.isnull().sum()/invest.shape[0]
invest_null.to_frame().transpose()

#### EIU Democracy Index

Below are the results of the initial analysis:

In [None]:
demo = pd.read_csv('/kaggle/input/income-inequality/demox_eiu.csv')

In [None]:
print("number of rows: ", demo.shape[0])
print("number of columns: {}".format(demo.shape[1]))
print("number of duplicates: {}".format(demo.duplicated().sum()))
print("datatypes:\n")
print(demo.dtypes)
print("\nSample:")
demo.head(3)

The results below show that there were no nulls in the dataset.

In [None]:
demo_null = demo.isnull().sum()/demo.shape[0]
demo_null.to_frame().transpose()

#### Gini Dataset

Below are results of the initial analysis:

In [None]:
gini = pd.read_csv('/kaggle/input/income-inequality/gini.csv')

In [None]:
print("number of rows: ", gini.shape[0])
print("number of columns: {}".format(gini.shape[1]))
print("number of duplicates: {}".format(gini.duplicated().sum()))
print("datatypes:\n")
print(gini.dtypes)
print("\nSample:")
gini.head(3)

The results below show that there were no nulls in the dataset.

In [None]:
gini_null = gini.isnull().sum()/gini.shape[0]
gini_null.to_frame().transpose()

### Conclusions and Next Steps

Below are the conclusions of the inital analysis:
    
1. Since the EIU Democracy Index only has data for years 2006 - 2018, a similar range of years will be selected for the other datasets.
2. The percentage of nulls in the Tax Revenue as a Percent of GDP was least from the years 2006 - 2016. Most values were missing for 2017. As a result of this, plus the result in 1. above, the the year period 2006 - 2016 will be selected for this dataset.

#### Next Steps

1. The above datasets will have the country column plus the years 2006 - 2016.
2. The datasets will be pivoted to have continent, country, year and the above datasets.  Below is an example of what the final combined dataset will look like:

|continent|country|year|demox_eiu|income_per_person|invest_%_gdp|tax_%_gdp|gini_index|
|---------|-------|----|---------|-----------------|------------|---------|----------|
|Asia|	Afghanistan|	2006|	30.6|	1120|	23.40|	6.88|	36.8|
|Asia|	Afghanistan|	2007|	30.4|	1250|	19.90|	5.23|	36.8|
|Asia|	Afghanistan|	2008|	30.2|	1270|	18.90|	6.04|	36.8|

### Slicing and Reorganizing the Datasets

#### EIU Democracy Index

In [None]:
demo_last_10 = demo.iloc[:, :-2]
demo_last_10.head(3)

In [None]:
demo_last_10 = demo_last_10.melt(id_vars=['country'], var_name='year', value_name='demox_eiu')
demo_last_10.sort_values(['country','year'], inplace=True)
demo_last_10.head(3)

#### Income Per Person (GDP per Capita)

In [None]:
income_last_10 = income.iloc[:, np.r_[:1, 207:218]]
income_last_10

In [None]:
income_last_10 = income_last_10.melt(id_vars=['country'], var_name='year', value_name='income_per_person')
income_last_10.sort_values(['country', 'year'], inplace=True)

In [None]:
income_last_10.head(3)

#### Investment Percent of GDP

In [None]:
invest_last_10 = invest.iloc[:, np.r_[:1, 47:58]]
invest_last_10 = invest_last_10.melt(id_vars=['country'], var_name='year', value_name='invest_%_gdp')
invest_last_10.sort_values(['country', 'year'], inplace=True)
invest_last_10.head(3)

#### Tax Revenue Percent of GDP

In [None]:
tax_last_10 = tax.iloc[:, np.r_[:1, 35:46]]
tax_last_10 = tax_last_10.melt(id_vars=['country'], var_name='year', value_name='tax_%_gdp')
tax_last_10.sort_values(['country', 'year'], inplace=True)
tax_last_10.head(3)

#### Gini Index

In [None]:
gini_last_10 = gini.iloc[:, np.r_[:1, 207:218]]
gini_last_10 = gini_last_10.melt(id_vars=['country'], var_name='year', value_name='gini_index')
gini_last_10.sort_values(by=['country', 'year'], inplace=True)
gini_last_10.head(3)

### Merging the Datasets

In [None]:
combined = demo_last_10.merge(income_last_10, left_on=['country', 'year'], right_on=['country', 'year'])
combined = combined.merge(invest_last_10, left_on=['country', 'year'], right_on=['country', 'year'])
combined = combined.merge(tax_last_10, left_on=['country', 'year'], right_on=['country', 'year'])
combined = combined.merge(gini_last_10, left_on=['country', 'year'], right_on=['country', 'year'])
combined

In [None]:
cont = pd.read_csv('/kaggle/input/income-inequality/continent_country.csv')
cont

#### Matching Country with Continent

In this step, we match each country with its continent.  This will enable analysis at the continent level for broader trend detection.

In [None]:
combined_final = cont.merge(combined, left_on=['country'], right_on=['country'])
combined_final

### Data Cleaning

Below are the steps taken to ensure quality of the dataset:

#### Missing Values

Below are is a summary of missing values (nulls) in the dataset:

In [None]:
combined_final.isna().sum()

One option for handling the missing 'tax_%_gdp' values would be to replace them with the country's mean.  However, some of the countries have all nulls and some have mostly nulls for this column.  

A second option is to drop the rows with nulls. In the interest of simplicity, we will use this option. 

In [None]:
combined_final.dropna(inplace=True)
combined_final.isna().sum()

#### Duplicates

There are no duplicates in the dataset:

In [None]:
combined_final.duplicated().sum()

#### Descriptive Statistics

Below are descriptive statistics of the dataset.  A review of the values indicates that the min, max and mean values appear to be reasonable.  

In [None]:
combined_final.describe()

### Summary
Since there are The results of the initial analysis indicate that the dataset is clean, no further cleaning steps were needed.

In [None]:
print("number of rows: ", combined_final.shape[0])
print("number of columns: {}".format(combined_final.shape[1]))
print("datatypes:\n")
print(combined_final.dtypes)

In [None]:
# These the continents included in the dataset. All values look reasonable.
combined_final.continent.unique()

In [None]:
combined_final.country.unique()

#### Save the Cleaned Dataset


In [None]:
combined_final.to_csv('combined_final_last_10_years.csv', index=False)

<a id='eda'></a>
## Exploratory Data Analysis


### Research Question 1 - Is Income Inequality Getting Worse or Better in the Last 10 Years?

Better means the Gini Index is going down.

#### Global Gini Mean By Year

In [None]:
columns = ['year', 'gini_index']
gini = combined_final[columns]
gini

In [None]:
gini_annual_average = gini.groupby('year')['gini_index'].mean()
gini_annual_average

As the plot below shows, the mean global gini index has been going down over the last 10 years, meaning global income inequality is improving.

In [None]:
plt.plot(gini_annual_average.index, gini_annual_average)
plt.title('Mean Global Gini Index by Year')
plt.xlabel('Year')
plt.ylabel('Mean Global Gini Index');

Mean Global Gini Index by Continent:

In [None]:
columns = ['year', 'continent', 'gini_index']
gini = combined_final[columns]
gini

In [None]:
gini_cont_average = gini.groupby(['year', 'continent'])['gini_index'].mean()
gini_cont_average

The chart below reveals that, on a continent basis, all were either declining or mostly flat, except for Africa.

In [None]:
gini_cont_average.unstack(level=1).plot(kind='line', subplots=False, \
                                        title='Mean Global Gini Index by Continent by Year').\
                                        set_ylabel("Gini Index");

In [None]:
columns = ['year', 'continent', 'country', 'gini_index']
gini = combined_final[columns]
gini

### Research Question 2 - What Top 10 Countries Have the Lowest and Highest Income Inequality?

#### Lowest

Overall, most of the countries with the lowest income inequality are in Europe.

In [None]:
gini.groupby(['country', 'continent'])['gini_index'].mean().to_frame().sort_values(by=['gini_index']).head(10)

#### Highest

Overall, most of the countries with the lowest income inequality are in Africa and in Americas.

In [None]:
gini.groupby(['country', 'continent'])['gini_index'].mean().to_frame().sort_values(by=['gini_index'], ascending=False).head(10)

### Research Question 3 - Is a higher tax revenue as a % of GDP associated with less income inequality?

The hypothesis is that countries with higher tax revenue as % of GDP are associated with lower income inequality.  The assumption for this is that higher tax revenues are distributed back to lower economic strata in the form of social benefits. Let's see what the data shows.

In [None]:
columns = ['continent', 'country', 'year', 'tax_%_gdp', 'gini_index']
tax = combined_final[columns]
tax

It is difficult to see a trend in the scatter plot below:

In [None]:
tax.plot(x='tax_%_gdp', y='gini_index', kind='scatter', title='Mean Global Gini Index by Tax % of GDP');

Looking at the log of both values reveals that the correlation between the two variables is essentially flat - there is no compelling evidence that higher tax percent of GDP leads to less income inequality.

In [None]:
tax_plot = tax.plot(x='tax_%_gdp', y='gini_index', kind='scatter', loglog=True, \
                    title='log Mean Global Gini Index by log Tax % of GDP')
tax_plot.set_xlabel('log(tax % gdp)')
tax_plot.set_ylabel('log(gini index)');

The Pearson correlation is slightly negative at -0.08:

In [None]:
tax_log = np.log(tax['tax_%_gdp']).to_frame()
tax_log['log_gini_index'] = np.log(tax['gini_index'])
tax_log.corr()

### Research Question 4 - Is Higher Income Per Person - GDP Per Capita associated with less income inequality?

The hypothesis is that a higher income per person indicates that more of the country's GDP is being distributed equality among its population.

In [None]:
columns = ['continent', 'country', 'year', 'income_per_person', 'gini_index']
income = combined_final[columns]
income

In [None]:
income.plot(x='income_per_person', y='gini_index', kind='scatter', title='Mean Gini Index by Income Per Person');

In [None]:
income_plot = income.plot(x='income_per_person', y='gini_index', kind='scatter', loglog=True, \
                    title='log Mean Gini Index by Income Per Person')
income_plot.set_xlabel('log(income_per_person)')
income_plot.set_ylabel('log(gini index)');

In this case, the Person correlation coefficient is -0.34 indicating that there is weak correlation between  log(income_per_person) and the log(gini_index):

In [None]:
income_log = np.log(income['income_per_person']).to_frame()
income_log['log_gini_index'] = np.log(tax['gini_index'])
income_log.corr()

### Research Question 5 - Is Higher Investment as % GDP associated with less income inequality?

The hypothesis is that a higher investment as a percent of GDP indicates that more of the country's GDP is being invested in capital improvements which distributes income benefits across a wide segment of the populcation leading to more equality among its population.

In [None]:
columns = ['continent', 'country', 'year', 'invest_%_gdp', 'gini_index']
invest = combined_final[columns]
invest

In [None]:
invest = invest[invest['invest_%_gdp'] > 0]

In [None]:
invest.plot(x='invest_%_gdp', y='gini_index', kind='scatter', title='Mean Gini Index by Investment % GDP');

In [None]:
invest_plot = invest.plot(x='invest_%_gdp', y='gini_index', kind='scatter', loglog=True, \
                    title='log Mean Global Gini Index by log Invest % of GDP')
invest_plot.set_xlabel('log(invest % gdp)')
invest_plot.set_ylabel('log(gini index)');

The Pearson corr coefficient of -0.03 indicates no correlation between these two variables.

In [None]:
invest_log = np.log(invest['invest_%_gdp']).to_frame()
invest_log['log_gini_index'] = np.log(tax['gini_index'])
invest_log.corr()

### Research Question 6 - Is Higher EIU Democracy Index associated with less income inequality?

The hypothesis is that countries with higher EIU Democracy Index address the needs of a broader segment of the popluation leading to less income inequality.

In [None]:
columns = ['continent', 'country', 'year', 'demox_eiu', 'gini_index']
demo = combined_final[columns]
demo

In [None]:
demo.plot(x='demox_eiu', y='gini_index', kind='scatter', title='Mean Gini Index by EIU Democracy Index');

In [None]:
demo_plot = demo.plot(x='demox_eiu', y='gini_index', kind='scatter', loglog=True, \
                    title='log Mean Global Gini Index by log EIU Democracy Index')
demo_plot.set_xlabel('log(demox_eiu)')
demo_plot.set_ylabel('log(gini index)');

In this case, the Person correlation coefficient is -0.2 indicating that there is weak correlation between  log(demox_eiu) and the log(gini_index):

In [None]:
demo_log = np.log(demo['demox_eiu']).to_frame()
demo_log['log_gini_index'] = np.log(tax['gini_index'])
demo_log.corr()

<a id='conclusions'></a>
## Conclusions

The following are the conclusions from this analysis:

***Research Question 1 - Is Income Inequality Getting Worse or Better in the Last 10 Years?***
 
Answer: 

Yes, it is getting better, improving from 38.7 to 37.3

On a continent basis, all were either declining or mostly flat, except for Africa.
   
***Research Question 2 - What Top 10 Countries Have the Lowest and Highest Income Inequality?***
  
Answer: 
  
Lowest: Slovenia, Ukraine, Czech Republic, Norway, Slovak Republic, Denmark, Kazakhstan, Finland,      Belarus,Kyrgyz Republic

Highest: Colombia, Lesotho, Honduras, Bolivia, Central African Republic, Zambia, Suriname, Namibia,                 Botswana, South Africa


***Research Question 3 Is a higher tax revenue as a % of GDP associated with less income inequality?***

Answer: No


***Research Question 4 - Is Higher Income Per Person - GDP Per Capita associated with less income inequality?***

Answer: No, but weak negative correlation.


***Research Question 5 - Is Higher Investment as % GDP associated with less income inequality?***

Answer: No

***Research Question 6 - Is Higher EIU Democracy Index associated with less income inequality?***

Answer: No, but weak negative correlation.


The above results suggest that there are other drivers for the overall reduction in income inequality.  Futher analysis of additional factors should be undertaken.
