# Data Science 1: Group 19 Project


**Title:**

Factors of Global Income Inequality

**Hypothesis:**

Income inequality exists in developing and developed nations alike despite the global economic growth and that the gap is diverging.


**Approach:**
1.   Explore World Income Inequality Database (WIID)
2.   Compare and contrast income of top x% with bottom y% population in 
various nations across the world.
3. Discover and illustrate a correlation between economic growth and income inequality in various nations.
4. Discover any changes in global wealth distribution over time and its correlation with income inequality.
5. Analyze income inequality by geographical region
6. Track income inequality of geographic regions in time and estimate for future time periods 
7. Determine relationship between income inequality and geographical region


**Dataset**

https://www.wider.unu.edu/database/world-income-inequality-database-wiid4

**Dataset User Guide:**
https://www.wider.unu.edu/sites/default/files/WIID/PDF/WIID-User_Guide_06MAY2020.pdf


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


## Load Data from CSV

In [4]:
#df = pd.read_csv('https://media.githubusercontent.com/media/syedaa/DS1_G19/main/WIID/WIID_19Dec2018.csv?token=AIUJ4LTRCDKACWHYDETSJULAMIFW6')
df = pd.read_excel('WIID_19Dec2018.xlsx')
df.head()

Unnamed: 0,id,country,c3,c2,year,gini_reported,q1,q2,q3,q4,q5,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,bottom5,top5,resource,resource_detailed,scale,scale_detailed,sharing_unit,reference_unit,areacovr,areacovr_detailed,popcovr,popcovr_detailed,region_un,region_un_sub,region_wb,eu,oecd,incomegroup,mean,median,currency,reference_period,exchangerate,mean_usd,median_usd,gdp_ppp_pc_usd2011,population,revision,quality,quality_score,source,source_detailed,source_comments,survey
0,1,Afghanistan,AFG,AF,2008,29.0,9.0,13.0,17.0,22.0,39.0,,,,,,,,,,,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Asia,Southern Asia,South Asia,Non-EU,Non-OECD,Low income,36588.0,,Afghan afghani,Month,50.249615,,,1298.0,27294031.0,New 2013,High,12,National statistical authority,European Commission and the Government of Afgh...,National Risk and Vulnerability Assessment,
1,2,Albania,ALB,AL,1996,27.01,9.15,13.7,17.73,23.29,36.12,3.86,5.29,6.38,7.32,8.38,9.35,10.82,12.47,14.9,21.22,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Europe,Southern Europe,Europe and Central Asia,Non-EU,Non-OECD,Upper middle income,2255.4,1982.16,US$2011PPP,Year,104.498917,2255.0,1982.0,4812.0,3092228.0,New 2018,Average,13,World Bank,World Bank 2018,PovcalNet,
2,3,Albania,ALB,AL,2002,31.74,8.35,12.58,16.49,22.21,40.37,3.49,4.86,5.84,6.74,7.65,8.84,10.23,11.98,14.93,25.44,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Europe,Southern Europe,Europe and Central Asia,Non-EU,Non-OECD,Upper middle income,2305.2,1901.52,US$2011PPP,Year,140.154516,2305.0,1902.0,6316.0,3119029.0,New 2018,Average,13,World Bank,World Bank 2018,PovcalNet,
3,4,Albania,ALB,AL,2005,30.6,8.4,12.9,17.03,22.5,39.17,3.48,4.92,5.98,6.92,7.99,9.04,10.37,12.13,14.83,24.34,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Europe,Southern Europe,Europe and Central Asia,Non-EU,Non-OECD,Upper middle income,2605.92,2217.48,US$2011PPP,Year,99.870254,2606.0,2217.0,7563.0,3079179.0,New 2018,Average,13,World Bank,World Bank 2018,PovcalNet,
4,5,Albania,ALB,AL,2008,29.98,8.87,13.07,16.83,22.23,39.0,3.73,5.14,6.09,6.98,7.91,8.92,10.3,11.93,14.54,24.46,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Europe,Southern Europe,Europe and Central Asia,Non-EU,Non-OECD,Upper middle income,2850.24,2385.12,US$2011PPP,Year,83.894604,2850.0,2385.0,9018.0,2991651.0,New 2018,Average,13,World Bank,World Bank 2018,PovcalNet,


In [5]:
df.shape

(11101, 55)

## Filter and Clean Data

In [6]:
#From: Cici The filters I used were (areacovr="All", popcovr="All", reference_unit="Person", sharing_unit="Household", resource not equal to "Income (net/gross)").

df2 = df[(df['areacovr']=='All') & (df['popcovr']=='All') & (df['reference_unit']== 'Person') \
     & (df['sharing_unit'] == "Household") & ~(df['resource'] == "Income (net/gross)")] 
df2.shape


(5571, 55)

In [7]:
#select the dataset with highest quality score if and when there is collision on year, country, resource and scale
df3 = df2[df2['quality_score'] == df2.groupby(['year','country','resource','scale',])['quality_score'].transform('max')]
df3.shape

(4681, 55)

In [8]:
# There are still some duplicates from various studies
# Idetify source ranking based on analysis of data quality scores and availability of various featuers and engineer a 'source_rank' feature
#Create the source dictionary for ranking studies; ranks are assigned 
source_dictionary ={'Luxembourg Income Study':1,\
'World Bank': 2, \
'United Nations':3, \
'Research study':4, \
'National statistical authority':5, \
'OECD':6, \
'SEDLAC':7, \
'Eurostat':8, \
'Other international organizations':9}
  
# Add a new column named 'source_rank' to help eliminate duplicates in high quality data
df3['source_rank'] = df3['source'].map(source_dictionary)
df3.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,id,country,c3,c2,year,gini_reported,q1,q2,q3,q4,q5,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,bottom5,top5,resource,resource_detailed,scale,scale_detailed,sharing_unit,reference_unit,areacovr,areacovr_detailed,popcovr,popcovr_detailed,region_un,region_un_sub,region_wb,eu,oecd,incomegroup,mean,median,currency,reference_period,exchangerate,mean_usd,median_usd,gdp_ppp_pc_usd2011,population,revision,quality,quality_score,source,source_detailed,source_comments,survey,source_rank
0,1,Afghanistan,AFG,AF,2008,29.0,9.0,13.0,17.0,22.0,39.0,,,,,,,,,,,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Asia,Southern Asia,South Asia,Non-EU,Non-OECD,Low income,36588.0,,Afghan afghani,Month,50.249615,,,1298.0,27294031.0,New 2013,High,12,National statistical authority,European Commission and the Government of Afgh...,National Risk and Vulnerability Assessment,,5
1,2,Albania,ALB,AL,1996,27.01,9.15,13.7,17.73,23.29,36.12,3.86,5.29,6.38,7.32,8.38,9.35,10.82,12.47,14.9,21.22,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Europe,Southern Europe,Europe and Central Asia,Non-EU,Non-OECD,Upper middle income,2255.4,1982.16,US$2011PPP,Year,104.498917,2255.0,1982.0,4812.0,3092228.0,New 2018,Average,13,World Bank,World Bank 2018,PovcalNet,,2
2,3,Albania,ALB,AL,2002,31.74,8.35,12.58,16.49,22.21,40.37,3.49,4.86,5.84,6.74,7.65,8.84,10.23,11.98,14.93,25.44,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Europe,Southern Europe,Europe and Central Asia,Non-EU,Non-OECD,Upper middle income,2305.2,1901.52,US$2011PPP,Year,140.154516,2305.0,1902.0,6316.0,3119029.0,New 2018,Average,13,World Bank,World Bank 2018,PovcalNet,,2
3,4,Albania,ALB,AL,2005,30.6,8.4,12.9,17.03,22.5,39.17,3.48,4.92,5.98,6.92,7.99,9.04,10.37,12.13,14.83,24.34,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Europe,Southern Europe,Europe and Central Asia,Non-EU,Non-OECD,Upper middle income,2605.92,2217.48,US$2011PPP,Year,99.870254,2606.0,2217.0,7563.0,3079179.0,New 2018,Average,13,World Bank,World Bank 2018,PovcalNet,,2
4,5,Albania,ALB,AL,2008,29.98,8.87,13.07,16.83,22.23,39.0,3.73,5.14,6.09,6.98,7.91,8.92,10.3,11.93,14.54,24.46,,,Consumption,Consumption,Per capita,Per capita,Household,Person,All,All,All,All,Europe,Southern Europe,Europe and Central Asia,Non-EU,Non-OECD,Upper middle income,2850.24,2385.12,US$2011PPP,Year,83.894604,2850.0,2385.0,9018.0,2991651.0,New 2018,Average,13,World Bank,World Bank 2018,PovcalNet,,2


In [9]:
df3_sorted=df3.sort_values(by = ['country', 'year','quality_score', 'source_rank'], ascending = [True,True,False,True])
df3_sorted.shape

(4681, 56)

In [10]:
df4 =df3_sorted.drop_duplicates(subset=['country', 'year','resource','scale'])
df4.shape

(4315, 56)

In [11]:
df_test = (df4.groupby(['country','year','resource','scale','quality','quality_score','source_rank']).size()\
.sort_values(ascending=False) 
   .reset_index(name='count'))
# there still seems to be a collision on some values ... perhaps further analysis is needed 
df_test[df_test['count'] > 1]
## Good riddence, no more duplicates

Unnamed: 0,country,year,resource,scale,quality,quality_score,source_rank,count


In [None]:
#save the data_set
df4.to_csv("revised_new_wiid.csv")