Deliverable 1 (due 1/26/22)
Rebecca Hsu

Gathering, Cleaning, and Integrating Data Tables

Gathering the Data: 
    1) Gini Index by Country
    2) Total Health Expenditure Per Capita by Country in 2018 PPP international USD, inflation adjusted to 2018
    3) World Bank - Life Expectancy at Birth

In [18]:
# importing pandas
import pandas as pd

#link for the online tables
giniLink="https://en.wikipedia.org/wiki/List_of_countries_by_income_equality"
healthexpLink="https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita"

# fetching the tables
giniData=pd.read_html(giniLink,header=0,flavor="bs4",attrs={'class':"wikitable"})
healthexpData=pd.read_html(healthexpLink,header=0,flavor="bs4",attrs={'class':"wikitable"})

In [19]:
# link to the data in CSV format
lifeexpLink='https://github.com/rhsu4/542_Deliv1/raw/main/LifeExpAtBirth_WB.csv'

# using 'read_csv' with a link
lifeexpData=pd.read_csv(lifeexpLink)

In [50]:
#from IPython.display import IFrame  

#IFrame(giniLink, width=700, height=300)

In [5]:
!pip install html5lib
!pip install beautifulsoup4
!pip install lxml



### Cleaning Gini Data

In [22]:
#type(healthexpData)
type(giniData)

list

In [24]:
#len(healthexpData)
len(giniData)

4

In [25]:
#For gini index, we're using the first table
giniData[0]

Unnamed: 0,Country,Subregion,Region,UN R/P,UN R/P.1,WB Gini[4],WB Gini[4].1,CIA R/P[5],CIA R/P[5].1,CIA Gini[6],CIA Gini[6].1
0,Country,Subregion,Region,10%[5],20%[7],%,Year,10%,Year,%,Year
1,,,,,,,,,,,
2,Afghanistan,Southern Asia,Asia,,,,,,,,
3,Albania,Southern Europe,Europe,7.2,4.2,33.2,2017,7.2,2004,26.9,2012 est.
4,Algeria,Northern Africa,Africa,9.6,4.0,27.6,2011,9.6,1995,35.3,1995
...,...,...,...,...,...,...,...,...,...,...,...
175,Palestine,Western Asia,Asia,,5.6,33.7,2016,,,,
176,Yemen,Western Asia,Asia,8.6,6.1,36.7,2014,8.6,2003,37.7,2005
177,Zambia,Eastern Africa,Africa,,21.1,57.1,2015,,,57.5,2010
178,Zimbabwe,Eastern Africa,Africa,,8.6,44.3,2017,,,50.1,2006


#Cleaning Notes for giniData
- values are not categorical
- variable names - need to drop all but country, WB GINI[4] and WB GINI[4].1; rename the WB gini variables to WBGiniPercent and WBGiniYear
- need to drop first row (and NaN rows)

In [27]:
origginiDF=giniData[0]

In [28]:
giniDF=origginiDF.copy()

In [29]:
giniDF.columns

Index(['Country', 'Subregion', 'Region', 'UN R/P', 'UN R/P.1', 'WB Gini[4]',
       'WB Gini[4].1', 'CIA R/P[5]', 'CIA R/P[5].1', 'CIA Gini[6]',
       'CIA Gini[6].1'],
      dtype='object')

In [33]:
#column positions to drop
whichToDrop=[1,2,3,4,7,8,9,10]

#dropping and updating the data frame
giniDF.drop(labels=giniDF.columns[whichToDrop],axis=1,inplace=True)

In [34]:
giniDF.columns

Index(['Country', 'WB Gini[4]', 'WB Gini[4].1'], dtype='object')

In [35]:
giniDF.columns=['Country', 'GiniPercent', 'GiniYear']
giniDF.columns

Index(['Country', 'GiniPercent', 'GiniYear'], dtype='object')

In [38]:
giniDF.Country[10]

'Azerbaijan'

In [40]:
#Removing Spaces
byeSpaces= lambda COLUMN:COLUMN.str.strip()
giniDF=giniDF.apply(byeSpaces)

In [41]:
#Value counts not a problem for gini
[giniDF[COLUMN].value_counts() for COLUMN in giniDF.iloc[:,1::]]

[32.8                     3
 34.4                     3
 35.3                     3
 40.8                     3
 39.0                     3
                         ..
 48.3                     1
 43.5                     1
 31.9                     1
 38.0                     1
 12.0[citation needed]    1
 Name: GiniPercent, Length: 124, dtype: int64,
 2017         43
 2018         28
 2016         18
 2015         16
 2014         13
 2013          7
 2011          7
 2012          7
 2009          4
 1999          3
 2010          3
 2004          2
 2020          1
 2003          1
 2006          1
 1992          1
 Year          1
 2008          1
 1998          1
 2007          1
 2005          1
 2002 est.     1
 Name: GiniYear, dtype: int64]

In [42]:
giniDF.dtypes

Country        object
GiniPercent    object
GiniYear       object
dtype: object

In [44]:
giniDF.drop(labels=[0,1,179],
           axis = 0,
           inplace=True) #dropping header rows with no data and "World" row

In [45]:
giniDF

Unnamed: 0,Country,GiniPercent,GiniYear
2,Afghanistan,,
3,Albania,33.2,2017
4,Algeria,27.6,2011
5,Angola,51.3,2018
6,Argentina,41.4,2018
...,...,...,...
174,Vietnam,35.7,2018
175,Palestine,33.7,2016
176,Yemen,36.7,2014
177,Zambia,57.1,2015


In [46]:
giniDF.reset_index(drop=True,inplace=True)

In [48]:
giniDF.to_csv("giniDF.csv",index=False)

### Cleaning health expenditure data

In [49]:
#For health expenditure, we're using the second table
healthexpData[1]

Unnamed: 0,Country or subnational area,2002,2010,2018
0,Afghanistan *,78.0,138.0,186.0
1,Albania *,314.0,452.0,697.0
2,Algeria *,335.0,648.0,963.0
3,Andorra *,2196.0,2771.0,3607.0
4,Angola *,119.0,168.0,165.0
...,...,...,...,...
187,Venezuela *,842.0,1130.0,384.0
188,Vietnam *,108.0,259.0,440.0
189,Yemen *,163.0,231.0,
190,Zambia *,125.0,122.0,208.0


In [52]:
orighealthexp=healthexpData[1]

In [53]:
healthexpDF=orighealthexp.copy()

In [55]:
healthexpDF.columns

Index(['Country or subnational area', '2002', '2010', '2018'], dtype='object')

Overall cleaning note - we may want to change to long data instead of wide, so that the final dataset will be:

Country, Year, Gini Percent, Health Expenditure in 2018 PPP, Life Expectancy