Deliverable 1 (due 1/26/22)
Rebecca Hsu

Gathering, Cleaning, and Integrating Data Tables

Gathering the Data: 
    1) Gini Index by Country
    2) Total Health Expenditure Per Capita by Country in 2018 PPP international USD, inflation adjusted to 2018
    3) World Bank - Life Expectancy at Birth

In [309]:
# importing pandas
import pandas as pd

#link for the online tables
giniLink="https://en.wikipedia.org/wiki/List_of_countries_by_income_equality"
healthexpLink="https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita"

# fetching the tables
giniData=pd.read_html(giniLink,header=0,flavor="bs4",attrs={'class':"wikitable"})
healthexpData=pd.read_html(healthexpLink,header=0,flavor="bs4",attrs={'class':"wikitable"})

In [310]:
# link to the data in CSV format
lifeexpLink='https://github.com/rhsu4/542_Deliv1/raw/main/LifeExpAtBirth_WB.csv'

# using 'read_csv' with a link
lifeexpData=pd.read_csv(lifeexpLink)

In [50]:
#from IPython.display import IFrame  

#IFrame(giniLink, width=700, height=300)

In [311]:
!pip install html5lib
!pip install beautifulsoup4
!pip install lxml



### Cleaning Gini Data

In [22]:
#type(healthexpData)
type(giniData)

list

In [24]:
#len(healthexpData)
len(giniData)

4

In [312]:
#For gini index, we're using the first table
giniData[0]

Unnamed: 0,Country,Subregion,Region,UN R/P,UN R/P.1,WB Gini[4],WB Gini[4].1,CIA R/P[5],CIA R/P[5].1,CIA Gini[6],CIA Gini[6].1
0,Country,Subregion,Region,10%[5],20%[7],%,Year,10%,Year,%,Year
1,,,,,,,,,,,
2,Afghanistan,Southern Asia,Asia,,,,,,,,
3,Albania,Southern Europe,Europe,7.2,4.2,33.2,2017,7.2,2004,26.9,2012 est.
4,Algeria,Northern Africa,Africa,9.6,4.0,27.6,2011,9.6,1995,35.3,1995
...,...,...,...,...,...,...,...,...,...,...,...
175,Palestine,Western Asia,Asia,,5.6,33.7,2016,,,,
176,Yemen,Western Asia,Asia,8.6,6.1,36.7,2014,8.6,2003,37.7,2005
177,Zambia,Eastern Africa,Africa,,21.1,57.1,2015,,,57.5,2010
178,Zimbabwe,Eastern Africa,Africa,,8.6,44.3,2017,,,50.1,2006


#Cleaning Notes for giniData
- values are not categorical
- variable names - need to drop all but country, WB GINI[4] and WB GINI[4].1; rename the WB gini variables to WBGiniPercent and WBGiniYear
- need to drop first row (and NaN rows)

In [313]:
origginiDF=giniData[0]

In [314]:
giniDF=origginiDF.copy()

In [315]:
giniDF.columns

Index(['Country', 'Subregion', 'Region', 'UN R/P', 'UN R/P.1', 'WB Gini[4]',
       'WB Gini[4].1', 'CIA R/P[5]', 'CIA R/P[5].1', 'CIA Gini[6]',
       'CIA Gini[6].1'],
      dtype='object')

In [316]:
#column positions to drop
whichToDrop=[1,2,3,4,7,8,9,10]

#dropping and updating the data frame
giniDF.drop(labels=giniDF.columns[whichToDrop],axis=1,inplace=True)

In [317]:
giniDF.columns

Index(['Country', 'WB Gini[4]', 'WB Gini[4].1'], dtype='object')

In [319]:
giniDF.columns=['Country', 'GiniPercent', 'GiniYear']
giniDF.columns

Index(['Country', 'GiniPercent', 'GiniYear'], dtype='object')

In [320]:
giniDF.Country[10]

'Azerbaijan'

In [321]:
#Removing Spaces
byeSpaces= lambda COLUMN:COLUMN.str.strip()
giniDF=giniDF.apply(byeSpaces)

In [60]:
#Value counts not a problem for gini
[giniDF[COLUMN].value_counts() for COLUMN in giniDF.iloc[:,1::]]

[32.8    3
 35.3    3
 33.7    3
 39.0    3
 40.8    3
        ..
 50.7    1
 48.3    1
 43.5    1
 31.9    1
 44.3    1
 Name: GiniPercent, Length: 122, dtype: int64,
 2017    43
 2018    28
 2016    18
 2015    16
 2014    13
 2013     7
 2011     7
 2012     7
 2009     4
 1999     3
 2010     3
 2004     2
 2003     1
 1992     1
 2020     1
 2008     1
 1998     1
 2007     1
 2005     1
 2006     1
 Name: Year, dtype: int64]

In [322]:
giniDF.dtypes

Country        object
GiniPercent    object
GiniYear       object
dtype: object

In [323]:
giniDF.drop(labels=[0,1,179],
           axis = 0,
           inplace=True) #dropping header rows with no data and "World" row

In [324]:
giniDF

Unnamed: 0,Country,GiniPercent,GiniYear
2,Afghanistan,,
3,Albania,33.2,2017
4,Algeria,27.6,2011
5,Angola,51.3,2018
6,Argentina,41.4,2018
...,...,...,...
174,Vietnam,35.7,2018
175,Palestine,33.7,2016
176,Yemen,36.7,2014
177,Zambia,57.1,2015


In [325]:
giniDF.reset_index(drop=True,inplace=True)

In [326]:
giniDF.to_csv("giniDF.csv",index=False)

### Cleaning health expenditure data

In [327]:
#For health expenditure, we're using the second table
healthexpData[1]

Unnamed: 0,Country or subnational area,2002,2010,2018
0,Afghanistan *,78.0,138.0,186.0
1,Albania *,314.0,452.0,697.0
2,Algeria *,335.0,648.0,963.0
3,Andorra *,2196.0,2771.0,3607.0
4,Angola *,119.0,168.0,165.0
...,...,...,...,...
187,Venezuela *,842.0,1130.0,384.0
188,Vietnam *,108.0,259.0,440.0
189,Yemen *,163.0,231.0,
190,Zambia *,125.0,122.0,208.0


In [328]:
orighealthexp=healthexpData[1]

In [329]:
healthexpDF=orighealthexp.copy()

In [330]:
healthexpDF.columns

Index(['Country or subnational area', '2002', '2010', '2018'], dtype='object')

Overall cleaning note - we may want to change to long data instead of wide, so that the final dataset will be:

Country, Year, Gini Percent, Health Expenditure in 2018 PPP, Life Expectancy

In [76]:
#Going to try reshaping after merge
#healthexpDF = healthexpDF.melt(id_vars=["Country or subnational area"], 
                              var_name="Year", 
                              value_name="healthExp")
#healthexpDF.head

<bound method NDFrame.head of     Country or subnational area  Year  healthExp
0                 Afghanistan *  2002       78.0
1                     Albania *  2002      314.0
2                     Algeria *  2002      335.0
3                     Andorra *  2002     2196.0
4                      Angola *  2002      119.0
..                          ...   ...        ...
571                 Venezuela *  2018      384.0
572                   Vietnam *  2018      440.0
573                     Yemen *  2018        NaN
574                    Zambia *  2018      208.0
575                  Zimbabwe *  2018      198.0

[576 rows x 3 columns]>

In [332]:
healthexpDF.columns=['Country','HE2002', 'HE2010', 'HE2018']

In [333]:
giniDF.dtypes #FLAG - set type to not object?
healthexpDF

Unnamed: 0,Country,HE2002,HE2010,HE2018
0,Afghanistan *,78.0,138.0,186.0
1,Albania *,314.0,452.0,697.0
2,Algeria *,335.0,648.0,963.0
3,Andorra *,2196.0,2771.0,3607.0
4,Angola *,119.0,168.0,165.0
...,...,...,...,...
187,Venezuela *,842.0,1130.0,384.0
188,Vietnam *,108.0,259.0,440.0
189,Yemen *,163.0,231.0,
190,Zambia *,125.0,122.0,208.0


In [None]:
#FLAG: Need to remove the * after country name, also any spaces

In [334]:
healthexpDF['Country'] = healthexpDF['Country'].str.replace('*', '',regex=False)
healthexpDF['Country']=healthexpDF.Country.str.strip()
healthexpDF

Unnamed: 0,Country,HE2002,HE2010,HE2018
0,Afghanistan,78.0,138.0,186.0
1,Albania,314.0,452.0,697.0
2,Algeria,335.0,648.0,963.0
3,Andorra,2196.0,2771.0,3607.0
4,Angola,119.0,168.0,165.0
...,...,...,...,...
187,Venezuela,842.0,1130.0,384.0
188,Vietnam,108.0,259.0,440.0
189,Yemen,163.0,231.0,
190,Zambia,125.0,122.0,208.0


In [335]:
healthexpDF.reset_index(drop=True,inplace=True) #don't think I changed any row indices, but just in case

In [336]:
healthexpDF.to_csv("healthexpDF.csv",index=False)

## Cleaning Life Expectancy Data

In [337]:
lifeexpData

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,2002 [YR2002],2010 [YR2010],2018 [YR2018]
0,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Afghanistan,AFG,56.784,61.028,64.486
1,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Albania,ALB,74.579,76.562,78.458
2,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Algeria,DZA,71.605,74.938,76.693
3,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,American Samoa,ASM,..,..,..
4,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Andorra,AND,..,..,..
...,...,...,...,...,...,...,...
217,,,,,,,
218,,,,,,,
219,,,,,,,
220,Data from database: Health Nutrition and Popul...,,,,,,


Cleaning steps:
- drop rows that are not part of life expectancy series
- drop cols [0,1,3]
- reshape wide, rename year column -> lifeExp
- clean values, make sure country names don't include spaces

In [338]:
lifeexpDF=lifeexpData.copy() #lifeexpData is already a dataframe
lifeexpDF.columns

Index(['Series Name', 'Series Code', 'Country Name', 'Country Code',
       '2002 [YR2002]', '2010 [YR2010]', '2018 [YR2018]'],
      dtype='object')

In [339]:
#FLAG - have to remove spaces in col names
import re
# one or more blanks: \\s+
# one or more numbers: \\d+ 
#--in this case, want to keep the numbers for now
# find opening bracket : \\[
# find closing bracket: \\]

# You can combine using '|' (or):
WhenYouFind='\\s+|\\[|\\]'
replaceWith=''

# substitute the elements in each NAME in the COLUMNS:
lifeexpDF.columns=[re.sub(WhenYouFind,replaceWith,aColumnName) for aColumnName in lifeexpDF.columns]

In [340]:
lifeexpDF

Unnamed: 0,SeriesName,SeriesCode,CountryName,CountryCode,2002YR2002,2010YR2010,2018YR2018
0,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Afghanistan,AFG,56.784,61.028,64.486
1,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Albania,ALB,74.579,76.562,78.458
2,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Algeria,DZA,71.605,74.938,76.693
3,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,American Samoa,ASM,..,..,..
4,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Andorra,AND,..,..,..
...,...,...,...,...,...,...,...
217,,,,,,,
218,,,,,,,
219,,,,,,,
220,Data from database: Health Nutrition and Popul...,,,,,,


In [341]:
#Dropping all rows whose series name != "Life exp at birth..."
#lifeexpDF = lifeexpDF[lifeexpDF.SeriesName!= 'Life expectancy at birth, total (years)']

lifeexpDF = lifeexpDF.loc[lifeexpDF['SeriesName'] == 'Life expectancy at birth, total (years)']

In [342]:
lifeexpDF

Unnamed: 0,SeriesName,SeriesCode,CountryName,CountryCode,2002YR2002,2010YR2010,2018YR2018
0,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Afghanistan,AFG,56.784,61.028,64.486
1,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Albania,ALB,74.579,76.562,78.458
2,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Algeria,DZA,71.605,74.938,76.693
3,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,American Samoa,ASM,..,..,..
4,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Andorra,AND,..,..,..
...,...,...,...,...,...,...,...
212,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Virgin Islands (U.S.),VIR,77.52195122,77.86585366,79.5195122
213,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,West Bank and Gaza,PSE,71.447,72.788,73.895
214,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,"Yemen, Rep.",YEM,61.781,65.549,66.096
215,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,Zambia,ZMB,45.4,55.655,63.51


In [343]:
#now dropping column indices 0,1,3
#column positions to drop
#whichToDrop=[0,1,3]

#dropping and updating the data frame
#lifeexpDF.drop(labels=lifeexpDF.columns[whichToDrop],axis=1,inplace=True)
lifeexpDF.drop(labels=['SeriesName', 'SeriesCode', 'CountryCode'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [344]:
lifeexpDF

Unnamed: 0,CountryName,2002YR2002,2010YR2010,2018YR2018
0,Afghanistan,56.784,61.028,64.486
1,Albania,74.579,76.562,78.458
2,Algeria,71.605,74.938,76.693
3,American Samoa,..,..,..
4,Andorra,..,..,..
...,...,...,...,...
212,Virgin Islands (U.S.),77.52195122,77.86585366,79.5195122
213,West Bank and Gaza,71.447,72.788,73.895
214,"Yemen, Rep.",61.781,65.549,66.096
215,Zambia,45.4,55.655,63.51


In [345]:
lifeexpDF.reset_index(drop=True,inplace=True)

In [226]:
#Going to try reshape after merge
#lifeexpDF = lifeexpDF.melt(id_vars=["CountryName"], 
                              var_name="Year", 
                              value_name="lifeExp")
#lifeexpDF.head

<bound method NDFrame.head of                CountryName        Year     lifeExp
0              Afghanistan  2002YR2002      56.784
1                  Albania  2002YR2002      74.579
2                  Algeria  2002YR2002      71.605
3           American Samoa  2002YR2002          ..
4                  Andorra  2002YR2002          ..
..                     ...         ...         ...
646  Virgin Islands (U.S.)  2018YR2018  79.5195122
647     West Bank and Gaza  2018YR2018      73.895
648            Yemen, Rep.  2018YR2018      66.096
649                 Zambia  2018YR2018       63.51
650               Zimbabwe  2018YR2018      61.195

[651 rows x 3 columns]>

In [350]:
#Now have to rename year to get rid of [yr ...]
#for WORD in lifeexpDF.columns[1]:
  #  WORD = WORD[0:3]
#lifeexpDF['Year'] = lifeexpDF['Year'].str[:3] 
#^^changed to working with wide for now

lifeexpDF.columns=['Country','LE2002','LE2010','LE2018']
lifeexpDF

Unnamed: 0,Country,LE2002,LE2010,LE2018
0,Afghanistan,56.784,61.028,64.486
1,Albania,74.579,76.562,78.458
2,Algeria,71.605,74.938,76.693
3,American Samoa,..,..,..
4,Andorra,..,..,..
...,...,...,...,...
212,Virgin Islands (U.S.),77.52195122,77.86585366,79.5195122
213,West Bank and Gaza,71.447,72.788,73.895
214,"Yemen, Rep.",61.781,65.549,66.096
215,Zambia,45.4,55.655,63.51


In [351]:
lifeexpDF.dtypes
#Do I need to change CountryName to string? 

Country    object
LE2002     object
LE2010     object
LE2018     object
dtype: object

In [352]:
lifeexpDF.to_csv("lifeexpDF.csv",index=False)

## Merging together


In [353]:
# link to the data in CSV format
linkDataGini='https://github.com/rhsu4/542_Deliv1/raw/main/giniDF.csv'
linkDataHealthexp='https://github.com/rhsu4/542_Deliv1/raw/main/healthexpDF.csv'
linkDataLifeexp='https://github.com/rhsu4/542_Deliv1/raw/main/lifeexpDF.csv'

# using 'read_csv' with a link
DataGini=pd.read_csv(linkDataGini)
DataHealthexp=pd.read_csv(linkDataHealthexp)
DataLifeexp=pd.read_csv(linkDataLifeexp)

In [354]:
DataGini.columns
#DataHealthexp.columns
#For this merge, would like to merge by country and by year if possible

Index(['Country', 'GiniPercent', 'GiniYear'], dtype='object')

In [355]:
#key columns - country and year
allData=DataHealthexp.merge(DataGini,left_on=["Country"],right_on=["Country"],how='outer',indicator='True') 

In [356]:
allData.shape
#DataLifeexp.columns
#allDataFull=allData.merge(DataLifeexp, left_on["Country","Year"], right_on=["CountryName","Year"])

(584, 6)

In [357]:
DataHealthexp.shape

(576, 3)

In [358]:
allData

Unnamed: 0,Country,Year,healthExpenditure,GiniPercent,GiniYear,True
0,Afghanistan,2002.0,78.0,,,both
1,Afghanistan,2010.0,138.0,,,both
2,Afghanistan,2018.0,186.0,,,both
3,Albania,2002.0,314.0,33.2,2017.0,both
4,Albania,2010.0,452.0,33.2,2017.0,both
...,...,...,...,...,...,...
579,Kosovo,,,29.0,2017.0,right_only
580,Macau,,,,,right_only
581,Somalia,,,,,right_only
582,Taiwan,,,,,right_only


In [303]:
allData['True'].value_counts()
#left is DataHealthexp, right is DataGini

left_only     545
right_only    146
both           31
Name: True, dtype: int64

In [304]:
allData[allData['True']=='left_only'].Country

0      Afghanistan
1          Albania
2          Algeria
3          Andorra
4           Angola
          ...     
570        Vanuatu
571      Venezuela
573          Yemen
574         Zambia
575       Zimbabwe
Name: Country, Length: 545, dtype: object

In [305]:
allData[allData['True']=='right_only'].Country

576    Afghanistan
577        Albania
578        Algeria
579      Australia
580        Austria
          ...     
717      Venezuela
718      Palestine
719          Yemen
720         Zambia
721       Zimbabwe
Name: Country, Length: 146, dtype: object

In [306]:
# The countries unmatched
UnmatchedLeft=allData[allData['True']=='left_only'].Country.to_list()
UnmatchedRight=allData[allData['True']=='right_only'].Country.to_list()

In [307]:
from thefuzz import process
process.extractOne(UnmatchedLeft[0], UnmatchedRight)

('Afghanistan', 100)

In [308]:
process.extract(UnmatchedLeft[0], UnmatchedRight,limit=3)

[('Afghanistan', 100), ('Ghana', 72), ('Pakistan', 63)]

In [274]:
[(left, process.extractOne(left, UnmatchedRight)) for left in sorted(UnmatchedLeft)]
#None of these matches are correct, so moving on to next merge

[('Andorra', ('North Korea', 56)),
 ('Andorra', ('North Korea', 56)),
 ('Andorra', ('North Korea', 56)),
 ('Antigua and Barbuda', ('Hong Kong', 40)),
 ('Antigua and Barbuda', ('Hong Kong', 40)),
 ('Antigua and Barbuda', ('Hong Kong', 40)),
 ('Bahamas', ('Macau', 33)),
 ('Bahamas', ('Macau', 33)),
 ('Bahamas', ('Macau', 33)),
 ('Barbados', ('Macau', 36)),
 ('Barbados', ('Macau', 36)),
 ('Barbados', ('Macau', 36)),
 ('Brunei', ('EU', 45)),
 ('Brunei', ('EU', 45)),
 ('Brunei', ('EU', 45)),
 ('Cook Islands', ('Taiwan', 45)),
 ('Cook Islands', ('Taiwan', 45)),
 ('Cook Islands', ('Taiwan', 45)),
 ('Dominica', ('Somalia', 53)),
 ('Dominica', ('Somalia', 53)),
 ('Dominica', ('Somalia', 53)),
 ('Eritrea', ('North Korea', 56)),
 ('Eritrea', ('North Korea', 56)),
 ('Eritrea', ('North Korea', 56)),
 ('Grenada', ('North Korea', 50)),
 ('Grenada', ('North Korea', 50)),
 ('Grenada', ('North Korea', 50)),
 ('Kiribati', ('North Korea', 40)),
 ('Kiribati', ('North Korea', 40)),
 ('Kiribati', ('North Kor

In [282]:
allData.drop(["True"],axis=1,inplace=True)


In [283]:
allDataFull=allData.merge(DataLifeexp,left_on=["Country"],right_on=["CountryName"],how='outer',indicator='True') 

In [284]:
allDataFull

Unnamed: 0,Country,Year_x,healthExpenditure,GiniPercent,GiniYear,CountryName,Year_y,lifeExp,True
0,Afghanistan,2002.0,78.0,,,Afghanistan,2002.0,56.784,both
1,Afghanistan,2002.0,78.0,,,Afghanistan,2010.0,61.028,both
2,Afghanistan,2002.0,78.0,,,Afghanistan,2018.0,64.486,both
3,Afghanistan,2010.0,138.0,,,Afghanistan,2002.0,56.784,both
4,Afghanistan,2010.0,138.0,,,Afghanistan,2010.0,61.028,both
...,...,...,...,...,...,...,...,...,...
1729,,,,,,West Bank and Gaza,2010.0,72.788,right_only
1730,,,,,,West Bank and Gaza,2018.0,73.895,right_only
1731,,,,,,"Yemen, Rep.",2002.0,61.781,right_only
1732,,,,,,"Yemen, Rep.",2010.0,65.549,right_only


In [287]:
# The countries unmatched
UnmatchedLeft=allDataFull[allDataFull['True']=='left_only'].Country.to_list()
UnmatchedRight=allDataFull[allDataFull['True']=='right_only'].CountryName.to_list()

In [288]:
UnmatchedLeft

['Bahamas',
 'Bahamas',
 'Bahamas',
 'Brunei',
 'Brunei',
 'Brunei',
 'Cape Verde',
 'Cape Verde',
 'Cape Verde',
 'Congo',
 'Congo',
 'Congo',
 'Cook Islands',
 'Cook Islands',
 'Cook Islands',
 'Ivory Coast',
 'Ivory Coast',
 'Ivory Coast',
 'DR Congo',
 'DR Congo',
 'DR Congo',
 'Egypt',
 'Egypt',
 'Egypt',
 'Gambia',
 'Gambia',
 'Gambia',
 'Iran',
 'Iran',
 'Iran',
 'Kyrgyzstan',
 'Kyrgyzstan',
 'Kyrgyzstan',
 'Laos',
 'Laos',
 'Laos',
 'Micronesia',
 'Micronesia',
 'Micronesia',
 'Niue',
 'Niue',
 'Niue',
 'South Korea',
 'South Korea',
 'South Korea',
 'Russia',
 'Russia',
 'Russia',
 'Saint Kitts and Nevis',
 'Saint Kitts and Nevis',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Lucia',
 'Saint Lucia',
 'Saint Vincent and the Grenadines',
 'Saint Vincent and the Grenadines',
 'Saint Vincent and the Grenadines',
 'São Tomé and Príncipe',
 'São Tomé and Príncipe',
 'São Tomé and Príncipe',
 'Slovakia',
 'Slovakia',
 'Slovakia',
 'Syria',
 'Syria',
 'Syria',
 'East Timor',
 'Ea

In [289]:
UnmatchedRight

['American Samoa',
 'American Samoa',
 'American Samoa',
 'Aruba',
 'Aruba',
 'Aruba',
 'Bahamas, The',
 'Bahamas, The',
 'Bahamas, The',
 'Bermuda',
 'Bermuda',
 'Bermuda',
 'British Virgin Islands',
 'British Virgin Islands',
 'British Virgin Islands',
 'Brunei Darussalam',
 'Brunei Darussalam',
 'Brunei Darussalam',
 'Cabo Verde',
 'Cabo Verde',
 'Cabo Verde',
 'Cayman Islands',
 'Cayman Islands',
 'Cayman Islands',
 'Channel Islands',
 'Channel Islands',
 'Channel Islands',
 'Congo, Dem. Rep.',
 'Congo, Dem. Rep.',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 'Congo, Rep.',
 'Congo, Rep.',
 "Cote d'Ivoire",
 "Cote d'Ivoire",
 "Cote d'Ivoire",
 'Curacao',
 'Curacao',
 'Curacao',
 'Egypt, Arab Rep.',
 'Egypt, Arab Rep.',
 'Egypt, Arab Rep.',
 'Faroe Islands',
 'Faroe Islands',
 'Faroe Islands',
 'French Polynesia',
 'French Polynesia',
 'French Polynesia',
 'Gambia, The',
 'Gambia, The',
 'Gambia, The',
 'Gibraltar',
 'Gibraltar',
 'Gibraltar',
 'Greenland',
 'Greenland',
 'Greenland',
 'Gu

In [290]:
[(left, process.extractOne(left, UnmatchedRight)) for left in sorted(UnmatchedLeft)]

[('Bahamas', ('Bahamas, The', 90)),
 ('Bahamas', ('Bahamas, The', 90)),
 ('Bahamas', ('Bahamas, The', 90)),
 ('Brunei', ('Brunei Darussalam', 90)),
 ('Brunei', ('Brunei Darussalam', 90)),
 ('Brunei', ('Brunei Darussalam', 90)),
 ('Cape Verde', ('Cabo Verde', 80)),
 ('Cape Verde', ('Cabo Verde', 80)),
 ('Cape Verde', ('Cabo Verde', 80)),
 ('Congo', ('Congo, Dem. Rep.', 90)),
 ('Congo', ('Congo, Dem. Rep.', 90)),
 ('Congo', ('Congo, Dem. Rep.', 90)),
 ('Cook Islands', ('British Virgin Islands', 86)),
 ('Cook Islands', ('British Virgin Islands', 86)),
 ('Cook Islands', ('British Virgin Islands', 86)),
 ('DR Congo', ('Congo, Dem. Rep.', 86)),
 ('DR Congo', ('Congo, Dem. Rep.', 86)),
 ('DR Congo', ('Congo, Dem. Rep.', 86)),
 ('EU', ('Bahamas, The', 60)),
 ('East Timor', ('Timor-Leste', 82)),
 ('East Timor', ('Timor-Leste', 82)),
 ('East Timor', ('Timor-Leste', 82)),
 ('Egypt', ('Egypt, Arab Rep.', 90)),
 ('Egypt', ('Egypt, Arab Rep.', 90)),
 ('Egypt', ('Egypt, Arab Rep.', 90)),
 ('Gambia', 

In [291]:
#Creating the list of incorrect matches
# this is a list of tuples:
TotallyWrong=[('Cook Islands', ('British Virgin Islands', 86)),
              ('Congo', ('Congo, Dem. Rep.', 90)),
              ('EU', ('Bahamas, The', 60)),
              ('Niue', ('New Caledonia', 51)),
              ('Palestine', ('Liechtenstein', 55)), 
              ('South Korea', ("Korea, Dem. People's Rep.", 86)),
              ('Taiwan', ('Northern Mariana Islands', 60))]
omitLeft=[leftName for (leftName,rightFuzzy) in TotallyWrong] #parenthesis not needed
omitLeft


['Cook Islands', 'Congo', 'EU', 'Niue', 'Palestine', 'South Korea', 'Taiwan']

In [293]:
changesRight={process.extractOne(left, UnmatchedRight)[0]:left for left in UnmatchedLeft if left not in omitLeft}
DataLifeexp.CountryName.replace(changesRight,inplace=True)

In [None]:
# dict of manual changes
bruteForceChanges={'Korea (the Republic of)':'South Korea', 
                'United States of America (the)':'United States',
                'Czechia':'Czech Republic', 
                'Congo (the)':'Republic of the Congo',
                'Sudan (the)':'Sudan',
                "Lao People's Democratic Republic (the)":'Laos'}

# replacing
DataLifeexp.Countryname.replace(bruteForceChanges,inplace=True)

In [295]:
# redoing merge
allDataFull=allData.merge(DataLifeexp,left_on="Country",right_on="CountryName")

# current dimension
allDataFull.shape

(1707, 8)

In [296]:
allDataFull

Unnamed: 0,Country,Year_x,healthExpenditure,GiniPercent,GiniYear,CountryName,Year_y,lifeExp
0,Afghanistan,2002.0,78.0,,,Afghanistan,2002,56.784
1,Afghanistan,2002.0,78.0,,,Afghanistan,2010,61.028
2,Afghanistan,2002.0,78.0,,,Afghanistan,2018,64.486
3,Afghanistan,2010.0,138.0,,,Afghanistan,2002,56.784
4,Afghanistan,2010.0,138.0,,,Afghanistan,2010,61.028
...,...,...,...,...,...,...,...,...
1702,Macau,,,,,Macau,2010,82.704
1703,Macau,,,,,Macau,2018,84.118
1704,Somalia,,,,,Somalia,2002,51.492
1705,Somalia,,,,,Somalia,2010,53.99


In [None]:
##maybe I should make this easier on myself. Just do one dataset with years, don't merge. Country name and gini index + gini year;  