# Index

Parse data for analysis
1. [Mortality](#Mortality)
2. [Medicine](#Medicine)
3. [Hygiene](#Hygiene)
4. [Finance-universal health coverage(UHC)](#Finance-universal-health-coverage-(UHC))

Machine learning
1. [Prepare data for machine learning](#Prepare-data-for-machine-learning)
2. [Maternal mortality](#Maternal-mortality)
3. [Infant mortality](#Infant-mortality)
4. [Neonatal mortality](#Neonatal-mortality)
5. [Under 5 mortality](#Under-5-mortality)

In [5]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [6]:
pip install pycountry-convert

Note: you may need to restart the kernel to use updated packages.


In [7]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [10,5]

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

import pycountry
import pycountry_convert as pc

def convert_continent(x):
    try:
        return pc.country_alpha2_to_continent_code(x)
    except:
        if x == 'TL':
            return 'AS'

EDA refer to : https://www.kaggle.com/singchenyeo/eda-mortality/notebook

# Mortality

In [8]:
f = lambda x: x['First Tooltip'].split("[")[0]

In [9]:
maternalMortalityRatio_s = pd.read_csv('../csv/maternalMortalityRatio.csv', parse_dates =['Period'])
maternalMortalityRatio_s['First Tooltip'] = maternalMortalityRatio_s.apply(f, axis=1)
maternalMortalityRatio_s['First Tooltip'] = maternalMortalityRatio_s['First Tooltip'].astype(float)
maternalMortalityRatio_s = maternalMortalityRatio_s[['Location','Period','First Tooltip']]
maternalMortalityRatio_s = maternalMortalityRatio_s.rename(columns ={'First Tooltip':'maternalMortalityRatio'})
maternalMortalityRatio_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3294 entries, 0 to 3293
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Location                3294 non-null   object        
 1   Period                  3294 non-null   datetime64[ns]
 2   maternalMortalityRatio  3294 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 77.3+ KB


In [10]:
infantMortalityRate_s = pd.read_csv('../csv/infantMortalityRate.csv', parse_dates=['Period'])
infantMortalityRate_s['First Tooltip'] = infantMortalityRate_s.apply(f, axis=1)
infantMortalityRate_s['First Tooltip'] = infantMortalityRate_s['First Tooltip'].astype(float)
infantMortalityRate_s = infantMortalityRate_s[infantMortalityRate_s['Dim1']=='Both sexes']
infantMortalityRate_s = infantMortalityRate_s[['Location','Period','First Tooltip']]
infantMortalityRate_s = infantMortalityRate_s.rename(columns={'First Tooltip':'infantMortalityRate'})
infantMortalityRate_s.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 29997
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Location             10000 non-null  object        
 1   Period               10000 non-null  datetime64[ns]
 2   infantMortalityRate  10000 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 312.5+ KB


In [11]:
neonatalMortalityRate_s = pd.read_csv('../csv/neonatalMortalityRate.csv', parse_dates=['Period']) 
neonatalMortalityRate_s['First Tooltip'] = neonatalMortalityRate_s.apply(f, axis=1)
neonatalMortalityRate_s['First Tooltip'] = neonatalMortalityRate_s['First Tooltip'].astype(float)
neonatalMortalityRate_s = neonatalMortalityRate_s[neonatalMortalityRate_s['Dim1']=='Both sexes']
neonatalMortalityRate_s = neonatalMortalityRate_s[['Location','Period','First Tooltip']]
neonatalMortalityRate_s = neonatalMortalityRate_s.rename(columns={'First Tooltip':'neonatalMortalityRate'})
neonatalMortalityRate_s.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9135 entries, 0 to 9134
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Location               9135 non-null   object        
 1   Period                 9135 non-null   datetime64[ns]
 2   neonatalMortalityRate  9135 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 285.5+ KB


In [12]:
under5MortalityRate_s = pd.read_csv('../csv/under5MortalityRate.csv', parse_dates=['Period']) 
under5MortalityRate_s['First Tooltip'] = under5MortalityRate_s.apply(f, axis=1)
under5MortalityRate_s['First Tooltip'] = under5MortalityRate_s['First Tooltip'].astype(float)
under5MortalityRate_s = under5MortalityRate_s[under5MortalityRate_s['Dim1']=='Both sexes']
under5MortalityRate_s = under5MortalityRate_s[['Location','Period','First Tooltip']]
under5MortalityRate_s = under5MortalityRate_s.rename(columns={'First Tooltip':'under5MortalityRate'})
under5MortalityRate_s.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 29997
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Location             10000 non-null  object        
 1   Period               10000 non-null  datetime64[ns]
 2   under5MortalityRate  10000 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 312.5+ KB


# Medicine

In [13]:
birthAttendedBySkilledPersonal = pd.read_csv('../csv/birthAttendedBySkilledPersonal.csv', parse_dates =['Period'])
birthAttendedBySkilledPersonal_s = birthAttendedBySkilledPersonal[['Location','Period','First Tooltip']]
birthAttendedBySkilledPersonal_s = birthAttendedBySkilledPersonal_s.rename(columns={'First Tooltip':'birthAttendedBySkilledPersonal'})
birthAttendedBySkilledPersonal_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1755 entries, 0 to 1754
Data columns (total 3 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Location                        1755 non-null   object        
 1   Period                          1755 non-null   datetime64[ns]
 2   birthAttendedBySkilledPersonal  1755 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 41.3+ KB


In [14]:
medicalDoctors = pd.read_csv('../csv/medicalDoctors.csv', parse_dates =['Period'])
medicalDoctors_s = medicalDoctors[['Location','Period','First Tooltip']]
medicalDoctors_s = medicalDoctors_s.rename(columns={'First Tooltip':'medicalDoctors'})
medicalDoctors_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2506 entries, 0 to 2505
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Location        2506 non-null   object        
 1   Period          2506 non-null   datetime64[ns]
 2   medicalDoctors  2506 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 58.9+ KB


In [15]:
nursingAndMidwife = pd.read_csv('../csv/nursingAndMidwife.csv', parse_dates =['Period'])
# Data from Belize has problem, should divide by 100
nursingAndMidwife.loc[(nursingAndMidwife['Location'] == 'Belize') & (nursingAndMidwife['First Tooltip'] > 50), 'First Tooltip']  /= 100
nursingAndMidwife_s = nursingAndMidwife[['Location','Period','First Tooltip']]
nursingAndMidwife_s = nursingAndMidwife_s.rename(columns={'First Tooltip':'nursingAndMidwife'})
nursingAndMidwife_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2587 entries, 0 to 2586
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Location           2587 non-null   object        
 1   Period             2587 non-null   datetime64[ns]
 2   nursingAndMidwife  2587 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 60.8+ KB


In [16]:
pharmacists = pd.read_csv('../csv/pharmacists.csv', parse_dates =['Period'])
pharmacists_s = pharmacists[['Location','Period','First Tooltip']]
pharmacists_s = pharmacists_s.rename(columns={'First Tooltip':'pharmacists'})
pharmacists_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Location     1795 non-null   object        
 1   Period       1795 non-null   datetime64[ns]
 2   pharmacists  1795 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 42.2+ KB


# Hygiene

In [17]:
basicDrinkingWaterServices = pd.read_csv('../csv/basicDrinkingWaterServices.csv', parse_dates=['Period'])
basicDrinkingWaterServices = basicDrinkingWaterServices[['Location','Period','First Tooltip']]
basicDrinkingWaterServices = basicDrinkingWaterServices.rename(columns={'First Tooltip':'basicDrinkingWaterServices'})
basicDrinkingWaterServices.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3455 entries, 0 to 3454
Data columns (total 3 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Location                    3455 non-null   object        
 1   Period                      3455 non-null   datetime64[ns]
 2   basicDrinkingWaterServices  3455 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 81.1+ KB


In [18]:
atLeastBasicSanitizationServices = pd.read_csv('../csv/atLeastBasicSanitizationServices.csv', parse_dates=['Period'])
atLeastBasicSanitizationServices = atLeastBasicSanitizationServices[atLeastBasicSanitizationServices['Dim1'] == 'Total']
atLeastBasicSanitizationServices = atLeastBasicSanitizationServices[['Location','Period','First Tooltip']]
atLeastBasicSanitizationServices = atLeastBasicSanitizationServices.rename(columns={'First Tooltip':'atLeastBasicSanitizationServices'})
atLeastBasicSanitizationServices.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3439 entries, 0 to 9365
Data columns (total 3 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Location                          3439 non-null   object        
 1   Period                            3439 non-null   datetime64[ns]
 2   atLeastBasicSanitizationServices  3439 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 107.5+ KB


In [19]:
safelySanitization = pd.read_csv('../csv/safelySanitization.csv', parse_dates=['Period'])
safelySanitization = safelySanitization[safelySanitization['Dim1'] == 'Total']
safelySanitization = safelySanitization[['Location','Period','First Tooltip']]
safelySanitization = safelySanitization.rename(columns={'First Tooltip':'safelySanitization'})
safelySanitization.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1571 entries, 0 to 3618
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Location            1571 non-null   object        
 1   Period              1571 non-null   datetime64[ns]
 2   safelySanitization  1571 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 49.1+ KB


In [20]:
basicHandWashing = pd.read_csv('../csv/basicHandWashing.csv', parse_dates=['Period'])
basicHandWashing = basicHandWashing[basicHandWashing['Dim1'] == 'Total']
basicHandWashing = basicHandWashing[['Location','Period','First Tooltip']]
basicHandWashing = basicHandWashing.rename(columns={'First Tooltip':'basicHandWashing'})
basicHandWashing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 921 entries, 0 to 2723
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Location          921 non-null    object        
 1   Period            921 non-null    datetime64[ns]
 2   basicHandWashing  921 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 28.8+ KB


# Finance-universal health coverage (UHC)

## UHC service coverage index

Coverage of essential health services (defined as the average coverage of essential services based on tracer interventions that include reproductive, maternal, newborn and child health, infectious diseases, non-communicable diseases and service capacity and access, among the general and the most disadvantaged population). The indicator is an index reported on a unitless scale of 0 to 100, which is computed as the geometric mean of 14 tracer indicators of health service coverage. The tracer indicators are as follows, organized by four components of service coverage: 
1. Reproductive, maternal, newborn and child health 
2. Infectious diseases 
3. Noncommunicable diseases 
4. Service capacity and access See the 2019 monitoring report for the tracer indicator within each component.

In [21]:
uhcCoverage = pd.read_csv('../csv/uhcCoverage.csv', parse_dates=['Period'])
uhcCoverage = uhcCoverage.rename(columns={'First Tooltip':'uhcCoverage'})
uhcCoverage = uhcCoverage[['Location','Period','uhcCoverage']]
uhcCoverage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Location     366 non-null    object        
 1   Period       366 non-null    datetime64[ns]
 2   uhcCoverage  366 non-null    int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 8.7+ KB


# Monitoring Sustainable Development Goals–Indicator 3.8.2

https://www.who.int/health_financing/topics/financial-protection/monitoring-sdg/en/

![](https://www.who.int/health_financing/topics/financial-protection/sdg-target-figure-491.jpg)

In [22]:
population10SDG = pd.read_csv('../csv/population10SDG3.8.2.csv', parse_dates=['Period']) 
population10SDG = population10SDG[population10SDG['Dim1'] == 'Total']
population10SDG = population10SDG.rename(columns = {'First Tooltip':'population10SDG'})
population10SDG.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 711 entries, 0 to 1890
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Location         711 non-null    object        
 1   Period           711 non-null    datetime64[ns]
 2   Indicator        711 non-null    object        
 3   Dim1             711 non-null    object        
 4   population10SDG  711 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 33.3+ KB


In [23]:
population25SDG = pd.read_csv('../csv/population25SDG3.8.2.csv', parse_dates=['Period']) 
population25SDG = population25SDG[population25SDG['Dim1'] == 'Total']
population25SDG = population25SDG.rename(columns = {'First Tooltip':'population25SDG'})
population25SDG.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 711 entries, 0 to 1890
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Location         711 non-null    object        
 1   Period           711 non-null    datetime64[ns]
 2   Indicator        711 non-null    object        
 3   Dim1             711 non-null    object        
 4   population25SDG  711 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 33.3+ KB


In [24]:
populationSDG = pd.merge(population10SDG[['Location','Period', 'population10SDG']], population25SDG[['Location','Period', 'population25SDG']], on=['Location','Period'] )
populationSDG.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 711 entries, 0 to 710
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Location         711 non-null    object        
 1   Period           711 non-null    datetime64[ns]
 2   population10SDG  711 non-null    float64       
 3   population25SDG  711 non-null    float64       
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 27.8+ KB


# Machine learning

# Prepare data for machine learning

In [25]:
# Merge medicine
df_medicine =  pd.merge(birthAttendedBySkilledPersonal_s, medicalDoctors_s, how='outer', on=['Period','Location'])
df_medicine = pd.merge(df_medicine,nursingAndMidwife_s, how='outer', on=['Period','Location'] )
df_medicine = pd.merge(df_medicine,pharmacists_s, how='outer', on=['Period','Location'] )
df_medicine.describe()

Unnamed: 0,birthAttendedBySkilledPersonal,medicalDoctors,nursingAndMidwife,pharmacists
count,1755.0,2506.0,2587.0,1795.0
mean,92.045442,20.685012,45.417905,4.124118
std,16.557089,14.299267,38.810124,3.62461
min,5.7,0.13,0.012,0.002
25%,95.35,7.7825,11.835,0.79
50%,99.0,21.28,37.45,3.53
75%,99.8,31.66,66.165,6.39
max,100.0,84.22,201.6,26.3


In [26]:
# Merge hygiene
df_hygiene = pd.merge(basicDrinkingWaterServices, atLeastBasicSanitizationServices, how='outer', on=['Period','Location'])
df_hygiene = pd.merge(df_hygiene, safelySanitization, how='outer', on=['Period','Location'])
df_hygiene = pd.merge(df_hygiene, basicHandWashing, how='outer', on=['Period','Location'])
df_hygiene.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3463 entries, 0 to 3462
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Location                          3463 non-null   object        
 1   Period                            3463 non-null   datetime64[ns]
 2   basicDrinkingWaterServices        3455 non-null   float64       
 3   atLeastBasicSanitizationServices  3439 non-null   float64       
 4   safelySanitization                1571 non-null   float64       
 5   basicHandWashing                  921 non-null    float64       
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 189.4+ KB


In [27]:
# Merge finance
df_finance = pd.merge(uhcCoverage, populationSDG, on=['Location', 'Period'], how='outer')
df_finance.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1051 entries, 0 to 1050
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Location         1051 non-null   object        
 1   Period           1051 non-null   datetime64[ns]
 2   uhcCoverage      366 non-null    float64       
 3   population10SDG  711 non-null    float64       
 4   population25SDG  711 non-null    float64       
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 49.3+ KB


## Maternal mortality

In [28]:
df_maternal = pd.merge(maternalMortalityRatio_s, df_medicine, on = ['Location','Period'],how='outer')
df_maternal = pd.merge(df_maternal, df_hygiene, on = ['Location','Period'],how='outer')
df_maternal = pd.merge(df_maternal, df_finance, on = ['Location','Period'],how='outer')
df_maternal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312 entries, 0 to 4311
Data columns (total 14 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Location                          4312 non-null   object        
 1   Period                            4312 non-null   datetime64[ns]
 2   maternalMortalityRatio            3294 non-null   float64       
 3   birthAttendedBySkilledPersonal    1755 non-null   float64       
 4   medicalDoctors                    2506 non-null   float64       
 5   nursingAndMidwife                 2587 non-null   float64       
 6   pharmacists                       1795 non-null   float64       
 7   basicDrinkingWaterServices        3455 non-null   float64       
 8   atLeastBasicSanitizationServices  3439 non-null   float64       
 9   safelySanitization                1571 non-null   float64       
 10  basicHandWashing                  921 non-null  

In [29]:
# Need to change some country name to use pycountry
df_maternal.loc[df_maternal['Location'] == "Sudan (until 2011)", 'Location'] = 'Sudan'
df_maternal.loc[df_maternal['Location'] == "Bolivia (Plurinational State of)", 'Location'] = 'Bolivia, Plurinational State of'
df_maternal.loc[df_maternal['Location'] == "Côte d’Ivoire", 'Location'] = 'Ivory Coast'
df_maternal.loc[df_maternal['Location'] == "Iran (Islamic Republic of)", 'Location'] = 'Iran, Islamic Republic of'
df_maternal.loc[df_maternal['Location'] == "Micronesia (Federated States of)", 'Location'] = 'Micronesia'
df_maternal.loc[df_maternal['Location'] == "Republic of Korea", 'Location'] = 'Korea, Republic of'
df_maternal.loc[df_maternal['Location'] == "The former Yugoslav Republic of Macedonia", 'Location'] = 'North Macedonia'
df_maternal.loc[df_maternal['Location'] == "Venezuela (Bolivarian Republic of)", 'Location'] = 'Venezuela, Bolivarian Republic of'
df_maternal.loc[df_maternal['Location'] == "Germany, Federal Republic (former)", 'Location'] = 'Germany'
df_maternal.loc[df_maternal['Location'] == "India (until 1975)", 'Location'] = 'India'
df_maternal.loc[df_maternal['Location'] == "Kiribati (until 1984)", 'Location'] = 'Kiribati'
df_maternal.loc[df_maternal['Location'] == "South Viet Nam (former)", 'Location'] = 'Viet Nam'
df_maternal.loc[df_maternal['Location'] == 'Yemen Arab Republic (until 1990)', 'Location'] = 'Yemen'
df_maternal.loc[df_maternal['Location'] == 'State of Palestine', 'Location'] = 'Palestine, State of'

df_maternal['country_code'] = df_maternal['Location'].apply(pc.country_name_to_country_alpha2)
df_maternal['continent'] = df_maternal['country_code'].apply(lambda x: convert_continent(x))

In [30]:
# Interpolate missing data within country
country_list = df_maternal['Location'].unique()

for country in country_list:
    df_country = df_maternal[df_maternal['Location'] == country] 
    df_country = df_country.sort_values('Period')
    for i in range(2, 14):
        df_country.iloc[:,i] = df_country.iloc[:,i].interpolate(method='linear', limit_direction='both')
    
    # drop the rows with the country
    df_maternal = df_maternal[df_maternal['Location'] != country]
    # Append fixed list to main list
    df_maternal = df_maternal.append(df_country)
    
df_maternal = df_maternal.sort_index()

In [31]:
# Prepare data for KNNImputer
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df_maternal['Location'] = le.fit_transform(df_maternal['Location'])
df_maternal['continent'] = le.fit_transform(df_maternal['continent'])

df_maternal = df_maternal.drop(['country_code'], axis=1)

# Convert time to float
df_maternal['Period'] = df_maternal['Period'].apply(lambda x: x.toordinal())
column_names = df_maternal.columns

In [32]:
df_maternal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312 entries, 0 to 4311
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Location                          4312 non-null   int32  
 1   Period                            4312 non-null   int64  
 2   maternalMortalityRatio            4086 non-null   float64
 3   birthAttendedBySkilledPersonal    4054 non-null   float64
 4   medicalDoctors                    4290 non-null   float64
 5   nursingAndMidwife                 4290 non-null   float64
 6   pharmacists                       4081 non-null   float64
 7   basicDrinkingWaterServices        4308 non-null   float64
 8   atLeastBasicSanitizationServices  4308 non-null   float64
 9   safelySanitization                2106 non-null   float64
 10  basicHandWashing                  1986 non-null   float64
 11  uhcCoverage                       4086 non-null   float64
 12  popula

In [33]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_maternal_imputed = pd.DataFrame(imputer.fit_transform(df_maternal), columns = column_names) 
df_maternal_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4312 entries, 0 to 4311
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Location                          4312 non-null   float64
 1   Period                            4312 non-null   float64
 2   maternalMortalityRatio            4312 non-null   float64
 3   birthAttendedBySkilledPersonal    4312 non-null   float64
 4   medicalDoctors                    4312 non-null   float64
 5   nursingAndMidwife                 4312 non-null   float64
 6   pharmacists                       4312 non-null   float64
 7   basicDrinkingWaterServices        4312 non-null   float64
 8   atLeastBasicSanitizationServices  4312 non-null   float64
 9   safelySanitization                4312 non-null   float64
 10  basicHandWashing                  4312 non-null   float64
 11  uhcCoverage                       4312 non-null   float64
 12  popula

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

numerical_cols = ['medicalDoctors', 'nursingAndMidwife', 'pharmacists', 'basicDrinkingWaterServices','atLeastBasicSanitizationServices', 'safelySanitization','basicHandWashing', 'uhcCoverage',
       'population10SDG', 'population25SDG', 'continent']

Y = np.array(df_maternal_imputed['maternalMortalityRatio']).reshape(-1, 1)
# Y = df_maternal_imputed['maternalMortalityRatio']
scaler = StandardScaler()
Y = scaler.fit_transform(Y)

df_maternal_imputed.pop('maternalMortalityRatio')
X = df_maternal_imputed
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# stratify - make sure classes are evenlly represented across splits
X_train, X_valid, y_train, y_valid = train_test_split(X, Y, train_size=0.75)

### Lasso

1. Evaluate LASSO parameter
2. Fit model and check coefficient

In [69]:
# grid search hyperparameters for lasso regression
from numpy import arange
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Lasso
# define model
model = Lasso()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -0.349
Config: {'alpha': 0.02}


In [70]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

clf = Lasso(alpha=0.02).fit(X_train, y_train)
prediction = clf.predict(X_valid)
print(f"RMSE = {np.sqrt(mean_squared_error(y_valid,prediction))}" ) 
print(f"R-square = {r2_score(y_valid, prediction)}" )
print(clf.coef_)
print(clf.intercept_)
pd.DataFrame(clf.coef_, index= df_maternal_imputed.columns ).sort_values(0)

RMSE = 0.47666958832175177
R-square = 0.7817769662314386
[-4.00590521e-04 -7.06488674e-05 -1.22935643e-02 -0.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -7.60500865e-02 -0.00000000e+00
 -1.64950246e-01 -8.05454587e-02 -1.27179741e-01  0.00000000e+00
  0.00000000e+00 -8.49815792e-02]
[52.36357909]


Unnamed: 0,0
safelySanitization,-0.16495
uhcCoverage,-0.12718
continent,-0.084982
basicHandWashing,-0.080545
basicDrinkingWaterServices,-0.07605
birthAttendedBySkilledPersonal,-0.012294
Location,-0.000401
Period,-7.1e-05
medicalDoctors,-0.0
nursingAndMidwife,-0.0


### Elastic Net

1. Evaluate elastic net parameter
2. Fit model and check coefficient

In [71]:
# grid search hyperparameters for the elastic net
from sklearn.linear_model import ElasticNet
# define model
model = ElasticNet()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
grid['l1_ratio'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -0.349
Config: {'alpha': 0.1, 'l1_ratio': 0.2}


In [72]:
from sklearn.linear_model import ElasticNet
eNet = ElasticNet(alpha = 0.1, l1_ratio = 0.2,random_state=1).fit(X_train, y_train)
prediction = eNet.predict(X_valid)
print(f"RMSE = {np.sqrt(mean_squared_error(y_valid,prediction))}" ) 
print(f"R-square = {r2_score(y_valid, prediction)}" )

print(eNet.coef_)
print(eNet.intercept_)

pd.DataFrame(eNet.coef_, index= df_maternal_imputed.columns ).sort_values(0)

RMSE = 0.4779676914844199
R-square = 0.7805867844504955
[-4.21346729e-04 -7.03014655e-05 -1.31846456e-02 -0.00000000e+00
 -0.00000000e+00 -7.04205828e-03 -7.92101339e-02 -3.94482617e-03
 -1.44997261e-01 -7.83057119e-02 -1.09015082e-01  0.00000000e+00
  0.00000000e+00 -7.95602643e-02]
[52.18610006]


Unnamed: 0,0
safelySanitization,-0.144997
uhcCoverage,-0.109015
continent,-0.07956
basicDrinkingWaterServices,-0.07921
basicHandWashing,-0.078306
birthAttendedBySkilledPersonal,-0.013185
pharmacists,-0.007042
atLeastBasicSanitizationServices,-0.003945
Location,-0.000421
Period,-7e-05


## Infant mortality

In [73]:
df_infant = pd.merge(infantMortalityRate_s, df_medicine, on = ['Location','Period'],how='outer')
df_infant = pd.merge(df_infant, df_hygiene, on = ['Location','Period'],how='outer')
df_infant = pd.merge(df_infant, df_finance, on = ['Location','Period'],how='outer')
df_infant.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10494 entries, 0 to 10493
Data columns (total 14 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Location                          10494 non-null  object        
 1   Period                            10494 non-null  datetime64[ns]
 2   infantMortalityRate               10000 non-null  float64       
 3   birthAttendedBySkilledPersonal    1755 non-null   float64       
 4   medicalDoctors                    2506 non-null   float64       
 5   nursingAndMidwife                 2587 non-null   float64       
 6   pharmacists                       1795 non-null   float64       
 7   basicDrinkingWaterServices        3455 non-null   float64       
 8   atLeastBasicSanitizationServices  3439 non-null   float64       
 9   safelySanitization                1571 non-null   float64       
 10  basicHandWashing                  921 non-null

In [74]:
# Need to change some country name to use pycountry
df_infant.loc[df_infant['Location'] == "Sudan (until 2011)", 'Location'] = 'Sudan'
df_infant.loc[df_infant['Location'] == "Bolivia (Plurinational State of)", 'Location'] = 'Bolivia, Plurinational State of'
df_infant.loc[df_infant['Location'] == "Côte d’Ivoire", 'Location'] = 'Ivory Coast'
df_infant.loc[df_infant['Location'] == "Iran (Islamic Republic of)", 'Location'] = 'Iran, Islamic Republic of'
df_infant.loc[df_infant['Location'] == "Micronesia (Federated States of)", 'Location'] = 'Micronesia'
df_infant.loc[df_infant['Location'] == "Republic of Korea", 'Location'] = 'Korea, Republic of'
df_infant.loc[df_infant['Location'] == "The former Yugoslav Republic of Macedonia", 'Location'] = 'North Macedonia'
df_infant.loc[df_infant['Location'] == "Venezuela (Bolivarian Republic of)", 'Location'] = 'Venezuela, Bolivarian Republic of'
df_infant.loc[df_infant['Location'] == "Germany, Federal Republic (former)", 'Location'] = 'Germany'
df_infant.loc[df_infant['Location'] == "India (until 1975)", 'Location'] = 'India'
df_infant.loc[df_infant['Location'] == "Kiribati (until 1984)", 'Location'] = 'Kiribati'
df_infant.loc[df_infant['Location'] == "South Viet Nam (former)", 'Location'] = 'Viet Nam'
df_infant.loc[df_infant['Location'] == 'Yemen Arab Republic (until 1990)', 'Location'] = 'Yemen'
df_infant.loc[df_infant['Location'] == 'State of Palestine', 'Location'] = 'Palestine, State of'

df_infant['country_code'] = df_infant['Location'].apply(pc.country_name_to_country_alpha2)
df_infant['continent'] = df_infant['country_code'].apply(lambda x: convert_continent(x))

In [75]:
# Interpolate missing data within country
country_list = df_infant['Location'].unique()

for country in country_list:
    df_country = df_infant[df_infant['Location'] == country] 
    df_country = df_country.sort_values('Period')
    for i in range(2, 14):
        df_country.iloc[:,i] = df_country.iloc[:,i].interpolate(method='linear', limit_direction='both')
    
    # drop the rows with the country
    df_infant = df_infant[df_infant['Location'] != country]
    # Append fixed list to main list
    df_infant = df_infant.append(df_country)
    
df_infant = df_infant.sort_index()

In [76]:
# Prepare data for KNNImputer
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df_infant['Location'] = le.fit_transform(df_infant['Location'])
df_infant['continent'] = le.fit_transform(df_infant['continent'])

df_infant = df_infant.drop(['country_code'], axis=1)

# Convert time to float
df_infant['Period'] = df_infant['Period'].apply(lambda x: x.toordinal())
column_names = df_infant.columns

In [77]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_infant_imputed = pd.DataFrame(imputer.fit_transform(df_infant), columns = column_names) 
df_infant_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10494 entries, 0 to 10493
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Location                          10494 non-null  float64
 1   Period                            10494 non-null  float64
 2   infantMortalityRate               10494 non-null  float64
 3   birthAttendedBySkilledPersonal    10494 non-null  float64
 4   medicalDoctors                    10494 non-null  float64
 5   nursingAndMidwife                 10494 non-null  float64
 6   pharmacists                       10494 non-null  float64
 7   basicDrinkingWaterServices        10494 non-null  float64
 8   atLeastBasicSanitizationServices  10494 non-null  float64
 9   safelySanitization                10494 non-null  float64
 10  basicHandWashing                  10494 non-null  float64
 11  uhcCoverage                       10494 non-null  float64
 12  popu

In [78]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

numerical_cols = ['medicalDoctors', 'nursingAndMidwife', 'pharmacists', 'basicDrinkingWaterServices','atLeastBasicSanitizationServices', 'safelySanitization','basicHandWashing', 'uhcCoverage',
       'population10SDG', 'population25SDG', 'continent']

Y = np.array(df_infant_imputed['infantMortalityRate']).reshape(-1, 1)
scaler = StandardScaler()
Y = scaler.fit_transform(Y)

df_infant_imputed.pop('infantMortalityRate')
X = df_infant_imputed
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# stratify - make sure classes are evenlly represented across splits
X_train, X_valid, y_train, y_valid = train_test_split(X, Y, train_size=0.75)

### Lasso

1. Evaluate LASSO parameter
2. Fit model and check coefficient

In [79]:
# grid search hyperparameters for lasso regression
# define model
model = Lasso()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -0.348
Config: {'alpha': 0.01}


In [80]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

clf = Lasso(alpha=0.01).fit(X_train, y_train)
prediction = clf.predict(X_valid)
print(f"RMSE = {np.sqrt(mean_squared_error(y_valid,prediction))}" ) 
print(f"R-square = {r2_score(y_valid, prediction)}" )
print(clf.coef_)
print(clf.intercept_)
pd.DataFrame(clf.coef_, index= df_maternal_imputed.columns ).sort_values(0)

RMSE = 0.47719266273608146
R-square = 0.7771039182961905
[-5.17577735e-04 -7.53710542e-05 -1.14313643e-02 -0.00000000e+00
 -1.88189590e-03 -3.01164268e-02 -7.90285828e-02 -1.38810032e-02
 -1.64082538e-01 -3.04603383e-02 -1.53630102e-01  9.16160337e-03
  0.00000000e+00 -7.02715878e-02]
[55.73468178]


Unnamed: 0,0
safelySanitization,-0.164083
uhcCoverage,-0.15363
basicDrinkingWaterServices,-0.079029
continent,-0.070272
basicHandWashing,-0.03046
pharmacists,-0.030116
atLeastBasicSanitizationServices,-0.013881
birthAttendedBySkilledPersonal,-0.011431
nursingAndMidwife,-0.001882
Location,-0.000518


### Elastic Net

1. Evaluate elastic net parameter
2. Fit model and check coefficient

In [81]:
# grid search hyperparameters for the elastic net
# define model
model = ElasticNet()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
grid['l1_ratio'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -0.347
Config: {'alpha': 0.1, 'l1_ratio': 0.05}


In [82]:
from sklearn.linear_model import ElasticNet
eNet = ElasticNet(alpha = 0.1, l1_ratio = 0.05,random_state=1).fit(X_train, y_train)
prediction = eNet.predict(X_valid)
print(f"RMSE = {np.sqrt(mean_squared_error(y_valid,prediction))}" ) 
print(f"R-square = {r2_score(y_valid, prediction)}" )

print(eNet.coef_)
print(eNet.intercept_)

pd.DataFrame(eNet.coef_, index= df_maternal_imputed.columns ).sort_values(0)

RMSE = 0.4778669104709998
R-square = 0.7764735927952703
[-5.29800755e-04 -7.45010832e-05 -1.17124605e-02 -8.81410657e-03
 -1.07934217e-02 -3.69947881e-02 -7.93541979e-02 -4.42401754e-02
 -1.36014294e-01 -3.61830223e-02 -1.15560935e-01  9.49902859e-03
  6.05179892e-03 -7.00511111e-02]
[55.12693623]


Unnamed: 0,0
safelySanitization,-0.136014
uhcCoverage,-0.115561
basicDrinkingWaterServices,-0.079354
continent,-0.070051
atLeastBasicSanitizationServices,-0.04424
pharmacists,-0.036995
basicHandWashing,-0.036183
birthAttendedBySkilledPersonal,-0.011712
nursingAndMidwife,-0.010793
medicalDoctors,-0.008814


## Neonatal mortality

In [83]:
df_neonatal = pd.merge(neonatalMortalityRate_s, df_medicine, on = ['Location','Period'],how='outer')
df_neonatal = pd.merge(df_neonatal, df_hygiene, on = ['Location','Period'],how='outer')
df_neonatal = pd.merge(df_neonatal, df_finance, on = ['Location','Period'],how='outer')
df_neonatal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9163 entries, 0 to 9162
Data columns (total 14 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Location                          9163 non-null   object        
 1   Period                            9163 non-null   datetime64[ns]
 2   neonatalMortalityRate             9135 non-null   float64       
 3   birthAttendedBySkilledPersonal    1755 non-null   float64       
 4   medicalDoctors                    2506 non-null   float64       
 5   nursingAndMidwife                 2587 non-null   float64       
 6   pharmacists                       1795 non-null   float64       
 7   basicDrinkingWaterServices        3455 non-null   float64       
 8   atLeastBasicSanitizationServices  3439 non-null   float64       
 9   safelySanitization                1571 non-null   float64       
 10  basicHandWashing                  921 non-null  

In [84]:
# Need to change some country name to use pycountry
df_neonatal.loc[df_neonatal['Location'] == "Sudan (until 2011)", 'Location'] = 'Sudan'
df_neonatal.loc[df_neonatal['Location'] == "Bolivia (Plurinational State of)", 'Location'] = 'Bolivia, Plurinational State of'
df_neonatal.loc[df_neonatal['Location'] == "Côte d’Ivoire", 'Location'] = 'Ivory Coast'
df_neonatal.loc[df_neonatal['Location'] == "Iran (Islamic Republic of)", 'Location'] = 'Iran, Islamic Republic of'
df_neonatal.loc[df_neonatal['Location'] == "Micronesia (Federated States of)", 'Location'] = 'Micronesia'
df_neonatal.loc[df_neonatal['Location'] == "Republic of Korea", 'Location'] = 'Korea, Republic of'
df_neonatal.loc[df_neonatal['Location'] == "The former Yugoslav Republic of Macedonia", 'Location'] = 'North Macedonia'
df_neonatal.loc[df_neonatal['Location'] == "Venezuela (Bolivarian Republic of)", 'Location'] = 'Venezuela, Bolivarian Republic of'
df_neonatal.loc[df_neonatal['Location'] == "Germany, Federal Republic (former)", 'Location'] = 'Germany'
df_neonatal.loc[df_neonatal['Location'] == "India (until 1975)", 'Location'] = 'India'
df_neonatal.loc[df_neonatal['Location'] == "Kiribati (until 1984)", 'Location'] = 'Kiribati'
df_neonatal.loc[df_neonatal['Location'] == "South Viet Nam (former)", 'Location'] = 'Viet Nam'
df_neonatal.loc[df_neonatal['Location'] == 'Yemen Arab Republic (until 1990)', 'Location'] = 'Yemen'
df_neonatal.loc[df_neonatal['Location'] == 'State of Palestine', 'Location'] = 'Palestine, State of'

df_neonatal['country_code'] = df_neonatal['Location'].apply(pc.country_name_to_country_alpha2)
df_neonatal['continent'] = df_neonatal['country_code'].apply(lambda x: convert_continent(x))

In [85]:
# Interpolate missing data within country
country_list = df_neonatal['Location'].unique()

for country in country_list:
    df_country = df_neonatal[df_neonatal['Location'] == country] 
    df_country = df_country.sort_values('Period')
    for i in range(2, 14):
        df_country.iloc[:,i] = df_country.iloc[:,i].interpolate(method='linear', limit_direction='both')
    
    # drop the rows with the country
    df_neonatal = df_neonatal[df_neonatal['Location'] != country]
    # Append fixed list to main list
    df_neonatal = df_neonatal.append(df_country)
    
df_neonatal = df_neonatal.sort_index()

In [86]:
# Prepare data for KNNImputer
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df_neonatal['Location'] = le.fit_transform(df_neonatal['Location'])
df_neonatal['continent'] = le.fit_transform(df_neonatal['continent'])

df_neonatal = df_neonatal.drop(['country_code'], axis=1)

# Convert time to float
df_neonatal['Period'] = df_neonatal['Period'].apply(lambda x: x.toordinal())
column_names = df_neonatal.columns

In [87]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_neonatal_imputed = pd.DataFrame(imputer.fit_transform(df_neonatal), columns = column_names) 
df_neonatal_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9163 entries, 0 to 9162
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Location                          9163 non-null   float64
 1   Period                            9163 non-null   float64
 2   neonatalMortalityRate             9163 non-null   float64
 3   birthAttendedBySkilledPersonal    9163 non-null   float64
 4   medicalDoctors                    9163 non-null   float64
 5   nursingAndMidwife                 9163 non-null   float64
 6   pharmacists                       9163 non-null   float64
 7   basicDrinkingWaterServices        9163 non-null   float64
 8   atLeastBasicSanitizationServices  9163 non-null   float64
 9   safelySanitization                9163 non-null   float64
 10  basicHandWashing                  9163 non-null   float64
 11  uhcCoverage                       9163 non-null   float64
 12  popula

In [88]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

numerical_cols = ['medicalDoctors', 'nursingAndMidwife', 'pharmacists', 'basicDrinkingWaterServices','atLeastBasicSanitizationServices', 'safelySanitization','basicHandWashing', 'uhcCoverage',
       'population10SDG', 'population25SDG', 'continent']

Y = np.array(df_neonatal_imputed['neonatalMortalityRate']).reshape(-1, 1)
scaler = StandardScaler()
Y = scaler.fit_transform(Y)

df_neonatal_imputed.pop('neonatalMortalityRate')
X = df_neonatal_imputed
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# stratify - make sure classes are evenlly represented across splits
X_train, X_valid, y_train, y_valid = train_test_split(X, Y, train_size=0.75)

### Lasso

1. Evaluate LASSO parameter
2. Fit model and check coefficient

In [89]:
# grid search hyperparameters for lasso regression
# define model
model = Lasso()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -0.345
Config: {'alpha': 0.01}


In [90]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

clf = Lasso(alpha=0.01).fit(X_train, y_train)
prediction = clf.predict(X_valid)
print(f"RMSE = {np.sqrt(mean_squared_error(y_valid,prediction))}" ) 
print(f"R-square = {r2_score(y_valid, prediction)}" )
print(clf.coef_)
print(clf.intercept_)
pd.DataFrame(clf.coef_, index= df_maternal_imputed.columns ).sort_values(0)

RMSE = 0.45946090534016854
R-square = 0.7801054837959218
[-4.67526119e-04 -7.24475752e-05 -1.47570362e-02 -0.00000000e+00
 -2.01538865e-02 -6.14934228e-02  5.34979466e-03 -1.10719978e-01
 -9.28622586e-02  0.00000000e+00 -1.65245444e-01  4.10363356e-02
  0.00000000e+00 -3.65895307e-02]
[53.98710298]


Unnamed: 0,0
uhcCoverage,-0.165245
atLeastBasicSanitizationServices,-0.11072
safelySanitization,-0.092862
pharmacists,-0.061493
continent,-0.03659
nursingAndMidwife,-0.020154
birthAttendedBySkilledPersonal,-0.014757
Location,-0.000468
Period,-7.2e-05
medicalDoctors,-0.0


### Elastic Net

1. Evaluate elastic net parameter
2. Fit model and check coefficient

In [91]:
# grid search hyperparameters for the elastic net
# define model
model = ElasticNet()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
grid['l1_ratio'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -0.345
Config: {'alpha': 0.01, 'l1_ratio': 0.99}


In [93]:
# from sklearn.linear_model import ElasticNet
eNet = ElasticNet(alpha = 0.01, l1_ratio = 0.99,random_state=1).fit(X_train, y_train)
prediction = eNet.predict(X_valid)
print(f"RMSE = {np.sqrt(mean_squared_error(y_valid,prediction))}" ) 
print(f"R-square = {r2_score(y_valid, prediction)}" )

print(eNet.coef_)
print(eNet.intercept_)

pd.DataFrame(eNet.coef_, index= df_maternal_imputed.columns ).sort_values(0)

RMSE = 0.4594216032385983
R-square = 0.7801431015643978
[-4.66621656e-04 -7.24517182e-05 -1.47477489e-02 -0.00000000e+00
 -2.02018771e-02 -6.15611946e-02  6.04429171e-03 -1.11198321e-01
 -9.30415283e-02  0.00000000e+00 -1.65374437e-01  4.11247767e-02
  0.00000000e+00 -3.67321170e-02]
[53.98927811]


Unnamed: 0,0
uhcCoverage,-0.165374
atLeastBasicSanitizationServices,-0.111198
safelySanitization,-0.093042
pharmacists,-0.061561
continent,-0.036732
nursingAndMidwife,-0.020202
birthAttendedBySkilledPersonal,-0.014748
Location,-0.000467
Period,-7.2e-05
medicalDoctors,-0.0


## Under 5 mortality

In [94]:
df_under5 = pd.merge(under5MortalityRate_s, df_medicine, on = ['Location','Period'],how='outer')
df_under5 = pd.merge(df_under5, df_hygiene, on = ['Location','Period'],how='outer')
df_under5 = pd.merge(df_under5, df_finance, on = ['Location','Period'],how='outer')
df_under5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10494 entries, 0 to 10493
Data columns (total 14 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Location                          10494 non-null  object        
 1   Period                            10494 non-null  datetime64[ns]
 2   under5MortalityRate               10000 non-null  float64       
 3   birthAttendedBySkilledPersonal    1755 non-null   float64       
 4   medicalDoctors                    2506 non-null   float64       
 5   nursingAndMidwife                 2587 non-null   float64       
 6   pharmacists                       1795 non-null   float64       
 7   basicDrinkingWaterServices        3455 non-null   float64       
 8   atLeastBasicSanitizationServices  3439 non-null   float64       
 9   safelySanitization                1571 non-null   float64       
 10  basicHandWashing                  921 non-null

In [95]:
# Need to change some country name to use pycountry
df_under5.loc[df_under5['Location'] == "Sudan (until 2011)", 'Location'] = 'Sudan'
df_under5.loc[df_under5['Location'] == "Bolivia (Plurinational State of)", 'Location'] = 'Bolivia, Plurinational State of'
df_under5.loc[df_under5['Location'] == "Côte d’Ivoire", 'Location'] = 'Ivory Coast'
df_under5.loc[df_under5['Location'] == "Iran (Islamic Republic of)", 'Location'] = 'Iran, Islamic Republic of'
df_under5.loc[df_under5['Location'] == "Micronesia (Federated States of)", 'Location'] = 'Micronesia'
df_under5.loc[df_under5['Location'] == "Republic of Korea", 'Location'] = 'Korea, Republic of'
df_under5.loc[df_under5['Location'] == "The former Yugoslav Republic of Macedonia", 'Location'] = 'North Macedonia'
df_under5.loc[df_under5['Location'] == "Venezuela (Bolivarian Republic of)", 'Location'] = 'Venezuela, Bolivarian Republic of'
df_under5.loc[df_under5['Location'] == "Germany, Federal Republic (former)", 'Location'] = 'Germany'
df_under5.loc[df_under5['Location'] == "India (until 1975)", 'Location'] = 'India'
df_under5.loc[df_under5['Location'] == "Kiribati (until 1984)", 'Location'] = 'Kiribati'
df_under5.loc[df_under5['Location'] == "South Viet Nam (former)", 'Location'] = 'Viet Nam'
df_under5.loc[df_under5['Location'] == 'Yemen Arab Republic (until 1990)', 'Location'] = 'Yemen'
df_under5.loc[df_under5['Location'] == 'State of Palestine', 'Location'] = 'Palestine, State of'

df_under5['country_code'] = df_under5['Location'].apply(pc.country_name_to_country_alpha2)
df_under5['continent'] = df_under5['country_code'].apply(lambda x: convert_continent(x))

In [96]:
# Interpolate missing data within country
country_list = df_under5['Location'].unique()

for country in country_list:
    df_country = df_under5[df_under5['Location'] == country] 
    df_country = df_country.sort_values('Period')
    for i in range(2, 14):
        df_country.iloc[:,i] = df_country.iloc[:,i].interpolate(method='linear', limit_direction='both')
    
    # drop the rows with the country
    df_under5 = df_under5[df_under5['Location'] != country]
    # Append fixed list to main list
    df_under5 = df_under5.append(df_country)
    
df_under5 = df_under5.sort_index()

In [97]:
# Prepare data for KNNImputer
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df_under5['Location'] = le.fit_transform(df_under5['Location'])
df_under5['continent'] = le.fit_transform(df_under5['continent'])

df_under5 = df_under5.drop(['country_code'], axis=1)

# Convert time to float
df_under5['Period'] = df_under5['Period'].apply(lambda x: x.toordinal())
column_names = df_under5.columns

In [98]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_under5_imputed = pd.DataFrame(imputer.fit_transform(df_under5), columns = column_names) 
df_under5_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10494 entries, 0 to 10493
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Location                          10494 non-null  float64
 1   Period                            10494 non-null  float64
 2   under5MortalityRate               10494 non-null  float64
 3   birthAttendedBySkilledPersonal    10494 non-null  float64
 4   medicalDoctors                    10494 non-null  float64
 5   nursingAndMidwife                 10494 non-null  float64
 6   pharmacists                       10494 non-null  float64
 7   basicDrinkingWaterServices        10494 non-null  float64
 8   atLeastBasicSanitizationServices  10494 non-null  float64
 9   safelySanitization                10494 non-null  float64
 10  basicHandWashing                  10494 non-null  float64
 11  uhcCoverage                       10494 non-null  float64
 12  popu

In [99]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

numerical_cols = ['medicalDoctors', 'nursingAndMidwife', 'pharmacists', 'basicDrinkingWaterServices','atLeastBasicSanitizationServices', 'safelySanitization','basicHandWashing', 'uhcCoverage',
       'population10SDG', 'population25SDG', 'continent']

Y = np.array(df_under5_imputed['under5MortalityRate']).reshape(-1, 1)
scaler = StandardScaler()
Y = scaler.fit_transform(Y)

df_under5_imputed.pop('under5MortalityRate')
X = df_under5_imputed
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# stratify - make sure classes are evenlly represented across splits
X_train, X_valid, y_train, y_valid = train_test_split(X, Y, train_size=0.75)

### Lasso

1. Evaluate LASSO parameter
2. Fit model and check coefficient

In [100]:
# grid search hyperparameters for lasso regression
# define model
model = Lasso()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -0.357
Config: {'alpha': 0.03}


In [101]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

clf = Lasso(alpha=0.03).fit(X_train, y_train)
prediction = clf.predict(X_valid)
print(f"RMSE = {np.sqrt(mean_squared_error(y_valid,prediction))}" ) 
print(f"R-square = {r2_score(y_valid, prediction)}" )
print(clf.coef_)
print(clf.intercept_)
pd.DataFrame(clf.coef_, index= df_maternal_imputed.columns ).sort_values(0)

RMSE = 0.4505552164495745
R-square = 0.7814616861595137
[-3.55123424e-04 -7.07348160e-05 -1.39618282e-02 -0.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -6.67705988e-02 -0.00000000e+00
 -1.55838854e-01 -7.69350302e-02 -1.15532580e-01  0.00000000e+00
  0.00000000e+00 -7.66053996e-02]
[52.56543187]


Unnamed: 0,0
safelySanitization,-0.155839
uhcCoverage,-0.115533
basicHandWashing,-0.076935
continent,-0.076605
basicDrinkingWaterServices,-0.066771
birthAttendedBySkilledPersonal,-0.013962
Location,-0.000355
Period,-7.1e-05
medicalDoctors,-0.0
nursingAndMidwife,-0.0


### Elastic Net

1. Evaluate elastic net parameter
2. Fit model and check coefficient

In [102]:
# grid search hyperparameters for the elastic net
# define model
model = ElasticNet()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
grid['l1_ratio'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train, y_train)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -0.357
Config: {'alpha': 0.1, 'l1_ratio': 0.21}


In [104]:
# from sklearn.linear_model import ElasticNet
eNet = ElasticNet(alpha = 0.1, l1_ratio = 0.21,random_state=1).fit(X_train, y_train)
prediction = eNet.predict(X_valid)
print(f"RMSE = {np.sqrt(mean_squared_error(y_valid,prediction))}" ) 
print(f"R-square = {r2_score(y_valid, prediction)}" )

print(eNet.coef_)
print(eNet.intercept_)

pd.DataFrame(eNet.coef_, index= df_maternal_imputed.columns ).sort_values(0)

RMSE = 0.45029643607036635
R-square = 0.7817126528986922
[-3.72569250e-04 -7.06127696e-05 -1.38608834e-02 -0.00000000e+00
 -0.00000000e+00 -1.23130915e-03 -7.64843121e-02 -2.78950732e-03
 -1.46012685e-01 -8.10793261e-02 -1.09819433e-01  0.00000000e+00
  0.00000000e+00 -7.83517901e-02]
[52.47022453]


Unnamed: 0,0
safelySanitization,-0.146013
uhcCoverage,-0.109819
basicHandWashing,-0.081079
continent,-0.078352
basicDrinkingWaterServices,-0.076484
birthAttendedBySkilledPersonal,-0.013861
atLeastBasicSanitizationServices,-0.00279
pharmacists,-0.001231
Location,-0.000373
Period,-7.1e-05


# Conclusion

1. Maternal mortality - Improving safe sanitization, UHC coverage, and basic hand washing  are important in reducing maternal mortality    
2. Infant mortality - Improving safe sanitization, UHC coverage, and basic drinking water services are important in reducing infant mortality
3. Neonatal mortality - Improving UHC coverage, basic sanitization services, safe sanitization and increasing number of pharmacist are important in reducing neonatal mortality
4. Under 5 mortality - Improving safe sanitization, UHC coverage, basic drinking water services, and basic hand washing are important in reducing mortality under 5-year old