## Introduction ##

For this study (done for an online data science course), I generated a data set with features that affect national life expectancy,
based on information from [WHO (World Health Organization)][who]. The servers are found at the
[GHO (Global Health Observatory)][whodb] and the [UNESCO Databases of Resources on Education][unesco_ed].

The data set is similar (but not identical) to a previously used [dataset in Kaggle][kag_ds].
However, that Kaggle set has bad data, where it is not clear how much (if any) of the information is valid.
Having that kind of set with good data would be better.

[who]: https://www.who.int
[whodb]: https://www.who.int/gho/database/en/
[unesco_ed]: https://en.unesco.org/themes/education/databases
[kag_ds]: https://www.kaggle.com/kumarajarshi/life-expectancy-who


This is a list of the features in the data set.

|Field|Description|
|---:|:---|
|country|Country|
|country_code|Three letter identifier of a country
|region|Global region of the country
|year|Year|
|life_expect|Life expectancy at birth (years)|
|life_exp60|Life expectancy at age 60 (years)|
|adult_mortality|Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)|
|infant_mort|Death rate up to  age 1|
|age1-4mort|Death rate between ages 1 and 4|
|alcohol|Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)|
|bmi|Mean BMI (kg/m^2) (18+) (age-standardized estimate)|
|age5-19thinness|Prevalence of thinness among children and adolescents, BMI \< (median - 2 s.d.) (crude estimate) (%)|
|age5-19obesity|Prevalence of obesity among children and adolescents, BMI \> (median + 2 s.d.) (crude estimate) (%)|
|hepatitis|Hepatitis B (HepB) immunization coverage among 1-year-olds (%)|
|measles|Measles-containing-vaccine first-dose (MCV1) immunization coverage among 1-year-olds (%)|
|polio|Polio (Pol3) immunization coverage among 1-year-olds (%)|
|diphtheria|Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)|
|basic_water|Population using at least basic drinking-water services|
|doctors|Medical doctors (per 10,000)|
|hospitals|Total density per 100 000 population: Hospitals|
|gni_capita|Gross national income per capita (PPP int. $)|
|gghe-d|Domestic general government health expenditure (GGHE-D) as percentage of gross domestic product (GDP) (%)|
|che_gdp|Current health expenditure (CHE) as percentage of gross domestic product (GDP) (%)|
|une_pop|Population (thousands)|
|une_infant|Mortality rate, infant (per 1,000 live births)|
|une_life|Life expectancy at birth, total (years)|
|une_hiv|Prevalence of HIV, total (\% of population ages 15-49)|
|une_gni|GNI per capita, PPP (current international \$)|
|une_poverty|Poverty headcount ratio at \$1.90 a day (PPP) (\% of population)|
|une_edu_spend|Government expenditure on education as a percentage of GDP (\%)|
|une_literacy|Adult literacy rate, population 15+ years, both sexes (\%)|
|une_school|Mean years of schooling (ISCED 1 or higher), population 25+ years, both sexes|

The feature names that start with "une_" are from the UNESCO database. The other features are from the GHO database.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import statsmodels.api as sm

from sklearn.metrics import confusion_matrix, precision_score, recall_score
from sklearn import linear_model, ensemble
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

import time
%matplotlib inline

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

# suppress warnings about "value is trying to be set on a copy of a slice from a DataFrame"
pd.options.mode.chained_assignment = None  # default='warn'

# function to visualize the predictions compared to the test set
# also print out several fit metrics
# only using RMS error, but including other metrics to be on the safe side

def fitter_metrics(y_actual, y_pred):
    plt.scatter(y_actual, y_pred)
    plt.plot(y_actual, y_actual, color="red")
    plt.xlabel("true values")
    plt.ylabel("predicted values")
    plt.title("Life expectancy: true and predicted values")
    plt.show()

    print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_actual, y_pred)))
    print("Mean squared error of the prediction is: {}".format(mse(y_actual, y_pred)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_actual, y_pred)))
    print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_actual - y_pred) / y_actual)) * 100))

# using a random number seed for the train-test split
# set it here, in case we want to re-run with a different seed number
rand_seed = 173

# *** read in file from Kaggle ***
who_df = pd.read_csv('/kaggle/input/who-national-life-expectancy/who_life_exp.csv')

# *** otherwise read from local version of file ***
#who_df = pd.read_csv('who_life_exp.csv')

print('Data set has {} countries for the years {} to {}\n'.format(who_df['country_code'].nunique(),
                                                                  who_df['year'].min(),who_df['year'].max()))

print(who_df.info())

## Target Feature ##

Below are the life expectancy (at birth) vs. year for the 183 countries, grouped into regions.
The goal of this analysis is to look for which features affect the national life expectancy.

Overall we can see the life expectancy is gradually increasing for most countries.
There are some countries that show dips and drops.
I did want to look at two unusual country plots to see why they are different than the others.
(Other dips or odd trends were not investigated in this analysis.)

In the Eastern Mediterranean region, the plot for one country has the life expectancy over 70 until 2011,
when it dropped to around 60 and stayed there. The country is Syria, and the drop in life expectancy
reflects the [Syrian civil war][syria_war] that began in 2011 and is still going on.

[syria_war]: https://en.wikipedia.org/wiki/Syrian_civil_war

In [None]:
region_names = who_df['region'].unique()
fig, axs = plt.subplots(2, 3)
fig.set_size_inches(14.0, 8.0)
for ireg, region in enumerate(region_names):
    ix = ireg//3
    iy = ireg%3
    axs[ix, iy].set_title(region)
    temp_df = who_df[who_df['region'] == region]
    for country in temp_df['country'].unique():
        axs[ix, iy].plot(temp_df[temp_df['country']==country].year, temp_df[temp_df['country']==country].life_expect)
    axs[ix, iy].set_xlabel("year")
    axs[ix, iy].set_ylabel("life expectancy")
plt.tight_layout()
plt.show()

The other country is Haiti in the "Americas" region. It has a data point from 2010 is
significantly lower than the years around it.
On 12 Jan 2010, a magnitude 7 earthquake which struck Port-au-Prince, and
destroyed most of the medical and treatment facilities. The loss of life that
year affected the mortality rates, and the life expectancy for the years after
2010 remained slightly lower than 2009. (If you are interested in learning more
about Haiti and life expectancy for that country, 
visit [The Borgen Project web site][haiti_borgen].)

I decided to remove the Haiti data point for 2010, as an outlier.
The models/fits used in this study were predicting a life expectancy around 63,
about the same as 2009 and 2011. (The actual value was 36.2 for the year 2010.)

Although I was interested in making a model that would account for disasters,
the values for that year would need to be sensitive enough to show the effect.
For Haiti, the features for "basic water", number of hospitals, and
number of doctors for the year 2010 do not reflect the damage done by the earthquake.

[haiti_borgen]: https://borgenproject.org/top-10-facts-about-life-expectancy-in-haiti/

In [None]:
# This identified which country had the 2010 dip in the Americas plot
#who_df[(who_df['region'] == "Americas") & (who_df['life_expect'] < 65)].head(30)

# print(who_df[who_df['country_code'] == 'HTI'].head(20))

who_df = who_df[~((who_df['country_code'] == 'HTI') & (who_df['year'] == 2010))]

## Comment on Morality Features ##

I have seen examples of notebooks that used "adult mortality" when fitting for life expectancy.

This should not be done, unless the study is about how the life expectancy was calculated.
- The life expectancy is calculated using the death rate at a particular age, and integrating the expectation for years lived over all possible years.
- The adult mortality is calculated using the death rate at a particular age, and integrating the number of deaths over a range years.

It is no surprise that two quantities that are both calculated from yearly death rates are highly correlated.
With only 3 mortality features (infant, child, and adult), a linear fit can get an R<sup>2</sup> of 0.972 for life expectancy.
If we add quadratic terms to the fit function to account for non-linearity, the
R<sup>2</sup> increases to 0.985. (The predictions were consistently low for the lowest and highest
life expectancies if we don't include the quadratic terms.)

Those features are still interesting for other studies. Although it is beyond the scope of this analysis,
a future study could look to see if a low life expectancy is being driven by deaths in a particular age group,
or the different age group mortality rates are affected by different features.

In [None]:
# We haven't done any data cleaning, so make sure to remove null values
X = who_df[['life_expect', 'infant_mort', 'age1-4mort', 'adult_mortality']].copy()
X = X.dropna(axis=0)

Y = X['life_expect']
X_m = X[['adult_mortality', 'infant_mort', 'age1-4mort']]
X_m['infant2'] = X_m['infant_mort'].pow(2)
X_m['youth2'] = X_m['age1-4mort'].pow(2)
X_m['adult2'] = X_m['adult_mortality'].pow(2)
X_m = sm.add_constant(X_m)
X_train, X_test, y_train, y_test = train_test_split(X_m, Y, test_size = 0.2, random_state = rand_seed)

# We fit an OLS model using statsmodels
results_ols = sm.OLS(y_train, X_train).fit()

# We print the summary results.
print(results_ols.summary())

In [None]:
# We are making predictions here
y_ols = results_ols.predict(X_test)

fitter_metrics(y_test, y_ols)

For simplicity, I will drop the UNESCO life expectancy feature, and use the GHO values.
I don't know enough about the methodologies, so I arbitrarily picked one version.

They are similar, but not identical. For example, the GHO data has a sharp drop for Haiti in 2010, but the UNESCO data does not.
Determining the reasons for those differences is beyond the scope of this analysis.

The other features that use mortality rates directly are removed from the data set.
That includes dropping the life expectancy after age 60.

In [None]:
plt.scatter(who_df["life_expect"], who_df["une_life"])
plt.plot(who_df["life_expect"], who_df["life_expect"], color="red")
plt.xlabel('GHO life expect')
plt.ylabel('UNESCO life expect')
plt.show()

del who_df['une_life']
del who_df['adult_mortality']
del who_df['infant_mort']
del who_df['age1-4mort']
del who_df['une_infant']
del who_df['life_exp60']

## Data Cleaning: missing values ##

For the initial examination of the database, I made some choices about which rows or columns to drop
due to missing values. My aim was to keep as many of the countries and features as possible. These will not
be the right choices for all analyses, but it is a starting point.

There are missing values for the vaccine features. Those are for Timor-Leste and the years 2000 and 2001.
It became the first new sovereign state of the 21st century in 2002. Given that historical information,
I will remove those years from that country.

For the "alcohol" feature, there are 70 missing values.
The following countries are missing data for a subset of years:
- South Sudan (SSD, all years)
- Sudan (SDN, 2000-2010)
- Serbia (SRB, 2000-2005)
- Montenego (MNE, 2000-2005)
- Canada (CAN, 2000-2004)
- Afghanistan (AFG, 2000-2004)

South Sudan gained independence from Sudan in 2011. With the missing alcohol information and my lack
of knowldege about how the data before 2011 was generated, it will be easier to drop Sudan and
South Sudan from this study entirely.

The union of Serbia and Montenego dissolved in 2006. Given that historical information, I will remove
the years 2000-2005 from those countries, rather than worry about what method was used to handle the
data in the years when those two countries were unified.

For Canada and Afghanistan, those two countries will be handled later, when we are using interpolation to fill in missing values in general.

In [None]:
clean_df = who_df.copy()

#print("NaN for polio")
#print(clean_df[clean_df['polio'].isnull()].head(50))

# missing vaccine information for 2000 and 2001
indices = clean_df[(clean_df['country_code'] == 'TLS') & (clean_df['year'] < 2002)].index
clean_df.drop(indices , inplace=True)

#print("NaN for alcohol")
#print(clean_df[clean_df['alcohol'].isnull()].head(50))

for country in clean_df['country_code'].unique():
    num_na = clean_df[clean_df['country_code'] == country]['alcohol'].isnull()
    if (num_na.any()):
        print("  for feature \"alcohol\" ",country," is missing data for", num_na.sum(),"years")

indices = clean_df[((clean_df['country_code'] == 'SSD') | (clean_df['country_code'] == 'SDN'))].index
clean_df.drop(indices , inplace=True)

indices = clean_df[((clean_df['country_code'] == 'SRB') | (clean_df['country_code'] == 'MNE')) &
                 (clean_df['year'] < 2006)].index
clean_df.drop(indices , inplace=True)

For the gross national income (GNI) per capita, we have values from both the GHO and UNESCO databases.
Although they have similar trends, there are clear differences. For reasons unknown to me, the GHO data is missing GNI for Argentina, while the UNESCO data is missing Cuba.

North Korea, Somolia, and Syria are missing GNI values from both sources,
so I will drop those countries.

Since GHO set does not have data for the years 2014-2016, I will use the UNESCO GNI, and fill in missing
values if available from GHO data.

At first, I was planning on using a scale factor when using filling in the missing values, but there
is not a consistent trend. (It may be useful to investigate the differences in the databases in some
future study.)
To keep things simple, I will use a 1.0 scale factor.

In [None]:
#print("NaN for gni ppp")
#print(clean_df[clean_df['une_gni'].isnull()].head(50))

indices = clean_df[((clean_df['country_code'] == 'SOM') | (clean_df['country_code'] == 'SYR') | (clean_df['country_code'] == 'PRK'))].index
clean_df.drop(indices , inplace=True)

plt.scatter(clean_df["gni_capita"], clean_df["une_gni"])
plt.plot(clean_df["gni_capita"], clean_df["gni_capita"], color="red")
plt.xlabel('GHO GNI per capita')
plt.ylabel('UNESCO GNI per capita')
plt.show()

clean_df['gni_scale'] = (clean_df['une_gni'] / clean_df['gni_capita'])
plt.scatter(clean_df["gni_capita"], clean_df["gni_scale"])
plt.plot([0.0, 122000.0], [1.0, 1.0], color="red")
plt.xlabel('GHO GNI per capita')
plt.ylabel('UNESCO GNI / GHO GNI')
plt.show()

print('Mean UNESCO/ GHO value for GNI > 40K : ', clean_df[clean_df['gni_capita'] > 40000]['gni_scale'].mean())
print('Mean UNESCO/ GHO value for GNI > 80K : ', clean_df[clean_df['gni_capita'] > 80000]['gni_scale'].mean())

# Not using a scale factor
# use GHO value when the UNESCO value is missing; possible that both are null
clean_df['une_gni'] = np.where(clean_df['une_gni'].notnull(), clean_df['une_gni'], clean_df['gni_capita'])
del clean_df['gni_scale']
del clean_df['gni_capita']

For the remaining missing values, I will use interpolation. Missing values before the earliest entry will use
the earliest value, values after the last entry will use that value, and a linear extrapolation for missing
values between available entries. Most of the features are relatively stable by country. If a different
analysis is being done, a better solution for missing values may be required.

The features "gghe-d" and "che_gdp" (health expenditures) only have two countries with no data, so those countries
will be dropped. Montenegro is missing info for gghe-d and che_gdp; Albania is missing che_gdp.

In [None]:
# to interpolate the missing values
clean_df = clean_df.groupby('country').apply(lambda group: group.interpolate(method='linear', limit_direction='both'))

# Montenegro is missing info for gghe-d and che_gdp; Albania is missing che_gdp
clean_df = clean_df[~((clean_df['country_code'] == 'ALB') | (clean_df['country_code'] == 'MNE'))]

country_list = clean_df['country_code'].unique()
column_list = list(clean_df.columns)

gone_all = dict()
gone_some = dict()

for col in column_list:
    for country in country_list:
        num_na = clean_df[clean_df['country_code'] == country][col].isnull()
        if (num_na.all()):
            gone_all[col] = gone_all.get(col, 0) + 1
        if (num_na.any()):
            gone_some[col] = gone_some.get(col, 0) + 1
    if col in gone_some:
        print("Feature",col,"has",gone_all[col],"countries with no data, ",gone_some[col],"with some missing data.")

The correlation table/heat map is useful to get an overview of the features, but should be used carefully.
Two variables can have a zero Pearson correlation coefficient, and still have a strong non-linear correlation.

In [None]:
#print(clean_df.corr())

plt.figure(figsize=(20,10))
sn.heatmap(clean_df.corr(), annot=True)
plt.show()

The 3 vaccination features are highly correlated with each other (0.92-0.97). Instead of keeping "measles", "diphtheria", and "polio", I will use Principal Component Analysis (PCA) to make one "vaccination" variable to replace those 3 features.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=1)

# Standardizing the features
X = clean_df[['measles', 'polio', 'diphtheria']]
X = StandardScaler().fit_transform(X)

# want the principal component vector to be positively correlated with increasing vaccination rates
# for this particular fit by the code, that requires multiplying by -1
principalComponents = pca.fit_transform(X)
clean_df['vaccination'] = -principalComponents

print('Variance accounted for by pca:', pca.explained_variance_ratio_)

plt.figure(figsize=(12,4))
plt.subplot(1, 3, 1)
plt.scatter(clean_df['vaccination'], clean_df['measles'])
plt.xlabel('vac pca')
plt.ylabel('measles')

plt.subplot(1, 3, 2)
plt.scatter(clean_df['vaccination'], clean_df['polio'])
plt.xlabel('vac pca')
plt.ylabel('polio')

plt.subplot(1, 3, 3)
plt.scatter(clean_df['vaccination'], clean_df['diphtheria'])
plt.xlabel('vac pca')
plt.ylabel('diphtheria')

plt.tight_layout()
plt.show()

del clean_df['measles']
del clean_df['polio']
del clean_df['diphtheria']

I plotted the distribution of the life expectancy, the remaining features, 
and the scatter plot of
life expectancy as a function of those features.

We don't necessarily expect these features to have a normal distribution.
For example, populations for each nation should be largely
random. Some of the features are a percentage of the population, which cannot exceed 100%.
This kind of limitation needs to be kept in mind if we are looking at metrics where the
fit assumes normality of the features.

(Seven of the features still have countries with missing information. This is the reason for the warning message from the code when plotting "hepatitis".)

In [None]:
target_feature = "life_expect"

num_features = list(clean_df.columns)
num_features.remove(target_feature)
num_features.remove('country')
num_features.remove('country_code')
num_features.remove('region')
num_features.remove('year')

print("Target feature:",target_feature)
print("Numeric  features:",num_features)

print("Plotting target feature:",target_feature)
plt.hist(clean_df[target_feature])
plt.show()

for feat in num_features:
    print("Plotting feature:",feat)
    plt.figure(figsize=(12,4))
    plt.subplot(1, 2, 1)
    plt.hist(clean_df[feat])
    plt.xlabel(feat)

    plt.subplot(1, 2, 2)
    plt.scatter(clean_df[feat], clean_df[target_feature])
    plt.xlabel(feat)
    plt.ylabel("life expectancy")
    plt.show()

Looking at the scatter plot distributions, 3 features
showed noticable non-linear correlations: young age obsesity, number of doctors (per capita),
and GNI. Based on the shape of the scatter plot distributions, I decided
to try using the logarithm of those values. The transformed variables appear (visually) to have
a more linear relationship with life expectancy.

(Later on, I will use linear regression to see how the fit is affected,
before and after the log transformation on those 3 features.)

In [None]:
no_log_df = clean_df.copy()

feature_log = ['age5-19obesity', 'doctors', 'une_gni']

for feat in feature_log:
    clean_df[feat] = np.log1p(clean_df[feat])

    print("Plotting feature:",feat)
    plt.figure(figsize=(12,4))
    plt.subplot(1, 2, 1)
    plt.hist(clean_df[feat])
    plt.xlabel(feat)

    plt.subplot(1, 2, 2)
    plt.scatter(clean_df[feat], clean_df[target_feature])
    plt.xlabel(feat)
    plt.ylabel("life expectancy")
    plt.show()

For now, I will drop features that have countries with missing values, to use the most countries.
The result is data with 11 features from 176 countries.

(Later on, I will see what changes if I try to keep more features, and drop the countries with missing values.)

I am also not using country, country code, year, and region information.
There are factors that may depend on the "year", like monetary inflation or technology advancements,
but those effects are too complicated to include in this initial study.

In [None]:
# make a copy before dropping any more features and countries
# this will save some work later on, when I want to try a
# different selection of features

df_before_remove = clean_df.copy()

# drop hepatitis, hospitals, hiv, poverty, spend edu, literacy, school
remove_list = ['hepatitis', 'hospitals', 'une_hiv', 'une_poverty', 'une_edu_spend', 'une_literacy', 'une_school']
for col_name in remove_list:
    del clean_df[col_name]
    del no_log_df[col_name]
    num_features.remove(col_name)

print('Remaining features:',len(num_features), num_features)
print("\n",clean_df['country_code'].nunique(),"countries to be analyzed")

# drop remaining NaN rows; should be zero, but running it just in case I missed a stray value somewhere
clean_df = clean_df.dropna(axis=0)
no_log_df = no_log_df.dropna(axis=0)

## Comment on Cross Validation and Overfitting ##
I am making a note here as a reminder to myself about a potential fitting problem to be avoided in the future.
It might prove useful as an example for others learning data science as well.

I am using 80\% of the data for model training, comparing different learning models, and tuning of hyperparameters. The remaining 20\% will be used to test the final model fit.

While searching for examples of cross validation on the Internet, there are times when it is acceptable to use 100\% of the data during cross validation. That is equivalent to trying multiple train-test selections. This
should only be done when the model and hyperparameters are already fixed before looking at the data.

Early while working on this analysis, I made the mistake of running "cross_val_score" without running a function like "train_test_split" or the dataframe method "sample(frac=1)" beforehand. The function "cross_val_score" does not make random selections when dividing the data. (If we run "cross_val_score" with "CV=5" (for example), it will take the first 20\% of the rows for a test set, the rows from 21\% to 40\% on the next iteration as a test set, and so forth.) For this data, where the entries are clustered by country and region, this will lead to massive overfitting and poor prediction results. (The reason for
the poor predictions was not immediately obvious to me,
since the code ran without errors.)

## Model Fit Selection ##

I tried multiple models to fit the life expectancy data, with default hyperparameters for each model. After a promising model
was selected (based in part on the root-mean-square (RMS) errors and processing time), I 
optimized the hyperparameters for that particular model.

In [None]:
# create a table of the model metrics, to allow easier comparison
model_perform = pd.DataFrame(columns = ['model', 'rms_error', 'time'])

X = clean_df[num_features]
y = list(clean_df[target_feature])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = rand_seed)

# standardize the features before fitting
# not all models require this, but it will make life easier to do it for all of them
sc = StandardScaler()
X_train_sc = pd.DataFrame(sc.fit_transform(X_train), columns=num_features)
X_test_sc = pd.DataFrame(sc.transform(X_test), columns=num_features)

For each model, I use "cross_val_score" from the sklearn library. The score is the negative root mean squared error. The fitter tries to increase the score, which results in a lower (non-negative) RMS error.

Below, I am running the statsmodel OLS fit, to get the nicely formatted fit results text output. (I ended up not using it, but I wanted that information available in case I wanted to see it.)

In [None]:
# fit using statsmodels OLS
X_train_sc2 = sm.add_constant(X_train_sc)
results_ols = sm.OLS(y_train, X_train_sc2).fit()

# We print the summary results.
print(results_ols.summary())

# We are making predictions here
X_test_sc2 = sm.add_constant(X_test_sc)
y_ols = results_ols.predict(X_test_sc2)

fitter_metrics(y_test, y_ols)

### Model Fitting: OLS ###
The ordinary least squares fit is not great (RMS error 3.62), but it does show that this set of 11 features
has some predictive power, even with the simple assumption of linearity.

This model fit would be improved if we had more usable features (to account for more of the variance), or if we
had a theoretical function to use in the fit that accounted for non-linear relationships.

In [None]:
model = linear_model.LinearRegression()
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'OLS (before log transform)', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y_train, y_pred)

The data includes the 3 features that were log transformed.
I wanted to see how the OLS fit does with the untransformed features. The fit below shows that not using the logarithm values results in a poorer fit (RMS error 3.97).

In [None]:
y2 = no_log_df[target_feature]
X2 = no_log_df[num_features]
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.2, random_state = 173)

model = make_pipeline(StandardScaler(), linear_model.LinearRegression())
score = cross_val_score(model, X2_train, y2_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X2_train, y2_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'OLS (after log transform)', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y2_train, y_pred)

### Model Fitting: ElasticNet ###

An alternative to OLS is [Elastic Net][elastic_net], which the sklearn documentation states is a "linear regression with combined L1 and L2 priors as regularizer".
The default model is not as good as OLS (RMS 4.20). (I assume that some tweaking of the hyperparamters would fix that, but I did not do that here.)

[elastic_net]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

In [None]:
model = linear_model.ElasticNet()
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'ElasticNet', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y_train, y_pred)

### Model Fitting: Huber ###
The description on the sklearn web site mentions the [Huber model][huber_reg] is a "linear regression model that
is robust to outliers." For this dataset, the default hyperparameters results in a fit comparable
to OLS (RMS 3.68).

[huber_reg]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html

In [None]:
# Failed to converge with default of max_iter=100
model = linear_model.HuberRegressor(max_iter=1200)
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'Huber Linear', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y_train, y_pred)

### Model Fitting: K Nearest Neighbors ###
The regression method based on [k-nearest neighbors][knn_method] is an improvement over the OLS fit (RMS 1.48).

Since the life expectancy
and features on a national level change slowly over time (in general), and each country has up to 17 year of entries, it is not surprising this model does well on this data. I ended up not chosing this model, as I am concerned that the fit does well on existing data, but perhaps not as well on predicting new results.

[knn_method]: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

In [None]:
# default is n_neighbors=5
model = KNeighborsRegressor()
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'KNeighbors', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y_train, y_pred)

### Model Fitting: Support Vector ###
The documentation for [Support Vector Machine methods][svr_method] mentions that the variables have to be scaled before
used in the model fit. In this particular situation, I found out that not standardizing the features results in model predictions
that are always near 70.

The RMS is better than the OLS fit (2.81), but we can see that for life expectancy below 50 years, the
predictions (with default hyperparameters) are overestimated.

[svr_method]: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

In [None]:
# 
model = SVR()
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')

print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'Support Vector', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y_train, y_pred)

### Model Fitting: Random Forest ###
The last two models that I tried use ensemble-based methods. The [Random Forest Regressor][rfr_method] by default has an unlimited depth, and nodes are expanded until all leaves are pure.

The results are good (RMS error 1.51), but this is probably close to the best result possible for this model. The other models have default hyperparameters set to limit the execution time of the code, so this is an apple-to-orange kind of comparison.

[rfr_method]: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [None]:
# The default value of n_estimators changed from 10 to 100 in version 0.22

model = ensemble.RandomForestRegressor()
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'Random Forest', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y_train, y_pred)

I wanted to see how this model performs with a limit on the depth, which will speed up the processing and make it more like the other models I used in this section.

The results with a depth of 4 are a better than OLS (RMS error 3.23), but it doesn't make any predictions below 50 years. As the depth increases, the model predictions improve at the lowest
and highest life expectancy values.

In [None]:
# The default value of n_estimators changed from 10 to 100 in version 0.22

model = ensemble.RandomForestRegressor(max_depth=4)
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'Random Forest (depth 4)', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y_train, y_pred)

### Model Fitting: Gradient Boosting ###
The last model I tried is [Gradient Boosting Regressor][gbr_method]. It is more of a "black-box" algorithm (less intuition about how to improve the fit), but it tends to perform well on machine learning problems.

[gbr_method]: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

In [None]:
# default n_estimators=100, max_depth=3

model = ensemble.GradientBoostingRegressor()
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

elapse_time = time.time()
y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)
elapse_time = time.time() - elapse_time
model_perform = model_perform.append({'model': 'Gradient Boost', 'rms_error': -score.mean(), 'time': elapse_time}, ignore_index=True)

fitter_metrics(y_train, y_pred)

## Optimizing the Gradient Boost Model ##

I decided to use Gradient Boosting as the model to fit this data set. Even with default hyperparameters, it had one of the lower RMS errors. Later on, we will see that the optimized hyperparameters results in a better fit than even the KNeighbors (unlimited depth) result.

In [None]:
print(model_perform.round(decimals=3).head(20))

The default hyperparameters for GBR are depth of 3 and 100 estimators.

I ran GridSearchCV to see if more resources will improve the model predictions.
The model seems stable over the range of hyperparameters used, so there is some
flexibility on what parameters to use.

I choose a depth of 6 and 300 estimators. While not the absolute best test score rank,
the test scores are similar (top four are all 0.978 for mean score), and chosing
300 estimators is faster than the best rank (depth 6, 500 estimators).

In [None]:
params = {'n_estimators': [100, 200, 300, 500],
          'max_depth': [3, 4, 5, 6, 7]}

model = ensemble.GradientBoostingRegressor()
clf = GridSearchCV(model, params)
clf.fit(X_train_sc, y_train)

optimize_df = pd.DataFrame.from_dict(clf.cv_results_)
del optimize_df['params']
#rank_col = optimize_df.pop("rank_test_score")
#optimize_df = optimize_df.insert(1, rank_col.name, rank_col)
optimize_df.round(decimals=3).head(20)

Using the chosen hyperparameters instead of the default improves the RMS errors from
2.26 to 1.34.

In [None]:
params = {'n_estimators': 300,
          'max_depth': 6}

model = ensemble.GradientBoostingRegressor(**params)
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

y_pred = cross_val_predict(model, X_train_sc, y_train, cv=5)

fitter_metrics(y_train, y_pred)

The GBR code does return scores for the features to indicate relative importance in the fit. For this fit, "basic_water" has a much higher "importance" than the other 10 features.

However, I suspect the features are have enough correlation that the model could have arrived at other solutions with similar results. If I use GBR with "basic_water" removed (with 10 features), we see that order of features is not preserved. For example, "doctors" is now the most important feature, rather than the previous second-best feature of GNI.

In [None]:
params = {'n_estimators': 300,
          'max_depth': 6}

# Initialize and fit the model.
clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_train_sc, y_train)

feature_importance = clf.feature_importances_

# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance depth 6')

params = {'n_estimators': 300,
          'max_depth': 6}

X_no_h2o = X_train_sc.copy()
del X_no_h2o['basic_water']

clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_no_h2o, y_train)

feature_importance = clf.feature_importances_

# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_no_h2o.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance w/o basic_water')

plt.subplots_adjust(left=0.5, right=1.1)
plt.tight_layout()
plt.show()

Using GBR with only 10 features does increase the RMS error from 1.34 to 1.44. "basic_water" is important, just not as overwhelmingly important as we might naively think based on the plot of relative importance.

In [None]:
params = {'n_estimators': 300,
          'max_depth': 6}

model = ensemble.GradientBoostingRegressor(**params)
score = cross_val_score(model, X_no_h2o, y_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

y_pred = cross_val_predict(model, X_no_h2o, y_train, cv=5)

fitter_metrics(y_train, y_pred)

I was curious about the
effect of changing the depth on the GBR. While "basic_water" stays on top, the importance of the other features
shift in order and relative magnitude.

(Future analyses might want to explore the correlation between the features, or try PCA in more detail.)

In [None]:
params = {'n_estimators': 300,
          'max_depth': 4}

# Initialize and fit the model.
clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_train_sc, y_train)
score = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='neg_root_mean_squared_error')


feature_importance = clf.feature_importances_

# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance depth 4')

params = {'n_estimators': 300,
          'max_depth': 8}

clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_train_sc, y_train)

feature_importance = clf.feature_importances_

# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_train.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance depth 8')

plt.subplots_adjust(left=0.5, right=1.1)
plt.tight_layout()
plt.show()

## Additional Features ##
I want to check how the results change if we can include more features, at the cost of removing countries that don't have that information. That will require reducing the number of countries from 176 to 61, but increases the available features from 11 to 18.

Keep in mind that we only had 130 entries for "hospitals", so most of the information for that feature was interpolated from one or two entries per country. (I won't be exploring in detail the effects of the features that have sparse entries.)

In [None]:
# use the copy before we dropped features and countries
clean_df = df_before_remove.copy()

more_features = list(clean_df.columns)
more_features.remove(target_feature)

#remove_list = ['hepatitis', 'une_hiv', 'une_poverty', 'une_edu_spend', 'une_literacy',
#               'gni_capita', 'une_pop']
#for col_name in remove_list:
#    del clean_df[col_name]

# drop remaining NaN rows
clean_df = clean_df.dropna(axis=0)

print(clean_df['country_code'].nunique(),"countries to be analyzed")
list_use_country = clean_df['country_code'].unique()

# make another dataframe with the coutries we just dropped
unclean_df = df_before_remove.copy()
unclean_df = unclean_df[~unclean_df.country_code.isin(list_use_country)]

# drop remaining NaN rows
#unclean_df = unclean_df.dropna(axis=0)

print(unclean_df['country_code'].nunique(),"countries excluded")

more_features.remove('country')
more_features.remove('country_code')
more_features.remove('region')
more_features.remove('year')

print("Target feature:",target_feature)
print("Larger set of numeric  features:",len(more_features), more_features)

Using the 61 countries that have at least some information for the 18 features, the fit accuracy improved.
The RMS error is 0.92 for the training set. (Anything near 1.0 suggests that we are near the limit for improving the fit model.)

In [None]:
Y3 = clean_df[target_feature]
X3 = clean_df[more_features]

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, Y3, test_size = 0.2, random_state = rand_seed)

params = {'n_estimators': 300,
          'max_depth': 6}

model = make_pipeline(StandardScaler(), ensemble.GradientBoostingRegressor(**params))
score = cross_val_score(model, X3_train, y3_train, cv=5, scoring='neg_root_mean_squared_error')
print('Array of cross_val_score results:',score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

y_pred = cross_val_predict(model, X3_train, y3_train, cv=5)

fitter_metrics(y3_train, y_pred)

I was curious if the selection of those 61 countries, rather than adding 7 more features, might have been the reason for the improvement.

Using the set of 11 features used earlier, the fit of the 61 countries has an RMS error of 1.30, similar to the earlier result (with 171 countries), and higher than the 18-feature fit (RMS 0.92).

In [None]:
Y4 = clean_df[target_feature]
X4 = clean_df[num_features]

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, Y4, test_size = 0.2, random_state = rand_seed)
# 
params = {'n_estimators': 300,
          'max_depth': 6}

model = make_pipeline(StandardScaler(), ensemble.GradientBoostingRegressor(**params))
score = cross_val_score(model, X4_train, y4_train, cv=5, scoring='neg_root_mean_squared_error')
print(score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

y_pred = cross_val_predict(model, X4_train, y4_train, cv=5)

fitter_metrics(y4_train, y_pred)

I then redid the fit (with 11 features) with the 115 countries that were excluded from the 18-feature fit. The RMS error is 1.14.

I don't have the time to explore this further. If some of the 7 additional features are not important for the GBR model, then those could be excluded in order to include more countries in the fit.

Future analyses might try an unsupervised classification, to test if some countries have different factors that influence the life expectancy. (Rather than trying a single global fit with all countries, we might want to try a model with fits on two distinct subsets of countries.)

In [None]:
Y5 = unclean_df[target_feature]
X5 = unclean_df[num_features]

X5_train, X5_test, y5_train, y5_test = train_test_split(X5, Y5, test_size = 0.2, random_state = rand_seed)
# 
params = {'n_estimators': 300,
          'max_depth': 6}

model = make_pipeline(StandardScaler(), ensemble.GradientBoostingRegressor(**params))
score = cross_val_score(model, X5_train, y5_train, cv=5, scoring='neg_root_mean_squared_error')
print(score)
print("Unweighted Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() ))

y_pred = cross_val_predict(model, X5_train, y5_train, cv=5)

fitter_metrics(y5_train, y_pred)

## Final Model Fit ##
I used the Gradient Boosting Regressor fit on the set of 176 countries with 11 features. The reserved test data is then run with the trained model.

The RMS error is 1.2, indicating a reasonably good fit, with no obvious bias when comparing the predictions to the true values. The RMS error is comparable to the errors seen with cross-validation with the training data. (If the model was overfitting, I would expect the error to be significantly higher for the test data.)

The data set has a small selection of features available from the WHO data servers. The model predicts the national life expectancies well with only 11 features, and without using features that depend directly on mortality rates.

This model suggests that countries with lower life expectancy are best served to improve access to water (potable and sanitation), doctors (basic health care), and overall income (GNI per capita). More study of the feature correlations would be needed to see what other features are important.

Even this basic analysis shows there is a lot that can be investigated in this data set. If you find this interesting,
you are encouraged to look at the GHO and UNESCO data servers, which have thousands of variables.

In [None]:
params = {'n_estimators': 300,
          'max_depth': 6}

# Initialize and fit the model.
clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_train_sc, y_train)

predict_test = clf.predict(X_test_sc)
fitter_metrics(y_test, predict_test)

feature_importance = clf.feature_importances_

# Make importances relative to max importance.
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X_test.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()