# Life Expectancy Regressor
__Shuyan(Dawn) Li__<br>
Deepnote link: https://deepnote.com/project/07719c8b-d1bf-4a88-9582-e7e42cd58ee8#%2Flife_expectancy.ipynb

In [None]:
import numpy as np
import pandas as pd
from   sklearn.pipeline        import Pipeline
from   sklearn.model_selection import train_test_split
from   sklearn.experimental    import enable_iterative_imputer
from   sklearn.preprocessing   import *
from   sklearn.impute          import *
from   sklearn.compose         import *
from sklearn.model_selection import RandomizedSearchCV
from   sklearn.linear_model    import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.base            import BaseEstimator
from   sklearn.metrics         import r2_score, mean_squared_error
from sklearn.inspection import permutation_importance

In [None]:
life = pd.read_csv('Life Expectancy Data.csv')

## Introduction
This is a machine learning project aimed to train a model to predict life expectancy using machine learning methods. Here we use the dataset from https://www.kaggle.com/kumarajarshi/life-expectancy-who. This dataset includes 2938 observations and 21 raw features.<br>
Although various studies undertaken in this topic, we are still insterested in what factors are influencing our life expectancy. We care about those influenciers for this will affect government policies in healthcare expenditure etc. This dataset has many interested varibles including income facors, immunization facors, mortality factors, economic factors, social factors and other health related factors. 

## 1. Load data 
In the step, I am going to load the data and treat life expectancy as the target variable and the rest except the Year variable as predictors. We have 2 categorical variables Countries and Status and the rest are numeric variables.

In [None]:
life.head(3)

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9


In [None]:
life.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

In [None]:
life.isnull().sum()

Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

In [None]:
# since our target varible is Life expectancy, so we need to delete those missing life expectancy
life = life[~life['Life expectancy '].isnull()]

In [None]:
life.shape

(2928, 22)

In [None]:
# Make those int values except year as float so that to treat them as numeric
int_cols = ['infant deaths', 'Measles ', 'under-five deaths ']
life[int_cols] = life[int_cols].astype(float)

In [None]:
# Set life expectancy variable as target variable
y = life['Life expectancy ']

In [None]:
cols = list(life.columns)

In [None]:
cols.remove('Life expectancy ')

In [None]:
X = life[cols]

## 1.1 Split Data for training and testing
Here we choose data before 2015 for model training and data of 2015 for model testing

In [None]:
X_train = X[X.Year != 2015]
X_test = X[X.Year == 2015]

In [None]:
y_train = y[X.Year != 2015]
y_test = y[X.Year == 2015]

In [None]:
cols.remove('Year')
X_train = X_train[cols]
X_test = X_test[cols]

In [None]:
X_train.shape, X_test.shape

((2745, 20), (183, 20))

In [None]:
X_train.columns

Index(['Country', 'Status', 'Adult Mortality', 'infant deaths', 'Alcohol',
       'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ',
       'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ',
       ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years',
       ' thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')

## 2. Data Preprocessing
In this process, we are going to include data imputation to better process data for analyzing.

In [None]:
mask_num = X_train.dtypes == (float or int)
columns_num = X_train.columns[mask_num].tolist()
columns_cat = X_train.columns[~mask_num].tolist()

print('-Num', columns_num)
print('-Cat', columns_cat, '\n')

-Num ['Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling']
-Cat ['Country', 'Status'] 



In [None]:
life[columns_cat].head(3)

Unnamed: 0,Country,Status
0,Afghanistan,Developing
1,Afghanistan,Developing
2,Afghanistan,Developing


### 2.1 Imputation

In [None]:
cat_pipe = Pipeline([('imputer', SimpleImputer(missing_values=np.nan,
                                               strategy='most_frequent')),
                      ('ohe', OneHotEncoder())
                     ])
con_pipe = Pipeline([('imputer', SimpleImputer(missing_values=np.nan,
                                               strategy='median')),
                    ('scaler', StandardScaler())
                      ])
preprocessing = ColumnTransformer([('categorical', cat_pipe, columns_cat),
                                   ('continuous',  con_pipe, columns_num)
                                   ])

## 3. Model Training
In this section, I chose Linear Regression, Ridge, Lasso, Decision Tree Regressor and RandomForest Regressor. For evaluation metric, I chose the mean absolute error for following reasons. First, life expectancy varies in a small range so MAE is a good and interpretable measure. Unlike MSE, it is also robust to outliers.  Moreover, in the practice, the MSE or R2_score metrics would tend to build a less general model which will cause overfitting problem.

In [None]:
# Create a search space for randomized search
search_space = [
    {
        'clf': [LinearRegression()]
    },
    {
        'clf': [Ridge()],
        'clf__alpha': [200, 230, 250,265, 270, 275, 290, 300, 500]
    },
    {
        'clf': [Lasso()],
        'clf__alpha': [0.02, 0.024, 0.025, 0.026, 0.03]
    },
    {
        'clf': [DecisionTreeRegressor()],
        'clf__criterion': ["mse", "mae"],
        'clf__min_samples_split': [10, 20, 40],
        'clf__max_depth': [2, 6, 8],
        'clf__min_samples_leaf': [20, 40, 100],
        'clf__max_leaf_nodes': [5, 20, 100]
        
    },
    {
        'clf': [RandomForestRegressor()],
        'clf__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
        'clf__max_features': ['auto', 'sqrt'],
        'clf__min_samples_leaf': [1, 2, 4],
        'clf__min_samples_split': [2, 5, 10],
        'clf__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
    }
]

In [None]:
class DummyEstimator(BaseEstimator):
    "Pass through class, methods are present but do nothing."
    def fit(self): pass
    def score(self): pass

In [None]:
pipe = Pipeline([('preprocessing', preprocessing),
                ('clf', DummyEstimator())])

In [None]:
clf_algos_rand = RandomizedSearchCV(estimator=pipe, 
                                    param_distributions=search_space, 
                                    n_iter=25,
                                    cv=5, 
                                    n_jobs=-1,
                                    verbose=1,
                                   scoring = 'neg_mean_absolute_error')
#  Fit grid search
best_model = clf_algos_rand.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
 nan nan nan nan nan nan nan]


In [None]:
best_model.best_params_

{'clf__n_estimators': 600,
 'clf__min_samples_split': 2,
 'clf__min_samples_leaf': 1,
 'clf__max_features': 'auto',
 'clf__max_depth': None,
 'clf': RandomForestRegressor(n_estimators=600)}

In [None]:
best_model.best_params_['clf']

RandomForestRegressor(n_estimators=600)

In [None]:
hyperparameters = best_model.best_params_['clf'].get_params()

In [None]:
hyperparameters

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 600,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [None]:
pipe = Pipeline([('preprocessing', preprocessing),
                ('rf', RandomForestRegressor(**hyperparameters))])
pipe.fit(X_train, y_train)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder())]),
                                                  ['Country', 'Status']),
                                                 ('continuous',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Adult Mortality',


## 4. Evaluation

In [None]:
rmse_train = np.sqrt(mean_squared_error(y_train, pipe.predict(X_train)))
pred = pipe.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, pred))

In [None]:
print("Train RMSE of the best model is:", rmse_train)

Train RMSE of the best model is: 0.6372463790337067


In [None]:
print('Test RMSE of the best model is: ', rmse)

Test RMSE of the best model is:  1.8205568058934933


In [None]:
r2_train = r2_score(y_train, pipe.predict(X_train))

In [None]:
r2 = r2_score(y_test, pred)

In [None]:
print("Train R2 score of the best model is:", r2_train)

Train R2 score of the best model is: 0.995582956761182


In [None]:
print('R2 score of the best model is: ', r2)

R2 score of the best model is:  0.9495013509914836


## 5. Feature Importance

In [None]:
r = permutation_importance(pipe, 
                           X_test, y_test, 
                           n_repeats=30,
                           random_state=42)

In [None]:
feature_importance = pd.DataFrame()
mean = []
std = []
columns = []

In [None]:
for i in r.importances_mean.argsort()[::-1]:
    columns.append(X_train.columns[i])
    mean.append(round(r.importances_mean[i], 3))
    std.append(round(r.importances_std[i], 3))

In [None]:
feature_importance['Cols'] = columns
feature_importance['Mean'] = mean
feature_importance['Std'] = std

In [None]:
feature_importance

Unnamed: 0,Cols,Mean,Std
0,Income composition of resources,0.266,0.034
1,Adult Mortality,0.239,0.027
2,HIV/AIDS,0.2,0.036
3,thinness 5-9 years,0.014,0.003
4,Schooling,0.012,0.002
5,Country,0.009,0.001
6,under-five deaths,0.008,0.002
7,BMI,0.005,0.002
8,Polio,0.004,0.002
9,thinness 1-19 years,0.004,0.001


## 6. Conclusion

After we fit all the models, we find out that `RandomForestRegressor` with parameters `n_estimators=600` works best among the search space. The root mean squared error is `1.82` and the R2 score is `0.944`(it may different each time fits). It means that `94.4%` of the life expectancy values in 2015  could be explained by the model. Then we analyse the feature importancy of each predictors and we surprisingly to see the `income, adult mortality and HIV/AIDS` are three top most influential factors to the life expectancy. Also, children healthcare plays an important role as influencing factors to global life expectancy <br>

For the next step, I will look more into how to polish the model so that there is less overfitting because for now the R2 scored is really high for training dataset and it will cause a bad perfomance on gernality. Also, I will add some time series analysis since there is a time series in the dataset. It will also improve the performance of the final model.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=07719c8b-d1bf-4a88-9582-e7e42cd58ee8' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>