# WHO Life Expectancy Feature Engineering

The first step in feature engineering is to train-test split the dataframe. This ensures that the model is robust and effective on future unseen data.

After this, we can apply feature engineering to the split dataframe where necessary. This involves feature selection, standardization and scaling.

In [15]:
# Libraries
import pandas as pd   # For general data use
import numpy as np    # For mathematical operations
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import MinMaxScaler

In [2]:
# Creating dataframe from CSV file
df = pd.read_csv('Life Expectancy Data.csv')

## Train-test splitting

In [3]:
# Features
X = df.drop('Life_expectancy', axis=1)

# Target
y = df['Life_expectancy']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Basic feature selection
We dropped the `Economy_status_Developing` feature. As discovered during exploratory analysis, there are two binary features presenting perfectly symmetrical information about economic status. It is common practice in a binary feature for **1** to represent **Yes** and for **0** to represent **No**. Since the `Economy_status_Developed` follows this convention, we chose to keep this feature in the dataframe, and drop the other.

The feature `Country` was deemed to be equivalent to a unique identifier. Label-encoding would have introduced a non-existing linear relationship and one-hot encoding would introduce so many new variables (179) that the model would be heavily weighted to binary predictors, while make coefficient output practically illegible. Therefore `Country` was dropped.

In [4]:
# Function dropping all required features
def drop_all(df):
    df.drop(columns = ['Country','Economy_status_Developing'], inplace=True)    # Drops in place, do not run twice!
    return df

In [5]:
X_train_drop = drop_all(X_train)
X_test_drop = drop_all(X_test)

`Region` was one-hot encoded and the original column was dropped, but this feature did not have significant p-values after early testing and was dropped as part of further feature selection.

In [6]:
pd.get_dummies(df, columns = ['Region'], prefix = 'Region', dtype=int)
X_train_drop.drop(columns=['Region'], inplace=True)
X_test_drop.drop(columns=['Region'], inplace=True)

## Linearization

The `GDP_per_capita` feature displayed a logarithmic relationship with `Life_expectancy`. To combat this, we took the log value of `GDP_per_capita`. This then presented a more linear relationship. We dropped the original `GDP_per_capita` feature following this.

In [7]:
# Function to normalise GDP
def log_GDP(df):
    df['GDP_per_capita_log'] = np.log(df['GDP_per_capita'])
    df.drop(columns = ['GDP_per_capita'], inplace = True)
    return df

In [8]:
# Apply function to X_train_drop and X_test_drop
X_train_fe = log_GDP(X_train_drop)
X_test_fe = log_GDP(X_test_drop)

## Further feature selection

The following stepwise selection function was used to select only those features which provide a significant contribution to the model (p-value great than 0.05).

In [9]:
def stepwise_selection(X, y, threshold_in = 0.01, threshold_out = 0.05, verbose = True):
    # The function is checking for p-values (whether features are statistically significant) - lower is better
    included = [] # this is going to be the list of features we keep
    while True:
        changed = False
        # forward step
        excluded = list(set(X.columns) - set(included))
        new_pval = pd.Series(index = excluded, dtype = 'float64')
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        # we add the feature with the lowest (best) p-value under the threshold to our 'included' list
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed = True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval)) # specifying the verbose text


        # backward step: removing features if new features added to the list make them statistically insignificant
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        # if the p-value exceeds the upper threshold, the feature will be dropped from the 'included' list
        if worst_pval > threshold_out:
            changed = True
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

In [20]:
# Selects features based on p-values contributing to the model
selected_features = stepwise_selection(X_train, y_train)
# Trims features to those selected
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

Add  Under_five_deaths              with p-value 0.0
Add  Adult_mortality                with p-value 0.0
Add  Economy_status_Developed       with p-value 1.79712e-150
Add  GDP_per_capita_log             with p-value 2.42674e-52
Add  Infant_deaths                  with p-value 2.61209e-14
Add  BMI                            with p-value 1.03723e-10
Add  Schooling                      with p-value 8.55602e-11
Add  Thinness_ten_nineteen_years    with p-value 1.50631e-06
Add  Year                           with p-value 0.000120534
Add  Alcohol_consumption            with p-value 0.000514411
Add  Incidents_HIV                  with p-value 0.00695664
Add  Hepatitis_B                    with p-value 0.00628234
Add  Polio                          with p-value 0.00171435


Original training of the model flagged a high Condition Number. So a VIF calculator was used to check for multicollinearity.

In [21]:
def calculate_vif(X):
    X = sm.add_constant(X)    # Adds a constant to DataFrame X so that the function "variance_inflation_factor" can perform its tests
    vif_data = pd.DataFrame()    # Creates DataFrame that will be used to visualize vif data
    vif_data['Variable'] = X.columns
    vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif_data

In [22]:
# Runs vif tests to identify multicollinearity
vif_data = calculate_vif(X_train_selected)
vif_data

Unnamed: 0,Variable,VIF
0,const,203155.946488
1,Under_five_deaths,45.179675
2,Adult_mortality,8.33418
3,Economy_status_Developed,2.753713
4,GDP_per_capita_log,5.12624
5,Infant_deaths,47.007928
6,BMI,2.776837
7,Schooling,4.395716
8,Thinness_ten_nineteen_years,1.943154
9,Year,1.088183


VIF scores are highest for `Under_five_deaths` and `Infant_deaths`. Together with the earlier finding that they have a correlation of **98.6%**, they can be said to be strongly multicollinear. `Infant_deaths` was dropped, as these are included in `Under_five_deaths`, so that the latter provides more information.

In [23]:
# Drops collinear variable
selected_features.remove('Infant_deaths')
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

In [24]:
# Reruns vif tests to check multicollinearity
vif_data = calculate_vif(X_train_selected)
vif_data

Unnamed: 0,Variable,VIF
0,const,202596.599505
1,Under_five_deaths,8.199744
2,Adult_mortality,8.326104
3,Economy_status_Developed,2.725756
4,GDP_per_capita_log,4.914016
5,BMI,2.771018
6,Schooling,4.395398
7,Thinness_ten_nineteen_years,1.928258
8,Year,1.086249
9,Alcohol_consumption,2.415537


After checking VIF scores again, only `Under_five_deaths` and `Adult_mortality` have VIF scores above 5. While this is not ideal, they lie in the acceptable range of 5 to 10 and both have a big impact on lowering tracking error to within 2 years of the target, without being strongly collinear. For this reason, both were kept for training the model.

## Scaling

Summary statistics of the data show a large difference in scale between features. These must be treated with care before modelling, to ensure that the models are not biased or dominated by certain features. Once feature selection was finished, min-max scaling was applied to all feature to remove these issues. In particular, `Year`, `Under five deaths`, `Adult mortality` displayed high units in their maximum values. But the Condition Number on the regression was lowered most by applying universal min-max scaling across features. This scaler was chosen due to most features exhibiting non-normal distributions, with min-max scaling preserving the distribution shape. This scaler was applied twice, once for each of the features selections in the different models.


In [None]:
X_train_scale = X_train_selected.copy()
X_test_scale = X_test_selected.copy()

In [None]:
# Trains MinMax Scaler on training data only to avoid bias(!)
minmax = MinMaxScaler()
minmax.fit(X_train_scale)

In [None]:
# Perform minmaxing transformation
X_train_scale[selected_features] = minmax.transform(X_train_scale)
# Repeat above for testing data
X_test_scale[selected_features] = minmax.transform(X_test_scale)

The following cells perform the same operation but drop variables that might be based on sensitive medical records.

In [None]:
# Drops medical variables
medical_features = ['BMI','Incidents_HIV','Polio','Hepatitis_B','Thinness_ten_nineteen_years']
for item in medical_features:
    selected_features.remove(item)
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

In [None]:
# Perform minmaxing transformation
X_train_scale2[selected_features] = minmax2.transform(X_train_scale2)
# Perform minmaxing transformation
X_test_scale2[selected_features] = minmax2.transform(X_test_scale2)

## Saving outputs

In [11]:
# Saving the dataframe to csv output, useable in the model notebook
X_train_scale.to_csv('X_train_scale.csv')
X_test_scale.to_csv('X_test_scale.csv')
X_train_scale2.to_csv('X_train_scale.csv')
X_test_scale2.to_csv('X_test_scale.csv')
y_train.to_csv('y_train.csv')
y_test.to_csv('y_test.csv')