# WHO Life Expectancy Feature Engineering

The first step in feature engineering is to train-test split the dataframe. This ensures that the model is robust and effective on future unseen data.

After this, we can apply feature engineering to the split dataframe where necessary. This involves feature scaling, standardising, or removal.

In [1]:
# Importing necessary packages
import pandas as pd # For general data use
import seaborn as sns # For data visualisation
import matplotlib.pyplot as plt # For data visualisation
import numpy as np # For mathematical operations
from sklearn.model_selection import train_test_split

In [2]:
# Creating dataframe from CSV file
df = pd.read_csv('Life Expectancy Data.csv')

In [3]:
# Features (all columns except 'Life_expectancy')
X = df.drop('Life_expectancy', axis=1)

# Target (the 'SUPERHERO' column)
y = df['Life_expectancy']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Dropped Columns
We want to drop the `Economy_status_Developing` feature. As discovered during EDA, there are two binary features presenting identical and opposite information about economic status. It is common practice in a binary feature for **1** to represent **Yes** and for **0** to represent **No**. Since the `Economy_status_Developed` follows this, we chose to keep this feature in the dataframe, and drop the other.

The features `Country` and `Region` were deemed to be equivalent unique identifiers. Although both could be potential features useful for life expectancy prediction, they could also be hindrances. A specific country may provide high life expectancy in past years, with factors in fields such as medicine or finance leading to lower life expectancy in later years. A model would associate country with life expectancy and therefore may not recognise these later changes, leading to innaccurate predictions. Because of thise, we chose to remove `Country` and `Region` from the dataframe.

We observed a high correlation of **98.6%** between `Infant_deaths` and `Under_five_deaths`. The latter feature can be seen as inclusive of the former, since under fives include infants. The removal of the feature `Infant_deaths` avoids collinearity in future modelling.

In [4]:
# Function dropping all required features
def drop_all(df):
    df.drop(columns = ['Economy_status_Developing','Country', 'Region','Infant_deaths'], inplace = True)
    return df

In [5]:
X_train_drop = drop_all(X_train)
X_test_drop = drop_all(X_test)

# Scaling

Summary statistics of the model show a large difference in scale between features. These must be treated with care before modelling, to ensure that models are not biased or dominated by certain features.

We observed various features with skewed distributions. Linear regression performs best with normally distributed data. Therefore, any skewed distributions should be scaled prior to modelling.

The `GDP_per_capita` feature displayed a logarithmic relationship with `Life_expectancy`. To combat this, we created a new feature that normalised log values, `GDP_per_capita_log`. This then presented a more linear relationship. We dropped the original `GDP_per_capita` feature following this.

In [6]:
# Function to normalise GDP
def log_GDP(df):
    df['GDP_per_capita_log'] = np.log(df['GDP_per_capita'])
    df.drop(columns = ['GDP_per_capita'], inplace = True)
    return df

In [7]:
# Apply function to X_train_drop and X_test_drop
X_train_log = log_GDP(X_train_drop)
X_test_log = log_GDP(X_test_drop)

We observed during EDA a strong linear reataionship between `Year` and `Life_expectancy`. Feature `Year` ranges between 2000 and 2015, which is a fairly high scale. To combat this, we decided to subtract **2000** from the `Year` value. This brings the range to 0-15, which is lower and therefore less likely to create bias during modelling.

In [8]:
# Function to reduce `Year` scale
def year(df):
    df['Year'] = df['Year'] - 2000
    return df

In [10]:
# Apply function to X_train_drop and X_test_drop
X_train_fe = year(X_train_log)
X_test_fe = year(X_test_log)

In [11]:
# Saving the dataframe
X_train_fe.to_csv('X_train_fe.csv')
X_test_fe.to_csv('X_test_fe.csv')
y_train.to_csv('y_train.csv')
y_test.to_csv('y_test.csv')