# Life Expectancy Models 📈

###### 

In [38]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split # To perform our train-test split
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm # For linear regression
import statsmodels.tools # For evaluation of our model

In [39]:
df = pd.read_csv('Life Expectancy Data.csv')

In [40]:
feature_cols = list(df.columns)
feature_cols.remove('Life_expectancy')
X = df[feature_cols]
y = df['Life_expectancy']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 104, stratify = X['Country'])
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [41]:
def scaling(df):
    ''' Return scaled data '''
    df = df.copy()
    scaled_col_names = ['Year', 'Infant_deaths', 'Under_five_deaths', 'Adult_mortality', 'Alcohol_consumption', 
                'Hepatitis_B', 'Measles', 'BMI', 'Polio', 'Diphtheria', 'Incidents_HIV', 'GDP_per_capita', 
                'Population_mln', 'Thinness_ten_nineteen_years', 'Thinness_five_nine_years', 'Schooling', 
                'Economy_status_Developed', 'Economy_status_Developing', 'GDP_per_capita_log', 'Incidents_HIV_log']
    features = df[scaled_col_names]
    scaler = StandardScaler().fit(features)
    scaled_features = scaler.transform(features)
    df[scaled_col_names] = scaled_features
    return df

In [42]:
def feature_eng(df):
        df = df.copy()  # Good practice to use a copy

        # Log transformation of the GDP column:
        df['GDP_per_capita_log'] = df['GDP_per_capita'].apply(lambda x: np.log(x))
        df['Incidents_HIV_log'] = df['Incidents_HIV'].apply(lambda x: -np.log(x))
    
        # Scale the data using a standard scaler:
        scaled_df = scaling(df)
    
        # Making region numerical OHE
        scaled_df = pd.get_dummies(scaled_df, columns = ['Region'], drop_first = True, prefix = 'Region')
        #df = pd.get_dummies(df, columns = ['Country'], drop_first = True, prefix = 'Country')

        # Add the constant (statsmodels)
        scaled_df = sm.add_constant(scaled_df)

        # Return the feature engineered result
        return scaled_df

In [43]:
X_train_fe = feature_eng(X_train)

###### 

---
## <u>The Accurate Model</u>

We were tasked at producing an **elaborate model** that can be used for countries who have decided on **sharing their sensitive data**. <br>
The aim was to produce the **most accurate model** able to **predict life expectancy** from several features relating to quality of life.<br>
These features include GDP, mortality rates, immunisation coverage, population etc. While we had data on the specific countries, we excluded them first.

### Why Did We Remove Country?
This feature was found to be far too powerful in predicting life expectancy that it dilutes the dataset. <br>
Essentially, any prediction would be dependent on the country rather than any other inputted features.

There were several other features excluded because we found them to be **insignificant** in predicting life expectancy while still **increasing the complexity** of the model. <br>
Significant features were identified using a p-value of **<0.05**.
### Excluded Features:
* Alcohol consumption **(p = 0.099)**<br>
* Measles cases per 1000 **(p = 0.852)**<br>
* Diphtheria immunisation coverage % **(p = 0.132)**<br>
* Population **(p = 0.171)**<br>
* Thinness % aged 5-9 year **(p = 0.103)**<br>

There were some features that we found to be **insigificant** that we still included in the model to maintain **fairness**. <br>
### Noteable Included Features:
* Middle Eastern Region **(p = 0.652)**<br>
* Rest of Europe Region **(p = 0.105)**
 
The choice was made between **removing the identified regions**, or **removing all regions**. The former was chosen as other regions were found to have a strong impact on the model.

In [None]:
feature_cols = ['const', 'Year', 'Infant_deaths', 'Under_five_deaths',
           'Adult_mortality', 'Hepatitis_B', 'BMI', 'Polio', 'Incidents_HIV_log', 'GDP_per_capita_log',
           'Thinness_ten_nineteen_years', 'Schooling', 'Economy_status_Developed', 'Region_Asia',
           'Region_Central America and Caribbean', 'Region_European Union',
           'Region_Middle East', 'Region_North America', 'Region_Oceania',
           'Region_Rest of Europe', 'Region_South America'] # The features included in this linear regression

lin_reg = sm.OLS(y_train, X_train_fe[feature_cols]) # Fit the model using the data
results = lin_reg.fit()
results.summary() # Produce a summary of the statistics of the linear regression

**R-Squared = 0.984** <br>
This is a measure of how good our model is doing. 98.4% of the variation in our data is explained by the model. <br>
**Condition Number = 33** <br>
This shows how stable the model is. If we were to change the data slightly, how would it affect the model. In this case, 33 is very low, so the model is very stable. <br>

---
## <u>The Ethical Model</u>

In this model, we were tasked at producing a **sensitive model** using the **least information necessary** to make a prediction for countries **not** willing to share their sensitive data. <br>
The aim was to identify the features capable of **accurately predicting life expectancy** while excluding features that could bring about **unwanted financial implications** from sharing their data. <br>
Note that all features **previously excluded** are excluded here for the same reason **except alcohol consumption** that became significant in this model.

## <u>Excluded Features:</u>
#### **Child Death Statistics:**
    * Infant Deaths per 1000
    * Under Five Deaths per 1000
The number of Deaths per 1000 for children is a reflection of a countries **access to healthcare** for vulnerable people. <br>
This is sensitive because exposing a high mortality rate would impose financial pressure on the country to improve healthcare.
#### **Immunisation Coverage and Disease Incidence:**
    * Hepatitis B immunisation coverage %
    * Polio immunisation coverage %
    * HIV/AIDS deaths per 1000 aged 0-4
Immunisation coverage reflects on the **healthcare infrastructure** relating to vaccination of children. <br>
While HIV/AIDS deaths relates to a countries **access to contraceptives and HIV awareness**. <br>
This is sensitive because it can put financial pressure on a country to put funding into their healthcare infrastructure.
#### **Weight Statistics:**
    * Average BMI
    * Thinness % aged 10-19
BMI and Thinness can indicate the **levels of nutrition** across a country, with a low BMI and high thinness suggesting **malnutrition**. <br>
This is sensitive as showing a statistic indicating mulnutrition puts financial pressure to provide food and farming.
#### **Education Statistics:**
    * Average Number of Years Spent Schooling
Schooling provides insight into the **access of education** in a country. <br>
This is sensitive as revealing a low number of schooling years would impose the need for funding education, creating financial pressure.
#### **Geographical Data:**
    * Region
Region was excluded from the model as it **introduced a strong bias**. <br>
The model became too heavily influenced by this feature, that we deemed it important to exclude.

## <u>Included Features:</u>
#### **Year:**

This makes **no implications** on a countries quality of life.
#### **Adult Mortality:**

The **necessary** foundational statistic related to death to make some level of prediction for life expectancy. <br>
#### **Alcohol Consumption:**

Considered for exclusion under religious purposes, however, this feature does not make any unwanted financial implications while actively improving the model. <br>
#### **GDP per Capita:**

Very fundamental quality of life feature that due to its **generality**, does not make any implications on where to allocate finances to improve it. <br>

In [45]:
feature_cols = ['const', 'Year', 'Adult_mortality', 'Alcohol_consumption', 'GDP_per_capita_log'] # The features included in this linear regression

lin_reg = sm.OLS(y_train, X_train_fe[feature_cols]) # Fit the model using the data
results = lin_reg.fit()
results.summary() # Produce a summary of the statistics of the linear regression

0,1,2,3
Dep. Variable:,Life_expectancy,R-squared:,0.944
Model:,OLS,Adj. R-squared:,0.944
Method:,Least Squares,F-statistic:,9694.0
Date:,"Mon, 09 Dec 2024",Prob (F-statistic):,0.0
Time:,11:59:37,Log-Likelihood:,-5070.7
No. Observations:,2291,AIC:,10150.0
Df Residuals:,2286,BIC:,10180.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,68.9270,0.046,1489.107,0.000,68.836,69.018
Year,0.3543,0.047,7.549,0.000,0.262,0.446
Adult_mortality,-7.1839,0.065,-110.289,0.000,-7.312,-7.056
Alcohol_consumption,0.8304,0.057,14.499,0.000,0.718,0.943
GDP_per_capita_log,2.0945,0.075,27.756,0.000,1.947,2.243

0,1,2,3
Omnibus:,241.768,Durbin-Watson:,1.944
Prob(Omnibus):,0.0,Jarque-Bera (JB):,621.995
Skew:,-0.594,Prob(JB):,8.619999999999999e-136
Kurtosis:,5.26,Cond. No.,2.96


**R-Squared = 0.944** <br>
94.4% of the variation in our data is explained by the model. <br>
**Condition Number = 2.96** <br>
A condition number of 2.96 is extremely low, indicating a very stable model.