# Life Expectancy Predictor

# <u>Notebook Overview</u>

This notebook provides a predictive tool to estimate the **life expectancy** of a population based on WHO data from 2000–2015.

The notebook offers two Linear Regression prediction models:  
- **Simplified Model** – Uses four non-sensitive features  
- **Advanced Model** – Uses ten features  

---

## <u>How to Use</u>

1. Run all cells in order.  
2. When prompted, decide whether to use advanced population data (**Y/N**).  
3. Choose how to provide your data:  
   - **Manual input** – Enter data directly into the program.  
   - **CSV upload** – Provide a CSV file with the required features.  
4. The model will output a predicted life expectancy.  

---

## <u>Data Required</u>

### Simplified Model
This model takes the following features:

- Region  
- Year
- GDP_per_capita    
- Economy_status_Developed  

### Advanced Model
*(Uses sensitive demographic data)*  

This model requires all of the features from the **Simplified model**, plus:

- Schooling  
- Under_five_deaths  
- Adult_mortality   
- Hepatitis_B  
- HIV_AIDS  
- Income_Composition_of_Resources  

---

## <u>Requirements</u>

The models require the WHO Dataset which you can download from the link below:
-  [WHO Dataset Link](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who)

Once downloaded:
Save the file as **"Life Expectancy Data.csv"** in the same directory as this notebook.

This notebook requires the following Python libraries:  
- numpy  
- pandas  
- statsmodels  
- scikit-learn  

To install all required libraries, run the following command in a new code cell:

```python
!pip install numpy pandas statsmodels scikit-learn

```
---

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor

import statsmodels.api as sm
import statsmodels.tools

In [2]:
# The main function that controls the flow of the program
def main():
    # Formating for visual clarity
    print("~"* 45)
    print("Welcome to the WHO Life Expectancy Predictor!")
    print("~"* 45)
    print("\n")

    # Loop to check whether the user consents to using advanced population data
    while True:
        print("Do you consent to using advanced population data, which may include protected information, for better accuracy? (Y/N)")
        while True:
            choice = input('Type here:')
            # Converts user input into lowercase and removes any extra spaces
            choice = choice.lower().strip()

             # Cathces any invalid input and re-prompts the user
            if choice != "y" and choice != "n":
                print('Invalid input: Please enter y or n')
            else:
                break
    
        # Asks the user how they would like to input their data        
        print("Would you like to upload a CSV file or enter data manually?")
        print("1. Manually input")
        print("2. CSV")
        while True:
            choice_input = input("Enter Choice:")
            # Validates the user's choice
            if choice_input == "1":
                break
            elif choice_input == "2":
                break
            else:
                print("Please enter 1 or 2")

        # If user consents, call the big_model() function  
        if choice == "y":
            # Handles the user input type
            if choice_input == "1":
                big_data = big_model()
            else:
                # Loads dataset from user's input
                file_path = input("Please enter the CSV file path: ").strip()
                try:
                    big_data = pd.read_csv(file_path)
                    # Aligns the columns with the training data
                    big_data = big_data.reindex(columns=X_train_advanced.columns)
                    # Prints the string if CSV was loaded correctly
                    print("Successfully loaded CSV")
                except:
                    print('Could not read CSV file')
                    print('Please manually input the data')
                    big_data = big_model()
                    
            # These are the features the advanced model uses for scaling
            float_cols = ['under_5_deaths_per_1000', 'adult_mortality_per_1000', 'hepatitis_b_immunization_%', 
                            'aids_deaths_per_1000',
                            'years_of_schooling']
            # Logs the data to reduce the skew
            big_data['bmi'] =  np.log(big_data['bmi'])
            big_data['gdp_per_capita'] =  np.log(big_data['gdp_per_capita'])
            
            # Initializes scaler
            scaler = RobustScaler()
            
            # Fits scaler on the required features
            scaler.fit(X_train[float_cols])
            
            # Scales users data
            big_data[float_cols] = scaler.transform(big_data[float_cols])

            # Reorders columns 
            big_data = big_data.reindex(columns=X_train_advanced.columns, fill_value=0)

            # Adds constant if there is one missing
            if 'const' in big_data.columns:
                big_data['const'] = 1
            else:
                big_data = sm.add_constant(big_data, has_constant='add')

            # Generates prdiction using the advanced model
            prediction = results_advanced.predict(big_data)
            print(f"Predicted average life expectancy: {prediction.values[0]:.2f} years")
            break
        # If user doesn't consent, call the small_model() function  
        elif choice == "n":
            # Handles the user input type
            if choice_input == "1":
                small_data = small_model()
            else:
                # Loads dataset from user's input
                file_path = input("Please enter the CSV file path: ").strip()
                try:
                    small_data = pd.read_csv(file_path)
                    # Prints the string if CSV was loaded correctly
                    print("Successfully loaded CSV")
                except:
                    print('Could not read CSV file')
                    print('Please manually input the data')
                    small_data = small_model()

            # Logs the 'gdp_per_capita' feature
            small_data['gdp_per_capita'] =  np.log(small_data['gdp_per_capita'])

            # Reorders columns 
            small_data = small_data.reindex(columns=X_train_simple.columns, fill_value=0)

            # Adds constant to dataframe if needed
            if 'const' in small_data.columns:
                small_data['const'] = 1
            else:
                small_data = sm.add_constant(small_data, has_constant='add')

            # Generates prdiction using the simple model
            prediction = results_simple.predict(small_data)
            print(f"Predicted average life expectancy: {prediction.values[0]:.2f} years")
            break
        # Cathces any invalid input and re-prompts the user
        else:
            print('Invalid Input. Please type Y or N')

In [3]:
# This function collects all the feature data needed for the advanced model
def big_model():
    # Captures user input for region
    print("Choose a region")
    print("1. Asia")
    print("2. Central America and Caribbean")
    print("3. European Union")
    print("4. Rest of Europe")
    print("5. Middle East")
    print("6. North America")
    print("7. South America")
    print("8. Africa")
    print("9. Oceania")
    Region = region_finder()      

    # Collecting the required features for the advanced model
    # Calling float_checker() for data validation
    under_5_deaths_per_1000 = float_checker("Enter under_5_deaths_per_1000: ")
    adult_mortality_per_1000 = float_checker("Enter adult_mortality_per_1000: ")
    hepatitis_b_immunization_percent = float_checker("Enter hepatitis_b_immunization_%: ")
    bmi = float_checker("Enter bmi: ")
    aids_deaths_per_1000 = float_checker("Enter aids_deaths_per_1000: ")
    gdp_per_capita = float_checker("Enter gdp_per_capita: ")
    years_of_schooling = float_checker("Enter years_of_schooling: ")


    # Loop to ensure valid input for Economy Status
    while True:
        Economy_status_Developed = input("Is Economy_status_Developed? (Y/N): ")
        Economy_status_Developed = Economy_status_Developed.lower().strip()
        if Economy_status_Developed == "n":
            developed_economy = 0
            break
        elif Economy_status_Developed == "y":
            developed_economy = 1
            break
        else:
            print('Invalid Input. Please type Y or N')

    # Creating a DataFrame with the collected user data
    big_data = pd.DataFrame([{
        'under_5_deaths_per_1000': under_5_deaths_per_1000,
        'adult_mortality_per_1000': adult_mortality_per_1000,
        'hepatitis_b_immunization_%': hepatitis_b_immunization_percent,
        'bmi': bmi,
        'aids_deaths_per_1000': aids_deaths_per_1000,
        'gdp_per_capita': gdp_per_capita,
        'years_of_schooling': years_of_schooling,
        'developed_economy': developed_economy,
        # Mapped regions to 0
        'region_Asia' : 0, 'region_Central America and Caribbean' : 0, 
        'region_European Union' : 0, 'region_Middle East' : 0,
        'region_North America' : 0,  'region_Oceania': 0,
        'region_Rest of Europe': 0, 'region_South America': 0, 
        'region_Africa' : 0
    }])

    big_data = region_finder2(Region, big_data)
    
    # Returns dataframe
    return big_data

In [4]:
# This function collects all the feature data needed for the smaller model
def small_model():
    # Captures user input for region
    print("Choose a region")
    print("1. Asia")
    print("2. Central America and Caribbean")
    print("3. European Union")
    print("4. Rest of Europe")
    print("5. Middle East")
    print("6. North America")
    print("7. South America")
    print("8. Africa")
    print("9. Oceania")
    Region = region_finder()      

    # Loop for capturing the year with a restriction of 2000-2015
    while True:   
        try:
            Year = int(input("Enter Year: (2001-2015)"))
            if (Year >= 2001) and (Year <= 2015):
                break
            else:
                print("Enter a year between 2001 and 2015 ")
        except:
            print('Please input a number')

    # Collecting the required features for the smaller model
    # Calling float_checker() for data validation
    gdp_per_capita = float_checker("Enter gdp_per_capita: ")

    # Loop to ensure valid input for Economy Status
    while True:
        developed_economy = input(" Is Economy_status_Developed? (Y/N): ")
        developed_economy = developed_economy.lower().strip()
        if developed_economy == "n":
            developed_economy = 0
            break
        elif developed_economy == "y":
            developed_economy = 1
            break
        else:
            print('Invalid Input. Please type Y or N')
            
    # Creating a DataFrame with the collected user data
    small_data = pd.DataFrame([{
        'gdp_per_capita': gdp_per_capita,
        'developed_economy': developed_economy,
        'region_Asia' : 0, 'region_Central America and Caribbean' : 0, 
        'region_European Union' : 0, 'region_Middle East' : 0,
        'region_North America' : 0,  'region_Oceania': 0,
        'region_Rest of Europe': 0, 'region_South America': 0,
        'region_Africa' : 0,
        'year_2001': 0, 'year_2002': 0, 'year_2003': 0, 'year_2004': 0, 'year_2005': 0, 
        'year_2006': 0, 'year_2007': 0, 'year_2008': 0, 'year_2009': 0, 'year_2010': 0, 
        'year_2011': 0, 'year_2012': 0, 'year_2013': 0, 'year_2014': 0, 'year_2015': 0
        }])
    # Year encoding
    small_data = year_finder(Year, small_data)

    # Region encoding
    small_data = region_finder2(Region, small_data)
    # Returns dataframe
    return small_data

In [5]:
# Function to check if the user's input is a float
def float_checker(feature):
    while True:
        try:
            check = float(input(feature))
            if check < 0:
                print("This feature requires a non-negative number")
                continue
            return check
        # Cathces any non-numeric input and re-prompts the user
        except:
            print("Incorrect input. Ensure you are inputting numbers")

In [6]:
# Function that returns the region
def region_finder():
    while True:
        Region = input("Enter Region: ")
        # Stripping input for data validation
        Region = Region.strip()
        if Region == "1":
            return 'Asia'
            break
        elif Region == "2":
            return 'Central America and Caribbean'
            break
        elif Region == "3":
            return 'European Union'
            break
        elif Region == "4":
            return 'Rest of Europe'
            break
        elif Region == "5":
            return 'Middle East'
            break
        elif Region == "6":
            return 'North America'
            break
        elif Region == "7":
            return 'South America'
            break
        elif Region == "8":
            return 'Africa'
            break
        elif Region == "9":
            return 'Oceania'
            break
        else:
            print('Please enter a number between 1-9')

In [7]:
# Function that finds the correct year 
def year_finder(Year, data):
    year_column = f"year_{Year}"
      # puts 1 in the inputted year
    if year_column in data.columns:
        data[year_column] = 1
    return data 

# Function that finds the correct region 
def region_finder2(Region, data):
    region_column = f"region_{Region}"
    # puts 1 in the inputted region
    if region_column in data.columns:
        data[region_column] = 1
    return data

# Modelling

In [8]:
# Import dataset and save to dataframe 'le' (Life Expectancy)
le = pd.read_csv("Life Expectancy Data.csv")

In [9]:
le.drop(columns = 'Country', inplace = True)
le.drop(columns = 'Economy_status_Developing', inplace = True) 


In [10]:
#Rename columns

le.rename(columns={'Region':'region',
                   'Year':'year',
                   'Infant_deaths':'infant_deaths_per_1000',
                   'Under_five_deaths':'under_5_deaths_per_1000',
                   'Adult_mortality':'adult_mortality_per_1000',
                   'Alcohol_consumption':'alcohol_consumption', 
                   'Hepatitis_B':'hepatitis_b_immunization_%',
                   'Measles' : 'measles_reported_per_1000', 
                   'Polio' : 'polio_immunization_%',
                   'Diphtheria' : 'diphtheria_immunization_%',
                   'Incidents_HIV' : 'aids_deaths_per_1000',
                   'Population_mln' : 'population_million',
                   'BMI' : 'bmi',
                   'GDP_per_capita' : 'gdp_per_capita',
                   'Thinness_ten_nineteen_years' : 'thinness_prevalence_10_19',
                   'Thinness_five_nine_years' : 'thinness_prevalence_5_9',
                   'Schooling':'years_of_schooling',
                   'Economy_status_Developed' : 'developed_economy',
                   'Economy_status_Developing' : 'developing_economy',
                   'Life_expectancy' : 'life_expectancy' }, 
                   inplace=True)   # set inplace = True to replace the original column names

In [11]:
# Define a function that performs the one-hot-encoding
def feature_eng(le):
    
    # Create a copy of the inputted dataframe
    df_local = le.copy()
    
    # Perform one-hot-encoding on the 'region' column
    # Drop the first column to prevent multicollinearity
    df_local = pd.get_dummies(df_local, columns=['region'], drop_first=True, dtype=int)
    
    # Repeat on the 'year' column
    df_local = pd.get_dummies(df_local, columns=['year'], drop_first=True, prefix='year', dtype=int)
    
    # Return the dataframe
    return df_local 

In [12]:
# Apply the feature engineering function to the dataframe
# And save as a new dataframe, le_fe
le_fe = feature_eng(le)

In [13]:
# The feature columns are all columns except 'life_expectancy'
# So we drop this column and save as a new dataframe, X
X = le_fe.drop('life_expectancy', axis=1)

# The target column in 'life_expectancy'
y = le_fe['life_expectancy']

# Create bin sizes for stratifying
y_strat_bin = pd.cut(y, bins=4, labels=False)

# Perform the split and assign to the corresponding variables
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2,        # set test_size to 20%
                                                    random_state = 100,     # fix the random state to a random number
                                                    stratify= y_strat_bin   # stratify the data
                                                   )

In [14]:
# Create a copy of the dataframe
X_train_rob = X_train.copy()

# Initialise the Robust Scaler
rob = RobustScaler()

# Define columns that need scaling (Non-Boolean columns)
rob_features = ['infant_deaths_per_1000',
                'under_5_deaths_per_1000',
                'adult_mortality_per_1000',
                'alcohol_consumption',
                'hepatitis_b_immunization_%',
                'measles_reported_per_1000',
                'polio_immunization_%',
                'diphtheria_immunization_%',
                'aids_deaths_per_1000',
                'population_million',
                'bmi',
                'gdp_per_capita',
                'thinness_prevalence_10_19',
                'thinness_prevalence_5_9',
                'years_of_schooling']

# Select these columns from the dataframe
# Then fit and transform these columns
X_train_rob[rob_features] = rob.fit_transform(X_train[rob_features])

In [15]:
# Drop the columns from train data
X_train_final = X_train_rob.drop(columns = ['under_5_deaths_per_1000',
                                            'diphtheria_immunization_%',
                                            'thinness_prevalence_10_19'])

# Drop the features that we deem to be sensitive from the train data
X_train_simple = X_train_final.drop(columns = ['alcohol_consumption',
                                               'hepatitis_b_immunization_%',
                                               'measles_reported_per_1000',
                                               'polio_immunization_%',
                                               'aids_deaths_per_1000',
                                               'bmi',
                                               'adult_mortality_per_1000',
                                               'thinness_prevalence_5_9', 
                                               'infant_deaths_per_1000',
                                              'population_million', 
                                              'years_of_schooling' ]
                                   )

In [16]:
# The feature columns are X_train_simple, and the target column is y_train
X_train_simple = sm.add_constant(X_train_simple)
# Add a constant to the train data to prepare for StatsModels method
X_train_final = sm.add_constant(X_train_final)
# Initialise the model
linreg_simple = sm.OLS(y_train, X_train_simple)

# Fit the model and save to 'results_simple'
results_simple = linreg_simple.fit()

In [17]:
# Drop from the train data
X_train_advanced = X_train_final.drop(columns = ['alcohol_consumption',
                                                 'measles_reported_per_1000', 
                                                 'polio_immunization_%',
                                                 'thinness_prevalence_5_9',
                                                 'population_million',
                                                 'infant_deaths_per_1000',
                                                 'year_2001', 'year_2002', 'year_2003', 'year_2004', 'year_2005', 
                                                 'year_2006', 'year_2007', 'year_2008', 'year_2009', 'year_2010', 
                                                 'year_2011', 'year_2012', 'year_2013', 'year_2014', 'year_2015'
                                                ]
                                     )

In [18]:
# Create the linear regression object
linreg_advanced = sm.OLS(y_train, X_train_advanced)

# Fit the model and save to 'results'
results_advanced = linreg_advanced.fit()
# Create a prediction of y using the model
y_pred_train_advanced = results_advanced.predict(X_train_advanced)

# Input here:

In [19]:
main()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Welcome to the WHO Life Expectancy Predictor!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Do you consent to using advanced population data, which may include protected information, for better accuracy? (Y/N)
Type here:Y
Would you like to upload a CSV file or enter data manually?
1. Manually input
2. CSV
Enter Choice:1
Choose a region
1. Asia
2. Central America and Caribbean
3. European Union
4. Rest of Europe
5. Middle East
6. North America
7. South America
8. Africa
9. Oceania
Enter Region: 5
Enter under_5_deaths_per_1000: 100
Enter adult_mortality_per_1000: 500
Enter hepatitis_b_immunization_%: 200
Enter bmi: 5
Enter aids_deaths_per_1000: 20
Enter gdp_per_capita: 505000
Enter years_of_schooling: 10
Is Economy_status_Developed? (Y/N): y
Predicted average life expectancy: 63.04 years


In [None]:
main()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Welcome to the WHO Life Expectancy Predictor!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Do you consent to using advanced population data, which may include protected information, for better accuracy? (Y/N)
