# Life Expectancy (WHO)

## Problem: 
what area(s) should be given importance in order to efficiently improve the life expectancy of the population?

## Business Understanding
Goal: To improve the life expectancy

Objective: To find the factor(s) contributing to lower value of life expectancy

## Analytic Approach: 
Predictive model

## Data Requirement
Dataset related to life expectancy (health, economic, social and other factors affecting the life expectancy) for different countries 

## Data Collection
https://www.kaggle.com/kumarajarshi/life-expectancy-who
The data was collected from WHO and United Nations website for year 2000-2015 for 193 countries for analysis of factors actually affecting the life expectancy. 


In [None]:
# Libraries Import
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
# from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt 
from scipy.stats import pearsonr
from scipy.stats.mstats import winsorize
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error 


In [None]:
# Dataset Import
df = pd.read_csv('../input/life-expectancy-who/Life Expectancy Data.csv')

## Data Understanding

In [None]:
rows, cols = df.shape
print("Records:", rows)

In [None]:
df.head(rows)                               # view all data

In [None]:
print('\nLife Expectancy Factors:', cols, list(df.columns))
print('\nNumber of Countries:', len(df['Country'].unique()))
print('\nPeriod: %s - %s'%(min(df['Year']), max(df['Year'])))

In [None]:
df.info()                                   # dtypes and other info

In [None]:
df.apply(lambda x: len(x.unique()))         # unique values in dataset

In [None]:
df[df.duplicated()]                         # duplicate rows count

In [None]:
# Missing Values Count & Percentage
null_vals = df.isna().sum().reset_index()
null_vals.columns = ['Factors', 'Missing Values']
null_vals["Missing %"] = round(null_vals['Missing Values']/rows*100, 2)
null_vals[ null_vals['Missing %'] > 0 ]

### Observations
Records: 2938

Period: 2000 - 2015

Countries: 193

Duplicate rows: 0

Duplicate Columns: 0

Life expectancy factors = 22 = ( Country, Year , Status, Life expectancy, Adult Mortality, infant deaths, Alcohol, percentage expenditure, Hepatitis B, Measles, BMI, under-five deaths, Polio, Total expenditure, Diphtheria, HIV/AIDS, GDP, Population, thinness  1-19 years,  thinness 5-9 years, Income composition of resource, Schooling)

Status: country status according to WHO standards, Developed or Developing

Life expectancy: life expectancy in age

Adult Mortality: probability of dying between 15 and 60 years per 1000 population

Infant deaths: infant deaths per 1000 population

Alcohol: alcohol consumption rate per capita (15+), measured as liters 

Percentage expenditure: expenditure on health as a percentage of GDP per capita(%)

Hepatitis B: HepB immunization coverage among 1-year-olds (%)

Measles: number of reported cases per 1000 population

BMI: average Body Mass Index of entire population

Under-five deaths: Number of under-five deaths per 1000 population

Polio: Pol3 immunization coverage among 1 year olds (%)

Total expenditure: government expenditure on health as a percentage of total government expenditure (%)

Diphtheria: diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

HIV/AIDS: deaths per 1000 live births HIV/AIDS (0-4 years)

GDP: Gross Domestic Product per capita (in USD)

Thinness 1-19 years: rate of thinness among people aged 10-19 

Thinness 5-9 years: rate of thinness among people aged 5-9

Income composition of resources: Human Development Index in terms of income composition of resources

Schooling: average number of years of schooling of a population



Thinness 1-19 years should be renamed to Thinness 10-19 years as it represents thinness in people aged 10-19

Polio, Deptheria, Hepatits B and Alcohol should be renamed to Pol3 Vaccination %, Deptheria Vaccination %, HepB Vaccination % and Alcohol Intake(L) respectively to more accurately represent the variables

Column name space issue: Life expectancy, Measles, BMI, under-five deaths, HIV/AIDS, thinness 1-19, thinness 5-9 years, Diphtheria

Column name case issue: Life expectancy, under-five deaths, thinness 1-19, thinness 5-9 years, infant deaths, percentage expenditure, Total expenditure

Column type issue: Country(object), Year(int64), Status(object)

Columns with missing values: Life expectancy(10), Adult Mortality(10), Alcohol Intake(194), HepB Vaccination %(553), BMI(34), Pol3 Vaccination %(19), Total expenditure(226), Diphtheria Vaccination %(19), GDP(448), Population(652), thinness 10-19 years(34), thinness 5-9 years(34), Income composition of resources(167), Schooling(163)

Data isn't normalized

## Data Preprocessing

In [None]:
# Space and Case Correction
df.rename(columns={'Life expectancy ': 'Life Expectancy',
                   'infant deaths': 'Infant Deaths',
                   'percentage expenditure': 'Percentage Expenditure',
                   'Measles ': 'Measles',
                   ' BMI ': 'BMI',
                   'under-five deaths ': 'Under Five Deaths',
                   'Diphtheria ': 'Diphtheria Vaccination %',
                   ' HIV/AIDS': 'HIV/AIDS',
                   ' thinness  1-19 years': 'Thinness 10-19 years',
                   ' thinness 5-9 years': 'Thinness 5-9 years',
                   'Income composition of resources': 'Resources Income Composition',
                   'Total expenditure': 'Total Expenditure',
                   'Polio': 'Pol3 Vaccination %',
                   'Hepatitis B': 'HepB Vaccination %', 
                   'Alcohol': 'Alcohol Intake(L)'
                  },inplace=True)
df.columns

In [None]:
# Data Formatting
df['Country'] = df['Country'].astype('string')
df['Status'] = df['Status'].astype('string')
df.dtypes

In [None]:
# Dealing with Missing Values (replacing with mean value for the year)
null_col = ('Life Expectancy', 'Adult Mortality', 'Alcohol Intake(L)', 'HepB Vaccination %', 'BMI', 'Pol3 Vaccination %', 'Total Expenditure', 'Diphtheria Vaccination %', 'GDP', 'Population', 'Thinness 10-19 years', 'Thinness 5-9 years', 'Resources Income Composition', 'Schooling')
data_valid = []
for year in list(df.Year.unique()):
    year_data = df[df.Year == year].copy()
    for col in null_col:
        year_data[col] = year_data[col].fillna(year_data[col].dropna().mean()).copy()
    data_valid.append(year_data)
df = pd.concat(data_valid).copy()
df.isnull().sum(axis = 0)

In [None]:
# Label Encoding (transform non-numerical labels to numerical labels)
df['Status'] = LabelEncoder().fit_transform(df['Status'])
df['Status']

In [None]:
# Data Normalization/ Feature Scaling
#df_scale = df.drop(['Country', 'Status', 'Year'], axis='columns')
#df_scale = MinMaxScaler().fit_transform(df_scale)
for col in df.columns:
    if col not in ('Country', 'Status', 'Year'):
        df[col] = df[col] / df[col].max()
df

In [None]:
# Rounding Float Values to 4 Decimals
for col in df:
    if df[col].dtype in (np.int64, np.float64):
        df[col] = round(df[col], 4)
df

## Exploratory Data Analysis

In [None]:
df.describe()                              # Statistical Info

In [None]:
# categorical columns & numerical columns
categ_cols = ['Country', 'Status']
numeric_cols = [] 
for i in df.columns:
    if i != 'Status' and df[i].dtype in (np.float64, np.int64):
        numeric_cols.append(i)

In [None]:
# Data Distribution
plt.rcParams.update({'figure.max_open_warning': 0})
for col in categ_cols:                                             
    sns.countplot(x=col, data=df, dodge=True, palette="Set3")
    plt.title('%s Data Distribution'%col)
plt.show()
for i,col in enumerate(numeric_cols, 1): 
    sns.displot(x=df[col])
    plt.title('%s Data Distribution'%col)
plt.show()

In [None]:
# Outliers 
plt.figure(figsize=(20,30))
for i,col in enumerate(numeric_cols, 1):
    plt.subplot(5, 4, i)
    plt.boxplot(df[col])
    plt.title(col)
plt.show()

In [None]:
# Outliers Lower & Upper Bound Percentage
percent_low = []
percent_high = []
for col in numeric_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    low = q1 - (iqr*1.5)
    high = q3 + (iqr*1.5)
    count_l = len(np.where(df[col] < low)[0])
    count_h = len(np.where(df[col] > high)[0])
    percent_low.append(round(count_l/len(df[col])*100, 2))
    percent_high.append(round(count_h/len(df[col])*100, 2))
outlier_table = pd.DataFrame({'Factor':numeric_cols, 'Lower Bound %':percent_low, 'Upper Bound %':percent_high})
outlier_table

In [None]:
# Handling Outliers as the CORRELATION COEFFICIENT is highly sensitive to outliers

In [None]:
# Effect of Outlier Trimming (remove outlier)
df_test = df.drop(['Country'], axis='columns')
df_test = df_test[~((df_test<low) | (df_test>high)).any(axis=1)]
df_test.shape

In [None]:
# Winzorization 
for i,col in enumerate(numeric_cols, 1):
    df[col] = winsorize(df[col], (max(percent_low)/100, max(percent_high)/100))      
 

In [None]:
# Life Expectancy Correlation with Other Factors wrt Status
for col in df.columns:
    if col not in ('Life Expectancy', 'Status'):
        plt.scatter(x=df[col], y=df['Life Expectancy'], c=df['Status'], label=df['Status'])
        plt.title('Life Expectancy Correlation with %s wrt Status'%col)
        plt.show()


In [None]:
# Correlation Matrix
plt.figure(figsize=(20,15))
sns.heatmap(df.corr(), square=True, annot=True, linewidths=.5, cmap="Blues")
plt.title("Correlation matrix among variables")
plt.show()

In [None]:
# Correlation Statistics
p_coef, p_val, rel, res = [], [], [], []
y = df['Life Expectancy']
for col in numeric_cols:
    #if col == 'Life Expectancy': continue
    coef, val = pearsonr(df[col], y)
    p_coef.append(coef)
    p_val.append(val)
    if coef > 0:
        if coef > 0.5: rel.append('Strong Positive')
        else: rel.append('Weak Positive')
    elif coef < 0:
        if coef < -0.5: rel.append('Strong Negative')
        else: rel.append('Weak Negative')
    else:
        rel.append('Nil')
    if val < 0.001:
        res.append('Strong')
    elif val < 0.05:
        res.append('Moderate')
    elif val < 0.1:
        res.append('Weak')
    else:
        res.append('Nil')
corelation_table = pd.DataFrame({'Factor':numeric_cols, 'Coefficient':p_coef, 'P-value':p_val, 'Relation':rel, 'Result Certainity':res}) 
corelation_table

In [None]:
# Important Variables Extraction
x = df.drop(['Country', 'Year', 'Life Expectancy', 'Population', 'Total Expenditure', 'Alcohol Intake(L)'], 1)

### Observations

There are outliers in the data as stats values for some column don't make sense:

Infant deaths min is 0 per 1000 but Under Five Deaths is not 0

BMI min is 0.01 and max is 1 (lower than 10 and greater than 80)

GDP per capita min is 0

population min is 0

Infant deaths of 0


Columns with outlier % greater than 5: Infant Deaths(11.1), Percentage Expenditure(13.2), HepB Vaccination %(8.9), Measles(18.4), Under Five Deaths(13.5), Pol3 Vaccination %(9.5), HIV/AIDS(18.4), Diptheria Vaccination %(10.1), GDP(12.4), Population(10)


Outliers trimming results in total data loss

Used winzorization for outlier handling


Life Expectancy correlation with other factors:

strong positive: BMI, Schooling and Resources Income Composition, Diptheria Vaccination %, Pol3 Vaccination %

strong negative: Adult Mortality, Thiness 5-9 years, HIV/AIDS, Under Five Deaths, Infant Deaths

weak negative:   HepB Vaccination %, Measles, Thiness 10-19 years, Status(negative because status value 0= developed country and 1=developing)

weak positive:   GDP, Percentage Expenditure, HepB Vaccination %, Alcohol Intake(Positive because alcohol is consumed more in developed countries and developed countries has high life expectancy)

negligible:      Year, Total Expenditure, Population 



Result certainity is strong for all factors except Population

Importants Factors: BMI, Schooling, Resources Income Composition, Diptheria Vaccination %, Pol3 Vaccination %, Adult Mortality, Thiness 5-9 years, HIV/AIDS, Under Five Deaths, Infant Deaths

Unimportant Factors: Country, Year, Infant Deaths, Total Expenditure, Population, Alcohol Intake


## Model Development 

In [None]:
# Dataset Splitting into Training & Test Sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)

In [None]:
# Model Development
lm = LinearRegression().fit(x_train, y_train)
pred_train = lm.predict(x_train)
pred_test = lm.predict(x_test)

## Model Evaluation

In [None]:
# Mean Squared Error & Mean Absolute Error
mse_train = round(mean_squared_error(y_train, pred_train), 4)
mse_test = round(mean_squared_error(y_test, pred_test), 4)
mae_train = round(mean_absolute_error(y_train, pred_train), 4)
mae_test = round(mean_absolute_error(y_test, pred_test), 4)
print('Mean Squared Error Training: %s\nMean Absolute Error Training: %s\nMean Squared Error Testing: %s\nMean Absolute Error Testing: %s'
      %(mse_train, mae_train, mse_test, mae_test))

## Conclusion

To improve the Life Expectancy:
    
Polio, Hepatitis, Diptheria vaccination coverage should be increased

Measures should be taken to ensure food security

Measures should be taken to provide education and reduce the risks of infant mortality

Resources should be utilized productively

AIDS awarness campaigns should be organized. 
