# Life Expectancy (WHO)
* Statistical Analysis on factors influencing Life Expectancy



## Context

Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.


## Content

The project relies on accuracy of data. The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single data-set. On initial visual inspection of the data showed some missing values. As the data-sets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo, Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model data-set. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories:​Immunization related factors, Mortality factors, Economical factors and Social factors.


## Acknowledgements

The data was collected from WHO and United Nations website with the help of Deeksha Russell and Duan Wang.


## Inspiration

The data-set aims to answer the following key questions:

   - Does various predicting factors which has been chosen initially really affect the Life expectancy? What are the predicting variables actually affecting the life expectancy?
   - Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan?
   - How does Infant and Adult mortality rates affect life expectancy?
   - Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol etc.
   - What is the impact of schooling on the lifespan of humans?
   - Does Life Expectancy have positive or negative relationship with drinking alcohol?
   - Do densely populated countries tend to have lower life expectancy?
   - What is the impact of Immunization coverage on life Expectancy?


### Note: This study created simple visualizations dont forget it but for basic levels!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from sklearn.preprocessing import scale 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import roc_auc_score,roc_curve
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score                # we are using this for model tunning

from warnings import filterwarnings
filterwarnings('ignore')

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
Life_Expectancy_Data = pd.read_csv("../input/life-expectancy-who/Life Expectancy Data.csv")
data = Life_Expectancy_Data.copy()
data = data.dropna()            # If there is a missing or empty observation, delete it. Or 'data.fillna(data.mean(), inplace=True)' with this make NaN values take mean

lindata = data.copy()
multidata = data.copy()
polydata = data.copy()
RFdata = data.copy()
logdata = data.copy()

# Linear Regression

In [None]:
lindata.info()

In [None]:
lindata.head()

In [None]:
lindata.corr()

Looking at heatmap, there is a good relationship (correlation exists) between the best 'GDP' and 'percentage expenditure' in the Life Expectation data.


In [None]:
# plot the heatmap
corr = lindata.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)



Here it is better to establish a linear model between 'GDP' and 'percentage expenditure'. Let's see what our spending percentages are compared to the "GDP" increase. Let's create and fit our linear model.

In [None]:
linear_reg = LinearRegression()
x = lindata.GDP.values.reshape(-1,1)
y = lindata['percentage expenditure'].values.reshape(-1,1)          

linear_reg.fit(x,y)

## y = b0 + b1*x is our linear regression model.
Let's see estimated percentage of expenditure in GDP 10 thousand:

In [None]:
b0 = linear_reg.predict(([[10000]]))       
print("b0: ", b0)

b1 = linear_reg.coef_
print("b1: ", b1)

In [None]:
x_array = np.arange(min(lindata.GDP),max(lindata.GDP)).reshape(-1,1)  # this for information about the line to be predicted

plt.scatter(x,y)
y_head = linear_reg.predict(x_array)                                 # this is predict percentage of expenditure
plt.plot(x_array,y_head,color="red")
plt.show()

from sklearn import metrics
print("Mean Absolute Error: ", metrics.mean_absolute_error(x_array,y_head))
print("Mean Squared Error: ", metrics.mean_squared_error(x_array,y_head))
print("Root Mean Squared Error: ", np.sqrt(metrics.mean_squared_error(x_array, y_head)))



In [None]:
print(r2_score(y, linear_reg.predict(x)))

#### The conclusion here is: the estimate made has 92% accuracy.

# Multi Linear Regression

* Here, let's take a look at the variable that depends on Life Expectancy.
* If there is missing observation or empty, delete it. Or 'data.fillna (data.mean (), inplace = True)' with this make NaN values averaged.
* When we look at the data, Country and Status columns are composed of objects. Because we need to be int or float.
* and let's take the last two columns (Income composition of resources, schooling) as independent variables.

In [None]:
Life_Expectancy_Data = pd.read_csv("../input/life-expectancy-who/Life Expectancy Data.csv")
data = Life_Expectancy_Data.copy()
data = data.dropna()

multidata = data.copy()

multidata.drop(["Country", "Status"], axis=1, inplace=True)             # When we look at the data, Country and Status columns are composed of objects. Because we need to be int or float.

x = multidata.iloc[:, [-2,-1]].values                                   # I took the last two columns (Income composition of resources, schooling) as independent variables.
y = multidata["percentage expenditure"].values.reshape(-1,1)            # our independent variable


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state= 42)
lm = LinearRegression()
model = lm.fit(x_train,y_train)

In [None]:
print("b0: ", lm.intercept_)
print("b1,b2: ", lm.coef_)

We look at what the data set we created here will affect how much it will affect our survival.

In [None]:
new_data = [[0.4,8], [0.5,10]]   
new_data = pd.DataFrame(new_data).T       # .T is transfor the chart.
model.predict(new_data) 

### Now let's look at the correctness of the evaluation we made. If the difference between the train error and the test error is not much, modeling is good.

In [None]:
rmse = np.sqrt(mean_squared_error(y_train, model.predict(x_train)))
rmse

In [None]:
model.score(x_train, y_train) 

### CV $r^2$ value of the model:

In [None]:
cross_val_score(model, x_train,  y_train, cv= 10, scoring="r2").mean()

Predicts of Train set values:

In [None]:
y_head = model.predict(x_test)
y_head[0:5]

In [None]:
y_test_1 =np.array(range(0,len(y_test)))

In [None]:
# r2 value: 
r2_degeri = r2_score(y_test, y_head)
print("Test r2 error = ",r2_degeri) 

plt.plot(y_test_1,y_test,color="r")
plt.plot(y_test_1,y_head,color="blue")
plt.show()

# Polynomial Regression

We will use the same data set.

In [None]:
from sklearn.preprocessing import PolynomialFeatures     # this gives properties of polynomial

Life_Expectancy_Data = pd.read_csv("../input/life-expectancy-who/Life Expectancy Data.csv")
data = Life_Expectancy_Data.copy()
data = data.dropna()        

polydata = data.copy()

Let's see what our spending percentages are compared to the "GDP" increase. Let's create and fit our linear model.

In [None]:
linear_reg = LinearRegression()
x = polydata.GDP.values.reshape(-1,1)
y = polydata['percentage expenditure'].values.reshape(-1,1)          

linear_reg.fit(x,y)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state= 42)

Let's look at the 15th degree. If it's not, we should change it.

In [None]:
polynomial_regression = PolynomialFeatures(degree = 15)    
x_polynomial = polynomial_regression.fit_transform(x)

linear_reg2 = LinearRegression()
linear_reg2.fit(x_polynomial,y)

y_head = linear_reg2.predict(x_polynomial)

plt.plot(x,y_head,color="green",label="poly")
plt.legend()
plt.show()

With degree we determine the precision of our forecast. If it is too large, it will deteriorate, so it is necessary to determine according to the data.

In [None]:
pol_reg = PolynomialFeatures(degree = 8)                    

level_poly = pol_reg.fit_transform(x_train)                 # According to the polynomial, x_train is defined

lm = LinearRegression()                                     
lm.fit(level_poly,y_train)

In [None]:
y_head = lm.predict(pol_reg.fit_transform(x_train))
y_test =np.array(range(0,len(y_train)))

Consistency and scatter drawing of $r^2$ model:

In [None]:
r2 = r2_score(y_train, y_head)
print("r2 value: ", r2)                               # percentage of significance


plt.scatter(y_test, y_train, color="red")
plt.scatter(y_test, y_head, color = "g")
plt.xlabel("GDP")
plt.ylabel("percentage expenditure")
plt.show()

In [None]:
plt.plot(y_test,y_train, color="red")
plt.plot(y_test, y_head, color = "blue")
plt.xlabel("GDP")
plt.ylabel("percentage expenditure")
plt.show()

# Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor               # for our predict model

Life_Expectancy_Data = pd.read_csv("../input/life-expectancy-who/Life Expectancy Data.csv")
data = Life_Expectancy_Data.copy()
data = data.dropna()                                         # same is done 

DTdata = data.copy()

In [None]:
x = polydata.GDP.values.reshape(-1,1)
y = polydata['percentage expenditure'].values.reshape(-1,1)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state= 42)

Let's see Expenditure percentage estimation of the country with "GDP" value of 1000:

In [None]:
DT_reg = DecisionTreeRegressor()           # created model
DT_reg.fit(x_train,y_train)                # fitted model according to train values

print(DT_reg.predict([[1000]]))            

In [None]:
x_array = np.arange(min(x),max(x),0.01).reshape(-1,1)   # line information to be drawn as a predict
y_head = DT_reg.predict(x_array)                        # percentage of spend estimate

plt.scatter(x,y, color="red")
plt.plot(x_array,y_head,color="blue")
plt.xlabel("GDP")
plt.ylabel("percentage expenditure")
plt.show()

### Result: See how it is nice picture and very successful accuracy score.

# Random Forest Regression
* A logic of DecisionTree. For example, 3000 sample data is selected from 100 thousand data and the result is obtained.

In [None]:
from sklearn.ensemble import RandomForestRegressor           # for our predict model

Life_Expectancy_Data = pd.read_csv("../input/life-expectancy-who/Life Expectancy Data.csv")
data = Life_Expectancy_Data.copy()
data = data.dropna()                                         # same is done 

RFdata = data.copy()

In [None]:
x = polydata.GDP.values.reshape(-1,1)
y = polydata['percentage expenditure'].values.reshape(-1,1)

Create regression with 5 DecisionTreeReg in the sklearn RandomForestRegressor model. We can do as many as we want. Giving random_state does not change the outcome. When we say 1, it should not change once.

In [None]:
RF_reg = RandomForestRegressor(n_estimators=100, random_state=42)          
RF_reg.fit(x,y)                                                # the best fit line is drawn

Expenditure percentage estimation of the country with "GDP" value of 1000:

In [None]:
print(RF_reg.predict([[1000]]))            

In [None]:
x_array = np.arange(min(x),max(x),0.01).reshape(-1,1)   # line information to be drawn as a predict
y_head = RF_reg.predict(x_array)                        # percentage of spend predict

plt.scatter(x,y, color="red")
plt.plot(x_array,y_head,color="blue")
plt.xlabel("GDP")
plt.ylabel("percentage expenditure")
plt.show()

### Result: This result good but not so good as BEFORE.

# Logistic Regression Model

* The aim is to reveal the class that will occur when a set of x values that have not yet been observed, to predict a classifier.
* For the classification problem, to establish a linear model that defines the relationship between dependent and independent variables.
* Regarding whether the dependent variable is 1 or 0 or yes or no status


** In this data, we will examine the states of Developed countries (Developed) = 0 and Developing = 1. I want to find the level of development I want, so close to 1!

When we look at the country column data, it consists of objects, let's drop it. Because we need int or float values.

In [None]:
logdata.drop(["Country"], axis=1, inplace=True)  
logdata.head()

Our variable class, which is 1 to 0, let's examine this.

In [None]:
logdata["Status"].value_counts()

Let's continue with the review.

In [None]:
logdata["Status"].value_counts().plot.barh();

We need to create binary, that is, from 0 to 1. Let's do the necessary transformations.

In [None]:
logdata.Status = [1 if each == "Developing" else 0 for each in logdata.Status]   

Let's look at their general statistical properties.

In [None]:

logdata.describe().T

Let's create our variables now.

In [None]:
y = logdata["Status"]
X_data = logdata.drop(["Status"], axis=1)

Let's do normalization in our data.

In [None]:
#*** Normalize ***#

X = (X_data - np.min(X_data))/(np.max(X_data) - np.min(X_data)).values

Let's build a model through statsmodels and make it fit. Here, the meaning of the model and how much of this variable affects us, comes from this table.

In [None]:
loj = sm.Logit(y, X)
loj_model= loj.fit()
loj_model.summary()

Then see model:

In [None]:
from sklearn.linear_model import LogisticRegression
loj = LogisticRegression(solver = "liblinear")
loj_model = loj.fit(X,y)
loj_model

In [None]:
# constant value
loj_model.intercept_

Coefficient values of all independent variables:

In [None]:
loj_model.coef_

# PREDICT and MODEL TUNNING

In [None]:
y_pred = loj_model.predict(X)              # predict

Confusion Matrix: Those that are 1 (PP) when it is 1 in reality, 0 (PN) when it is 1, 1 (NP) when it is 0 when it is 0 (NN) when it is 0.

In [None]:
confusion_matrix(y, y_pred)

See accuracy score:

In [None]:
accuracy_score(y, y_pred)

One of the outputs that will evaluate the results of a most detailed classification algorithm.

In [None]:
print(classification_report(y, y_pred))

See top 10 model predict values:

In [None]:
loj_model.predict(X)[0:10]

* Using the 'predict_proba' module if we want to give the noble values rather than the values given above 1 and 0.


* Returns the values of 0 in the index or left side of 0, and values of 1 in the index 1 or of the right side of the matrix.

In [None]:

loj_model.predict_proba(X)[0:10][:,0:2]                # Top 10

Now let's try to model the ten prediction probability values above 'predict_proba'.

In [None]:

y_probs = loj_model.predict_proba(X)
y_probs = y_probs[:,1]

In [None]:
y_probs[0:10]               # top 10

Put our guess values here in the loop and give it 1 to 0.5 and 0 to the little ones.

In [None]:

y_pred = [1 if i > 0.5 else 0 for i in y_probs]

When we look at the value above, we notice the change. Our purpose to do this is to verify our model.

In [None]:

y_pred[0:10]


In [None]:

confusion_matrix(y, y_pred)


In [None]:
accuracy_score(y, y_pred)

In [None]:
print(classification_report(y, y_pred))

Let's do one more look at the top 5 elements we did above.

In [None]:

loj_model.predict_proba(X)[:,1][0:5]

In [None]:
logit_roc_auc = roc_auc_score(y, loj_model.predict(X))

In [None]:
fpr, tpr, thresholds = roc_curve(y, loj_model.predict_proba(X)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='AUC (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Oranı')
plt.ylabel('True Positive Oranı')
plt.title('ROC')
plt.show()


Here, 

- blueline: The graphic of our success regarding the model we have established.
- redline: If we don't do anything, our model will be this way. 

In [None]:
# test train is subjected to separation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)


# Let's create and fit our model.

In [None]:

loj = LogisticRegression(solver = "liblinear")
loj_model = loj.fit(X_train,y_train)
loj_model



Let's see accuracy score:

In [None]:
accuracy_score(y_test, loj_model.predict(X_test))


Finally Tunned model score:

In [None]:
cross_val_score(loj_model, X_test, y_test, cv = 10).mean()


### Result: From this data, we understand: 89% of the countries that are developing are developing countries, and the effects of the variables that will question life expectancies can be examined.

# Conclusion
We examined the **Life Expectancy (WHO)** data set with the basic models in Machine Learing and made some comments.

Note:

   - After this notebook, my aim is to prepare 'kernel' which is 'not clear' data set.

   - If you have any suggestions, please could you write for me? I wil be happy for comment and critics!

   - Thank you for your suggestion and votes ;)
