In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib as mplib
from matplotlib import pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.
# /kaggle/input/life-expectancy-who/led.csv

### Understanding the dataset
The focus here is to take a look at the dataset and understand what sort of preprocessing it may need. I am interested to see where values are missing and what kinds of datatypes I will be working with. 

#### Metadata 
- Country -                       Country

- Year -                          Year

- Status -                        Developed or Developing status

- Lifeexpectancy -                Life Expectancy in age

- AdultMortality -                Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)

- infantdeaths -                  Number of Infant Deaths per 1000 population

- Alcohol -                       Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)

- percentageexpenditure -         Expenditure on health as a percentage of Gross Domestic Product per capita(%)

- HepatitisB -                    Hepatitis B (HepB) immunization coverage among 1-year-olds (%)

- Measles -                       Measles - number of reported cases per 1000 population

- BMI -                           Average Body Mass Index of entire population

- under-fivedeaths -              Number of under-five deaths per 1000 population

- Polio -                         Polio (Pol3) immunization coverage among 1-year-olds (%)

- Totalexpenditure -              General government expenditure on health as a percentage of total government expenditure (%)

- Diphtheria -                    Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

- HIV/AIDS -                      Deaths per 1 000 live births HIV/AIDS (0-4 years)

- GDP -                           Gross Domestic Product per capita (in USD)

- Population -                    Population of the country

- thinness1-19years -             Prevalence of thinness among children and adolescents for Age 10 to 19 (% )

- thinness5-9years -              Prevalence of thinness among children for Age 5 to 9(%)

- Incomecompositionofresources -  Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

- Schooling -                     Number of years of Schooling(years)

In [None]:
data = pd.read_csv("../input/life-expectancy-who/led.csv")
data_df = pd.DataFrame(data)


In [None]:
print(data_df.info())
print(data_df.describe())
# data_df.head()
print("Null values:\n", data_df.isnull().sum())
print("Percent missingness:\n", data_df.isnull().sum() / data_df.count())
print("Shape:\n", data_df.shape)
print("Data Types:\n", data_df.dtypes)

### Initial thoughts


- There are missing values for columns: GDP, Population, Totalexpenditure, HepatitisB, Incomecompositionofresources, and Schooling. I could input missing data or it might be easier to drop some of the columns altogether if they are missing too much data. 


- After doing some research, with a large dataset, a missingness of up to 40% may even be acceptable. In this case, we may be fine without inputting missing data. 


- We will be doing linear regression because life expectancy is a continuous number. 



### Correlation Matrix
Next, I want to run a correlation matrix with all the variables so I can get an understanding on which variables are most influential.

In [None]:
corrMatrix = data_df.corr()
corrMatrix.style.background_gradient(cmap='plasma', low=.5, high=0).highlight_null('red') # from https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

In [None]:
df_last5 = data_df[data_df['Year'].isin([2011,2012,2013,2014,2015])]

In [None]:
df_last5_avg = df_last5.groupby(['Country'],as_index=False).mean()

### Check to make sure data is accurate 


I will average a variable over the last five years for a random country, Italy for example, and check to make sure I have the same value in my new dataframe, df_last5_avg.

In [None]:
print(df_last5[df_last5['Country']=='Italy'])
print(df_last5_avg[df_last5_avg['Country']=='Italy'])

### Take a look at the new dataset information


I will use the same method as before to check for missing data. Hopefully by only taking the last 5 years for each country, we will have less missing data. 


A benefit to selecting and averaging the last 5 years of data for each country is that we will have an easier time running models as their is only one record per country now. 

In [None]:
print('Percent Missingness:\n', df_last5_avg.isnull().sum()/df_last5_avg.count())

- Since there are such a large number of missing values in the Population column, and the correlation is so low (-0.02), I will drop it instead of imputing values. 


- However, I will want to impute missing values for the GDP column as it has a higher correlation (0.46). I think its best to take the median value of developing and developed countries and insert based on respective status. 

In [None]:
df_last5_avg.drop(['Population'],1,inplace=True)
df_last5_avg.drop(['Year'],1,inplace=True)

In [None]:
# pd.set_option("display.max_rows", None, "display.max_columns", None) # this allows you to see the full dataset

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['GDP']])
df_last5_avg['GDP'] = imp.transform(df_last5_avg[['GDP']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['Lifeexpectancy']])
df_last5_avg['Lifeexpectancy'] = imp.transform(df_last5_avg[['Lifeexpectancy']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['AdultMortality']])
df_last5_avg['AdultMortality'] = imp.transform(df_last5_avg[['AdultMortality']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['Alcohol']])
df_last5_avg['Alcohol'] = imp.transform(df_last5_avg[['Alcohol']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['HepatitisB']])
df_last5_avg['HepatitisB'] = imp.transform(df_last5_avg[['HepatitisB']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['BMI']])
df_last5_avg['BMI'] = imp.transform(df_last5_avg[['BMI']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['Totalexpenditure']])
df_last5_avg['Totalexpenditure'] = imp.transform(df_last5_avg[['Totalexpenditure']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['thinness1-19years']])
df_last5_avg['thinness1-19years'] = imp.transform(df_last5_avg[['thinness1-19years']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['thinness5-9years']])
df_last5_avg['thinness5-9years'] = imp.transform(df_last5_avg[['thinness5-9years']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['Incomecompositionofresources']])
df_last5_avg['Incomecompositionofresources'] = imp.transform(df_last5_avg[['Incomecompositionofresources']])

In [None]:
imp = SimpleImputer(missing_values = np.nan, strategy='median')
imp = imp.fit(df_last5_avg[['Schooling']])
df_last5_avg['Schooling'] = imp.transform(df_last5_avg[['Schooling']])

In [None]:
print("Percent missingness:\n", df_last5_avg.isnull().sum() / df_last5_avg.count())

### Some Visualization Before Multiple Linear Regression

I would like to run some visualizations just to get a sense of the dataset and to examine if intuitions are correct

In [None]:
X = df_last5_avg['AdultMortality']
y = df_last5_avg['Lifeexpectancy']
plt.scatter(X,y)
plt.ylabel('Life Expectancy')
plt.xlabel('Adult Mortality per 1000 pop.')
plt.show()

In [None]:
X = df_last5_avg['GDP']
y = df_last5_avg['Lifeexpectancy']
plt.scatter(X,y)
plt.ylabel('Life Expectancy')
plt.xlabel('GDP')
plt.show()

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df_last5_avg[['AdultMortality']],df_last5_avg['Lifeexpectancy'])
prediction_space = np.linspace(min(df_last5_avg['AdultMortality']),max(df_last5_avg['AdultMortality'])).reshape(-1,1)
plt.scatter(df_last5_avg['AdultMortality'],df_last5_avg['Lifeexpectancy'],color='yellow')
plt.plot(prediction_space,reg.predict(prediction_space),color='blue',linewidth=3)
plt.show()

### Correlated Variables

Before I run variable selection, I would like to eliminate variables that already have a high correlation. This step will increase the speed of variable selection. 

I will follow the model demonstrated on this page.

https://towardsdatascience.com/feature-selection-in-python-recursive-feature-elimination-19f1c39b8d15

In [None]:
correlated_features = set()
correlation_matrix = df_last5_avg.drop('Lifeexpectancy', axis=1).corr()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.8: #.iloc method is used to extract rows
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)
print(correlated_features)

In [None]:
df_last5_avg.drop(['Diphtheria'],1,inplace=True)
df_last5_avg.drop(['thinness5-9years'],1,inplace=True)
df_last5_avg.drop(['under-fivedeaths'],1,inplace=True)
df_last5_avg.drop(['Schooling'],1,inplace=True)

In [None]:
df_last5_avg.head()
df_last5_avg.set_index('Country',inplace=True)

### Feature Selection 
#### RFE: Recursive Feature Elimination from SKLearn

https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

The next step I wanted to do before running a linear regression was to do variable selection. Variable selection is an important part of machine learning becuase it...
- reduces the chance of overfitting
- improves the accuracy of your model
- improves the speed of training the model 


With the guidance provided from the link above I will perform feature selection

In this procedure, I will first get a ranking of variables just to make sure the feature selection is working correctly. Then, I will have the model tell me what is the optimal amount of variables. Lastly, I will use the rfe.fit_transform method using the number of optimal variables as the amount of variables to select. Once it returns the optimal variables, I will plot the coefficients to get a look at their values. These variables will be selected to run the multiple linear regression. 

In [None]:
from sklearn.feature_selection import RFE

X = df_last5_avg.drop('Lifeexpectancy',axis=1)
y = df_last5_avg['Lifeexpectancy']

lr = linear_model.LinearRegression()
rfe = RFE(estimator=lr, n_features_to_select=8, step=1)
rfe.fit(X, y)

#print(rfe.get_support)
print(rfe.ranking_)

In [None]:
from sklearn.model_selection import train_test_split
#no of features
nof_list=np.arange(1,13)            
high_score=0
#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
    model = linear_model.LinearRegression()
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

In [None]:
cols = list(X.columns)
model = linear_model.LinearRegression()
#Initializing RFE model
rfe = RFE(model, 7)             
#Transforming data using RFE
X_rfe = rfe.fit_transform(X,y)  
#Fitting the data to model
model.fit(X_rfe,y)             
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print(selected_features_rfe)


In [None]:
coefs = model.fit(X_rfe,y).coef_  
_ = plt.plot(coefs)
_ = plt.xticks(np.arange(7),('AdultMortality', 'Alcohol', 'HepatitisB', 'Totalexpenditure',
       'HIV/AIDS', 'thinness1-19years', 'Incomecompositionofresources'), rotation=60)
_ = plt.ylabel('Coefficients')
plt.show()

In [None]:
coef_dict = dict(enumerate(coefs))
coef_dict = {'AdultMortality':coef_dict[0],'Alcohol':coef_dict[1],'HepatitisB':coef_dict[2],
             'Totalexpenditure':coef_dict[3],'HIV/AIDS':coef_dict[4],
             'thinness1-19years':coef_dict[5],'Incomecompositionofresources':coef_dict[6]}
coef_dict

In [None]:
import seaborn as sns
corr = df_last5_avg.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right')

### Linear Regression

I will run two different regression model using Ridge and Lasso

In [None]:
df_regress = df_last5_avg[['AdultMortality', 'Alcohol', 'HepatitisB', 'Totalexpenditure',
       'HIV/AIDS', 'thinness1-19years', 'Incomecompositionofresources','Lifeexpectancy']]
X = df_regress.drop('Lifeexpectancy',axis=1)
y = df_regress['Lifeexpectancy']

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)
ridge = Ridge(alpha = 0.1, normalize=True)
ridge.fit(X_train,y_train)
ridge_pred=ridge.predict(X_test)
ridge_pred1=ridge.predict(X_train)
print(ridge.score(X_train,y_train))
print(ridge.score(X_test,y_test))

In [None]:
from sklearn.linear_model import Lasso

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 42)
lasso = Lasso(alpha = 0.1, normalize=True)
lasso.fit(X_train,y_train)
lasso_pred=lasso.predict(X_test)
lasso.score(X_test,y_test)

### Final Conclusions

From the models we ran, we end up with a very accurate predictions, Ridge R^2 = .89 and Lasso R^2 = .86. By doing feature selection we can also note that the most influential factors on life expectancy are:

{'AdultMortality': -0.04432281224565553,
 'Alcohol': 0.14676844206224568,
 'HepatitisB': 0.03433452490962368,
 'Totalexpenditure': 0.25275165023031065,
 'HIV/AIDS': -0.3195821205985184,
 'thinness1-19years': -0.08129266307963838,
 'Incomecompositionofresources': 21.411124433259623}
 
- Adult mortality - Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
- Alcohol - recorded per capita (15+) consumption (in litres of pure alcohol)
- Hepatitis B - Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
- total expenditure - General government expenditure on health as a percentage of total government expenditure (%)
- HIV/AIDS - Deaths per 1 000 live births HIV/AIDS (0-4 years)
- Thinness 1-19 years - Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
- Income composition of resources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

One reason to why our predictive models are so accurate may be because I reduced the dataset considerably to 192 records. Although, there isn't a problem of overfitting because the training accuracy is very similar to the test accuracy.

### Project Report

1. The first steps I took, were to look at the dataset and then take a look at the amount of missing values.
2. Next, I wanted to reduce the dataset so that I only had individual countries, which I did by taking data from the  last five years of each country and then averaging it to make a single record.
3. Next, I wanted to check the amount of missing data again to see if I needed to impute values or drop columns altogether. 
4. After getting a clean dataset, I ran some visualizations to test my intuition and make sure things are working correctly. 
5. Once that was done, I moved on to removing heavily correlated variables and performed feature selection. 
6. Finally, after feature selection, I ran the linear regressions. 
