### 1. Frame the problem and look at the big picture.


Dataset used : https://www.kaggle.com/kumarajarshi/life-expectancy-who

It is clearly a typical supervised learning task since you are given labeled training examples (each instance comes with the expected output, i.e., Life Expentancy).

Moreover, it is also a typical regression task, since you are asked to predict a value. More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the Adult Mortality,Population,Income composition of resources etc.).

Finally, there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.

### Select a Performance Measure

Your next step is to select a performance measure. A typical performance measure for regression problems is the R2_Score.

### 2.Get the Data

In [None]:
#import necessary modules
import pandas as pd

life_data = pd.read_csv("../input/life-expectancy-who/Life Expectancy Data.csv")


### Take a Quick Look at the Data Structure

In [None]:
life_data.head()          #View top 5 rows

In [None]:
#Lets look at the columns
life_data.columns

Since the column names are inconsistent as some names start with capital letters and some names begins with spaces. Lets make the column name consistent by removing the leading and trailing spaces

In [None]:
#Remove Leading spaces : use lstrip() method
life_data.columns =  [names.lstrip() for names in life_data.columns]

#Remove Trailing spaces : use lstrip() method
life_data.columns =  [names.rstrip() for names in life_data.columns]

#Capliatize column name, making them consistent

life_data.columns = [names.capitalize() for names in life_data.columns]

In [None]:
#Lets view our Columns
life_data.columns

 The data contains 21 columns and 2938 rows with the header row. The table contains data about:
* country (Nominal) - the country in which the indicators are from (i.e. United States of America or Congo)
* year (Ordinal) - the calendar year the indicators are from (ranging from 2000 to 2015)
* status (Nominal) - whether a country is considered to be 'Developing' or 'Developed' by WHO standards
* life_expectancy (Ratio) - the life expectancy of people in years for a particular country and year
* adult_mortality (Ratio) - the adult mortality rate per 1000 population (i.e. number of people dying between 15 and 60 years per 1000 population); if the rate is 263 then that means 263 people will die out of 1000 between the ages of 15 and 60; another way to think of this is that the chance an individual will die between 15 and 60 is 26.3%
* infant_deaths (Ratio) - number of infant deaths per 1000 population; similar to above, but for infants
* alcohol (Ratio) - a country's alcohol consumption rate measured as liters of pure alcohol consumption per capita
* percentage_expenditure (Ratio) - expenditure on health as a percentage of Gross Domestic Product (gdp)
* hepatitis_b (Ratio) - number of 1 year olds with Hepatitis B immunization over all 1 year olds in population
* measles (Ratio) - number of reported Measles cases per 1000 population
* bmi (Interval/Ordinal) - average Body Mass Index (BMI) of a country's total population
* under-five_deaths (Ratio) - number of people under the age of five deaths per 1000 population
* polio (Ratio) - number of 1 year olds with Polio immunization over the number of all 1 year olds in population
* total_expenditure (Ratio) - government expenditure on health as a percentage of total government expenditure
* diphtheria (Ratio) - Diphtheria tetanus toxoid and pertussis (DTP3) immunization rate of 1 year olds
* hiv/aids (Ratio) - deaths per 1000 live births caused by HIV/AIDS for people under 5; number of people under 5 who die due to HIV/AIDS per 1000 births
* gdp (Ratio) - Gross Domestic Product per capita
* population (Ratio) - population of a country
* thinness_1-19_years (Ratio) - rate of thinness among people aged 10-19 (Note: variable should be renamed to thinness_10-19_years to more accurately represent the variable)
* thinness_5-9_years (Ratio) - rate of thinness among people aged 5-9
* income_composition_of_resources (Ratio) - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
* schooling (Ratio) - average number of years of schooling of a population

With the exclution of Country name and Status(either developed or developing) all of the data is numeric. The values are either in years, precentages, millions or dollars in the case of Gross Domestic Product (GDP)

The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values

In [None]:
life_data.info()

There are 2,934 instances in the dataset, which means that it is fairly small by Machine Learning standards, but it’s perfect to get started.

Notice that there are missing values in Life Expentancy(our target variable),Adult mortailty, Alcohol,Hepatitis B, Bmi,Gdp Diphtheria,Hiv/aids, Population,Thinness  1-19 years,
Thinness 5-9 years, Income composition of resource, Schooling


As stated above it would be useful to change the name of the variable thinness_1-19_years to thinness_10-19_years as it is a more accurate depiction of what the variable means.

In [None]:
life_data.rename(columns={'thinness_1-19_years':'thinness_10-19_years'}, inplace=True)

The describe() method shows a summary of the numerical attributes

In [None]:
life_data.describe()

Note that the standard deviation of the features Infant deaths, Percentage Expenditure , Measles, Under five deaths,GDP,Population is a lot higher than the mean(more than twice). 

This implies that the data is not centered around its mean and are more spread out. Or we can say there is large variation in the data ranging from min-max.

Things that may not make sense from above:

* Adult mortality of 1? This is likely an error in measurement, but what values make sense here? May need to change to null if under a certain threshold.
* Infant deaths as low as 0 per 1000? That just isn't plausible - I'm deeming those values to actually be null. Also on the other end 1800 is likely an outlier, but it is possible in a country with very high birthrates and perhaps a not very high population total - this can be dealt with later.
* BMI of 1 and 87.3? Pretty sure the whole population would not exist if that were the case. A BMI of 15 or lower is seriously underweight and a BMI of 40 or higher is morbidly obese, therefore a large number of these measurements just seem unrealistic...this variable might not be worth digging into at all.
* Under Five Deaths, similar to infant deaths just isn't likely (perhaps even impossible) to have values at zero.
* GDP per capita as low as 1.68 (USD) possible? Doubtful - but perhaps values this low are outliers.
* Population of 34 for an entire country? Hmm...

Lets Plot the histogram of each attribute to get more insight of the data

In [None]:
#import 
%matplotlib inline    

import matplotlib.pyplot as plt


In [None]:
#creating histogram for each numeric attribute
life_data.hist(bins = 50,
               figsize = (20,15))
plt.show()

#### Conclusions:
* All these attributes are in different scales. Feature scaling is needed.
* Many histogram are tail heavily i.e. left skewed. So we need to convert them to a bell-shaped distribution.
* Diphtheria is right skewed. So we need to convert it to a bell shaped-distribution.

Lets look at our categorical attributes

In [None]:
life_data["Country"].value_counts()

In [None]:
life_data["Status"].value_counts()

### EDA: Data Cleaning

In [None]:
#Copy the test data
life_copy = life_data.copy()

#### Checking for outlier
Best way to plot a box-plot

In [None]:
plt.figure(figsize=(15,10))

for i,column in enumerate(['Adult mortality', 'Infant deaths', 'Bmi', 'Under-five deaths', 'Gdp', 'Population'],start=1):
    plt.subplot(2, 3,i)
    life_copy.boxplot(column)

There are a few of the above that could simply be outliers, but there are some that almost certainly have to be errors of some sort. Of the above variables, changes to null will be made for the following since these numbers don't make any sense:

* Adult mortality rates lower than the 5th percentile
* Infant deaths of 0
* BMI less than 10 and greater than 50
* Under Five deaths of 0

Lets replace these values with NULL

In [None]:
#import
import numpy as np

In [None]:
#Adult mortality rates lower than the 5th percentile
mortality_less_5_per = np.percentile(life_copy["Adult mortality"].dropna(),5) 
life_copy["Adult mortality"] = life_copy.apply(lambda x: np.nan if x["Adult mortality"] < mortality_less_5_per else x["Adult mortality"], axis=1)


In [None]:
#Remove Infant deaths of 0
life_copy["Infant deaths"] = life_copy["Infant deaths"].replace(0,np.nan)

In [None]:
#Remove the invalid BMI
life_copy["Bmi"] =life_copy.apply(lambda x : np.nan if (x["Bmi"] <10 or x["Bmi"] >50) else x["Bmi"],axis =1)

In [None]:
#Remove Under five deaths
life_copy["Under-five deaths"] =life_copy["Under-five deaths"].replace(0,np.nan)

#### Dealing with missing values

After making above transformations the missing values must have increased. Following function will count for issing values in the dataset.

In [None]:
def count_null(df):
    df_cols = list(df.columns)
    cols_total_count = len(df_cols)
    cols_count = 0
    
    for loc,col in enumerate(df_cols):
        null_count = df[col].isnull().sum()                                  #total null values
        total_count = df[col].isnull().count()                               #Total rows
        percent_null = round(null_count/total_count*100, 2)                  #Percentage null 
      
        if null_count > 0:
            cols_count += 1
            print('[iloc = {}] {} has {} null values: {}% null'.format(loc, col, null_count, percent_null))
    
    cols_percent_null = round(cols_count/cols_total_count*100, 2)
    print('Out of {} total columns, {} contain null values; {}% columns contain null values.'.format(cols_total_count, cols_count, cols_percent_null))

In [None]:
count_null(life_copy)

Earlier there were 10 missing values (for Life Expectancy feature) in our total dataset. But there are only 2 in our training set. This means our testing data consists of a lot missing values for the variable to be predicted. We will take careof this later.

Nearly half of the BMI variable's values are null, it is likely best to remove this variable altogether.

In [None]:
life_copy.drop(columns='Bmi', inplace=True)

###### Lets deal with the Missing values
Alright, so it looks like there are a lot of columns containing null values, since this is time series data assorted by country, the best course of action would be to interpolate the data by country. However, when attempting to interpolate by country it doesn't fill in any values as the countries' data for all the null values are null for each year, therefore imputation by year may be the best possible method here. Imputation of each year's mean is done below.

In [None]:
imputed_data = []

for year in list(life_copy.Year.unique()):
    year_data = life_copy[life_copy.Year == year].copy()
    
    for col in list(year_data.columns)[3:]:
        year_data[col] = year_data[col].fillna(year_data[col].dropna().mean()).copy()

    imputed_data.append(year_data)
df = pd.concat(imputed_data).copy()

In [None]:
count_null(df)

### Outliers Detection

First a boxplot and histogram will be created for each continuous variable in order to visually see if outliers exist.

In [None]:
life_numeric_data = df.drop(columns=["Year","Country","Status"])

In [None]:
%matplotlib inline

def plot_numeric_data(data):
    i = 0
    for col in data.columns:
        i += 1
        plt.subplot(9, 4, i)
        plt.boxplot(data[col])
        plt.title('{} boxplot'.format(col))
        i += 1
        plt.subplot(9, 4, i)
        plt.hist(data[col])
        plt.title('{} histogram'.format(col))
        
    plt.show()


In [None]:
plt.figure(figsize=(15,40))
plot_numeric_data(life_numeric_data)

Visually, it is plain to see that there are a number of outliers for all of these variables - including the target variable, life expectancy. The same will be done statistically using Tukey's method below - outliers being considered anything outside of 1.5 times the IQR.

In [None]:
def outlier_count(col, data=df):
    
    print("\n"+15*'-' + col + 15*'-'+"\n")
    
    q75, q25 = np.percentile(data[col], [75, 25])
    iqr = q75 - q25
    min_val = q25 - (iqr*1.5)
    max_val = q75 + (iqr*1.5)
    outlier_count = len(np.where((data[col] > max_val) | (data[col] < min_val))[0])
    outlier_percent = round(outlier_count/len(data[col])*100, 2)
    print('Number of outliers: {}'.format(outlier_count))
    print('Percent of data that is outlier: {}%'.format(outlier_percent))

In [None]:
cont_vars = list(life_numeric_data)
for col in cont_vars:
    outlier_count(col)

Since each variable has a unique amount of outliers and also has outliers on different sides of the data, the best route to take is probably winsorizing (limiting) the values for each variable on its own until no outliers remain. The function below allows me to do exactly that by going variable by variable with the ability to use a lower limit and/or upper limit for winsorization. By default the function will show two boxplots side by side for the variable (one boxplot of the original data, and one with the winsorized change). Once a satisfactory limit is found (by visual analysis), the winsorized data will be saved in the wins_dict dictionary so the data can easily be accessed later.

In [None]:
from scipy.stats.mstats import winsorize

def test_wins(col, lower_limit=0, upper_limit=0, show_plot=True):
    wins_data = winsorize(df[col], limits=(lower_limit, upper_limit))
    wins_dict[col] = wins_data
    if show_plot == True:
        plt.figure(figsize=(15,5))
        plt.subplot(121)
        plt.boxplot(df[col])
        plt.title('original {}'.format(col))
        plt.subplot(122)
        plt.boxplot(wins_data)
        plt.title('wins=({},{}) {}'.format(lower_limit, upper_limit, col))
        plt.show()

In [None]:
wins_dict = {}
test_wins(cont_vars[0], lower_limit=.01, show_plot=True)
test_wins(cont_vars[1], upper_limit=.04, show_plot=False)
test_wins(cont_vars[2], upper_limit=.05, show_plot=False)
test_wins(cont_vars[3], upper_limit=.0025, show_plot=False)
test_wins(cont_vars[4], upper_limit=.135, show_plot=False)
test_wins(cont_vars[5], lower_limit=.1, show_plot=False)
test_wins(cont_vars[6], upper_limit=.19, show_plot=False)
test_wins(cont_vars[7], upper_limit=.05, show_plot=False)
test_wins(cont_vars[8], lower_limit=.1, show_plot=False)
test_wins(cont_vars[9], upper_limit=.02, show_plot=False)
test_wins(cont_vars[10], lower_limit=.105, show_plot=False)
test_wins(cont_vars[11], upper_limit=.185, show_plot=False)
test_wins(cont_vars[12], upper_limit=.105, show_plot=False)
test_wins(cont_vars[13], upper_limit=.07, show_plot=False)
test_wins(cont_vars[14], upper_limit=.035, show_plot=False)
test_wins(cont_vars[15], upper_limit=.035, show_plot=False)
test_wins(cont_vars[16], lower_limit=.05, show_plot=False)
test_wins(cont_vars[17], lower_limit=.025, upper_limit=.005, show_plot=False)

All the variables have now been winsorized as little as possible in order to keep as much data in tact as possible while still being able to eliminate the outliers. Finally, small boxplots will be shown for each variable's winsorized data to show that the outliers have indeed been dealt with.

In [None]:
plt.figure(figsize=(15,5))

for i, col in enumerate(cont_vars, 1):
    plt.subplot(2, 9, i)
    plt.boxplot(wins_dict[col])

    plt.tight_layout()
plt.show()

In [None]:
#A new dataframe with the winsorized data 
wins_df = df.iloc[:, 0:3]
for col in cont_vars:
    wins_df[col] = wins_dict[col]

#### Drop irrelevant features

In [None]:
dataset = wins_df.drop(columns= ["Year","Country"],axis = True)

In [None]:
#Dealing with Categorical data

In [None]:
status = pd.get_dummies(dataset.Status)
dataset = pd.concat([dataset, status], axis = 1)
dataset= dataset.drop(['Status'], axis=1)

In [None]:
dataset.columns

### Lets Split the data into training and test data

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(dataset.drop(columns = ["Life expectancy"],axis = 1),
                                                 dataset["Life expectancy"],
                                                 test_size = 0.2,
                                                 random_state = 42)

### Feature Scaling 

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
X_train_scaled = std_scaler.fit_transform(X_train)


## Train the model using Linear Regression
#### [UNDERFITTING]

In [None]:
#import necessary modules
from sklearn.linear_model import LinearRegression

linear_regressor = LinearRegression()

linear_regressor.fit(X_train_scaled,y_train)


In [None]:
from sklearn.metrics import r2_score

#Make predictions
y_pred = linear_regressor.predict(X_train_scaled)

#Calculating RMSE
linear_r2_score = r2_score(y_train,y_pred)

print(linear_r2_score)

### Better Evaluation Using Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
scoring = make_scorer(r2_score)

linear_scores = cross_val_score(linear_regressor,X_train_scaled,y_train,
                       scoring = scoring,cv=10)
linear_scores

### Train the model using Decision Tree Classifier
#### [OVERFITTING]

In [None]:
#import necessary modules
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()

tree_reg.fit(X_train_scaled,y_train)


In [None]:
#Make predictions
y_pred = tree_reg.predict(X_train_scaled)

#Calculating RMSE
tree_r2_score = r2_score(y_train,y_pred)

print(tree_r2_score)

### Better Evaluation Using Cross-Validation

Clearly the Decicion Tree Regressor is Overfitting the data. Lets check the cross validation scores

In [None]:
from sklearn.metrics import make_scorer
scoring = make_scorer(r2_score)
scores = cross_val_score(tree_reg,X_train_scaled,y_train,
                       scoring = scoring,cv=10)


In [None]:
scores

# Train the model using Random Forest Classifier
#### [Prefect Fit]

In [None]:
#RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
socre = make_scorer("r2_score")
forest_reg = RandomForestRegressor()

forest_reg.fit(X_train_scaled,y_train)



In [None]:
#Make predictions
y_pred = forest_reg.predict(X_train_scaled)
#Calculating RMSE
forest_r2_score = r2_score(y_train,y_pred)

print(forest_r2_score)

### Better Evaluation Using Cross-Validation


In [None]:
forest_score = cross_val_score(forest_reg, X_train_scaled,y_train,
                              scoring=scoring,cv=10)

forest_score

> ## Evaluate Your System on the Test Set

In [None]:
X_test_scaled = std_scaler.fit_transform(X_test)
y_pred = forest_reg.predict(X_test_scaled)

#Calculating RMSE
tree_r2_score = r2_score(y_test,y_pred)

print("R^2 score: %.2f"%tree_r2_score)

## Refrences

Dataset used : https://www.kaggle.com/kumarajarshi/life-expectancy-who

Exploratory Data Analysis : https://www.kaggle.com/philbowman212/life-expectancy-exploratory-data-analysis