# Life Expectancy

The objective of this notebook is to predict life expectancy.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import seaborn as sns
plt.style.use("bmh")

In [None]:
# Load data
life_data = pd.read_csv("../input/life-expectancy-who/Life Expectancy Data.csv")

In [None]:
life_data.head(20)

In [None]:
life_data.shape

In [None]:
life_data.columns

### Variable notes:

* **Country:** Country
* **Year:** Year
* **Status:** Developed or Developing status
* **Life expectancy:** Life Expectancy in age
* **Adult Mortality:** Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
* **infant deaths:** Number of Infant Deaths per 1000 population
* **Alcohol:** Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
* **percentage expenditure:** Expenditure on health as a percentage of Gross Domestic Product per capita(%)
* **Hepatitis B:** Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
* **Measles:** Measles - number of reported cases per 1000 population
* **BMI:** Average Body Mass Index of entire population
* **under-five deaths:** Number of under-five deaths per 1000 population
* **Polio:** Polio (Pol3) immunization coverage among 1-year-olds (%)
* **Total expenditure:** General government expenditure on health as a percentage of total government expenditure (%)
* **Diphtheria:** Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
* **HIV/AIDS:** Deaths per 1 000 live births HIV/AIDS (0-4 years)
* **GDP:** Gross Domestic Product per capita (in USD)
* **Population:** Population of the country
* **thinness  1-19 years:** Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
* **thinness 5-9 years:** Prevalence of thinness among children for Age 5 to 9(%)
* **Income composition of resources:** Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
* **Schooling:** Number of years of Schooling(years)

In [None]:
# check missing data
life_data.isnull().sum()

In [None]:
# fill missing data
for column in life_data.columns:
    if life_data[column].isnull().sum() != 0:
        life_data[column].fillna(life_data[column].mean(), inplace=True)

In [None]:
life_data.isnull().sum()

Let's see how each characteristic affects life expectancy.

In [None]:
fig, ax = plt.subplots(figsize=(25, 10))
g1 = life_data.groupby("Country")["Life expectancy "].mean()
g1.plot(kind="bar", alpha=0.3, ec="k", color="green")
plt.ylabel("Mean life expectancy")
plt.show()

In [None]:
plt.scatter(x=life_data["Year"], y=life_data["Life expectancy "], color="purple", alpha=0.3)
plt.ylabel("Life expectancy")
plt.show()

Over the years life expectancy has been growing, because of the development in health sector.

In [None]:
g2 = life_data.groupby("Status")["Life expectancy "].mean()
g2.plot(kind="bar", alpha=0.3, ec="k", color="brown")
plt.ylabel("Life expectancy")
plt.show()

Developed countries have higher life expectancy.

In [None]:
sns.pairplot(life_data, x_vars=["Adult Mortality", "infant deaths", "Alcohol", "percentage expenditure", "Hepatitis B"], y_vars=["Life expectancy "])

In [None]:
sns.pairplot(life_data, x_vars=["Measles ", " BMI ", "under-five deaths ", "Polio", "Total expenditure"], y_vars=["Life expectancy "])

In [None]:
sns.pairplot(life_data, x_vars=["Diphtheria ", " HIV/AIDS", "GDP", "Population", " thinness  1-19 years"], y_vars=["Life expectancy "])

In [None]:
sns.pairplot(life_data, x_vars=[" thinness 5-9 years", "Income composition of resources", "Schooling"], y_vars=["Life expectancy "])

Let's see the correlations between the variables.
* A value near 1.0 indicates a positive correlation.
* A value near -1.0 indicates a negative correlation.
* And a value near zero indicates the absence of any correlation.

In [None]:
corrMatrix = life_data.corr()
fig, ax = plt.subplots(figsize=(18, 10))
sns.heatmap(corrMatrix, annot=True)

In [None]:
life_data.dtypes

In [None]:
# Since Country and Status columns aren't numerical values, I will transform them

print(life_data["Country"].unique())
letters = life_data["Country"].unique()
numbers = list(range(len(life_data["Country"].unique())))
d = dict(zip(letters, numbers))
life_data["Country"] = life_data["Country"].map(d)
print(life_data["Country"].unique())

print(life_data["Status"].unique())
mappingstatus = {"Developing": 0, "Developed": 1}
life_data["Status"] = life_data["Status"].map(mappingstatus)
print(life_data["Status"].unique())

In [None]:
life_data.describe()

In [None]:
# Now I will make all variables range from 0 to 1, because machine learning algorithms work better that way.

for column in life_data.columns:
    maxcolumn = life_data[column].max()
    if maxcolumn > 1:
        life_data[column] = life_data[column] / maxcolumn

In [None]:
life_data.describe()

In [None]:
# Split data
life_data_X = life_data.drop("Life expectancy ", axis=1)
life_data_y = life_data["Life expectancy "]

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(life_data_X, life_data_y, test_size=0.3)

Since this is a supervised learning problem, and the y values are continuous, I will use Linear Regression.

In [None]:
linearparam = {"fit_intercept": [True, False], "normalize": [True, False], "copy_X": [True, False]}
lineargrid = GridSearchCV(LinearRegression(), linearparam, cv=10)
lineargrid.fit(Xtrain, ytrain)
print("Best Linear Regression estimator:", lineargrid.best_estimator_)

In [None]:
ypredictedlinear = lineargrid.best_estimator_.predict(Xtest)

mae = metrics.mean_absolute_error(ytest, ypredictedlinear)
mse = metrics.mean_squared_error(ytest, ypredictedlinear)
r2 = metrics.r2_score(ytest, ypredictedlinear)

print("Linear Regression performance:")
print("MAE:", mae)
print("MSE:", mse)
print("R2 score:", r2)