# Life Time:

# Dataset - Life expectancy 2000 - 2015

![](https://cdn.pixabay.com/photo/2018/04/29/01/23/skin-3358873__480.jpg)

### Import

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../input/life-expectancy-who/Life Expectancy Data.csv')

### EDA (Exploratory Data Analysis)

In [None]:
df.head()

In [None]:
df.columns

We can see that some columns have space before and after the text, so we apply .strip to remove those spaces.

In [None]:
df.rename(columns=lambda x: x.strip(), inplace=True)

### NaN

In [None]:
df.isna().sum()

As we have not much data instead of .dropna() let's fill them with the mean.

In [None]:
df = df.fillna(df.mean())

### Visualization

Let's visualize the life expectancy per country in 2000.

In [None]:
df_2000=(df[df.Year==2000]
    .groupby("Country")
    ["Country", "Life expectancy"]
    .median()
    .sort_values(by="Life expectancy", ascending=True))

df_2000.plot(kind='bar', figsize=(50,10), fontsize=12)
plt.title("Life expectancy per Country in 2000",fontsize=30)
plt.xlabel("Country",fontsize=15)
plt.ylabel("Life expectancy 2015",fontsize=15)
plt.show()

We can see Japan arrived first with a clear step with the second.

What about the year 2011. The year of the Japan earthquake & tsunami.

In [None]:
df_2011=(df[df.Year==2011]
    .groupby("Country")
    ["Country", "Life expectancy"]
    .median()
    .sort_values(by="Life expectancy", ascending=True))

df_2011.plot(kind='bar', figsize=(50,10), fontsize=12)
plt.title("Life expectancy per Country in 2011",fontsize=30)
plt.xlabel("Country",fontsize=15)
plt.ylabel("Life expectancy 2015",fontsize=15)
plt.show()

The life expectancy per country in 2015.

In [None]:
df_2015=(df[df.Year==2015]
    .groupby("Country")
    ["Country", "Life expectancy"]
    .median()
    .sort_values(by="Life expectancy", ascending=True))

df_2015.plot(kind='bar', figsize=(50,10), fontsize=12)
plt.title("Life expectancy per Country for 2015",fontsize=30)
plt.xlabel("Country",fontsize=15)
plt.ylabel("Life expectancy 2015",fontsize=15)
plt.show()

Let's see the mean of life expectancy from 2000 to 2015.

In [None]:
life_expectancy_per_country = df.groupby('Country')['Life expectancy'].mean().sort_values(ascending=True)
life_expectancy_per_country.plot(kind='bar', figsize=(50,10), fontsize=12)
plt.title("Life expectancy mean per Country from 2000 to 2015",fontsize=30)
plt.xlabel("Country",fontsize=15)
plt.ylabel("Life expectancy",fontsize=15)
plt.show()

We can visualize the difference beetween developed and developing country.

In [None]:
plt.figure(figsize=(10,10))
plt.bar(df.groupby('Status')['Status'].count().index,df.groupby('Status')['Life expectancy'].mean())
plt.xlabel("Status",fontsize=15)
plt.ylabel("Life expectancy",fontsize=15)
plt.title("Life expectancy for developed and developing country",fontsize=20)
plt.show()

We can also see the effect of alcohol and schooling.

In [None]:
plt.figure(figsize=(20,7))
plt.subplot(1, 2, 1)
plt.scatter(df["Alcohol"], df["Life expectancy"])
plt.xlabel("Alcohol",fontsize=15)
plt.ylabel("Life expectancy",fontsize=15)
plt.title("Life expectancy - Alcohol",fontsize=17)

plt.subplot(1, 2, 2)
plt.scatter(df["Schooling"], df["Life expectancy"])
plt.xlabel("Schooling",fontsize=15)
plt.ylabel("Life expectancy",fontsize=15)
plt.title("Life expectancy - Schooling",fontsize=17)

## Encoding strings

In [None]:
df

In [None]:
df['Status'].value_counts()

Let's use a binary encode for the column Status.

In [None]:
def encode_status(x):
    if x == 'Developed':
        return 1
    else:
        return 0

In [None]:
df['Status'] = df['Status'].apply(encode_status)

And a get_dummies for Country since there is plenty of country listed in the dataset.

In [None]:
df = pd.concat([df, pd.get_dummies(df['Country'], prefix='Country', drop_first=True)], axis=1)
df = df.drop(['Country'], axis=1)

In [None]:
df

## Define X & y

Now that our dataset is full of numeric values, we can define our target.

In [None]:
X = df.drop(['Life expectancy'], axis=1)

y = df['Life expectancy']

## Split

Now that our dataset is ready to feed our model, we need to split it to hidde the test part so we can compare prediction & real information.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=21)

## Model fit & predict

Let's choose a model, a Linear regression should be fine for this task.

In [None]:
lr = LinearRegression()

lr.fit(X_train, y_train)

Our model is train so he can now predict!

In [None]:
y_pred = lr.predict(X_test)

In [None]:
mean_squared_error(y_test, y_pred)

r2_score(y_test, y_pred)*100