# Introduction

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        path = os.path.join(dirname, filename)

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv(path)

**Country**
Country

**Year**
Year

**Status**
Status in development or under development

**Life expectancy**
Life expectancy at age

**Adult Mortality**
Adult mortality rates for both sexes (probability of dying between the ages of 15 and 60 per 1000 population)

**infant deaths**
Infant deaths per 1000 population

**Alcohol**
Accounting for alcohol consumption per capita (15+) (in liters of pure alcohol)

**percentage expenditure**
Health care expenditure as a percentage of gross domestic product per capita (%)

**Hepatitis B**
Immunization coverage against hepatitis B (HepB) among one year old children (%)

**Measles**
Measles - the number of reported cases per 1000 population

**BMI**
Average body mass index of the entire population

**under-five deaths**
Deaths of children under five years of age per 1000 population

**Polio**
Polio immunization coverage (Pol3) among one-year-old children (%)

**Total expenditure**
Total government spending on health as a percentage of total government spending (%)

**Diphtheria**
Immunization coverage against diphtheria and pertussis tetanus (DTP3) among children aged 1 year

**HIV / AIDS**
Mortality per 1,000 live births HIV / AIDS (0-4 years)

**GDP**
Gross Domestic Product per capita (in US dollars)

**Population**
Population of the country

**thinness 1-19 years**
Prevalence of thinness among children and adolescents aged 10 to 19 years (%)

**thinness 5-9 years**
Prevalence of thinness among children aged 5 to 9 (%)

**Income composition of resources**
Human Development Index in terms of income structure of resources (index from 0 to 1)

**Schooling**
Number of years of study (years)

In [None]:
df.head()

Some column names contain leading and trailing spaces. They should be removed.

In [None]:
for col in df.columns:
  if col[-1] == ' ':
    df = df.rename(columns={col: col[0:-1]})
    print({col: col[0:-1]})

In [None]:
for col in df.columns:
  if col[0] == ' ':
    df = df.rename(columns={col: col[1:]})
    print({col: col[1:]})

In [None]:
#we use only part of data for visualization
df_sample = df.sample(frac=0.15)

The data from the table can be divided into the following groups
* immunization factors 
* mortality factors
* economic factors
* social factors

To answer the question of which predictive variables really affect life expectancy, we will consider them in the aggregate of these groups.

# Immunization factors 

In [None]:
imm_cols = ['Hepatitis B', 'Polio', 'Diphtheria']

In [None]:
fig, axis = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))

iter = 0
for ax in axis.flat:
  pcm = ax.scatter(df_sample[imm_cols[iter]], df_sample['Life expectancy'], s = 18, c=df_sample['Total expenditure'], alpha=0.8)
  ax.set_xlabel(imm_cols[iter])
  iter += 1
axis[0].set_ylabel('Life expectancy')
fig.colorbar(pcm, ax=axis, label='Total expenditure')
plt.show()

The graphs show a relationship between the percentage of vaccinated population and life expectancy (the more vaccinated, the higher the life expectancy). But at this stage it cannot be said that there is a relationship between these parameters. It is not excluded that life expectancy can be, like the degree of immunization of the population, associated with economic factors. Populations in countries with lower health spending have shorter life expectancies.

It is also worth noting that in a number of countries the percentage of the vaccinated population is within 10 points, and life expectancy is at a high level. This may indicate both data falsification and the presence of other decisive factors prevailing in these countries.

In [None]:
pdplot = pd.plotting.scatter_matrix(df[imm_cols])

The comparative matrix shows that there are countries that may lack mandatory vaccination against certain diseases, which explains the displacement of the points in the previous graph.

# Mortality factors

In [None]:
die_cols = ['infant deaths', 'under-five deaths', 'HIV/AIDS', 'Adult Mortality']

In [None]:
fig, axis = plt.subplots(nrows=1, ncols=4, figsize=(12, 4))

axis[3].scatter(df_sample['Adult Mortality'], df_sample['Life expectancy'], s = 18, c=df_sample['Income composition of resources'], alpha=0.8)
axis[3].set_xlabel('Adult Mortality')

iter = 0
for ax in axis.flat:
  pcm = ax.scatter(df_sample[df_sample[die_cols[iter]] < 80][die_cols[iter]], 
                   df_sample[df_sample[die_cols[iter]] < 80]['Life expectancy'], 
                   s = 18, 
                   c=df_sample[df_sample[die_cols[iter]] < 80]['Income composition of resources'], 
                   alpha=0.8)
  ax.set_xlabel(die_cols[iter])
  iter += 1
  if iter == 3:
    break
axis[0].set_ylabel('Life expectancy')
fig.colorbar(pcm, ax=axis, label='Income composition of resources')
plt.show()

Life expectancy is clearly related to probable Adult Mortality. The infant mortality rate also correlates with the overall life expectancy of the population. In countries with high life expectancy, there is almost no child mortality. These parameters can also be linked to the Human Development Index.

# Economic factors

In [None]:
ec_cols = ['percentage expenditure', 'Total expenditure', 'GDP', 'Income composition of resources']

In [None]:
fig = plt.figure(figsize=(10, 5))
grid = plt.GridSpec(4, 4, hspace=0.2, wspace=0.2)
main_ax = fig.add_subplot(grid[:-1, 1:])
y_hist = fig.add_subplot(grid[:-1, 0], xticklabels=[], sharey=main_ax)
x_hist = fig.add_subplot(grid[-1, 1:], yticklabels=[], sharex=main_ax)

main_ax.scatter(df_sample['Total expenditure'], df_sample['Life expectancy'], s = 18, alpha=0.8)
main_ax.plot([0, 17], [65, 65], c='red')
main_ax.text(10, 66, 'Life expectancy > 65', fontsize=12)


x_hist.hist(df['Total expenditure'], bins=40, histtype='stepfilled', orientation='vertical')
x_hist.invert_yaxis()
x_hist.set_xlabel('Total expenditure')

y_hist.hist(df['Life expectancy'], bins=40, histtype='stepfilled', orientation='horizontal')
y_hist.set_ylabel('Life expectancy')
plt.show()

List of countries with the highest healthcare spending.

In [None]:
df[['Country', 'Total expenditure']][df['Total expenditure'] >= 10].groupby('Country').mean()

The graphs show that there is no clear relationship between health care costs and life expectancy. Countries with a low life expectancy (<65) spend on medicine not much less than countries in which people live longer.

# Social factors

In [None]:
soc_cols = ['BMI', 'thinness  1-19 years', 'thinness 5-9 years', 'Alcohol', 'Population', 'Schooling']

In [None]:
fig, axis = plt.subplots(nrows=1, ncols=3, figsize=(12, 4))

iter = 0
for ax in axis.flat:
  pcm = ax.scatter(df_sample[soc_cols[iter]], df_sample['Life expectancy'], s = 18, alpha=0.8)
  ax.set_xlabel(soc_cols[iter])
  iter += 1
axis[0].set_ylabel('Life expectancy')
plt.show()

Life expectancy in countries with high average body mass index is also higher. Obviously, being overweight cannot be positively correlated with life expectancy. Probably, this is influenced by factors of the general well-being of the population.

In [None]:
plt.scatter(df_sample['Schooling'], df_sample['Life expectancy'],  c=df_sample['Income composition of resources'])
plt.xlabel('Schooling')
plt.ylabel('Life expectancy')
plt.colorbar(label='Income composition of resources')
plt.show()

The last graph shows a direct relationship between life expectancy indicators and how much, on average, citizens of countries spend time on their education.

# Predictive model

Using LinearRegression from the skeet-learn library to build a predictive model of life expectancy and see which factors have a significant impact.

In [None]:
#using features
X_full = ['Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
            'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
            'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years',
            'thinness 5-9 years', 'Income composition of resources', 'Schooling']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[X_full].fillna(0), df['Life expectancy'].fillna(0))

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
print(f'Correctness on the training set {lr.score(X_train, y_train):.2f}')
print(f'Correctness on the testing set {lr.score(X_test, y_test):.2f}')

Consider the correlation of features.

In [None]:
plt.figure(figsize=(13, 10))
sns.heatmap(df[X_full+['Life expectancy']].corr(), annot=True, cmap='coolwarm')
plt.show()

Let's exclude those with signs of thinness and vaccination rates, thereby simplifying the model.

In [None]:
#new using features
X_sample = ['Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure',
            'Measles', 'BMI', 'under-five deaths', 'Polio',
            'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years',
            'Income composition of resources', 'Schooling']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[X_sample].fillna(0), df['Life expectancy'].fillna(0))

In [None]:
lr.fit(X_train, y_train)

In [None]:
print(f'Correctness on the training set {lr.score(X_train, y_train):.2f}')
print(f'Correctness on the testing set {lr.score(X_test, y_test):.2f}')

A model is obtained that predicts life expectancy, taking into account social, economic and health factors.