# Exploratory data analysis of life expectancy

**I have taken this dataset from kaggle which was provided by the Global Health Observatory (GHO) data repository under World Health Organization (WHO) that keeps track of the health status as well as many other related factors for all countries.**

### Data understing

In [None]:
import pandas as pd
from pandas import DataFrame
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from matplotlib import rcParams
import plotly.graph_objects as go
import plotly.express as px
from plotly.colors import n_colors
import numpy as np
import seaborn as sns
import pandas_profiling
%matplotlib inline
from matplotlib import rc
import scipy.stats
from scipy.stats.mstats import winsorize

In [None]:
life_expectancy = pd.read_csv("../input/life-expectancy-who/Life Expectancy Data.csv") #reading the file

In [None]:
life_expectancy.head(5)

**In order to understand the data we need to know the meaning of each column.**

Country - Country name

Year - Year

Status - Status of the given country (Either Developing or Developed)

Life expectancy - in years

Adult Mortality - Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)

Infant deaths	- Number of Infant Deaths per 1000 population

Alcohol - Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)

Percentage expenditure - Expenditure on health as a percentage of Gross Domestic Product per capita(%)

Hepatitis B - Hepatitis B (HepB) immunization coverage among 1-year-olds (%)

Measles - Measles - number of reported cases per 1000 population

BMI - Average Body Mass Index of entire population

Under five death - Number of under-five deaths per 1000 population

Polio - Polio (Pol3) immunization coverage among 1-year-olds (%)

Total expenditure - General government expenditure on health as a percentage of total government expenditure (%)

Diphtheria - Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)

HIV/AID - Deaths per 1 000 live births HIV/AIDS (0-4 years)

GDP - Gross Domestic Product per capita (in USD)

Population - Population of the country

thinness 1-19 years - Prevalence of thinness among children and adolescents for Age 10 to 19 (% )

thinness 5-9 years - Prevalence of thinness among children for Age 5 to 9(%)

Income composition of resources - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

Schooling - Number of years of Schooling(years)

In [None]:
life_expectancy.describe()

In [None]:
life_expectancy.shape

In [None]:
life_expectancy.columns

**As we can see here, there are only two categorical variables Country and Status.**

**Now let's change the name of the coulmns to make all variables uniform.**

In [None]:
life_expectancy.rename(columns = {" BMI " :"BMI", 
                                  "Life expectancy ": "Life_expectancy",
                                  "Adult Mortality":"Adult_mortality",
                                  "infant deaths":"Infant_deaths",
                                  "percentage expenditure":"Percentage_expenditure",
                                  "Hepatitis B":"HepatitisB",
                                  "Measles ":"Measles",
                                  "under-five deaths ": "Under_five_deaths",
                                  "Total expenditure":"Total_expenditure",
                                  "Diphtheria ": "Diphtheria",
                                  " thinness  1-19 years":"Thinness_1-19_years",
                                  " thinness 5-9 years":"Thinness_5-9_years",
                                  " HIV/AIDS":"HIV/AIDS",
                                  "Income composition of resources":"Income_composition_of_resources"}, inplace = True)

### Handling missing values

Let's check data type and null/nonnull values of dataset.

In [None]:
life_expectancy.info()

The columns with null values are Life_expectancy, Adult_mortality, Alcohol, Hepatitis B, BMI, Polio,Total_expenditure, Diphtheria,GDP,Population, Thinness_1-19_years, Thinness_5-9_years, Income_composition_of_resources, and Schooling.

In [None]:
life_expectancy.nunique(axis=0)

In [None]:
print(life_expectancy.isnull().sum())

There are many columns with null values but the amount of missing values is not big enough to drop the columns. So, to impute the missing values would be a good idea.

We also know that all the coulmns with missing values are numerical cotinuous variables.

Filling missing values with central tendency mean would not be the good idea because of the outliers.

We can also fill it with median but since data is divided into countries by years the median value will vary country wise.

The appropriate soultion would be to interpolate the values using country.

In [None]:
life_expectancy.reset_index(inplace=True)
life_expectancy.groupby('Country').apply(lambda group: group.interpolate(method= 'linear'))

In [None]:
print(life_expectancy.isnull().sum())

**Well! it turned out that interploation didn't fill the missing values.There are two reasons behind it.**

1) countries data for all the null values are null for each year.

2) Many countires have first value as null and this method doesn't fill first null entry.

So, the next possible method would be fill the missing values using median but yearwise.

In [None]:
imputed_data = []
for year in list(life_expectancy.Year.unique()):
    year_data = life_expectancy[life_expectancy.Year == year].copy()
    for col in list(year_data.columns)[4:]:
        year_data[col] = year_data[col].fillna(year_data[col].dropna().median()).copy()
    imputed_data.append(year_data)
life_expectancy = pd.concat(imputed_data).copy()

In [None]:
life_expectancy.describe()

In [None]:
print(life_expectancy.isnull().sum())

### Dealing with outliers

**Now let's deal with outliers.**

In [None]:
col_dict = {'Life_expectancy':1,'Adult_mortality':2,'Infant_deaths':3,'Alcohol':4,'Percentage_expenditure':5,'HepatitisB':6,'Measles':7,'BMI':8,'Under_five_deaths':9,'Polio':10,'Total_expenditure':11,'Diphtheria':12,'HIV/AIDS':13,'GDP':14,'Population':15,'Thinness_1-19_years':16,'Thinness_5-9_years':17,'Income_composition_of_resources':18,'Schooling':19}

# Detect outliers in each variable using box plots.
fig = plt.figure(figsize=(20,30))

for variable,i in col_dict.items():
                     plt.subplot(5,4,i)
                     plt.boxplot(life_expectancy[variable])
                     plt.title(variable)
                     plt.grid(True)
    
plt.show()


Infant_Deaths represents number of Infant Deaths per 1000 population. That is why number beyond 1000 is unrealistic. So we will remove them as outliers.

Same applies to Measles and Under_five_deaths because both are number per 1000 population.

As we can see that some countries are spending as high as 20000% of their GDP on health. Most of the countires are spending under 2500% of their GDP on health. Since values are very large in columns of Percentage_expenditure, GDP, and Population, it's better to take a log value or use winsorization if required.

The BMI values are very unrealistic because the value more 40 considered as extreme obesity. The median is more than 40. and some country/countries has mean around 60 which is not possible. We may drop that entire column.

Since pretty much every column other has outliers we can use winsorization.


In [None]:
life_expectancy = life_expectancy[life_expectancy['Infant_deaths'] < 1001]
life_expectancy = life_expectancy[life_expectancy['Measles'] < 1001]
life_expectancy = life_expectancy[life_expectancy['Under_five_deaths'] < 1001]

In [None]:
life_expectancy.drop(['BMI'], axis=1, inplace=True)

In [None]:
life_expectancy['log_Percentage_expenditure'] = np.log(life_expectancy['Percentage_expenditure'])
life_expectancy['log_Population'] = np.log(life_expectancy['Population'])
life_expectancy['log_GDP'] = np.log(life_expectancy['GDP'])

In [None]:
life_expectancy = life_expectancy.replace([np.inf, -np.inf], 0)
life_expectancy['log_Percentage_expenditure']

In [None]:
life_expectancy['winz_Life_expectancy'] = winsorize(life_expectancy['Life_expectancy'], (0.05,0))
life_expectancy['winz_Adult_mortality'] = winsorize(life_expectancy['Adult_mortality'], (0,0.04))
life_expectancy['winz_Alcohol'] = winsorize(life_expectancy['Alcohol'], (0.0,0.01))
life_expectancy['winz_HepatitisB'] = winsorize(life_expectancy['HepatitisB'], (0.20,0.0))
life_expectancy['winz_Polio'] = winsorize(life_expectancy['Polio'], (0.20,0.0))
life_expectancy['winz_Total_expenditure'] = winsorize(life_expectancy['Total_expenditure'], (0.0,0.02))
life_expectancy['winz_Diphtheria'] = winsorize(life_expectancy['Diphtheria'], (0.11,0.0))
life_expectancy['winz_HIV/AIDS'] = winsorize(life_expectancy['HIV/AIDS'], (0.0,0.21))
life_expectancy['winz_Thinness_1-19_years'] = winsorize(life_expectancy['Thinness_1-19_years'], (0.0,0.04))
life_expectancy['winz_Thinness_5-9_years'] = winsorize(life_expectancy['Thinness_5-9_years'], (0.0,0.04))
life_expectancy['winz_Income_composition_of_resources'] = winsorize(life_expectancy['Income_composition_of_resources'], (0.05,0.0))
life_expectancy['winz_Schooling'] = winsorize(life_expectancy['Schooling'], (0.03,0.01))

In [None]:
col_dict_winz = {'winz_Life_expectancy':1,'winz_Adult_mortality':2,'Infant_deaths':3,'winz_Alcohol':4,
            'log_Percentage_expenditure':5,'winz_HepatitisB':6,'Measles':7,'Under_five_deaths':8,'winz_Polio':9,
            'winz_Total_expenditure':10,'winz_Diphtheria':11,'winz_HIV/AIDS':12,'log_GDP':13,'log_Population':14,
            'winz_Thinness_1-19_years':15,'winz_Thinness_5-9_years':16,'winz_Income_composition_of_resources':17,
            'winz_Schooling':18}


fig = plt.figure(figsize=(20,20))
for variable,i in col_dict_winz.items():
                     plt.subplot(5,6,i)
                     sns.boxplot(y = life_expectancy[variable], color = "pink")
                     plt.title(variable)
                     plt.ylabel('')
                     
                     plt.grid(True)
    
plt.show()


In [None]:
life_expectancy.shape

In [None]:
life_expectancy.Status.unique()

In [None]:
life_expectancy.Country.unique()

In [None]:
life_expectancy.Country.nunique()

In [None]:
print(life_expectancy.groupby('Status').Country.nunique())

In [None]:
life_expectancy.Year.unique()

**After imputing missing values and dealing with outliers, we are left with 2413 rows.**

### Data Exploration


In [None]:
fig = plt.figure(figsize=(20,20))
for variable,i in col_dict_winz.items():
                     plt.subplot(5,6,i)
                     plt.hist(life_expectancy[variable])
                     plt.title(variable)
                     plt.ylabel('')
                     
                     plt.grid(True)
    
plt.show()


In [None]:
life_exp = life_expectancy[['Year', 'Country', 'Status','winz_Life_expectancy','winz_Adult_mortality','Infant_deaths','winz_Alcohol',
            'log_Percentage_expenditure','winz_HepatitisB','Measles','Under_five_deaths','winz_Polio',
            'winz_Total_expenditure','winz_Diphtheria','winz_HIV/AIDS','log_GDP','log_Population',
            'winz_Thinness_1-19_years','winz_Thinness_5-9_years','winz_Income_composition_of_resources',
            'winz_Schooling']]
plt.figure(figsize=(15,10))
sns.heatmap(life_exp.corr(), annot =True, linewidths = 4)

**Some insights from the heatmap are following:**

Adult_mortality has negative relation with Schooling, Income_composition_of_resources and positive relation with HIV/AIDS.

Infant_deaths and Under_five_deaths have strong positive relationship.

Schooling and Alcohol have some positive relationship.

Percentage_expenditure has positive relation with Schooling, Income_composition_of_resources, GDP, and Life_expectancy. 

HepatitisB has strong positive relation with Polio and Diphtheria. Polio also has strong positive relation with Diphtheria,HepatitisB, and Life_expectancy.

Diphtheria has strong positive relation with Polio and Life_expectancy.

**Through data exploration we will try to explore life_expectancy.**

As we can see from the heat map that Life_expectancy has positive relation with schooling,Income_composition_of_resources, GDP,Diphtheria, Polio, and Percentage_expenditure.

Life_expectancy has negative relation with Adult_mortality, Thinness_1-19_years, Thinness_5-9_years, HIV/AIDS, Under_five_deaths, and Infant_deaths.

Let's explore them in detail.


In [None]:
status_life_exp = life_expectancy.groupby(by=['Status']).mean().reset_index().sort_values('winz_Life_expectancy',ascending=False).reset_index(drop=True)
plt.figure(figsize=(20,10))

fig = px.bar(status_life_exp, x='Status', y='winz_Life_expectancy',color='winz_Life_expectancy')

fig.update_layout(
        title="Life expectancy according to status",
        xaxis_title="Status",
        yaxis_title="Average Life Expectancy",
        font=dict(
            family="Courier New",
            size=16,
            color="black"
        )
    )
fig.show()


In [None]:
life_year = life_expectancy.groupby(by = ['Year', 'Status']).mean().reset_index()
Developed = life_year.loc[life_year['Status'] == 'Developed',:]
Developing = life_year.loc[life_year['Status'] == 'Developing',:]
fig1 = go.Figure()
for template in ["plotly_dark"]:
    fig1.add_trace(go.Scatter(x=Developing['Year'], y=Developing['winz_Life_expectancy'],
                    mode='lines',
                    name='Developing',
                    marker_color='#f075c2'))
    fig1.add_trace(go.Scatter(x=Developed['Year'], y=Developed['winz_Life_expectancy'],
                    mode='lines',
                    name='Developed',
                    marker_color='#28d2c2'))
    fig1.update_layout(
    height=500,
    xaxis_title="Years",
    yaxis_title='Life expectancy in age',
    title_text='Average Life expectancy of Developing and Developed countries over the years',
    template=template)
fig1.show()

**We can see from above two graph that developed countires have more life expectancy than developing countries.**

In [None]:
sns.pairplot(life_expectancy, x_vars=["winz_Income_composition_of_resources", "winz_Schooling","log_GDP","winz_Diphtheria"], y_vars=["winz_Life_expectancy"],
             hue="Status",markers=["o", "x"], height=6, aspect=.7, kind="reg");

**Schooling can effect life expectancy more in developing countries than developed countries. This may be because education is more established and prevalent in wealthier countries. This means countries with less corruption, infrastructure, healthcare, welfare, and so forth. Same applies to GDP, Diptheria and Polio.**

In [None]:
sns.pairplot(life_expectancy, x_vars=["winz_Adult_mortality", "winz_Thinness_1-19_years","winz_Thinness_5-9_years","winz_HIV/AIDS"], y_vars=["winz_Life_expectancy"],
             hue="Status",markers=["o", "x"], height=6, aspect=.7, kind="reg")
plt.ylim(60, 90)

In [None]:
sns.pairplot(life_expectancy, x_vars=["winz_HIV/AIDS","Infant_deaths","winz_Polio","log_Percentage_expenditure"], y_vars=["winz_Life_expectancy"],
             hue="Status",markers=["o", "x"], height=8, aspect=.7, kind="reg")
plt.ylim(60, 90)


In [None]:
sns.pairplot(life_expectancy, x_vars=["winz_Alcohol"], y_vars=["winz_Life_expectancy"],
             hue="Status",markers=["o", "x"], height=8, aspect=.7, kind="reg")
plt.ylim(60, 90)


**Iâ€™m guessing that this is due to the fact that only wealthier countries can afford alcohol or the consumption of alcohol is more prevalent among wealthier populations.**

**That is why developing countries and alcohol have positive relation and developed countries and alcohol have negative relation.**

The following graph shows counrywise life expectancy.

In [None]:
life_country = life_expectancy.groupby('Country')['winz_Life_expectancy'].mean()
life_country 
my_colors = list('rgbkymc')
life_country.plot(kind='bar', figsize=(50,15), fontsize=25,color=my_colors)
plt.title("Life_Expectancy Country wise",fontsize=40)
plt.xlabel("Country",fontsize=35)
plt.ylabel("Average Life expectancy",fontsize=35)
plt.tick_params(axis='x', which='major', labelsize=15)
plt.show()