In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import requests
%matplotlib inline

## Introduction ##

The data set is similar (but not identical) to a previously used [dataset in Kaggle][kag_ds].
That Kaggle set has bad data, where it is not clear how much of the information is valid.
While there may have been errors in the older WHO data that have been corrected by now, there
are some values that are unrealistic.

Some examples of corruption problems with the previous Kaggle data set:
- For "percentage expenditure", a country is listed as spending 194 times their GDP on just health care.
- India is listed as having 1500-1800 infant deaths per 1000 population, which is not possible with half the population being female and a 9 month pregnancy.
- "infant deaths" column has Uzbekistan with the values (17, 16, 15) for the years 2013-2015. This seems reasonable, except the actual values from the WHO web page are (26.3, 24.6, 23.0).

This notebook will compare some of the features that are present in both the previous Kaggle dataset
and the one I generated. Since I have just made this data set, and can compare directly to the data in
the source server, I have confidence that any differences with the previous Kaggle set are due to errors in
the older set (which does not document how the data was generated).

[who]: https://www.who.int
[whodb]: https://www.who.int/gho/database/en/
[unesco_ed]: https://en.unesco.org/themes/education/databases
[kag_ds]: https://www.kaggle.com/kumarajarshi/life-expectancy-who
[think_9_10]: https://courses.thinkful.com/dsbc-model-prep-v1/checkpoint/10

In [None]:
# *** Following lines if using Kaggle ***

data_new = pd.read_csv('../input/who-national-life-expectancy/who_life_exp.csv', skipinitialspace=True)
data_old = pd.read_csv('../input/life-expectancy-who/Life Expectancy Data.csv', skipinitialspace=True).rename(
    columns = {'Country':'country', 'Year':'year', 'Life expectancy ':'kag_life',
               'Adult Mortality':'kag_adult', 'Alcohol':'kag_alcohol',
               'BMI ':'kag_bmi', 'Polio':'kag_polio', 'Population':'kag_pop'})

# *** otherwise read from local version of file ***
#data_new = pd.read_csv('who_life_exp.csv', skipinitialspace=True)
#data_old = pd.read_csv('Life_Expectancy_Data.csv', skipinitialspace=True).rename(
#    columns = {'Country':'country', 'Year':'year', 'Life expectancy ':'kag_life',
#               'Adult Mortality':'kag_adult', 'Alcohol':'kag_alcohol',
#               'BMI ':'kag_bmi', 'Polio':'kag_polio', 'Population':'kag_pop'})

# Replace two of the country names, which changed since the Kaggle set was made
data_old['country'] = data_old['country'].replace(['Swaziland'],'Eswatini')
data_old['country'] = data_old['country'].replace(['The former Yugoslav republic of Macedonia'],'Republic of North Macedonia')

#
# make a new dataframe with some overlapping features

data_old2 = data_old[{'country', 'year', 'kag_life', 'kag_pop', 'kag_adult', 'kag_alcohol', 'kag_bmi', 'kag_polio'}]

data_new2 = data_new[{'country', 'year', 'life_expect', 'une_pop', 'adult_mortality', 'alcohol', 'bmi', 'polio'}]
# the newer data has population in thousands; the older set does not
data_new2['population'] = 1000.0 * data_new2['une_pop']

# merge tables, then remove any rows with missing values

data_new2 = data_new2.merge(data_old2, how='left')
clean_df = data_new2.dropna(axis=0)
print(clean_df.info())

I have picked out 6 features which should be almost the same between the two data sets.
I am using the country name and year to match the two sets. If either of the
sets is missing a value, that country-year row is removed. (I am not investigating why data
might be present in one set but absent in the other.)

I say "almost the same", because I have seen differences between the UNESCO and GHO values
for some of the variables in my data set. However, those differences were relatively minor. Comparing
the previous Kaggle set to the recently generated data set shows significant larger differences.

I am writing a few observations about the features, in the order they are plotted.
The left-most plot is the distribution from my data set, the middle plot is from the previous Kaggle set,
and the right-most plot is a scatter distribution of the previous Kaggle set vs. my set.

### Life Expectancy ###
The most obvious difference is the clustering of life expectancies in the previous Kaggle set in decades
(50, 60, 70, 80). I don't know the reason for it, but it might be due
to rounding errors or filling in missing values.

### Population ###
The scatter plot shows that there are many points where the population is the same in both sets
(along the drawn red line), but also quite a few where the previous Kaggle set appears to be off by a factor of 10 or a 100.

### Adult Mortality ###
Like population, the previous Kaggle set has some data points near or at the correct value, but a large
number that are significantly wrong.

### Alcohol Consumption ###
Like population, the previous Kaggle set has some data points near or at the correct value, but a large
number that are significantly wrong. It may be that the older set had more missing values, and
replaced those with zeroes, but that is a guess.

### Polio Vaccination ###
Not many data points along the red line, no consistent trend in over or under estimating, and problems in the problems Kaggle set with values near zero.

In my set, there are only a few points less than 20, which are reported as near 40 in the previous Kaggle set.
(If I had to guess, these could be extrapolations from world regional averages.)

### Body Mass Index (BMI) ###
This is the most glaring example, as there is basically no correlation between the two sets for BMI.

To set a realistic scale, this feature is a national average. WHO regards a BMI of less than 18.5 as underweight, while a BMI greater than 25 is considered overweight and above 30 is considered obese. In my data set it ranges from 20 to 32. In the Kaggle set, it goes from single digits up to 80.

In [None]:
list_features = [['life_expect', 'kag_life'], ['population', 'kag_pop'], ['adult_mortality', 'kag_adult'],
                 ['alcohol', 'kag_alcohol'], ['polio', 'kag_polio'], ['bmi', 'kag_bmi']]

print(list_features)

for feat1, feat2 in list_features:
    print("Plotting features:",feat1, feat2)
    plt.figure(figsize=(12,4))
    plt.subplot(1, 3, 1)
    plt.hist(clean_df[feat1])
    plt.xlabel(feat1)
    plt.subplot(1, 3, 2)
    plt.hist(clean_df[feat2])
    plt.xlabel(feat2)

    plt.subplot(1, 3, 3)
    plt.scatter(clean_df[feat1], clean_df[feat2])
    plt.plot(clean_df[feat1], clean_df[feat1], color="red")
    plt.xlabel(feat1)
    plt.ylabel(feat2)
    
    plt.tight_layout()
    plt.show()