# WHO World Health Data Analysis

Welcome to my WHO world data analysis notebook, where today we will be analysing different health statistics provided by WHO. This analysis is important as it can help people treat those with poorer health conditions better.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
def lines(data):
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))

    probs = data[data['Dim1']=='Both sexes']['First Tooltip'].sort_values()
    least_keys, most_keys = probs[1:4], probs[-4:-1]
    least = data['Location'][least_keys.keys()]
    most = data['Location'][most_keys.keys()]
    freq = pd.DataFrame({'Least':least.reset_index(drop=True), 'Most':most.reset_index(drop=True)})

    for title in freq:
        for country in freq[title]:
            ax = axes[list(freq).index(title)][list(freq[title]).index(country)]
            df = data[data['Location']==country]

            both = df[df['Dim1']=='Both sexes']['First Tooltip']
            fema = df[df['Dim1']=='Male']['First Tooltip']
            male = df[df['Dim1']=='Female']['First Tooltip']
            years = np.unique(df['Period'])[::-1]

            ax.plot(years, both, label='Both sexes')
            ax.plot(years, fema, label='Females')
            ax.plot(years, male, label='Males')

            ax.set_title(title)
            ax.set_xlabel('Years in ' + country)
            ax.set_ylabel('Probability (%)')
            ax.legend()

    plt.suptitle(data['Indicator'].iloc[0])
    plt.show()

# Probability of disease

Firstly, we take a look at the probability of disease in the three countries which have it the least and the three which have it the most. The data which is taken is in the years 2000, 2005, 2010 and 2015.

As seen below, females are roughly 10% more likely to get a disease than men in the places with the least probability, while in the countries where disease is most likely, there is a much larger gap; around 20%.

The trend seems to be decreasing in all the countries, meaning that every place is working to lessen the chance of sickness.

In [None]:
disease=pd.read_csv('../input/who-worldhealth-statistics-2020-complete/30-70cancerChdEtc.csv')
lines(disease)

# Life expectancy at birth

The next analysis takes a look at the estimated life span of a person at birth. Again we take the three countries with the lowest span and the three with the highest.

It's evident in all of our graphs that males tend to have a 5-6% higher life expectancy at birth than women. The countries with the lowest amount have a life span ranging from 50-65 years, while those with the highest are expected to live near 85 years of age.

The trend in the data is increasing in every plot, meaning that every place is working to increase the life expectancy at birth.

Another conclusion we can make is that the places with the lowest longevity (Burundi, Central African Republic and Zambia) are in Africa, whereas two thirds of the countries with the highest life expectancy (Republic of Korea and Japan) are in Asia. This could suggest that the healthcare in Asia is much better than the one in Africa.

In [None]:
birth_rate = pd.read_csv('../input/who-worldhealth-statistics-2020-complete/lifeExpectancyAtBirth.csv')
lines(birth_rate)

# Road traffic deaths

In the next plots, we analyse the number of road traffic deaths per 100 000 population.

Three out of the five countries which have the most deaths are in Africa (Liberia, Zimbabwe, Burundi), while three out of the five countries which have the least amount of deaths are in Europe (Norway, Switzerland and San Marino). This could suggest that the road traffic in Europe is much more controlled than in Africa, however, some of this might be due to San Marino having a significantly lower population (34,000).

In [None]:
traffic=pd.read_csv('../input/who-worldhealth-statistics-2020-complete/roadTrafficDeaths.csv')
traffic_sort = traffic.sort_values(by='First Tooltip', 
                                       ascending=False)
for df in traffic_sort[:90].reset_index(drop=True), traffic_sort[90:].reset_index(drop=True):
    fig, ax = plt.subplots(1, 1, figsize=(14, 8))
    bars = sns.barplot(df['Location'], df['First Tooltip'])

    for index in range(0, len(bars.patches), 2):
        bar = bars.patches[index]
        bars.annotate(format(bar.get_height(), '.0f'), (bar.get_x()+bar.get_width()/2., 
                                        bar.get_height()), ha='center', va='bottom', size=12)

    plt.xticks(size=8, rotation=90)
    plt.title(traffic['Indicator'][0])
    plt.xlabel('Country')
    plt.ylabel('Number of road traffic deaths (per 100 000 population)')
    plt.show()

# Air pollution death rate

Now we turn our attention to air pollution in the next analysis and compare the different complications that arise from it. We look at the three countries with the most pollution: Singapore, Nigeria and Chad and the three countries with the least: Brunei Darussalam, 'Canada, New Zealand.

The places with the most air contamination have the most lower respiratory infections, followed by ischaemic heart disease. However, the countries with the least air pollution have ischaemic heart disease as their most common complication. Also, on average, females are slightly more likely to get consequences from air pollution than men.

In [None]:
pollution = pd.read_csv('../input/who-worldhealth-statistics-2020-complete/airPollutionDeathRate.csv')
pollution['First Tooltip'] = [float(i.split(' [')[0]) for i in pollution['First Tooltip']]
indices = pd.Series([i.mean() for i in np.array_split(pollution['First Tooltip'], 
                                                      183)]).sort_values(ascending=False).index
fig, axes = plt.subplots(2, 3, figsize=(13, 10))
fig.tight_layout(h_pad=18)
h, j = 0, 0

for typ in [np.unique(pollution['Location'])[list(indices[:3])],
                  np.unique(pollution['Location'])[list(indices[-3:])]]:
    for country in typ:
        df = pollution[pollution['Location']==country]
        data = pd.DataFrame([], columns=['Dim1', 'Dim2', 'First Tooltip'])

        for i in np.unique(df['Dim2']):
            means = df[df['Dim2']==i].groupby(['Dim1']).mean()
            for t in list(means['First Tooltip']):
                data = data.append({'First Tooltip': t, 'Dim2': i}, ignore_index=True)
        data['Dim1'] = df['Dim1'][:len(data)].reset_index(drop=True)

        sns.barplot(data=data, x='Dim2', y='First Tooltip', hue='Dim1', palette='twilight', ax=axes[h][j])
        axes[h][j].set_title('Air pollution death rate in ' + country)
        axes[h][j].set_xlabel('Type of disease')
        axes[h][j].set_ylabel('Number of deaths')
        axes[h][j].set_xticklabels(np.unique(data['Dim2']), rotation=80, size=8)
        j += 1
    j = 0
    h += 1

plt.show()

# Violence against women

Furthermore, we will visualise the rate of violence that women receive in different contries and see which age groups have the most assault.

## Country comparison

In the following line graph, we will see which countries have the most violence against women.

Some of the countries with the most violence are in Central and West Africa (Democratic Republic of the Congo, Equatorial Guinea, Liberia), suggesting that possibly Africa is a more violent place for women to be in.

In [None]:
assault = pd.read_csv('../input/who-worldhealth-statistics-2020-complete/eliminateViolenceAgainstWomen.csv')
df = assault[assault['Location']=='Afghanistan']

assault_sort = pd.DataFrame([])
assault_sort0 = []
assault_sort1 = []
index = []

for i in np.unique(assault['Location']):
    assault_sort0.append(i)
    assault_sort1.append(assault['First Tooltip'][assault['Location']==i].sum())
    index.append(list(assault['Location'][assault['Location']==i].index)[0])
    
assault_sort['Name'] = assault_sort0
assault_sort['Value'] = assault_sort1
assault_sort.index = index
assault_sort = assault_sort.sort_values(by='Value', ascending=False)

fig, ax = plt.subplots(1, 1, figsize=(20, 11))
plt.plot(assault_sort['Name'], assault_sort['Value'])
plt.title('Probability of assault per country', size=20)
plt.xlabel('Country', size=15)
plt.ylabel('Probability of assault', size=15)
plt.xticks(rotation=90, size=10)
plt.show()

## Three most violent countries

The final visualisations in this dataset is comparing the different age groups for the countries with the largest violence against women: Afghanistan, Democratic Republic of the Congo (DRC) and Equitorial Guinea.

In Afghanistan, the peak age groups for assault is 25-39, in DRC it's 20-29 and in Equitorial Guinea the most assaulted age groups are 15-29. The peak ages in these groups are consistently in their 20s, suggesting that attackers prefer to inflict harm to women who are in their 20s.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
countries = list(assault_sort[:3]['Name'])

for country in countries:
    df = assault[assault['Location']==country]
    labels = []
    
    for year in assault['Dim2'][assault['Location']==country]:
        labels.append(year[:5])
    
    sns.barplot(data=df, x='Dim2', y='First Tooltip', palette='rocket', 
                ax=axes[countries.index(country)])
    axes[countries.index(country)].set_title('Probability of female assault per age group')
    axes[countries.index(country)].set_xlabel('Age group in ' + country)
    axes[countries.index(country)].set_ylabel('Probability of female assault')
    axes[countries.index(country)].set_xticklabels(labels=labels, fontsize=9)
plt.show()

### Thank you for reading my notebook.

### If you enjoyed this notebook and found it helpful, please upvote it and give feedback as it would help me make more of these.