# Introduction
Hi everyone! This is my first notebook on Kaggle. This dataset doesn't contain enough information for any machine learning so the focus will be on exploring and visualizing the data. Please let me know if you have any feedback - I'm always looking to improve!

# Imports & Settings

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
sns.set_theme(context='talk', style='darkgrid')

# Basic Data Information

In [None]:
df = pd.read_csv('../input/countries-life-expectancy/Life expectancy.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.Entity.nunique()

In [None]:
df.Entity.unique()

# Visualizations
In this section, I'll ask several different questions I'm curious about and then answer them with visualizations. Any interesting observations will also be noted.

## How has the median life expectancy for all 15 countries changed over time?

In [None]:
fig = plt.figure(figsize=(10, 6))
sns.lineplot(data=df.groupby('Year').median(), x='Year', y='Life expectancy')
plt.title('15-Country Median Life Expectancy');

**Observations:**
- Data before the 1870s is likely backfilled due to the lack of variance.
- WW1, the Spanish Flu, and WW2's impacts on life expectancy are all clearly visible.
- Rapid progress was made in the 20th century but appears to be cooling off at the start of the 21st. Are we reaching a natural boundary?

## What are the current life expectancies for each country?

In [None]:
df_2 = df.copy()
df_2 = df_2[df.Year == df.Year.max()]
df_2.sort_values('Life expectancy', ascending=False, inplace=True)
df_2

In [None]:
fig = plt.figure(figsize=(8, 10))
sns.barplot(data=df_2, x='Life expectancy', y='Entity', palette='crest', orient='h')
plt.title('2016 Life Expectency');

**Observations:**
- Without additional data on things like GDP, healthcare quality index, etc., it's difficult to determine any clear pattern here.
- Geographic location does not seem to matter. Japan and India are both in Asia but are at opposite ends of the graph. Similar story for Switzerland and Russia.
- Not much variation at the top of the list, reinforcing the idea of approaching a natural boundary.

## What are the growth rates in life expectancy for each country since 1900?

In [None]:
df_3 = df.copy()
df_3 = df_3[(df_3.Year == df_3.Year.max()) | (df_3.Year == 1900)]
df_3['pct_chg'] = df_3['Life expectancy'].pct_change()*100
df_3

In [None]:
fig = plt.figure(figsize=(8, 10))
sns.barplot(data=df_3[df_3.Year == 2016].sort_values('pct_chg', ascending=False), x='pct_chg', y='Entity', palette='crest', orient='h')
plt.title('Change in Life Expectancy (1900 - 2016)')
plt.xlabel('Percent Change');

**Observations:**
- The countries that were near the bottom of the current life expectancy visualization have actually improved the most since 1900.

## How has the standard deviation of life expectancy changed over time?
Given the previous two charts, I'm curious how countries have "clustered" over time. One way to check for this in a general sense is by plotting the rolling standard deviation. In years where certain countries have a much higher life expectancy than others we would expect the standard deviation to be high. Conversely, in years where countries begin to cluster together with similar life expectancies the standard deviation will be low.

In [None]:
df_4 = df.copy()
df_4 = df_4.groupby('Year').std()
df_4.head()

In [None]:
fig = plt.figure(figsize=(10, 6))
sns.lineplot(data=df_4, x='Year', y='Life expectancy')
plt.title('Standard Deviation of Life Expectancy');

**Observations:**
- As mentioned before, the data pre-1870s was likely backfilled and has artificially low standard deviation.
- Up to and during both world wars, the standard deviation sharply increased. This is expected and indicates that certain countries were much more damaged than others.
- Post-WW2, the standard deviation has fallen dramatically as more and more countries close in on the natural boundary.

## How many countries had *decreasing* life expectancies per year?

In [None]:
df_5 = df.copy()
df_5['decr'] = df_5['Life expectancy'].diff()
df_5 = df_5[df_5.decr < 0]
df_5 = df_5.groupby('Year').count()
df_5 = df_5[1:]  # Removing first row since it will be inaccurate
df_5.head()

In [None]:
fig = plt.figure(figsize=(10, 6))
sns.lineplot(data=df_5, x='Year', y='decr')
plt.title('Number of Countries with Decreasing Life Expectancy')
plt.ylabel('Number of Countries');

In [None]:
df_5[df_5.decr == 15]

**Observations:**
- The Spanish Flu impacted all 15 countries in this dataset and caused their life expectancies to decline in 1918.