# Analysis of Causes of Death in the US from 1900-2018
The data comes from the National Center for Health Statistics and the site for this particular data is
[https://data.cdc.gov/NCHS/NCHS-Age-adjusted-Death-Rates-for-Selected-Major-C/6rkc-nb2q]
We've downloaded the data and stored it in ./data/deathrates.csv

The goal of this notebook is to give an example of how to use pandas and numpy to answer questions about a fairly large and interesting dataset.

The main question we want to answer is:
* How have the five main causes of death in the US changed in the past 118 years?

We will answer this by loading the data into a pandas data frame and then using pandas to create a plot
with each cause of death being a line plot and the x axis is the age adjusted death rate for that cause.

The more general skill we are hoping you will learn is how to use the pandas and numpy documentation to learn how to do more analysis that we can teach in this course.  Here is the pandas user manual
* https://pandas.pydata.org/docs/user_guide/index.html

and here is the numpy user guide
* https://numpy.org/doc/stable/user/index.html


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# reading the data into a data frame
We use the pd.read_csv function to read in the data.
It has three columns (Series)
* Year - an integer
* Cauase - a string
* Age Adjusted Death Rate - a float

There are 595 rows

In [None]:
df = pd.read_csv('data/deathrates.csv')
df

# Getting the causes
we can pull out one column from the table using the [] notation.

In [None]:
df['Cause']

and we can use the .unique() method to get the array of 5 unique values for cause and 2018 unique values for year

In [None]:
causes = df['Cause'].unique()
causes

In [None]:
df['Year'].unique()

# Selecting a subset of rows
We can select just those rows for Heart Disease using the boolean selector notation
```
df[df[COLUMN]==VALUE]
```



In [None]:
d1 = df[df['Cause']=='Heart Disease']
d1

# making the plot
We now know how to pull out just the rows we want for each cause
and to pull out the series for the year and for the death rate for those rows
which is all we need to know to be able to plot the data.

In [None]:
plt.figure(figsize=(15,10))
for cause in causes:
    d1 = df[df['Cause']==cause]  # remove all rows except those with the specified cause
    # plot the year vs the death rate, labeled with the cause
    plt.plot(d1['Year'],d1['Age Adjusted Death Rate'],label=cause)
plt.legend()
plt.grid()

# Discussion of results
We see that there was a big spike for Influenza and Pneumonia around 1918. This was the Spanish Flu pandemic (which probably started in the US).
We also see that Heart Disase peaked around 1950 but stroke and accidents steadily fell.
Cancer steadily rose until about 1990 when it started to fall.
The more interesting questions, which this data will not tell us, is why do we observe these patterns!


# A simpler approach to this analysis
Here we show how to use pivot tables to plot the data in a more elegant way.
Notice that our data tables has a Cause column with 5 possible values. 


In [None]:
df


# Creating a pivot table
We can create a "pivot table" in which those causes are the column headers and the values
stored in the columns are the death rates.

In [None]:
df2 = pd.pivot_table(df, values="Age Adjusted Death Rate", index=["Year"], columns=["Cause"])
df2

# Plotting a pivot table
Now that we have a table where the index is the year, and the columns contain the data we want to plot,
we can use the built-in plot method (which calls matplotlib.pyplot.plot behind the scenes)

In [None]:
df2.plot(figsize=(15,10))
plt.grid()
plt.title('Changes in most common deaths in the 1900s')
plt.xlabel('year')
plt.ylabel('age adjusted death rate per 100,000')

# More info on pandas plotting
We can get more info on pandas plotting using this command
```
help(df2.plot)
```
or looking  at the [pandas API reference for plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)
