# Final Project: New York vs. The U.S.

Students: Michael Hernandez and Tim Lynch <br>
Professor: Charles Pak<br>

For the final project we are utilizing Beautiful Soup and read_csv() to pull in data from two different sources (The CDC and Data.gov). We will be comparing New York deaths to other states. 

In [6]:
# Imports
import pandas as pd
import requests
from bs4 import BeautifulSoup

#Beautiful Soup data from CDC website
res = requests.get("https://www.cdc.gov/nchs/pressroom/states/newyork/newyork.htm")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')
df = pd.read_html(str(table))
for dataframe in df:
    dataframe.drop(dataframe.tail(1).index,inplace=True)
ny_death_data_2017 = df[1]
ny_death_data_2016 = df[5]
ny_death_data_2015 = df[9]
ny_death_data_2014 = df[13]

#Read CSV data found at https://catalog.data.gov *For full link refer to discussion post*
csv_data = pd.read_csv('csv_data.csv')
csv_data.head(10)

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,2012,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,21,2.6
1,2016,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,30,3.7
2,2013,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,30,3.8
3,2000,"Intentional self-harm (suicide) (*U03,X60-X84,...",Suicide,District of Columbia,23,3.8
4,2014,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Arizona,325,4.1
5,2009,"Intentional self-harm (suicide) (*U03,X60-X84,...",Suicide,District of Columbia,29,4.4
6,2011,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,South Dakota,49,4.5
7,2015,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,39,4.5
8,2014,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,37,4.5
9,2013,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Arizona,374,4.9


__For the most part we aren't interested in a few of these columns. At least not yet. So we will be cleaning up the data.__

We will start by dropping a few of the columns. Mainly the "113 Cause Name" and "Age-adjusted Death Rate"

In [7]:
csv_data.drop(['113 Cause Name', 'Age-adjusted Death Rate'], axis = 1, inplace = True)

In [8]:
agg_data = csv_data[csv_data['Cause Name'].str.contains("All causes")]
suc_data = csv_data[csv_data['Cause Name'].str.contains('Suicide')]

In [9]:
csv_data.head(5)

Unnamed: 0,Year,Cause Name,State,Deaths
0,2012,Kidney disease,Vermont,21
1,2016,Kidney disease,Vermont,30
2,2013,Kidney disease,Vermont,30
3,2000,Suicide,District of Columbia,23
4,2014,Kidney disease,Arizona,325


In [20]:
csv_grouped = csv_data.set_index (['Year','State','Cause Name','Deaths'])
csv_grouped.style.set_properties(**{'text-align': 'right'})

IndexError: list index out of range

<pandas.io.formats.style.Styler at 0x11e8011d0>

__We actually divided up some of our csv data into special categories. One of those is suicide and the other is aggregate data__


In [12]:
agg_data.head(10)

Unnamed: 0,Year,Cause Name,State,Deaths
9360,2016,All causes,Hawaii,10913
9361,2011,All causes,Hawaii,9923
9362,2012,All causes,Hawaii,10274
9363,2015,All causes,Hawaii,11053
9364,2014,All causes,Hawaii,10767
9365,2010,All causes,Hawaii,9617
9366,2013,All causes,Hawaii,10505
9367,2014,All causes,California,245929
9368,2008,All causes,Hawaii,9501
9369,2016,All causes,California,262240


In [13]:
suc_data.head(10)

Unnamed: 0,Year,Cause Name,State,Deaths
3,2000,Suicide,District of Columbia,23
5,2009,Suicide,District of Columbia,29
10,2015,Suicide,District of Columbia,34
12,1999,Suicide,District of Columbia,30
14,2016,Suicide,District of Columbia,40
15,2006,Suicide,District of Columbia,30
16,2002,Suicide,District of Columbia,31
22,2005,Suicide,District of Columbia,33
24,2004,Suicide,District of Columbia,33
27,2011,Suicide,District of Columbia,37


Mike's Portion of the code

In [14]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# For the HTML Data we used BeautifulSoup to pull the data
res = requests.get("https://www.cdc.gov/nchs/pressroom/states/newyork/newyork.htm")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')
# this builds a list of tables
df = pd.read_html(str(table))

# The final row in the table is not data I want (it's a subtitle), so I'll loop through the tables and delete that row
# Some extra cleanup - eliminated the numeric indexes from the labels using regex

for dataframe in df:
    dataframe.drop(dataframe.tail(1).index,inplace=True)
    dataframe.iloc[:,0] = dataframe.iloc[:,0].str.replace('\d+\.', '')
    dataframe.rename(columns={ dataframe.columns[0]: "Cause of Death", dataframe.columns[2]: "Rate" }, inplace=True)

# Next I want to create the data frames from the list of data frames we created.  
# I'll pull the data using their index locations

ny_death_data_2017 = df[1]
ny_death_data_2016 = df[5]
ny_death_data_2015 = df[9]
ny_death_data_2014 = df[13]

frames = [df[1], df[5], df[9], df[13]]
yearly_dr = pd.concat(frames, keys=['2017','2016','2015','2014'], sort=True)


In [15]:
ny_death_data_2017

Unnamed: 0,Cause of Death,Deaths,Rate,State Rank*,U.S. Rate**
0,Heart Disease,44092,171.2,17th,165.0
1,Cancer,34956,141.2,41st,152.5
2,Accidents,7687,35.5,49th,49.4
3,Chronic Lower Respiratory Diseases,7258,28.9,48th,40.9
4,Stroke,6264,24.6,50th,37.6
5,Flu/Pneumonia,4517,17.7,10th,14.3
6,Diabetes,4176,16.8,47th,21.5
7,Alzheimer’s disease,3521,13.2,50th,31.0
8,Hypertension,2699,10.4,11th,9.0


In [16]:
df = yearly_dr.groupby(['Cause of Death'])['Deaths']

In [17]:
ny_death_data_2015['Deaths'].dtype


dtype('int64')

In [18]:
ny_death_data_2014.columns

Index(['Cause of Death', 'Deaths', 'Rate', 'State Rank*', 'U.S. Rate**'], dtype='object')

In [19]:
ny_death_data_2014

Unnamed: 0,Cause of Death,Deaths,Rate,State Rank*,U.S. Rate**
0,Heart Disease,43116,178.3,16th,167.0
1,Cancer,35392,151.8,42nd,161.2
2,Chronic Lower Respiratory Disease,6806,29.1,48th,36.5
3,Stroke,6212,26.1,49th,40.5
4,Accidents,5945,27.6,49th,40.5
5,Flu/Pneumonia,4702,19.5,8th,15.1
6,Diabetes,4064,17.4,45th,20.9
7,Alzheimer’s disease,2639,10.7,50th,25.4
8,Septicemia,2568,10.8,22nd (tie),10.7
