# Final Notebook Format

* Intro
* Hypothesis
* Explain data being used
* Explain EDA process
    * Justify why columns were dropped
    * What columns are being used and justify why they were kept
    * Explain data scope and how the columns are going to help with the hpyothesis
    * Explain temporal data (probably just state that our data is for only one year worth)
    * Basically everytime we make an assumption it needs to be backed/explained
* Explain visualizations and how they support our hypothesis
    * what is the conclusion to each visual
    * tie it back to our hypothesis
    * If we are comparing data make sure its placed strategically and close together
* Explain ML/Stats model and show visualizations
* Wrap up our big idea and conclude whether we found meaninful data
    

# Stroke/Heart Mortality Rate Trends
__Analysis of data about mortality amongst adults__

## Introduction

__Question: Is there a significant difference in mortalitiy rates between groups baced on race and ethnicity?__

We believe that there is a significant differnce in mortality rates among different groups of people based on assumption. To explore and support this we pulled Heart and Stroke mortality data from the CDC website. We are targeting health care leaders and local authorities that handle budgeting. If our initial assumption is correct, than there is a target group of people that would need extra resources and assistance because their high mortality rates compared to others could inidicate a lack of local funding, ineffective policies, or possible another underlying issue.

## Exploratory Data Analysis

Stroke(2013) - [Stroke Data](https://data.world/us-hhs-gov/12ea7a13-b229-43b4-b19b-1459e9a64d3f)

Heart(2017) - [Heart Data](https://data.world/us-hhs-gov/01969266-32c7-4071-a84e-4fe524d472c2)

Both Stroke and Heart have similiar data columns. They contain number of deaths per 100,00 population per county in the US. Because of this there are almost 60k rows in each data set. Columns for race/ethnicity and gender are also present. Both data sets contain data for only the year listed.

In [17]:
from DFfunctions import *
from MLfunctions import *

heart = pd.read_csv('heartmortality.csv')
stroke = pd.read_csv('strokemortality.csv')

heart.head()

Unnamed: 0,Year,LocationAbbr,LocationDesc,GeographicLevel,DataSource,Class,Topic,Data_Value,Data_Value_Unit,Data_Value_Type,Data_Value_Footnote_Symbol,Data_Value_Footnote,StratificationCategory1,Stratification1,StratificationCategory2,Stratification2,TopicID,LocationID,Location 1
0,2013,AK,Aleutians East,County,NVSS,Cardiovascular Diseases,Heart Disease Mortality,147.4,"per 100,000 population","Age-adjusted, Spatially Smoothed, 3-year Avera...",,,Gender,Overall,Race/Ethnicity,Overall,T2,2013,"(55.440626, -161.962562)"
1,2013,AK,Aleutians West,County,NVSS,Cardiovascular Diseases,Heart Disease Mortality,229.4,"per 100,000 population","Age-adjusted, Spatially Smoothed, 3-year Avera...",,,Gender,Overall,Race/Ethnicity,Overall,T2,2016,"(52.995403, -170.251538)"
2,2013,AK,Anchorage,County,NVSS,Cardiovascular Diseases,Heart Disease Mortality,255.5,"per 100,000 population","Age-adjusted, Spatially Smoothed, 3-year Avera...",,,Gender,Overall,Race/Ethnicity,Overall,T2,2020,"(61.159049, -149.103905)"
3,2013,AK,Bethel,County,NVSS,Cardiovascular Diseases,Heart Disease Mortality,305.5,"per 100,000 population","Age-adjusted, Spatially Smoothed, 3-year Avera...",,,Gender,Overall,Race/Ethnicity,Overall,T2,2050,"(60.924483, -159.749655)"
4,2013,AK,Bristol Bay,County,NVSS,Cardiovascular Diseases,Heart Disease Mortality,,"per 100,000 population","Age-adjusted, Spatially Smoothed, 3-year Avera...",~,Insufficient Data,Gender,Overall,Race/Ethnicity,Overall,T2,2060,"(58.754192, -156.694709)"


### Data Cleaning

The columns we want to target are the deaths per 100k population, race/ethnicity, and Location by state. By keeping States we will be able to classify specific regions of the US such as the Midwest and use that to find a correlation between race/ethnicity and mortality rates per region. All other columns will be dropped because they do not directly help find an answer to our hypothesis. Since both data sets only contain information for one year, we can not find anything related to date and time frame. Furthermore, the various geolocation columns will be dropped since we are using the States to classify a region.

In [18]:
heartNotUsed = ['Year','LocationDesc','GeographicLevel','DataSource','Class','Topic','Data_Value_Unit','Data_Value_Type',
                 'StratificationCategory1','Data_Value_Footnote_Symbol','StratificationCategory2',
                 'TopicID','LocationID','Location 1']
heartDf = removeC(heart, heartNotUsed)

strokeNotUsed = ['Year','LocationDesc','DataSource','Class','Topic','Data_Value_Unit','Data_Value_Type',
                 'StratificationCategory1','Data_Value_Footnote_Symbol','StratificationCategory2',
                 'TopicID','LocationID','Y_lat','X_lon','GeographicLevel']
strokeDf = removeC(stroke, strokeNotUsed)

Data column names will be renamed to clearly display the information we are targeting and rows that have incomplete values will be dropped.

In [23]:
heartDf = heartDf.rename(columns = {'Data_Value': 'Deaths per 100,000', 'Data_Value_Footnote': 'Sufficiency?'
                               ,'Stratification1': 'Gender', 'Stratification2': 'Race/Ethnicity'})

strokeDf = strokeDf.rename(columns = {'Data_Value': 'Deaths per 100,000', 'Data_Value_Footnote': 'Sufficiency?'
                                 ,'Stratification1': 'Gender','Stratification2': 'Race/Ethnicity'})

# Filtering data to obtain overall results for gender and clear any insufficient data from our dataset
heartDf = getSufficientData(heartDf)
strokeDf = getSufficientData(strokeDf)

notWanted = ['Sufficiency', 'Sufficiency?', 'Gender']

heartUpdated = removeC(heartDf, notWanted)
strokeUpdated = removeC(strokeDf, notWanted)

strokeUpdated.head()

Unnamed: 0,LocationAbbr,"Deaths per 100,000",Race/Ethnicity
89,AK,55.7,White
90,AK,70.0,White
92,AK,73.3,White
94,AK,101.4,White
95,AK,59.6,White


What we are left with is a data set that contains Location, Deaths per 100k, and our target columns Race/Ethnicity, These columns can now be used to find correalations of mortality rates and ethnicity. For our intdended scope we now have a complete data set that we can continue to visualize since we can transform the deaths column to averages of deaths per race and ethnicity. In our case we have White, Black, Hispanics, Asian and Pacific Islanders, and American Indian and Alaskan Native present in our data.

## Visualizations

## ML/Stats model 

## Conclusion