# **Analysing the mortality rates for 2015 across US**

## **Problem:** 
The goal of this project is to look at deaths in the US and the correlations between the descriptive statistics and mortality. Examples of questions we will answer and visualize include:

Grouping the deaths by age and analyzing similarities within those groups

* Grouping the number of deaths by age and analyzing similarities within those groups
* Additional potential groupings by gender, education, time of year, and manner of death
* Using various types of data columns to visualize most frequent causes of death across the US


## **Data sources:**
This dataset was found on Kaggle and is created by the CDC
* Kaggle https://www.kaggle.com/cdc/mortality?select=2015_data.csv
* CDC https://www.cdc.gov/nchs/nvss/mortality_public_use_data.html

## **Why this data set?**

The reason for analyzing any data set is to gain a deeper understanding of what the data represents. A story can be told from a data set and this one provides interesting insights into the US population. Mortalities (deaths) can tell us so much about a population if there is proper recordings of the facts of both the people and the situations around their deaths. From this data many relationships between features are looked into to try and understand who is dying and what groups of people can be made.

An example of a relationship that provides an interesting story is looking at education levels and seeing if there is a correlation between education and the age at which someone dies.

## **Exploratory Phase of our Project on mortality counts:**

---





### Overview of the tables with all data present:

In [None]:
from google.colab import auth
auth.authenticate_user()
#authenticate and connect to gcloud

In [None]:
%%bigquery --project=ba775-team9-b2
select * from `ba775-team9-b2.death_data_us.Sorted_2015_US_Mortality` limit 5;

Unnamed: 0,resident_status,education_2003_revision,month_of_death,sex,age_code,place_of_death,marital_status,day_of_week_of_death,injury_at_work,manner_of_death,activity_code,place_of_injury,race
0,1,1,1,M,2,2,S,5,U,7,,,1
1,2,1,1,M,3,1,S,4,U,7,,,1
2,3,1,1,M,2,1,S,7,U,7,,,3
3,1,9,1,F,3,4,S,3,U,1,9.0,0.0,1
4,1,1,1,M,3,2,S,1,U,7,,,2


In [None]:
%%bigquery --project=ba775-team9-b2
select * FROM `ba775-team9-b2.death_data_us.Mortailty_with_string_categories` limit 5;

Unnamed: 0,education_level,month_of_death,sex,age_code,age_consolidated,day_of_week_of_death,injury_at_work,manner_of_death,race,activity_code
0,unknown or none,1,M,5-14years,0-14years,Sunday,U,Natural,Black,
1,unknown or none,1,M,1-4years,0-14years,Thursday,U,Natural,White,
2,unknown or none,1,F,5-14years,0-14years,Thursday,U,Natural,White,
3,unknown or none,1,M,1-4years,0-14years,Wednesday,N,Accident,Black,Unspecified Activities
4,unknown or none,1,M,5-14years,0-14years,Wednesday,U,Natural,White,


The first table is the raw data given by the CDC which was found on Kaggle. From this data we used the legends given by the CDC to convert this categorical data back into strings so that it is easier to read.
Below is the documentaion we used for this conversion:
https://www.cdc.gov/nchs/data/dvs/Multiple_Cause_Record_Layout_2015.pdf

### Relative death counts between men and woman:

In [14]:
%%bigquery --project=ba775-team9-b2
select 
sex,
count(*) as deaths_by_gender,
round((count(*)/total_death_count)*100,2) as percent_of_total
from( select count(*) as total_death_count
from `ba775-team9-b2.death_data_us.Sorted_2015_US_Mortality`),
 `ba775-team9-b2.death_data_us.Sorted_2015_US_Mortality`
group by sex,total_death_count 
order by deaths_by_gender;

Unnamed: 0,sex,deaths_by_gender,percent_of_total
0,F,1305420,49.35
1,M,1339735,50.65


We observed that males have a slightly higher death counts, but the difference is not substantial enough to draw any conclusions.


### Analyzing how age and gender impact % of deaths:

In [17]:
%%bigquery --project=ba775-team9-b2
select 
    sex,
    age_code, 
    count(*) as death_count, 
    ((count(*) / total_death_count)*100) as percent_of_total,
    round((count(case when sex = 'M' then sex else NULL end)/male_deaths)*100,2) as Male_Perc,
    round((count(case when sex = 'F' then sex else NULL end)/female_deaths)*100,2) as Female_Perc

from( select count(*) as total_death_count from `ba775-team9-b2.death_data_us.Sorted_2015_US_Mortality`),
    ( select count(*) as male_deaths from `ba775-team9-b2.death_data_us.Sorted_2015_US_Mortality` where sex = 'M'),
    ( select count(*) as female_deaths from `ba775-team9-b2.death_data_us.Sorted_2015_US_Mortality` where sex = 'F'),
    `ba775-team9-b2.death_data_us.Mortailty_with_string_categories`
group by sex,age_code,total_death_count, male_deaths, female_deaths 
order by percent_of_total desc limit 10;

Unnamed: 0,sex,age_code,death_count,percent_of_total,Male_Perc,Female_Perc
0,F,85years and over,526179,19.89218,0.0,40.31
1,M,75-84years,315376,11.92278,23.54,0.0
2,M,85years and over,315103,11.912459,23.52,0.0
3,F,75-84years,305163,11.536677,0.0,23.38
4,M,65-74years,274758,10.387217,20.51,0.0
5,M,55-64years,211962,8.013217,15.82,0.0
6,F,65-74years,206869,7.820676,0.0,15.85
7,F,55-64years,136204,5.149188,0.0,10.43
8,M,45-54years,102764,3.88499,7.67,0.0
9,F,45-54years,66989,2.532517,0.0,5.13


The age group with the most deaths for females is '85years and over', while the group with the highest deaths for males is '75-84years'.

### Analyzing which manner of death is the most common among the total number of deaths reported:

In [18]:
%%bigquery --project=ba775-team9-b2
select manner_of_death, count(sex) as number_of_deaths, round((count(sex)/total_death_count)*100,2) as percent_of_total
FROM ( select count(*) as total_death_count
from `ba775-team9-b2.death_data_us.Sorted_2015_US_Mortality`),
 `ba775-team9-b2.death_data_us.Mortailty_with_string_categories` 
 where manner_of_death not in ('Unknown','Could not determine', 'Pending investigation')
GROUP BY manner_of_death,total_death_count
ORDER BY number_of_deaths DESC;

Unnamed: 0,manner_of_death,number_of_deaths,percent_of_total
0,Natural,2042686,77.22
1,Accident,140365,5.31
2,Suicide,43316,1.64
3,Homicide,18313,0.69


From this query we can see that natural causes are the most common manner of death, followed by accidents.

## **Digging Deeper into the Data:**

### Analyzing how the day of the week is related to number of deaths in a given age group:

In [19]:
%%bigquery --project=ba775-team9-b2
WITH weekday_age as (
    select
      a.age_consolidated as age_group, 
        a.day_of_week_of_death,  b.total_deaths_by_age,
        round((count(a.day_of_week_of_death)/total_deaths_by_age)*100,2) as percent_of_total,
    FROM `ba775-team9-b2.death_data_us.Mortailty_with_string_categories` as a
    INNER JOIN (
        SELECT count(*) as total_deaths_by_age, age_consolidated
        FROM `ba775-team9-b2.death_data_us.Mortailty_with_string_categories`
        GROUP BY age_consolidated
        
    ) as b
    on a.age_consolidated = b.age_consolidated
    WHERE a.age_consolidated <> 'Age not stated' and a.day_of_week_of_death <>'Unknown'
    GROUP BY age_group, day_of_week_of_death, total_deaths_by_age
    ORDER BY b.total_deaths_by_age DESC, percent_of_total DESC) 

    SELECT  * FROM weekday_age 
    order by percent_of_total desc
    limit 10;

Unnamed: 0,age_group,day_of_week_of_death,total_deaths_by_age,percent_of_total
0,15-34years,Sunday,80172,16.02
1,15-34years,Saturday,80172,15.74
2,0-14years,Saturday,32125,14.7
3,0-14years,Friday,32125,14.7
4,65+years,Thursday,1943448,14.57
5,35-54years,Saturday,240872,14.51
6,55-64years,Monday,348166,14.48
7,55-64years,Thursday,348166,14.47
8,0-14years,Thursday,32125,14.46
9,35-54years,Sunday,240872,14.44


The above results show the % of deaths between 15-34 year olds tend to be higher on the weekends, while % of deaths tend to be more even across the week for higher age groups.

### Initial analysis of death counts by education:



In [None]:
%%bigquery --project=ba775-team9-b2
select education_level,
     COUNT(*) as deaths_counts,
from `ba775-team9-b2.death_data_us.Mortailty_with_string_categories`
where education_level not like  '%unknown%'
GROUP BY education_level
ORDER BY education_level DESC;

Unnamed: 0,education_level,deaths_counts
0,master degree,103978
1,high school grad or GED,1077604
2,doctorate or professional degree,43688
3,"come college credit, no degree",315364
4,bachelor degreee,261106
5,associate degree,154326
6,"9-12th grade, no diploma",278916


Based on the above results, we can see that death counts are highest for the individuals with high school diploma and lowest for the ones that have doctorate or professional degrees; however, this analysis is impacted by the fact that there are fewer people with higher levels of education.

### Analyzing % deaths by education level for each age group:

In [20]:
%%bigquery --project=ba775-team9-b2
WITH C as (
    select
        a.age_code, 
        a.education_level, 
        count(a.sex) as deaths_by_age_and_education,
        b.total_deaths_by_education
    FROM `ba775-team9-b2.death_data_us.Mortailty_with_string_categories` as a
    INNER JOIN (
        SELECT count(*) as total_deaths_by_education, education_level
        FROM `ba775-team9-b2.death_data_us.Mortailty_with_string_categories`
        GROUP BY education_level 
        
    ) as b
    on a.education_level = b.education_level
    WHERE a.education_level <> 'unknown or none'
    GROUP BY age_code, education_level, total_deaths_by_education
    ORDER BY b.total_deaths_by_education DESC, deaths_by_age_and_education DESC 
)

SELECT *, (round((deaths_by_age_and_education  / total_deaths_by_education),3)*100) as percent_of_total
FROM C 
order by percent_of_total desc
limit 10;

Unnamed: 0,age_code,education_level,deaths_by_age_and_education,total_deaths_by_education,percent_of_total
0,85years and over,doctorate or professional degree,16380,43688,37.5
1,85years and over,bachelor degreee,86013,261106,32.9
2,85years and over,high school grad or GED,346300,1077604,32.1
3,85years and over,master degree,33344,103978,32.1
4,85years and over,"9-12th grade, no diploma",78285,278916,28.1
5,75-84years,doctorate or professional degree,12193,43688,27.9
6,75-84years,master degree,27848,103978,26.8
7,85years and over,"come college credit, no degree",83644,315364,26.5
8,85years and over,associate degree,39335,154326,25.5
9,75-84years,bachelor degreee,62485,261106,23.9


The analysis between age groups and education levels indicated a higher % of deaths for people with education levels of bachelors and above were in the 65+ year age group, compared to deaths amongst people with fewer years of education.

### Investigating if the education level and age are related to the injury at work and results in higher or lower death counts:


In [13]:
%%bigquery --project=ba775-team9-b2
 select age_consolidated as age_during_death,education_level,injury_at_work ,((count(injury_at_work)/total_death_count)*100) as percent_of_total
 FROM( select count(*) as total_death_count
from `ba775-team9-b2.death_data_us.Mortailty_with_string_categories` where injury_at_work in ('Y','N') and education_level<>'unknown or none' and age_consolidated <> 'Age not stated'),
  `ba775-team9-b2.death_data_us.Mortailty_with_string_categories`
 where injury_at_work in ('Y','N') and education_level<>'unknown or none' and age_consolidated <> 'Age not stated'
 group by age_consolidated,education_level,injury_at_work,total_death_count
 order by percent_of_total desc
 limit 10;

Unnamed: 0,age_during_death,education_level,injury_at_work,percent_of_total
0,65+years,high school grad or GED,N,13.912719
1,35-54years,high school grad or GED,N,13.415004
2,15-34years,high school grad or GED,N,11.971686
3,55-64years,high school grad or GED,N,6.449709
4,15-34years,"9-12th grade, no diploma",N,5.803059
5,15-34years,"come college credit, no degree",N,5.211651
6,35-54years,"come college credit, no degree",N,4.567167
7,35-54years,"9-12th grade, no diploma",N,4.386278
8,65+years,bachelor degreee,N,4.062412
9,65+years,"come college credit, no degree",N,3.928099


Results from above query show that % deaths due to injury at work is lower for people with doctorate or masters degree, compared to people with education levels of high school diplomas or below. An important external variable that we did not have visibility into would have been the type of occupation people in these age groups had.

### Analyzing death counts by specific age groups and activity codes along with manner of death:

In [23]:
%%bigquery --project=ba775-team9-b2
WITH activity_code_age as (
    select
      a.age_consolidated as age_group, 
        a.activity_code,  a.manner_of_death, b.total_deaths_by_age,
        (round((count(a.activity_code)/total_deaths_by_age),2)*100) as percent_of_total,
    FROM `ba775-team9-b2.death_data_us.Mortailty_with_string_categories` as a
    INNER JOIN (
        SELECT count(*) as total_deaths_by_age, age_consolidated
        FROM `ba775-team9-b2.death_data_us.Mortailty_with_string_categories`
        where age_consolidated <> 'Age not stated'  and activity_code<>'N/A' and activity_code<>'Unspecified Activities'
        GROUP BY age_consolidated
        
    ) as b
    on a.age_consolidated = b.age_consolidated
    WHERE a.age_consolidated <> 'Age not stated'  and a.activity_code<>'N/A' and a.activity_code not in ('Unspecified Activities','Other-Specified Activities')
    and manner_of_death <>'Could not determine'
    GROUP BY age_group, a.activity_code, manner_of_death, total_deaths_by_age
    )

    SELECT  * FROM activity_code_age order by percent_of_total desc
    limit 10;


Unnamed: 0,age_group,activity_code,manner_of_death,total_deaths_by_age,percent_of_total
0,0-14years,Vital Activities,Accident,24,54.0
1,65+years,Vital Activities,Accident,360,44.0
2,15-34years,Leisure Activity,Accident,177,25.0
3,55-64years,Vital Activities,Accident,110,22.0
4,35-54years,Leisure Activity,Accident,204,22.0
5,0-14years,Leisure Activity,Accident,24,17.0
6,15-34years,Vital Activities,Accident,177,16.0
7,55-64years,Leisure Activity,Accident,110,16.0
8,35-54years,Working for Income,Accident,204,15.0
9,55-64years,Working for Income,Accident,110,15.0


The data indicates that people at 65+ years, and 0-14 years had their highest % of deaths when facing accidents while engaged in vital activities.

### Analyzing how place of injury is realted to % of deaths for a given age group:


In [25]:
 %%bigquery --project=ba775-team9-b2
select age_consolidated,
        place_of_injury,
        count(place_of_injury) as death_by_place_of_injury,
        (round((count(place_of_injury)/total_death_count),2)*100) as percent_of_total
 FROM ( select count(*) as total_death_count
from `ba775-team9-b2.death_data_us.Mortailty_with_string_categories`
where place_of_injury not in ('Other or Not Recorded','Other Specified Places')
),
 `ba775-team9-b2.death_data_us.Mortailty_with_string_categories`
 where place_of_injury not in ('Other or Not Recorded','Other Specified Places')
 group by age_consolidated,place_of_injury,total_death_count
 order by percent_of_total desc
 limit 10;

Unnamed: 0,age_consolidated,place_of_injury,death_by_place_of_injury,percent_of_total
0,35-54years,Home,29389,24.0
1,65+years,Home,29720,24.0
2,15-34years,Home,21432,17.0
3,55-64years,Home,14572,12.0
4,65+years,Residential institution,8059,6.0
5,15-34years,Street and highway,3551,3.0
6,35-54years,Trade and service area,2393,2.0
7,0-14years,Home,2846,2.0
8,35-54years,Street and highway,2148,2.0
9,55-64years,Trade and service area,936,1.0


From the above query we observe that across age groups, most people faced injuries at home, and the lowest in sporting areas.

## **Summary and Conclusions:**

Through our analyses we investigated the relation between several demographic variables and mortality data in 2015.After initially observing a difference in death counts between males and females across age groups, we proceeded to analyze additional factors such as education level, manner of death, and place of injury. 

Although we observed that males have higher death counts, we see that males tend to have most deaths at younger ages than females do. When looking at deaths within each age group by day of the week, we saw that young adults tended to have a higher % of death on the weekends, while older age groups had more consistent % across the entire week. While analyzing death counts by manner of death, the data indicated that natural causes are the most common manner of death, whereas homicide was the least common manner of death. Lastly, the most common place of injury leading to death was at home for all age groups, and the least common place of injury was in a sports area.

From our analyses we observed varying relationships between death counts and other variables in the dataset. While some of the conclusions were different by age group, such as day of the week, others showed more consistency across age groups, like place of injury. 

Dashboard Link:

https://public.tableau.com/views/Project_Tableau_Team9/All_dashboards?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link

![title](overview_deaths.jpeg)

![title](age_education.jpeg)

![title](education_injuryatwork.jpeg)

![title](manner_death_place_of_injury.jpeg)

![title](manner_death_activity.jpeg)