# Source Code:  Mortality in the United States 

## Introduction 
This project contains the source code to support the `Mortality in the United States` report created by the following individuals, as a part of Washington University's Foundations of Analytics course. 

* Kunihiro Fujita 
* Qi Lu
* Segun Akinyemi
* Gowtham Anbumani


## Key Problems & Questions  
The following are the key questions that were the focus of the analysis presented in our report.

1. What are the major causes of death in the U.S? Answered [here](#Section-1:-Major-Causes-of-Death-in-the-United-States).<br/><br/>
2. For the major causes of death in the U.S, what does the death distribution look like when plotted against age? (For example, histogram of 5 year age band). <br/><br/>
3. For each 5-year age band, what are the top 3 causes of death? Do they differ? <br/><br/>
4. Are the causes of death in the United States changing over time? Are there any significant increasing or decreasing trends in the prevalence of some causes? <br/><br/>
5. Given the medical transcripts of 5000 patients in the file `medicaltranscriptions.csv`, determine if any of those patients have medical conditions that are associated with the `ICD` codes of major causes of death. 
    * Design a similarity measure metric comparing `ICD` codes to medical transcripts.  
    * Calculate the similarity measure between `ICD` code descriptions and medical transcripts. 
    * Assign `ICD` codes to a medical transcript only if the similarity score is above certain threshold.
    

## Data & Library Imports
This section contains the various libraries and data files that we're used throughout our code. 

In [519]:
import re
import sys
import nltk 
import json
import warnings
import numpy as np
import pandas as pd
from numpy import genfromtxt
from tensorflow import keras
from tabulate import tabulate
import numpy.linalg as linalg
from collections import Counter
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from collections import defaultdict
from prettytable import PrettyTable
from tensorflow.keras import layers
import statsmodels.api as statsmodels
from gensim.models import KeyedVectors, Word2Vec
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize, wordpunct_tokenize

# Ensuring that output warnings are not displayed and setting some formatting options
warnings.filterwarnings('ignore')
pd.options.display.float_format = '{:,}'.format

#cdc_data_2005 = pd.read_csv('./mortality-data/2005_data.csv', na_values=['NA','?'])
#cdc_data_2006 = pd.read_csv( './mortality-data/2006_data.csv', na_values=['NA','?'])
#cdc_data_2007 = pd.read_csv( './mortality-data/2007_data.csv', na_values=['NA','?'])
# cdc_data_2008 = pd.read_csv( './mortality-data/2008_data.csv', na_values=['NA','?'])
# cdc_data_2009 = pd.read_csv( './mortality-data/2009_data.csv', na_values=['NA','?'])
# cdc_data_2010 = pd.read_csv( './mortality-data/2010_data.csv', na_values=['NA','?'])
# cdc_data_2011 = pd.read_csv( './mortality-data/2011_data.csv', na_values=['NA','?'])
# cdc_data_2012 = pd.read_csv( './mortality-data/2012_data.csv', na_values=['NA','?'])
# cdc_data_2013 = pd.read_csv( './mortality-data/2013_data.csv', na_values=['NA','?'])
# cdc_data_2014 = pd.read_csv( './mortality-data/2014_data.csv', na_values=['NA','?'])
# cdc_data_2015 = pd.read_csv( './mortality-data/2015_data.csv', na_values=['NA','?'])

# Importing ICD Codes
# with open ('./mortality-data/2005_codes.json') as json_file: 
#     icd_codes_2005 = json.load(json_file)   
# with open ('./mortality-data/2006_codes.json') as json_file: 
#     icd_codes_2006 = json.load(json_file)
# with open ('./mortality-data/2007_codes.json') as json_file: 
#     icd_codes_2007 = json.load(json_file)
# with open ('./mortality-data/2008_codes.json') as json_file: 
#     icd_codes_2008 = json.load(json_file)
# with open ('./mortality-data/2009_codes.json') as json_file: 
#     icd_codes_2009 = json.load(json_file)
# with open ('./mortality-data/2010_codes.json') as json_file: 
#     icd_codes_2010 = json.load(json_file)
# with open ('./mortality-data/2011_codes.json') as json_file: 
#     icd_codes_2011 = json.load(json_file)
# with open ('./mortality-data/2012_codes.json') as json_file: 
#     icd_codes_2012 = json.load(json_file)
# with open ('./mortality-data/2013_codes.json') as json_file: 
#     icd_codes_2013 = json.load(json_file)
# with open ('./mortality-data/2014_codes.json') as json_file: 
#     icd_codes_2014 = json.load(json_file)
# with open ('./mortality-data/2015_codes.json') as json_file: 
#     icd_codes_2015 = json.load(json_file)

In [316]:
pd.set_option('precision', 0)

## Function Definitions
This section is meant to provide an overview of some of the key functions that were used to generate data and perform analysis. 

**Function Definition: Retrieving the full name of an ICD code.**

In [None]:
    def FindFullNameFromCode(target_code, icd_codes): 
        target_code = str(target_code).zfill(3)
        for code in icd_codes: 
            if code == target_code: 
                return icd_codes[code]
    
    # Example Usage 
    print(FindFullNameFromCode('8' , icd_codes_2005["39_cause_recode"]))

**Function Definition: Finding the single most frequent cause of death in a data set**

In [None]:
    def MostFrequentCauseOfDeath(data, icd_codes): 
        most_frequent_death = str(int(data.mode()[0])).zfill(3)
        for code in icd_codes: 
            if code == most_frequent_death: 
                return icd_codes[code]

    # Example Usage 
    mostFrequentDeath2005 = MostFrequentCauseOfDeath(cdc_data_2005['130_infant_cause_recode'], icd_codes_2005['130_infant_cause_recode'])

**Function Definition: Finding the top `n` causes of death in a data set** 

In [387]:
    def TopCausesOfDeath(cdc_data, icd_descriptions, n = 10):
        
        deathsByFrequency = cdc_data.value_counts()
        top_n_deaths = deathsByFrequency.head(10).rename_axis('Code').reset_index(name='Deaths')
        
        codeDescriptions = [icd_descriptions[code] for code in top_n_deaths['Code']]
        top_n_deaths["Description"] = codeDescriptions
        
        return top_n_deaths

    # Example Usage 
    #result = TopCausesOfDeath(cdc_data_2005["113_cause_recode"], icdLookupTable_2005)
    #result.style.hide_index()

**Function Definition: Creating a dictionary of ICD codes and their associated named descriptions.**

In [231]:
    def MapIcdToDesc(cdc_data, icd_codes): 
        codeToDescDict = {}
        
        for code in set(cdc_data): 
            zeroPaddedCode = str(code).zfill(3)
            codeToDescDict.update({
                code: icd_codes[icd] 
                for icd in icd_codes if icd == zeroPaddedCode
            }) 

        return codeToDescDict


## Major Causes of Death in the United States
This section contains the source code for the conclusions, visualizations and understandings our report presented on the major causes of death in the United States. 

In [520]:
# Dictionaries to hold {key, value} pairs containing {icd code, description} for each year. 
icd_desc_2005 = MapIcdToDesc(cdc_data_2005["113_cause_recode"], icd_codes_2005["113_cause_recode"])
icd_desc_2006 = MapIcdToDesc(cdc_data_2006["113_cause_recode"], icd_codes_2006["113_cause_recode"])
icd_desc_2007 = MapIcdToDesc(cdc_data_2007["113_cause_recode"], icd_codes_2007["113_cause_recode"])
icd_desc_2008 = MapIcdToDesc(cdc_data_2008["113_cause_recode"], icd_codes_2008["113_cause_recode"])
icd_desc_2009 = MapIcdToDesc(cdc_data_2009["113_cause_recode"], icd_codes_2009["113_cause_recode"])
icd_desc_2010 = MapIcdToDesc(cdc_data_2010["113_cause_recode"], icd_codes_2010["113_cause_recode"])
icd_desc_2011 = MapIcdToDesc(cdc_data_2011["113_cause_recode"], icd_codes_2011["113_cause_recode"])
icd_desc_2012 = MapIcdToDesc(cdc_data_2012["113_cause_recode"], icd_codes_2012["113_cause_recode"])
icd_desc_2013 = MapIcdToDesc(cdc_data_2013["113_cause_recode"], icd_codes_2013["113_cause_recode"])
icd_desc_2014 = MapIcdToDesc(cdc_data_2014["113_cause_recode"], icd_codes_2014["113_cause_recode"])
icd_desc_2015 = MapIcdToDesc(cdc_data_2015["113_cause_recode"], icd_codes_2015["113_cause_recode"])

In [521]:
# The top 10 causes of death for each year. 
top_ten_2005 = TopCausesOfDeath(cdc_data_2005["113_cause_recode"], icd_desc_2005)
top_ten_2006 = TopCausesOfDeath(cdc_data_2006["113_cause_recode"], icd_desc_2006)
top_ten_2007 = TopCausesOfDeath(cdc_data_2007["113_cause_recode"], icd_desc_2007)
top_ten_2008 = TopCausesOfDeath(cdc_data_2008["113_cause_recode"], icd_desc_2008)
top_ten_2009 = TopCausesOfDeath(cdc_data_2009["113_cause_recode"], icd_desc_2009)
top_ten_2010 = TopCausesOfDeath(cdc_data_2010["113_cause_recode"], icd_desc_2010)
top_ten_2011 = TopCausesOfDeath(cdc_data_2011["113_cause_recode"], icd_desc_2011)
top_ten_2012 = TopCausesOfDeath(cdc_data_2012["113_cause_recode"], icd_desc_2012)
top_ten_2013 = TopCausesOfDeath(cdc_data_2013["113_cause_recode"], icd_desc_2013)
top_ten_2014 = TopCausesOfDeath(cdc_data_2014["113_cause_recode"], icd_desc_2014)
top_ten_2015 = TopCausesOfDeath(cdc_data_2015["113_cause_recode"], icd_desc_2015)

In [389]:
top_ten_2005.style.hide_index()

Code,Deaths,Description
63,228656,"All other forms of chronic ischemic heart disease (I20,I25.1-I25.9)"
111,217898,"All other diseases (Residual) (D65-E07,E15-E34,E65-F99,G04-G12,G23-G25,G31-H93, K00-K22,K29-K31,K50-K66,K71-K72,K75-K76,K83-M99, N13.0-N13.5,N13.7-N13.9, N14,N15.0,N15.8-N15.9,N20-N23,N28-N39,N41-N64,N80-N98)"
27,159405,"Malignant neoplasms of trachea, bronchus and lung (C33-C34)"
59,151312,Acute myocardial infarction (I21-I22)
70,143789,Cerebrovascular diseases (I60-I69)
86,112239,"Other chronic lower respiratory diseases (J44,J47)"
68,109736,"All other forms of heart disease (I26-I28,I34-I38,I42-I49,I51)"
46,75207,Diabetes mellitus (E10-E14)
52,71611,Alzheimer's disease (G30)
62,63082,"Atherosclerotic cardiovascular disease, so described (I25.0)"


In [487]:
top_ten_2011

Unnamed: 0,Code,Deaths,Description
0,111,290219,"All other diseases (Residual) (D65-E07,E15-E34..."
1,63,192773,All other forms of chronic ischemic heart dise...
2,27,157148,"Malignant neoplasms of trachea, bronchus and l..."
3,86,129709,"Other chronic lower respiratory diseases (J44,..."
4,70,129143,Cerebrovascular diseases (I60-I69)
5,68,120786,"All other forms of heart disease (I26-I28,I34-..."
6,59,120126,Acute myocardial infarction (I21-I22)
7,52,84991,Alzheimer's disease (G30)
8,46,73916,Diabetes mellitus (E10-E14)
9,43,65554,All other and unspecified malignant neoplasms ...


In [527]:
topTenAllYears = pd.concat([top_ten_2005, top_ten_2006, top_ten_2007, top_ten_2008, top_ten_2009, top_ten_2010,
                           top_ten_2011, top_ten_2012, top_ten_2013, top_ten_2014, top_ten_2015])\
                   .groupby(['Code']).sum()\
                   .sort_values(by='Deaths', ascending=False)\
                   .reset_index()

# topTenAllYears = pd.concat([top_ten_2005, top_ten_2006, top_ten_2007, top_ten_2008, top_ten_2009, top_ten_2010, 
#                             top_ten_2011])\

# topTenAllYears = pd.concat([top_ten_2010, top_ten_2011])\
#                     .groupby(['Code', 'Description']).sum()\
#                     .sort_values(by='Deaths', ascending=False)\
#                     .reset_index()


#grouped.columns=np.where(grouped.columns==0, 'count', grouped.columns) #replace the default 0 to 'count'


# topTenAllYears = pd.concat([top_ten_2010, top_ten_2011])\
#                    .groupby('Code')['Description'].sum()\
#                    .reset_index()

#topTenAllYears = top_ten_2010.groupby(['Code', 'Description']).sum().add(top_ten_2011.groupby(['Code', 'Description']).sum(), fill_value=0).reset_index()

topTenAllYears['Deaths'] = topTenAllYears['Deaths'].apply("{:,}".format)
topTenAllYears = topTenAllYears.style.hide_index()
topTenAllYears

Code,Deaths
111,3028258
63,2212520
27,1733447
70,1471444
59,1392415
86,1378763
68,1327848
52,921265
46,807571
43,586856


In [453]:
pd.reset_option('colheader_justify')

In [376]:

# top102005 = cdc_data_2005["113_cause_recode"].value_counts().head(10)
# top102006 = cdc_data_2006["113_cause_recode"].value_counts().head(10)
# result = top102005.combine(top102006, min, fill_value=0).sort_values(ascending=False)
# result

#deathsByFrequency = cdc_data_2005["113_cause_recode"].value_counts()

#top_n_deaths = deathsByFrequency.head(10)
#type(top_n_deaths)
df = pd.concat([top_ten_2005, top_ten_2006]).groupby(['Code', 'Description']).sum()
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Frequency
Code,Description,Unnamed: 2_level_1
27,"Malignant neoplasms of trachea, bronchus and lung (C33-C34)",318191
43,"All other and unspecified malignant neoplasms (C17,C23-C24,C26-C31,C37-C41, C44-C49,C51-C52,C57-C60,C62-C63,C66,C68-C69,C73-C80,C97)",63193
46,Diabetes mellitus (E10-E14),147732
52,Alzheimer's disease (G30),144057
59,Acute myocardial infarction (I21-I22),293050
62,"Atherosclerotic cardiovascular disease, so described (I25.0)",63082
63,"All other forms of chronic ischemic heart disease (I20,I25.1-I25.9)",447960
68,"All other forms of heart disease (I26-I28,I34-I38,I42-I49,I51)",217780
70,Cerebrovascular diseases (I60-I69),281118
86,"Other chronic lower respiratory diseases (J44,J47)",219980


5. Given the medical transcripts of 5000 patients in the file `medicaltranscriptions.csv`, determine if any of those patients have medical conditions that are associated with the `ICD` codes of major causes of death. 
    * Design a similarity measure metric comparing `ICD` codes to medical transcripts.  
    * Calculate the similarity measure between `ICD` code descriptions and medical transcripts. 
    * Assign `ICD` codes to a medical transcript only if the similarity score is above certain threshold.
    

In [7]:
# Importing Medical Data
medical_data = pd.read_csv('./data-from-canvas/medicaltranscriptions.csv', sep=',', header=0)
medical_data_desc = ''.join(medical_data['description'])
pub_med_model = KeyedVectors.load_word2vec_format('./data-from-canvas/PubMed-and-PMC-w2v.bin', binary=True)
stop_words = set(stopwords.words('english'))

In [30]:
# This is a list of lists, each list in here is a tokenized description. Length of this should be 4999 (zero indexed so
# it's actually 5000), one for each description in the medicaltranscipts.csv file. 
medical_desc_tokenized = []

# Cleaning up the descriptions and removing stuff that is junk 
for desc in medical_data['description']: 
    token_desc = word_tokenize(desc) 
    token_desc = [word for word in token_desc if word]
    token_desc = [word.lower() for word in token_desc] 
    token_desc = [word.lower() for word in token_desc if not word in stop_words] 
    token_desc = [word.lower() for word in token_desc if word.isalpha()]
    medical_desc_tokenized.append(token_desc)


4999

In [43]:
medical_pos = nltk.pos_tag(medical_desc_tokenized)
MedicalPosNounVocab = []
accpetable_tags = ['NN', 'NNP', 'NNS', 'NNPS']

for word, pos in medical_pos: 
    if (pos in accpetable_tags):
        MedicalPosNounVocab.append(word)

MedicalPosNounVocab

['female',
 'presents',
 'complaint',
 'allergies',
 'bypass',
 'consult',
 'laparoscopic',
 'bypass',
 'doppler',
 'echocardiogram',
 'morbid',
 'obesity',
 'bypass',
 'eea',
 'anastomosis',
 'years',
 'diets',
 'abdomen',
 'revision',
 'breast',
 'reconstruction',
 'excision',
 'tissue',
 'abdomen',
 'lipodystrophy',
 'abdomen',
 'doppler',
 'morbid',
 'obesity',
 'laparoscopic',
 'bypass',
 'eea',
 'anastamosis',
 'esophagogastroduodenoscopy',
 'ventricle',
 'enlargement',
 'mild',
 'regurgitation',
 'increase',
 'heart',
 'angiogram',
 'moyamoya',
 'disease',
 'patient',
 'surgery',
 'service',
 'consideration',
 'laparoscopic',
 'roux',
 'bypass',
 'surgery',
 'removal',
 'teeth',
 'teeth',
 'visit',
 'management',
 'laparoscopic',
 'gastric',
 'neck',
 'exploration',
 'tracheostomy',
 'bronchoscopy',
 'site',
 'body',
 'stent',
 'material',
 'dilation',
 'trachea',
 'placement',
 'shiley',
 'cannula',
 'tracheostomy',
 'tube',
 'patient',
 'status',
 'post',
 'lap',
 'band',
 'pl

In [46]:
medical_desc_vectors = []
countConverted = 0 

for word in MedicalPosNounVocab: 
    try:
        medical_desc_vectors.append(pub_med_model[word])
        countConverted += 1
    except KeyError:
        continue


NameError: name 'icd_codes_2005' is not defined

In [78]:
icd_all_desc = {}
icd_all_desc.update(icd_codes_2005['39_cause_recode'])
icd_all_desc.update(icd_codes_2005['113_cause_recode'])
icd_all_desc.update(icd_codes_2005['130_infant_cause_recode'])
icd_all_desc.update(icd_codes_2005['358_cause_recode'])
icd_all_desc = list(icd_all_desc.values())
icd_all_desc = ''.join(icd_all_desc)

icd_codes_tokenized = word_tokenize(icd_all_desc) 
icd_codes_tokenized = [word for word in icd_codes_tokenized if word]
icd_codes_tokenized = [word.lower() for word in icd_codes_tokenized] 
icd_codes_tokenized = [word.lower() for word in icd_codes_tokenized if not word in stop_words] 
icd_codes_tokenized = [word.lower() for word in icd_codes_tokenized if word.isalpha()]
len(icd_codes_tokenized)



1581

In [1]:
icd_pos = nltk.pos_tag(medical_desc_tokenized)
IcdPosVocab = []

for word, pos in icd_pos: 
    if (pos in accpetable_tags):
        IcdPosVocab.append(word)


NameError: name 'nltk' is not defined

In [84]:
icd_desc_vectors = []
countConverted = 0 

for word in IcdPosVocab: 
    try:
        icd_desc_vectors.append(pub_med_model[word])
        countConverted += 1
    except KeyError:
        continue


31149


31149

In [93]:
# only use one recode (113), compare the top 10 causes of death, their descriptions, to every row in the medical desc
# as a single record for a patient, Not all the words in teh medical desc, but one row as a vector. 
# cosine_similarity(icd_desc_vectors, medical_desc_vectors)

array([[ 0.9999998 ,  0.05249518,  0.21974863, ..., -0.01770275,
         0.02033191, -0.05484248],
       [ 0.05249518,  0.99999976,  0.26056474, ...,  0.00570566,
         0.13618785, -0.12366677],
       [ 0.21974863,  0.26056474,  1.0000001 , ...,  0.06785312,
         0.25297388, -0.03772742],
       ...,
       [-0.01770275,  0.00570566,  0.06785312, ...,  1.0000004 ,
         0.01982268,  0.02522024],
       [ 0.02033191,  0.13618785,  0.25297388, ...,  0.01982268,
         1.0000001 , -0.00119112],
       [-0.05484248, -0.12366677, -0.03772742, ...,  0.02522024,
        -0.00119112,  0.99999994]], dtype=float32)