# Final Project: ICD Codes and Mortality

## Introduction 
This project was created by Group A. The members of the group are as follows.  
* Kunihiro Fujita 
* Qi Lu
* Segun Akinyemi
* Gowtham Anbumani

## Background
Every year, centers for disease control and prevention (CDC) provides detailed statistics of deaths and their underlying causes in the United States. The CDC mortality data is used by various industries like medical, health and insurance to provide better services. It provides the basis of numerous researches and is widely cited in public papers. Because of your outstanding knowledge, you are hired as a data scientist by a prestigeous insurance company. The VP of your department wants to launch a life insurance product but would like to do an analysis on the cause of death in the US population first. The following questions needs to be answered: 

## Problems 
The following are the problems that we will be providing answers to in this project. 

1. What are the major causes of death in the US <br/><br/>
2. For different causes of death, how does the death distribution look like against age (For example, histogram of 5 year age band)? <br/><br/>
3. For each age band (5-year band), what are the top 3 causes of death? Do they differ? <br/><br/>
4. Are the causes of death changing over time? Are there any significant increasing or decreasing trends in some causes? <br/><br/>
5. You have medical transcripts of 5000 patients `medicaltranscriptions.csv`. You want to determine if any patients have medical conditions that are associated with ICD codes of major causes of death. 
    * Design a similarity measure metric. 
    * Calculate the similarity measure between ICD code descritption and medical transcirpts. 
    * Assign ICD code to a medical transcripts only if the similarity score is above certain threshold.

### Source Code

In [1]:
import sys
import json
import warnings
import numpy as np
import pandas as pd
from numpy import genfromtxt
from tabulate import tabulate
import numpy.linalg as linalg
import matplotlib.pyplot as plt
from collections import Counter
from prettytable import PrettyTable
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

# Getting rid of annoying warning. 
warnings.filterwarnings('ignore')

In [2]:
# Importing CDC data. 
cdc_data_2005 = pd.read_csv( './mortality-data/2005_data.csv', na_values=['NA','?'])
#cdc_data_2006 = pd.read_csv( './mortality-data/2006_data.csv', na_values=['NA','?'])
#cdc_data_2007 = pd.read_csv( './mortality-data/2007_data.csv', na_values=['NA','?'])
#cdc_data_2008 = pd.read_csv( './mortality-data/2008_data.csv', na_values=['NA','?'])
#cdc_data_2009 = pd.read_csv( './mortality-data/2009_data.csv', na_values=['NA','?'])
#cdc_data_2010 = pd.read_csv( './mortality-data/2010_data.csv', na_values=['NA','?'])
#cdc_data_2011 = pd.read_csv( './mortality-data/2011_data.csv', na_values=['NA','?'])
#cdc_data_2012 = pd.read_csv( './mortality-data/2012_data.csv', na_values=['NA','?'])
#cdc_data_2013 = pd.read_csv( './mortality-data/2013_data.csv', na_values=['NA','?'])
#cdc_data_2014 = pd.read_csv( './mortality-data/2014_data.csv', na_values=['NA','?'])
#cdc_data_2015 = pd.read_csv( './mortality-data/2015_data.csv', na_values=['NA','?'])

# Importing ICD Codes
with open ('./mortality-data/2005_codes.json') as json_file: 
    icd_codes_2005 = json.load(json_file)   
with open ('./mortality-data/2006_codes.json') as json_file: 
    icd_codes_2006 = json.load(json_file)
with open ('./mortality-data/2007_codes.json') as json_file: 
    icd_codes_2007 = json.load(json_file)
with open ('./mortality-data/2008_codes.json') as json_file: 
    icd_codes_2008 = json.load(json_file)
with open ('./mortality-data/2009_codes.json') as json_file: 
    icd_codes_2009 = json.load(json_file)
with open ('./mortality-data/2010_codes.json') as json_file: 
    icd_codes_2010 = json.load(json_file)
with open ('./mortality-data/2011_codes.json') as json_file: 
    icd_codes_2011 = json.load(json_file)
with open ('./mortality-data/2012_codes.json') as json_file: 
    icd_codes_2012 = json.load(json_file)
with open ('./mortality-data/2013_codes.json') as json_file: 
    icd_codes_2013 = json.load(json_file)
with open ('./mortality-data/2014_codes.json') as json_file: 
    icd_codes_2014 = json.load(json_file)
with open ('./mortality-data/2015_codes.json') as json_file: 
    icd_codes_2015 = json.load(json_file)

### Function Definition: Getting the ICD code death name.

In [3]:
def FindDeathNameFromCode(target_code, icd_codes): 
    target_code = str(target_code.zfill(3))
    for code in icd_codes: 
        if code == target_code: 
            return icd_codes[code]
# Usage 
print(FindDeathNameFromCode('8' , icd_codes_2005["39_cause_recode"]))

Malignant neoplasms of trachea, bronchus and lung (C33-C34)


### Function Definition: Finding the most frequent cause of death in a data set

In [4]:
def MostFrequentCauseOfDeath(data, icd_codes): 
    most_frequent_death = str(int(data.mode()[0])).zfill(3)
    for code in icd_codes: 
        if code == most_frequent_death: 
            return icd_codes[code]

# Usage 
mostFrequent2005 = MostFrequentCauseOfDeath(cdc_data_2005['130_infant_cause_recode'], icd_codes_2005['130_infant_cause_recode'])
print(mostFrequent2005)

Extremely low birthweight or extreme immaturity (P07.0,P07.2)


### Function Definition: Finding the top `n` causes of death in a data set, defaults to top 10

In [5]:
def TopCausesOfDeath(data, n = 10):
    counter = Counter(data)
    most_common_causes = [key for key, val in counter.most_common(n)]
    return most_common_causes

# Usage 
print(TopCausesOfDeath(cdc_data_2005["39_cause_recode"]))

[21, 37, 22, 8, 24, 15, 28, 39, 16, 17]


### Example: Mapping the top `n` causes of death to the ICD code name

In [18]:
top_causes_2005 = TopCausesOfDeath(cdc_data_2005["39_cause_recode"])

final_list = []
for cause in top_causes_2005: 
    final_list.append("Code " + str(cause) + ": " + FindDeathNameFromCode(str(cause) , icd_codes_2005["39_cause_recode"])) 

for death in final_list: 
    print(death + "\n")

Code 21: Ischemic heart diseases (I20-I25)

Code 37: All other diseases (Residual) (A00-A09,A20-A49,A54-B19,B25-B99,D00-E07, E15-G25,G31-H93,I80-J06,J20-J39,J60-K22,K29-K66,K71-K72, K75-M99,N10-N15,N20-N23,N28-N98)

Code 22: Other diseases of heart (I00-I09,I26-I51)

Code 8: Malignant neoplasms of trachea, bronchus and lung (C33-C34)

Code 24: Cerebrovascular diseases (I60-I69)

Code 15: Other malignant neoplasms (C00-C15,C17,C22-C24,C26-C32,C37-C49,C51-C52, C57-C60,C62-C63,C69-C81,C88,C90,C96-C97)

Code 28: Chronic lower respiratory diseases (J40-J47)

Code 39: All other and unspecified accidents and adverse effects (V01,V05-V06,V09.1,V09.3-V09.9,V10-V11,V15-V18,V19.3,V19.8-V19.9, V80.0-V80.2,V80.6-V80.9,V81.2-V81.9,V82.2-V82.9,V87.9,V88.9,V89.1, V89.3,V89.9,V90-X59,Y40-Y86,Y88)

Code 16: Diabetes mellitus (E10-E14)

Code 17: Alzheimer's disease (G30)

