# Discharge Notes Analysis
### with Olivia Tang and Brian Jhong
### BIME 498
### University of Washington

In [8]:
#import libraries
import glob
import pandas as pd
import numpy as np 
import re
from collections import Counter

 ## (2) Analysis: 

A challenge for analysis of clinical data is that much of the data is in free text as in these notes. Thus, although there is a well-recognized standard terminology for coding of diagnosis and conditions (ICD-10), it can be challenging for a computer (or sometimes a human) to understand what diagnoses the patient has at time of discharge. Your "simple" analysis task, is to search through all of the discharge summaries of all of your patients (recall that not all patients have such a summary) and search for the terms following "discharge diagnosis". (If the patient dies in the hospital, there are no discharge diagnoses.) Then, produce a table that lists the most common discharge diagnoses among your patients. (E.g. "Congestive heart failure -- 10 patients; Diabetes mellitus: 7 patients....."). Also include number of patients with discharge diagnoses, and the average number of diagnoses per patient.  For the highest grade, try to find and recognize some synonyms (e.g. "afib" is the same as "atrial fibrillation"). 

This analysis is a simple initial example of the sort of data analysis you will do in project 2. To help you learn the technologies for project 2, I am asking that you carry out this analysis with Python using Jupyter notebooks. We recommend that you install Anaconda as the easiest way to use both Jupyter and Python.  If you'd prefer to use R for project 1, that's okay, but project 2 will require Python. 

In [46]:
#get list of all .txt files
files = glob.glob('./DischargeNotes-24/*txt')
#get number of files: 153
len(files)

153

In [47]:
#master list
diagnoses = []
#avg count of diagnoses per file
avgdiag = []
stripchars = '0123456789.'
for file in files:
    with open(file, "r") as f:
        copy = False
        diag = []
        for line in f: ##line cleaning before inserting into list
            line = line.strip()
            line = line.lstrip(stripchars)
            line = line.strip()
            line = line.lower()
            line = re.sub("[\(\[].*?[\)\]]", '', line)
            line = re.sub('\s.*\]', '', line)
            line = line.strip()
            if line == "discharge diagnosis:":
                copy = True
            elif not line:
                copy = False
            elif line == "discharge condition:":
                copy = False
            elif copy:
                if not (line.startswith('primary') or line.startswith('secondary') or line == 'na' or line.startswith('[')):
                    diag.append(line)
                elif(line.startswith('primary:')):
                    line = line.replace('primary:', '') 
                    if not line == '':
                        diag.append(line)
                elif(line.startswith('primary diagnoses:')):
                    line = line.replace('primary diagnoses:', '')
                    if not line == '':
                        diag.append(line)
                elif(line.startswith('secondary:')):
                    line = line.replace('secondary:', '')
                    if not line == '':
                        diag.append(line)       
        diagnoses.extend(diag)
        avgdiag.append(len(diag))
            

In [9]:
len(diagnoses)

254

In [10]:
#convert to np array
np_diag = np.array(diagnoses)

In [15]:
#used numpy unique to filter same diagnoses
uniquediag = np.unique(np_diag)

In [20]:
#got rid of whitespace in front and behind
for i in range(len(uniquediag)):
    uniquediag[i] = uniquediag[i].strip()

## Manual Sorting and Filtering
We wrote diagnoses data and unique diagnoses into two seperate .txt files for manual sorting and filtering (We later found that it was easier to simply go through the diagnoses list). We made the executive decision to do so because the representation of each diagnosis in each discharge note varied--some lines of the discharge diagnoses had several diagnoses on each line, some diagnoses were misspelled and many of the diagnoses were specified with 'right', 'left', 'acute', 'chronic', 'prior', 's/p' terms which although are important in a stand-alone diagnosis, are less valuable for statistical analysis of multiple diagnoses for multiple patients. 

**General Edits**
* We simplfied diagnoses like 'right subarachnoid hemorrhage/intrparenchymal hemorrhage' in favor for hemorrhage; 'left rib fracture' was simplified to rib fracture
* generalized all cancers --lymphoma,carcinoma,leukemia
* some diagnoses were multifactoral (i.e. 'multifocal aspiration pneumonia with sepsis' to 'pneumonia' and 'sepsis')
* some diagnoses included explaination for diagnosis such as 'history of falls' for 'senile dementia' so history of falls was not included.
* abbreviations like 'htn' were written as 'hypertension' for more clarity, the only diagnoses that were left as abbreviations were 'uti' and 'a-fib'
* all representation of diabetes was shorten to 'diabetes'

In [13]:
f= open("diagnoses.txt","w+")
for d in diagnoses:
     f.write(d + '\n')
f.close() 

In [22]:
f= open("uniquediag.txt","w+")
for d in uniquediag:
     f.write(d + '\n')
f.close() 

In [41]:
# write final txt file back in
finaldiagnoses = []
with open('finaldiagnoses.txt', "r") as f:
    for line in f:
        line = line.lower()
        finaldiagnoses.append(line.strip())

In [42]:
#finds unique diagnoses in keys and stores counts in values
keys = list(Counter(finaldiagnoses).keys())
values = list(Counter(finaldiagnoses).values())

In [43]:
#write keys and values to df
df = pd.DataFrame({'diagnosis': keys, 'counts': values})
df = df.sort_values(by = 'counts', ascending=False)

In [45]:
#lists top 20 values
df.head(20)

Unnamed: 0,diagnosis,counts
8,hypertension,11
39,uti,10
19,pneumonia,7
44,cancer,7
3,respiratory failure,7
4,coronary artery disease,7
42,a-fib,7
85,congestive heart failure,6
24,myocardial infarction,6
14,diabetes,5


In [50]:
avgdiag = np.array(avgdiag)

In [57]:
len(avgdiag)

153

In [53]:
#count of patients wo diagnoses
unique, counts = np.unique(avgdiag, return_counts=True)

In [56]:
print(unique)
print(counts) ## 88 patients have no diagnoses 153-88 = 65 patients with diagnoses

[ 0  1  2  3  4  5  6  7  9 10 12 13 18]
[88 27  4  5  6  6  7  2  2  2  1  2  1]


In [58]:
#mean diagnoses among patients with diagnoses (88)
sum(avgdiag)/88

2.8863636363636362

In [51]:
#mean diagnoses per patient
np.mean(avgdiag)

1.6601307189542485