# Machine Learning Case Study
***
Cohere’s mission is to reduce complexity and waste in clinical care.  To that end we are seeking to build a flexible and expressive technology that allows for the codification of clinical rules and patient observations to automate clinical insurance approval. Often the most valuable clinical observation data is locked in clinical narratives. We are seeking to employ creative and effective natural language processing techniques to clinical notes in order to gain a better understanding of the clinical content. 
## Task
Using the data in [sampleclinicalnotes.zip](https://drive.google.com/file/d/1HFzT2bWkK9idNVbxMySAB8oNEAY0KYBZ/view?usp=sharing) it is expected that you will apply Graph Machine Learning techniques to uncover the common underlying factors for a given medical condition. Each txt file contains pertinent sections such as the Discharge Diagnosis, Chief Complaint and History of Present Illness which are the focal point of this exercise. Accompanying each file are annotations representing the output of a named entity recognition process. This should help compliment the factors found during modeling. 

Once you have related the underlying factors and conditions, choose one of the below questions to answer:
1. What are the differences between the graphs for male and female patients?
2. What variations are present between the graphs for patients of different ages?
3. How do the graphs vary for patients who are taking medications that they're allergic to vs those who are not?
4. Given a medication, provide a list of the most similar medications; based on associated underlying

## Setup
The below code is to help you get started. Please feel free to modify it to meet your needs. 
***

## Imports

In [1]:
import os
import re
import spacy
import scispacy
import pandas as pd
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [2]:
nlp = spacy.load("en_core_sci_md")

## Read Data Files

In [3]:
# Set file path for training data
data_files = 'Downloads/cohere_task/training_20180910'

# Get a list of all of the files provided
files = os.listdir(data_files)
len(files)

607

In [4]:
# Make a list of the text files only
text_files = [file for file in files if '.txt' in file]
len(text_files)

303

In [5]:
# Make a list of the annotation files only
ann_files = [file for file in files if '.ann' in file]
len(ann_files)

303

## Get Text by Section Headings

In [6]:
# A list of headings found in the text documents.
use_headings = ['SERVICE',
                'CHIEF COMPLAINT',
                 'ADMISSION DATE',
                 'ALLERGIES',
                 'PAST MEDICAL HISTORY',
                 'HISTORY OF PRESENT ILLNESS',
                 'SOCIAL HISTORY',
                 'DISCHARGE MEDICATIONS',
                 'DISCHARGE DISPOSITION',
                 'ATTENDING',
                 'DISCHARGE DIAGNOSIS',
                 'DISCHARGE CONDITION',
                 'MEDICATIONS ON ADMISSION',
                 'BRIEF HOSPITAL COURSE',
                 'DISCHARGE INSTRUCTIONS',
                 'FAMILY HISTORY',
                 'MAJOR SURGICAL OR INVASIVE PROCEDURE',
                 'PHYSICAL EXAM',
                 'DATE OF BIRTH',
                 'FOLLOWUP INSTRUCTIONS',
                 'PERTINENT RESULTS']

print("You've specified",len(use_headings),"headings")

You've specified 21 headings


In [7]:
# This block of code goes through the text files and the lines in each file. 
# It then splits the text by heading, and converts it to a dataframe with rows: File, Heading, Lines, Text and Entities. 
# So in the end you will have a line for each Heading with it's text contents and the entities for the text. 
# There is also some clean up to remove special characters.

# Intializie dataframe
text_df = pd.DataFrame()

# Regex pattern for keeping only alpha characters and removing all special characters and numbers
pattern = r'[^A-Za-z]+'


for file in text_files:
    # Read in text file
    with open(data_files+"/"+file) as f:
        lines = f.readlines()
        
    # Initialize headings dictionary - will be used to create dataframe
    headings = {}
    # Setup headings list to keep track of headings we've used
    headings_list = []
    # Find the lines that contain headings, then add them to the headings dictionary and lists.
    for i in range(0,len(lines)):
        potential_heading = lines[i].split(":")[0].upper()
        if (i==0)&(":" in lines[i])&(potential_heading in use_headings):
            headings[i] = {'File': file, 'Heading': lines[i].split(":")[0]}
            headings_list.append(potential_heading)
        elif (len(lines[i])>1)&(":" in lines[i])&(potential_heading in use_headings)&(potential_heading not in (headings_list)):
            headings[i] = {'File': file, 'Heading': lines[i].split(":")[0]}
            headings_list.append(potential_heading)

    # Overwrite headings list - to have the final list of headings found in the document
    headings_list = list(headings.keys())
    
    # For each heading, find the corresponding range of lines, grab the text for those lines, and get the entities
    for i in range(len(headings_list)):
        if i < len(headings_list)-1:
            headings[headings_list[i]]['Lines'] = [headings_list[i],headings_list[i+1]]
            text = " ".join(lines[headings_list[i]:headings_list[i+1]]).replace(headings[headings_list[i]]['Heading']+":","").replace("\n","").strip()
            text = re.sub(pattern, ' ', text).strip()
            headings[headings_list[i]]['Text'] = text
            headings[headings_list[i]]['Entities'] = list(nlp(text).ents)
            headings[headings_list[i]]['Heading'] = headings[headings_list[i]]['Heading'].upper()
        else:
            headings[headings_list[i]]['Lines'] = [headings_list[i],len(lines)]
            text = " ".join(lines[headings_list[i]:len(lines)]).replace(headings[headings_list[i]]['Heading']+":","").replace("\n","").strip()
            text = re.sub(pattern, ' ', text).strip()
            headings[headings_list[i]]['Text'] = text
            headings[headings_list[i]]['Entities'] = list(nlp(text).ents)
            headings[headings_list[i]]['Heading'] = headings[headings_list[i]]['Heading'].upper()
    
    # Update the dataframe with headings data found for the text file
    text_df = pd.concat([text_df,pd.DataFrame.from_dict(headings, orient='index')], ignore_index=True)

text_df.head()

Unnamed: 0,File,Heading,Lines,Text,Entities
0,110727.txt,ADMISSION DATE,"[0, 2]",Discharge Date,[(Discharge)]
1,110727.txt,DATE OF BIRTH,"[2, 4]",Sex M,"[(Sex, M)]"
2,110727.txt,SERVICE,"[4, 6]",MEDICINE,[(MEDICINE)]
3,110727.txt,ALLERGIES,"[6, 9]",Keflex Orencia Remicade,"[(Keflex), (Orencia, Remicade)]"
4,110727.txt,ATTENDING,"[9, 10]",First Name LF,[(LF)]


## Get the conditions
Choose which columns you want to include for conditions.

In [8]:
conditions_headings = ['DISCHARGE DIAGNOSIS', 'CHIEF COMPLAINT']
condition_df = text_df[text_df['Heading'].isin(conditions_headings)].pivot(index='File', columns='Heading', values=['Text', 'Entities'])

condition_df.head()

Unnamed: 0_level_0,Text,Text,Entities,Entities
Heading,CHIEF COMPLAINT,DISCHARGE DIAGNOSIS,CHIEF COMPLAINT,DISCHARGE DIAGNOSIS
File,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
100035.txt,Post cardiac arrest asthma exacerbation,Anoxic Brain Injury s p PEA arrest x Status As...,"[(Post, cardiac, arrest), (asthma, exacerbation)]","[(Anoxic), (Brain, Injury), (PEA, arrest), (St..."
100039.txt,Abdominal Pain,Primary Abdominal Pain Acute on chronic renal ...,"[(Abdominal, Pain)]","[(Primary, Abdominal, Pain), (Acute), (chronic..."
100187.txt,SOB,Primary Pulmonary Embolism with history of DVT...,[(SOB)],"[(Primary, Pulmonary, Embolism), (history), (D..."
100229.txt,Hypotension with elevated lactate code sepsis,Primary Sepsis Shock liver Heparin induced thr...,"[(Hypotension), (elevated, lactate, code, seps...","[(Primary, Sepsis), (Shock, liver, Heparin), (..."
100564.txt,SVC thrombosis,Deep Vein Thrombosis of subclavian vein Rectal...,"[(SVC, thrombosis)]","[(Deep, Vein), (Thrombosis), (subclavian, vein..."


## Get the underlying factors
Choose which columns that you want to include for underlying factors.

In [9]:
factors_headings = ['HISTORY OF PRESENT ILLNESS']
factor_df = text_df[text_df['Heading'].isin(factors_headings)].pivot(index='File', columns='Heading', values=['Text', 'Entities'])
factor_df

Unnamed: 0_level_0,Text,Entities
Heading,HISTORY OF PRESENT ILLNESS,HISTORY OF PRESENT ILLNESS
File,Unnamed: 1_level_2,Unnamed: 2_level_2
100035.txt,Mr Known lastname is a year old gentleman with...,"[(year), (gentleman), (PMH), (signifciant), (d..."
100039.txt,yo F w h o ALL in remission s p cord transplan...,"[(yo), (ALL), (remission), (cord, transplant),..."
100187.txt,yo woman w h o recurrent PEs s Initials NamePa...,"[(yo), (woman), (recurrent), (PEs), (Initials)..."
100229.txt,yoM PMH ESRD secondary to Brights disease on H...,"[(yoM), (PMH), (ESRD), (secondary, to, Brights..."
100564.txt,yo male with hx of rectal CA DMII and histopla...,"[(yo), (male), (rectal, CA), (DMII), (histopla..."
...,...,...
195689.txt,Pt is a y o African American gentleman who has...,"[(Pt), (African, American), (gentleman), (medi..."
195784.txt,The patient is a year old male with a history ...,"[(patient), (year), (male), (history), (hepati..."
196798.txt,This is a year old woman with history of CAD C...,"[(year), (woman), (history), (CAD), (CHF), (co..."
197869.txt,yo male with a history of DM CAD s p CABG peri...,"[(yo), (male), (history), (DM), (CAD), (CABG, ..."


***
***
# Your Code
***
***