## Simplify Note Texts using GPT-4o

**Goal of this notebook:**  
Use AI to extract ONLY the important components of the Mimic-III NOTEEVENT TEXT field. Removing details such as admission dates and medical dosages will reduce the token count, and simplify the text _while_ keeping relevant context and the original sentences that would be lost if we were to use a parser such as [mednlp](https://github.com/plandes/mednlp) python package.

The extracted data will be used to...  
- Simplify the note text before attempting to code. This will improve calssification results

**Requirements**  
- Setup Azure Open AI Resource with GPT-4o deployment, ensure [.env](./.env.sample) file is populated up to date
- Setup Azure Language Resource in the same manor

In [1]:
from dotenv import load_dotenv, find_dotenv
from openai import AzureOpenAI
from textwrap import dedent
from fuzzywuzzy import process
from azure.core.credentials import AzureKeyCredential
from azure.ai.textanalytics import TextAnalyticsClient, ExtractiveSummaryAction

import pandas as pd
import os
import tiktoken
from tqdm.notebook import tqdm

load_dotenv(find_dotenv(), override=True)

pd.set_option('display.max_colwidth', None)

#### Prepare Data

In [None]:
# Get medical coding data, take a small subsample

df = pd.read_csv("data/joined/dataset_single_notes_full.csv.gz").sample(1, replace=False,random_state=1234)
print(df.shape)
display(df.dtypes)
display(df.head(1))

In [8]:
# Helper Functions

# Function to get token counts
def get_token_counts(text):
    encoding = tiktoken.get_encoding('cl100k_base')
    num_tokens = len(encoding.encode(text))
    return num_tokens

#### Use AOAI to Simplify Note Text

In [9]:
aoai_client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_BASE"), 
    api_key=os.getenv("AZURE_OPENAI_KEY"),
    api_version="2024-02-01"
)

In [10]:
# Funciton to build prompt

def build_prompt(note):
    sys = """
    Parse the following medical note. Return any sentences that relates to a diagnosis. Ignore other information such as patient name, dates, medicine types, dosage amounts, prior medical history, and prior treamtment.
    DO NOT ADD ANY INFORMATION TO THE NOTE. ONLY RETURN THE RELEVANT SENTENCES. 
    """
    prompt = f"{note}"

    return (sys, prompt)

In [11]:
def aoai_extract(note):
    sys, prompt = build_prompt(note)
    response = aoai_client.chat.completions.create(
        model=os.getenv("AOAI_MAIN_DEPLOYMENT_NAME"), # model = "deployment_name".
        messages=[
            {"role": "system", "content": dedent(sys)},
            {"role": "user", "content": dedent(prompt)}
        ],
    )

    return response.choices[0].message.content


In [None]:
tqdm.pandas()
df["AOAI_EX"] = df["TEXT"].progress_apply(lambda x: aoai_extract(x))

##### Examine Results

In [None]:
# Text analysis - Is any information lost?
sub_df=df[['ICD9_CODE','TEXT','AOAI_EX']]
display(sub_df.iloc[[0]])

In [None]:
# TODO: Detect hallucinations - ensure that only sentences from the original notes are returned

def validate_subset(original, extracted):
    original = original.replace('[','').replace(']','').replace('*','').replace('\n',' ').strip()
    extracted = extracted.replace('[','').replace(']','').replace('*','').replace('\n',' ').strip()
    extracted_list = list(filter(None,extracted.split('.')))
    true_list = []
    falses = 0
    
    for sentence in extracted_list:
        fuzzy_match = process.extractOne(sentence, original.split('.'))
        # print(fuzzy_match)
        if fuzzy_match[1] > 85: # Tune this threshold accordingly
            true_list.append(sentence)
        else:
            falses = falses+1
            #print(f"False: {sentence}")

    return (falses, ". ".join(true_list))

#Test
result = validate_subset(sub_df['TEXT'].iloc[0], sub_df['AOAI_EX'].iloc[0])
print(result)

In [None]:
tqdm.pandas()
df["AOAI_EX_CLEAN"] = df.progress_apply(lambda x: validate_subset(x['TEXT'], x['AOAI_EX']), axis=1)
df["AOAI_MISMATCH"] = df["AOAI_EX_CLEAN"].apply(lambda x: x[0])
df["AOAI_EX_CLEAN"] = df["AOAI_EX_CLEAN"].apply(lambda x: x[1])


In [None]:
display(df[["ICD9_CODE","TEXT","AOAI_EX_CLEAN","AOAI_MISMATCH"]].head(1))

In [None]:
# Avg token count difference - How much efficiency is gained?
df["AOAI_TOKENS"] = df["AOAI_EX_CLEAN"].apply(lambda x: get_token_counts(x))
df["TEXT_TOKENS"] = df["TEXT"].apply(lambda x: get_token_counts(x))

print(f"Average Original Text tokens: {df['TEXT_TOKENS'].mean()}")
print(f"Average AOAI Extracted tokens: {df['AOAI_TOKENS'].mean()}")
print(f"Average token count difference: {(df['TEXT_TOKENS'] - df['AOAI_TOKENS']).mean()}")

#### Use Azure AI Language _Extractive_ Summarization to Simplify Note Text

In some cases, in the intrest of extreme risk aversion using a GPT based LLM approach may not be acceptable. One alternative is the AI Language service. This implementation uses **extractive** summaries to simplify the note text. Unlike **abstractive** summaries, **extractive** summaries only use text from the original corpus and do not alter that text, or create next text in any way

In [13]:
# Authenticate the client using your key and endpoint 
key = os.getenv("LANGUAGE_KEY")
endpoint = os.getenv("LANGUAGE_ENDPOINT")

client = TextAnalyticsClient(
        endpoint=endpoint, 
        credential=AzureKeyCredential(key)
        )

In [None]:
def extract_summary(client, text):

    poller = client.begin_analyze_actions(
        [text],
        actions=[
            ExtractiveSummaryAction(max_sentence_count=20, # Note: Tune this based on output results
                                    order_by="Offset",
                                    disable_service_logs=True,
                                    model_version="latest"),
        ],
    )
    extract_summary_results = poller.result()
    for result in extract_summary_results:
        extract_summary_result = result[0]
        if extract_summary_result.is_error:
            print("...Is an error with code '{}' and message '{}'".format(
                result.error.code, result.error.message
            ))
            return None
        else:
            # print("Summary extracted.")
            return " ".join([sentence.text for sentence in extract_summary_result.sentences])
    
        
# Test
text = df['TEXT'].iloc[0]
summary = extract_summary(client, text)
print(summary)

In [None]:
tqdm.pandas()
df["AZ_LANG_EX"] = df["TEXT"].progress_apply(lambda x: extract_summary(client,x))

In [16]:
df["AZ_LANG_TOKENS"] = df["AZ_LANG_EX"].apply(lambda x: get_token_counts(x))
df["TEXT_TOKENS"] = df["TEXT"].apply(lambda x: get_token_counts(x))

##### Examine Results

In [None]:
# Text analysis - Is any information lost?
sub_df=df[['ICD9_CODE','TEXT','AZ_LANG_EX']]
display(sub_df.iloc[[0]])

In [None]:
# Avg token count difference - How much efficiency is gained?
print(f"Average Original Text tokens: {df['TEXT_TOKENS'].mean()}")
print(f"Average AOAI Extracted tokens: {df['AZ_LANG_TOKENS'].mean()}")
print(f"Average token count difference: {(df['TEXT_TOKENS'] - df['AZ_LANG_TOKENS']).mean()}")

#### Observations

1. The Azure Language Extract is not a great fit for medical coding use cases. The summary does not highly rank symptom and diagnosis information, instead focusing on the human behavior and processes of the note.

2. The LLM based extract can be prompt engineered to grab imformation that is much more relevant to medical coding. However, it takes additional work to ensure that the LLM does not hallucinate additional information. A rudimentary approach is demonstrated in this notebook, but further engineering may be required for a production system. 