## Simplify Note Texts using GPT-4o

**Goal of this notebook:**  
Use AI to extract ONLY the important components of the Mimic-III NOTEEVENT TEXT field. Removing details such as admission dates and medical dosages will reduce the token count, and simplify the text _while_ keeping relevant context and the original sentences that would be lost if we were to use a parser such as [Azure Text Analytics for Health](./01_az_text_analytics.ipynb) or the [mednlp](https://github.com/plandes/mednlp) python package.

The extracted data will be used to...  
- Simplify the note text before attempting to code. This will improve calssification results

**Requirements**  
- Setup Azure Open AI Resource with GPT-4o deployment, ensure [.env](./.env.sample) file is populated up to date

In [2]:
from dotenv import load_dotenv, find_dotenv
from openai import AzureOpenAI
from textwrap import dedent

import pandas as pd
import os
import tiktoken

load_dotenv(find_dotenv(), override=True)

pd.set_option('display.max_colwidth', None)

#### Prepare Data

In [None]:
# Get medical coding data, take a small subsample

df = pd.read_csv("data/joined/dataset_single_notes_full.csv.gz").sample(5, replace=False,random_state=1234)
print(df.shape)
display(df.dtypes)
display(df.head(1))

#### Use AOAI to Simplify Note Text

In [5]:
aoai_client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_BASE"), 
    api_key=os.getenv("AZURE_OPENAI_KEY"),
    api_version="2024-02-01"
)

In [6]:
# Function to get token counts

def get_token_counts(text):
    encoding = tiktoken.get_encoding('cl100k_base')
    num_tokens = len(encoding.encode(text))
    return num_tokens

In [7]:
# Funciton to build prompt

def build_prompt(note):
    sys = """
    Parse the following medical note. Return any sentences that relates to a diagnosis. Ignore other information such as patient name, dates, medicine types, dosage amounts, prior medical history, prior treamtment, etc.
    DO NOT ADD ANY INFORMATION TO THE NOTE. ONLY RETURN THE RELEVANT SENTENCES. 
    """
    prompt = f"{note}"

    return (sys, prompt)

In [8]:
def aoai_extract(note):
    sys, prompt = build_prompt(note)
    response = aoai_client.chat.completions.create(
        model=os.getenv("AOAI_MAIN_DEPLOYMENT_NAME"), # model = "deployment_name".
        messages=[
            {"role": "system", "content": dedent(sys)},
            {"role": "user", "content": dedent(prompt)}
        ],
    )

    return response.choices[0].message.content


In [9]:
df["AOAI_EX"] = df["TEXT"].apply(lambda x: aoai_extract(x))

In [10]:
df["AOAI_TOKENS"] = df["AOAI_EX"].apply(lambda x: get_token_counts(x))
df["TEXT_TOKENS"] = df["TEXT"].apply(lambda x: get_token_counts(x))

#### Examine Results

In [None]:
# Avg token count difference - How much efficiency is gained?
print(f"Average Original Text tokens: {df['TEXT_TOKENS'].mean()}")
print(f"Average AOAI Extracted tokens: {df['AOAI_TOKENS'].mean()}")
print(f"Average token count difference: {(df['TEXT_TOKENS'] - df['AOAI_TOKENS']).mean()}")

In [12]:
# Text analysis - Is any information lost?
sub_df=df[['ICD9_CODE','TEXT','AOAI_EX']]

In [None]:
# Example 0
display(sub_df.iloc[[0]])

---
### Detailed Analysis (TODO)