![electronic_medical_records](./images/electronic_medical_records.png)

## Brief

Medical professionals often document patient encounters in natural language transcripts, detailing symptoms, diagnoses, and treatments. While these transcripts are essential for medical documentation, including insurance claims, extracting key information from them can be difficult due to the dense medical content.

To address this challenge, your team at Lakeside Healthcare Network aims to utilise the OpenAI API to automatically extract relevant medical data from these transcripts and match it with the corresponding ICD-10 codes. ICD-10 codes are a globally recognised system used for diagnosis and billing, particularly in insurance claim processing.

## The Data
The dataset contains anonymised medical transcriptions categorised by specialty.

## transcriptions.csv
| Column     | Description              |
|------------|--------------------------|
| `"medical_specialty"` | The medical specialty associated with each transcription.  |
| `"transcription"` | Detailed medical transcription texts, with insights into the medical case. |

In [None]:
# Import the necessary libraries
import pandas as pd
import uuid
import json
import os

from tabulate import tabulate
from openai import OpenAI
from dotenv import load_dotenv

In [2]:
# Load the data
transcriptions_df = pd.read_csv("data/transcriptions.csv")
transcriptions_df.head()

Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Orthopedic,"CHIEF COMPLAINT:, Achilles ruptured tendon.,H..."
2,Bariatrics,"PREOPERATIVE DIAGNOSIS: , Morbid obesity.,POST..."
3,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSES,Airway obstruction seco..."
4,Urology,"CHIEF COMPLAINT:, Urinary retention.,HISTORY ..."


In [3]:
# Overview of Dataset information and data types

transcriptions_df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   medical_specialty  5 non-null      object
 1   transcription      5 non-null      object
dtypes: object(2)
memory usage: 208.0+ bytes


### Define the OpenAI Client

In [4]:
# Define the model to use
model = "gpt-4o-mini"

load_dotenv()

# Define the client
api_key = os.getenv("OPENAI_API_KEY")
organization_id = os.getenv('OPENAI_ORG_ID')
project_id = os.getenv('OPENAI_PROJ_ID')

client = OpenAI(
    api_key=api_key,
    organization=organization_id,
    project=project_id,
)

### Define the `system_prompt` content to set the context for the AI medical assistant

In [None]:
# Base guideline for the AI-powered medical assistant chatbot
base_system_prompt = """
Act as a healthcare professional.
Your task is to extract the 'age' and 'recommended treatment' from a given medical record transcript. Follow these strict rules:
- Always extract the age as a standalone number, with no additional text, units, or formatting (e.g., 23).
- If the age is mentioned in formats like 'XX-year-old', extract only the number 'XX'.
- If the age is given indirectly (e.g., birth year), convert it to the current age using the current year.
- If the age cannot be determined, return 'Not Found'.

For the 'recommended treatment', extract the treatment description verbatim. If no treatment is mentioned, return 'Not Found'.
"""

# Response guidelines
response_guidelines = """
Response guidelines:
- Format your response as: 'Age: [age], Recommended Treatment: [treatment]'.
- Only provide direct responses as per the extraction rules.
"""

# Example text to guide the extraction of age-related information
example = """
Examples of age extraction:
- Input: '23-year-old'
  Output: '23'
- Input: 'born in 1980'
  Output: '44' (assuming the current year is 2024)
- Input: 'Patient's age at diagnosis: 35 years'
  Output: '35'
- Input: 'The patient is in their mid-30s'
  Output: 'Not Found'
"""

# Combine all parts to form the final system prompt
system_prompt = base_system_prompt + response_guidelines + example

# Define the conversation flow for the chatbot system
# Initialise the conversation with a system message that sets the context
def generate_response_messages(transcription):  
    return [{"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Return the age and recommended treatment for the patients from the body of the following transcription: {transcription}."}]

### Define the `function_definitions` with their respective parameters and descriptions

In [6]:
function_definition = [(
    {
        'type': 'function',
        'function': {
            'name': 'extract_medical_info',
            'description': 'Get the age and recommended treatment from the input text. Always return both age and recommended treatment. If any of the fields is missing in the transcript, return Not Found.',
            'parameters': {
                'type': 'object',
                'properties': {
                    'age': {
                        'type': 'string',
                        'description': 'Age of the patient'
                    },
                    'recommended_treatment': {
                        'type': 'string',
                        'description': 'The recommended treatment for the patient'
                    }
                }
            }
        }
    }
)]

### Define function to retrieve ai responses

In [None]:

# Generate a unique ID
unique_id = str(uuid.uuid4())

# Create a response using the AI assistant
def get_response(transcription):
    response = client.chat.completions.create(
        model=model,
        messages=generate_response_messages(transcription),
        tools=function_definition,
        user=unique_id

    )
    # Check if the response indicates a tool call
    if response.choices[0].finish_reason == 'tool_calls':
        # Extract the function call details
        function_call = response.choices[0].message.tool_calls[0].function
        # Check the function name
        if function_call.name == 'extract_medical_info':
           return json.loads(function_call.arguments)
        else:
            print("I am sorry, but I could extract the medical info.")
    else:
       print("I am sorry, but I could not understand your request.")
    

In [None]:
# Define the messages argument for the `match_icd_codes` function
def generate_icd_messages_input(treatment):
    example_structure = f"""
    1. ICD Code: XXX. Description: XXX.
    2. ICD Code: XXX. Description: XXX.
    3. ICD Code: XXX. Description: XXX.
    """
    
    content = f"""Provide the ICD codes for the following treatment or procedure: {treatment}. 
    Return the answer as a list of codes with a concise corresponding definition with the format:
    {example_structure}
    """
    
    return [{"role": "user", "content": content}]

In [None]:
# Define function to extract age and recommended treatment/procedure
def match_icd_codes(treatment):
    """Retrieves ICD codes for a given treatment using OpenAI."""
    response = client.chat.completions.create(
        model=model,
        messages=generate_icd_messages_input(treatment),
        temperature=0.2,
        user=unique_id
    )
    return response.choices[0].message.content
    

In [10]:
# Start an empty list to store processed data
processed_data = []

try:
    for index, row in transcriptions_df.iterrows():
        transcription = row['transcription']
        medical_specialty = row['medical_specialty']
        
        response_data = get_response(transcription)
        
        recommended_treatment = response_data["recommended_treatment"]
        
        # Retrieve the ICD codes
        ice_code = match_icd_codes(recommended_treatment)
        
        response_data["medical_specialty"] = medical_specialty.replace('/', 'or')
        response_data["icd_code"] = ice_code
        
        print(json.dumps(response_data, indent=2))

        # Append the extracted information as a new row in the list
        processed_data.append(response_data)
except Exception as e:
    print(f"An error occurred: {e}")


{
  "age": "23",
  "recommended_treatment": "She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well.",
  "medical_specialty": "Allergy or Immunology",
  "icd_code": "Here are the relevant ICD codes based on the treatment and procedures described:\n\n1. ICD Code: J30.9. Description: Allergic rhinitis, unspecified.\n2. ICD Code: J45.909. Description: Unspecified asthma, uncomplicated.\n3. ICD Code: Z79.899. Description: Other long term (current) drug therapy.\n\nThese codes reflect the conditions being treated (allergic rhinitis and possibly asthma) and the ongoing medication management."
}
{
  "age": "41",
  "recommended_treatment": "operatiive fixation, nonweightbearing in a splint for 10 days, nonweightbearing in a dynamic brace for 4 weeks, a walking boot for another six weeks with a lift until three months postop when we can get him into a shoe wit

## Solution

In [11]:
# Convert the list to a DataFrame
df_structured = pd.DataFrame(processed_data)

# Define the maximum width for the recommended_treatment column
max_width = 50

# Truncate the text in the columns
df_structured['recommended_treatment'] = df_structured['recommended_treatment'].apply(
    lambda x: x if len(x) <= max_width else x[:max_width] + '...')

df_structured['icd_code'] = df_structured['icd_code'].apply(
    lambda x: x if len(x) <= max_width else x[:max_width] + '...')

markdown_table = tabulate(df_structured, headers='keys', tablefmt='psql')

# Print the new structured DataFrame using tabulate for better formatting
print("new structured DataFrame:")
print(markdown_table)

new structured DataFrame:
+----+-------+-------------------------------------------------------+-----------------------------+-------------------------------------------------------+
|    |   age | recommended_treatment                                 | medical_specialty           | icd_code                                              |
|----+-------+-------------------------------------------------------+-----------------------------+-------------------------------------------------------|
|  0 |    23 | She will try Zyrtec instead of Allegra again. Anot... | Allergy or Immunology       | Here are the relevant ICD codes based on the treat... |
|  1 |    41 | operatiive fixation, nonweightbearing in a splint ... | Orthopedic                  | Here are the ICD codes that correspond to the trea... |
|  2 |    30 | Laparoscopic antecolic antegastric Roux-en-Y gastr... | Bariatrics                  | Here are the ICD codes related to the procedure yo... |
|  3 |    50 | Neck exploration;