![electronic_medical_records](electronic_medical_records.png)

Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.  

You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.

## The Data
The dataset contains anonymized medical transcriptions categorized by specialty.

## transcriptions.csv
| Column     | Description              |
|------------|--------------------------|
| `"medical_specialty"` | The medical specialty associated with each transcription.  |
| `"transcription"` | Detailed medical transcription texts, with insights into the medical case. |

In [58]:
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json

In [59]:
# Load the data
df = pd.read_csv("data/transcriptions.csv")
df.head()

Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Orthopedic,"CHIEF COMPLAINT:, Achilles ruptured tendon.,H..."
2,Bariatrics,"PREOPERATIVE DIAGNOSIS: , Morbid obesity.,POST..."
3,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSES,Airway obstruction seco..."
4,Urology,"CHIEF COMPLAINT:, Urinary retention.,HISTORY ..."


In [60]:
# Initialize the OpenAI client
client = OpenAI()

### Define the Function Definition for GPT Function Calling

In [61]:
function_definition = {
                    "name": "extract_patient_data",
                    "description": "Extract patient age, medical specialty, and recommended treatment.",
                    "parameters":{
                        "type":"object",
                        "properties":{
                            "age": {"type": "string", "description": "Age of the patient"},
                            "medical_speciality": {"type": 'string', "description": "Medical Speciality"},
                            "recommendation":{"type": "string", "description": "Recommended treatment or procedure"}
                        }
                        
                    },
    "required": ["age", "medical_speciality", "recommended_treatment" ] 
}

### Define GPT-based Extraction Function

In [62]:
import json

def extract_data(transcription):
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a helpful medical assistant."},
                {"role": "user", "content": f"""
You will receive a medical transcription. Extract and return the following:

1. Age of the patient (as a number)
2. Medical Specialty (e.g., Allergy, Cardiology, Orthopedic)
3. Recommended treatment or procedure (if any, from the plan section)

ONLY return this information using the schema provided to you.

Transcription:
{transcription}
"""}
            ],
            functions=[function_definition],
            function_call={"name": "extract_patient_data"},
            temperature=0.2
        )
        
        message = response.choices[0].message
        print(f"\n[DEBUG] Function call arguments:\n{message.function_call.arguments}\n")
        
        arguments = response.choices[0].message.function_call.arguments
        parsed = json.loads(arguments)
        return {
            "age": parsed.get("age"),
            "medical_specialty": parsed.get("medical_speciality"),  # Note: 'medical_speciality' not 'specialty'
            "recommended_treatment": parsed.get("recommendation")   # <- FIXED THIS
        }

    except Exception as e:
        print(f"[EXTRACTION ERROR] Index: {index} | {e}")
        return {
            "age": None,
            "medical_specialty": None,
            "recommended_treatment": None
        }

### Define ICD-10 Matching Using GPT

In [63]:
def get_icd_code(treatment):
    try:
        response = client.chat.completions.create(
            model = "gpt-4o-mini",
            messages = [
                {"role": "system", "content": "You are a medical coder."},
                {"role": "user", "content": f"Give the most appropriate ICD-10 code and condition for the following treatment or procedure:\n\n{treatment}\n\nOnly return in this format:\nICD-10 Code: <code>\nDescription: <condition>"}
            ],
            temperature=0.2
        )
        return response.choices[0].message.content.strip()

    except Exception as e:
        print(f"Error fetching ICD code: {e}")
        return None

### Apply to the DataFrame

In [64]:
# Initialize empty list to collect results
structured_data = []

for index, row in df.iterrows():
    transcription = row["transcription"]
    specialty = row["medical_specialty"]

    extracted = extract_data(transcription)
    treatment = extracted.get("recommended_treatment")

    if treatment:
        icd_info = get_icd_code(treatment)
    else:
        icd_info = None
        print(f"[WARNING] Missing treatment in row {index}. Skipping ICD lookup.")

    structured_data.append({
        "age": extracted.get("age"),
        "medical_specialty": extracted.get("medical_specialty"),
        "recommended_treatment": treatment,
        "icd_code": icd_info
    })


[DEBUG] Function call arguments:
{"age":"23","medical_speciality":"Allergy","recommendation":"She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well."}


[DEBUG] Function call arguments:
{"age":"41","medical_speciality":"Orthopedic","recommendation":"operative fixation"}


[DEBUG] Function call arguments:
{"age":"30","medical_speciality":"Bariatric Surgery","recommendation":"Laparoscopic antecolic antegastric Roux-en-Y gastric bypass with EEA anastomosis."}


[DEBUG] Function call arguments:
{"age":"50","medical_speciality":"Laryngology and Thoracic Surgery","recommendation":"Placement of #8 Shiley single cannula tracheostomy tube."}


[DEBUG] Function call arguments:
{"age":"66","medical_speciality":"Urology","recommendation":"Self-catheterize in the event that he does incur urinary retention again; prescription for 6 months of Flomax and Proscar."}

### Create Final DataFrame

In [65]:
# Create the structured DataFrame
df_structured = pd.DataFrame(structured_data)

# Display the top rows
df_structured.head()

Unnamed: 0,age,medical_specialty,recommended_treatment,icd_code
0,23,Allergy,She will try Zyrtec instead of Allegra again. ...,ICD-10 Code: J30.9 \nDescription: Allergic rh...
1,41,Orthopedic,operative fixation,ICD-10 Code: S82.92XA \nDescription: Unspecif...
2,30,Bariatric Surgery,Laparoscopic antecolic antegastric Roux-en-Y g...,ICD-10 Code: E66.01 \nDescription: Morbid obe...
3,50,Laryngology and Thoracic Surgery,Placement of #8 Shiley single cannula tracheos...,ICD-10 Code: J95.851 \nDescription: Tracheost...
4,66,Urology,Self-catheterize in the event that he does inc...,ICD-10 Code: R33.9 \nDescription: Urinary ret...
