![electronic_medical_records](electronic_medical_records.png)

Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.  

You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.

## The Data
The dataset contains anonymized medical transcriptions categorized by specialty.

## transcriptions.csv
| Column     | Description              |
|------------|--------------------------|
| `"medical_specialty"` | The medical specialty associated with each transcription.  |
| `"transcription"` | Detailed medical transcription texts, with insights into the medical case. |

In [2]:
import pandas as pd
from openai import OpenAI
import json

In [33]:
df = pd.read_csv("data/transcriptions.csv")
df.head()

Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Orthopedic,"CHIEF COMPLAINT:, Achilles ruptured tendon.,H..."
2,Bariatrics,"PREOPERATIVE DIAGNOSIS: , Morbid obesity.,POST..."
3,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSES,Airway obstruction seco..."
4,Urology,"CHIEF COMPLAINT:, Urinary retention.,HISTORY ..."


In [34]:
client = OpenAI()

function_definitions = [
    {
        "type": "function",
        "function": {
            "name": "extract_patient_data",
            "description": "Extracts patient age and recommended treatment or procedure.",
            "parameters": {
                "type": "object",
                "properties": {
                    "age" : {"type": "integer", "description": "Age of the patient."},
                    "recommended_treatment": {"type": "string", "description": "The recommended treatment or procedure for the disease."}
                }
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "extract_icd_code",
            "description": "Extracts the correct International Classification of Diseases (ICD) code for the given treatment.",
            "parameters": {
                "type": "object",
                "properties": {
                    "icd" : {"type": "string", "description": "The official ICD-10 code corresponding to the recommended treatment"}
                }
            }
        }
    }
]


def get_recommended_treatment(client, text):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": text}],
        tools=function_definitions,
        tool_choice={"type": "function", "function": {"name": "extract_patient_data"}}
    )

    return response


def get_icd_code(client, text):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a medical assistant. Given a treatment description, return the corresponding ICD-10 code"},
            {"role": "user", "content": text}
        ],
        tools=function_definitions,
        tool_choice={"type": "function", "function": {"name": "extract_icd_code"}}
    )

    return response

In [35]:
patient_data_extract = []
for i in df["transcription"]:
    response = get_recommended_treatment(client, i)
    parsed_response = response.choices[0].message.tool_calls[0].function.arguments
    if isinstance(i, str):
        patient_data_extract.append(json.loads(parsed_response))
    else:
        patient_data_extract.append(i)

print(patient_data_extract)

[{'age': 23, 'recommended_treatment': 'Zyrtec, Nasonex'}, {'age': 41, 'recommended_treatment': 'operative fixation of Achilles tendon rupture'}, {'age': 30, 'recommended_treatment': 'Laparoscopic antecolic antegastric Roux-en-Y gastric bypass with EEA anastomosis.'}, {'age': 50, 'recommended_treatment': 'Neck exploration; tracheostomy; urgent flexible bronchoscopy; removal of foreign body, tracheal metallic stent material; dilation distal trachea; placement of #8 Shiley single cannula tracheostomy tube.'}, {'age': 66, 'recommended_treatment': 'Flomax and Proscar'}]


In [36]:
icd_codes_extract = []
for i in patient_data_extract:
    response = get_icd_code(client, i["recommended_treatment"])
    parsed_response = response.choices[0].message.tool_calls[0].function.arguments
    icd_codes_extract.append(json.loads(parsed_response))

print(icd_codes_extract)

[{'icd': 'Zyrtec, Nasonex'}, {'icd': 'S86.011A'}, {'icd': '0DB78ZZ'}, {'icd': 'J95.851'}, {'icd': 'N40.0'}]


In [37]:
if not len(df) == len(patient_data_extract) == len(icd_codes_extract):
    raise Exception("number of extracted results does not equal dataframe row count.")

df_structured = pd.DataFrame(
    {
        "age": [i["age"] for i in patient_data_extract],
        "recommended_treatment": [i["recommended_treatment"] for i in patient_data_extract],
        "icd_code": [i["icd"] for i in icd_codes_extract],
        "medical_specialty": df["medical_specialty"]
    }
)

pd.set_option('display.max_columns', None)
print(df_structured)

   age                              recommended_treatment         icd_code           medical_specialty
0   23                                    Zyrtec, Nasonex  Zyrtec, Nasonex        Allergy / Immunology
1   41      operative fixation of Achilles tendon rupture         S86.011A                  Orthopedic
2   30  Laparoscopic antecolic antegastric Roux-en-Y g...          0DB78ZZ                  Bariatrics
3   50  Neck exploration; tracheostomy; urgent flexibl...          J95.851  Cardiovascular / Pulmonary
4   66                                 Flomax and Proscar            N40.0                     Urology
