## 🧠 Script Overview: Medical Transcription Extractor

**Filename:** `extract_transcriptions.py`  
**Author:** *Ankur Bhattacharjee*  
**Date:** *29-JUL-2025*  

### 📋 Description

This script uses the **OpenAI API (via GPT-4o-mini)** to extract structured clinical information from unstructured medical transcriptions. It focuses on extracting:

- `age`
- `recommended_treatment`
- `ICD_10_Code`

The extracted data is returned and later organized into a structured DataFrame for downstream analysis.

---

### ⚙️ Requirements

- `openai >= 1.0.0`
- `pandas`
- `python-dotenv`

Also, create a `.env` file in your project root with your API key:

```env
OPENAI_API_KEY=your-openai-api-key-here


In [1]:
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json
import os
from dotenv import load_dotenv

In [2]:
# Load the data
df = pd.read_csv("transcriptions.csv")
df.head()

Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Orthopedic,"CHIEF COMPLAINT:, Achilles ruptured tendon.,H..."
2,Bariatrics,"PREOPERATIVE DIAGNOSIS: , Morbid obesity.,POST..."
3,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSES,Airway obstruction seco..."
4,Urology,"CHIEF COMPLAINT:, Urinary retention.,HISTORY ..."


In [3]:
# Pass the key securely into the client
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

In [4]:
# Define the function to pass to tools
function_definition = [{"type": "function",
                        "function" : {
                            "name": "get_med_info",
                            "description": "Get the medical information from medical transcriptions ",
                            "parameters": {
                                "type": "object", 
                                'properties': {
                                      'age': {'type': 'string', 'description': 'age of the patient'},  
                                      'recommended_treatment': {'type': 'string', 'description': 'recommended treatment'},
                                      ' ICD_10_Code': {'type': 'string', 'description': ' ICD 10 code for recommended treatment'}
                                }}}}]


In [5]:
# # Define Response Extraction Function
def get_response(user_prompt):

    try:
        messages = [{"role": "user", "content":system_promt},
                {"role": "user", "content": user_prompt}]
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages= messages,
            tools=function_definition,
            tool_choice={"type": "function", "function": {"name": "get_med_info"}}
        )
        tool_call = response.choices[0].message.tool_calls[0]
        result_age = json.loads(tool_call.function.arguments)["age"]
        result_recommended_treatment = json.loads(tool_call.function.arguments)["recommended_treatment"]
        result_ICD_10_Code = json.loads(tool_call.function.arguments)["ICD_10_Code"]
        return result_age,result_recommended_treatment,result_ICD_10_Code
      
    except Exception as e:
        print(f"Error: {e}")
        return None, None 

In [6]:
# Defining the system prompt

system_promt= """You are an AI assistant specialized in medical transcription analysis. Your task is to interpret each user-provided medical transcription and extract the following information:

1. Patient Age
2. Recommended Treatment(s) 
3. For each recommended treatment, identify and match the corresponding ICD (International Classification of Diseases) code

Ensure that the extracted data is accurate, concise, and aligned with standard medical terminology.
Note, 1. Recommended Treatment(s) should be precise based on recommendations, if available in medical transcription make a serial for Recommended Treatment(s) e.g. 1. ..., 2. ... , max_token for Recommended Treatment(s): 120
    2. Medical Specialty value will be taken from dataframe mentioned as df['medical_specialty'] """

In [7]:
# run the model to extract required information from dataframe

structured = []

for index, row in df.iterrows():
    transcription = row['transcription']
    specialty = row['medical_specialty']
    age,recommended_treatment,ICD_code = get_response(transcription)
    structured.append(
            {'medical_specialty': specialty,
            'transcription': transcription,
            'age': age,
            'recommended_treatment': recommended_treatment,
            'ICD_10_code': ICD_code}
            )
df_structured = pd.DataFrame(structured)
df_structured.head()



Unnamed: 0,medical_specialty,transcription,age,recommended_treatment,ICD_10_code
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr...",23,1. Try Zyrtec instead of Allegra. 2. Use lorat...,J30.9
1,Orthopedic,"CHIEF COMPLAINT:, Achilles ruptured tendon.,H...",41,1. Operative fixation for right Achilles tendo...,S86.001
2,Bariatrics,"PREOPERATIVE DIAGNOSIS: , Morbid obesity.,POST...",30,1. Laparoscopic antecolic antegastric Roux-en-...,E66.01
3,Cardiovascular / Pulmonary,"PREOPERATIVE DIAGNOSES,Airway obstruction seco...",50,1. Neck exploration; 2. Tracheostomy; 3. Urgen...,J38.1
4,Urology,"CHIEF COMPLAINT:, Urinary retention.,HISTORY ...",66,"1. Flomax, 2. Proscar",N40
