In [4]:
!git clone https://github.com/sweatyrichard/tribune7283.git

fatal: destination path 'tribune7283' already exists and is not an empty directory.


# Part 1: Extraction

## Loading Dataset


In [5]:
from google.colab import drive
import pandas as pd

# Mount Google Drive
#drive.mount('/content/drive')

# Specify the path to your XLSX file in Google Drive
file_path = '/content/tribune7283/synthetic_clinical_notes.xlsx'

# Read the XLSX file into a pandas DataFrame
try:
    df = pd.read_excel(file_path)
    print("File read successfully!")
    # Display the first 5 rows of the DataFrame
    display(df.head())
except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

File read successfully!


Unnamed: 0,note
0,"Patient: John Smith, 58-year-old male\nMedical..."
1,"Patient: Linda Green, 45-year-old female\nMedi..."
2,"Patient: Michael Brown, 62-year-old male\nMedi..."
3,"Patient: Sarah Johnson, 50-year-old female\nMe..."
4,"Patient: Carlos Ramirez, 55-year-old male\nMed..."


## RegEx Based Extraction

In [6]:
import re
import pandas as pd

def extract_patient_info(note):
    """Extracts patient information using regex."""
    # Updated regex to capture first and last names separately
    name_match = re.search(r"Patient: (.*?)\s(.*?)[,\n]", note)
    age_match = re.search(r"(\d+)-year-old", note)
    gender_match = re.search(r"-old (male|female)", note)
    mrn_match = re.search(r"Medical Record #: (\d+)", note)

    given_name = name_match.group(1).strip() if name_match else None
    family_name = name_match.group(2).strip() if name_match else None
    age = int(age_match.group(1)) if age_match else None
    gender = gender_match.group(1) if gender_match else None
    mrn = mrn_match.group(1) if mrn_match else None

    return given_name, family_name, age, gender, mrn

# Apply the function to the 'note' column and create new columns
df[['given_name', 'family_name', 'age', 'gender', 'mrn']] = df['note'].apply(lambda x: pd.Series(extract_patient_info(x)))

# Display the updated DataFrame with the new columns
display(df.head())

Unnamed: 0,note,given_name,family_name,age,gender,mrn
0,"Patient: John Smith, 58-year-old male\nMedical...",John,Smith,58,male,1234
1,"Patient: Linda Green, 45-year-old female\nMedi...",Linda,Green,45,female,5678
2,"Patient: Michael Brown, 62-year-old male\nMedi...",Michael,Brown,62,male,9102
3,"Patient: Sarah Johnson, 50-year-old female\nMe...",Sarah,Johnson,50,female,3344
4,"Patient: Carlos Ramirez, 55-year-old male\nMed...",Carlos,Ramirez,55,male,2211


In [7]:
import re
import pandas as pd
from typing import Union

def validate_given_name(name: Union[str, None]) -> bool:
    """Validates the patient's given name."""
    return isinstance(name, str) and 1 <= len(name) <= 50 # Assuming a reasonable length for a given name

def validate_family_name(name: Union[str, None]) -> bool:
    """Validates the patient's family name."""
    return isinstance(name, str) and 1 <= len(name) <= 50 # Assuming a reasonable length for a family name

def validate_age(age: Union[int, None]) -> bool:
    """Validates the patient's age."""
    # Check if age is an integer and within a reasonable range
    return isinstance(age, int) and 0 <= age <= 120

def validate_gender(gender: Union[str, None]) -> bool:
    """Validates the patient's gender."""
    # Check if gender is a string and is 'male' or 'female' (case-insensitive)
    return isinstance(gender, str) and gender.lower() in ['male', 'female']

def validate_mrn(mrn: Union[str, None]) -> bool:
    """Validates the patient's medical record number (MRN)."""
    # Check if mrn is a string and matches the 4-digit pattern
    return isinstance(mrn, str) and bool(re.fullmatch(r'\d{4}', str(mrn)))


def validate_patient_data(df: pd.DataFrame) -> pd.DataFrame:
    """Adds validation flags to the DataFrame."""
    df['given_name_valid'] = df['given_name'].apply(validate_given_name)
    df['family_name_valid'] = df['family_name'].apply(validate_family_name)
    df['age_valid'] = df['age'].apply(validate_age)
    df['gender_valid'] = df['gender'].apply(validate_gender)
    df['mrn_valid'] = df['mrn'].apply(validate_mrn)

    return df

# Apply the validation function to the DataFrame
df = validate_patient_data(df.copy())

# Display the DataFrame with validation flags
display(df)

Unnamed: 0,note,given_name,family_name,age,gender,mrn,given_name_valid,family_name_valid,age_valid,gender_valid,mrn_valid
0,"Patient: John Smith, 58-year-old male\nMedical...",John,Smith,58,male,1234,True,True,True,True,True
1,"Patient: Linda Green, 45-year-old female\nMedi...",Linda,Green,45,female,5678,True,True,True,True,True
2,"Patient: Michael Brown, 62-year-old male\nMedi...",Michael,Brown,62,male,9102,True,True,True,True,True
3,"Patient: Sarah Johnson, 50-year-old female\nMe...",Sarah,Johnson,50,female,3344,True,True,True,True,True
4,"Patient: Carlos Ramirez, 55-year-old male\nMed...",Carlos,Ramirez,55,male,2211,True,True,True,True,True
5,"Patient: Rebecca Lee, 29-year-old female\nMedi...",Rebecca,Lee,29,female,7788,True,True,True,True,True
6,"Patient: Thomas Wilson, 67-year-old male\nMedi...",Thomas,Wilson,67,male,9890,True,True,True,True,True
7,"Patient: Emily Dawson, 36-year-old female\nMed...",Emily,Dawson,36,female,5566,True,True,True,True,True
8,"Patient: Robert Kim, 48-year-old male\nMedical...",Robert,Kim,48,male,3030,True,True,True,True,True
9,"Patient: Diane Carter, 60-year-old female\nMed...",Diane,Carter,60,female,1212,True,True,True,True,True


## LLM Based Extraction

### Set up the environment

In [8]:
%pip install semantic-kernel openai transformers semantic-kernel[anthropic]

Collecting semantic-kernel
  Downloading semantic_kernel-1.35.3-py3-none-any.whl.metadata (12 kB)
Collecting azure-ai-projects>=1.0.0b12 (from semantic-kernel)
  Downloading azure_ai_projects-1.1.0b2-py3-none-any.whl.metadata (23 kB)
Collecting azure-ai-agents>=1.1.0b4 (from semantic-kernel)
  Downloading azure_ai_agents-1.2.0b2-py3-none-any.whl.metadata (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.8/71.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting cloudevents~=1.0 (from semantic-kernel)
  Downloading cloudevents-1.12.0-py3-none-any.whl.metadata (7.2 kB)
Collecting pydantic-settings~=2.0 (from semantic-kernel)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting azure-identity>=1.13 (from semantic-kernel)
  Downloading azure_identity-1.24.0-py3-none-any.whl.metadata (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Collecting openapi_c

### Configure semantic kernel


In [10]:
import os
from google.colab import userdata # Import userdata to access Colab secrets

import semantic_kernel as sk

# Use the import path and class provided by the user
try:
    from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
    print("Successfully imported OpenAIChatCompletion from semantic_kernel.connectors.ai.open_ai")
    OpenAI_Connector = OpenAIChatCompletion # Alias for easier use
except ImportError as e:
    print(f"Import Error: {e}")
    print("Could not import OpenAIChatCompletion from semantic_kernel.connectors.ai.open_ai.")
    print("Please ensure your semantic-kernel version is compatible with this import path.")
    OpenAI_Connector = None # Set to None if import fails


# 2. Initialize a Semantic Kernel instance.
# Check if kernel is already initialized to avoid re-initialization if this cell is run after a cell that initializes it.
if 'kernel' not in globals() or not isinstance(kernel, sk.Kernel):
    kernel = sk.Kernel()
    print("Semantic Kernel initialized.")
else:
    print("Semantic Kernel already initialized.")


# 3. Get the OpenAI API key from Colab secrets using userdata.get().
openai_api_key = userdata.get("OPENAI_API_KEY")

# 4. Add the GPT-4o model to the kernel's services.
gpt4o_model_name = "gpt-4o-2024-08-06" # Setting the model name explicitly to gpt-4o
gpt_service_id = "gpt_service" # Renaming service ID for clarity

if OpenAI_Connector and openai_api_key:
    try:
        # Use the provided template structure for adding the service
        #chat_completion_service = OpenAI_Connector(
        chat_completion_service = OpenAI_Connector(
            ai_model_id=gpt4o_model_name,
            api_key=openai_api_key,
            service_id=gpt_service_id,
        )
        # Add the chat service to the kernel (assuming add_chat_service method exists in v1.x)
        kernel.add_service(chat_completion_service)

        print(f"'{gpt4o_model_name}' (service_id: {gpt_service_id}) added to the kernel as a chat service.")
    except Exception as e:
        print(f"Error adding OpenAI service '{gpt4o_model_name}': {e}")
        print("OpenAI service not added. Please check your OPENAI_API_KEY in Colab secrets and the semantic-kernel version compatibility.")
else:
    if not OpenAI_Connector:
        print("OpenAI connector class not available due to import failure.")
    else:
        print("OpenAI API key not found in Colab secrets. OpenAI service not added.")

print("\nOpenAI service addition attempt complete.")
# print("Kernel services:", kernel.get_service_ids()) # get_service_ids might cause AttributeError
print("Check kernel services manually if needed based on your SK version.")

Successfully imported OpenAIChatCompletion from semantic_kernel.connectors.ai.open_ai
Semantic Kernel already initialized.
'gpt-4o-2024-08-06' (service_id: gpt_service) added to the kernel as a chat service.

OpenAI service addition attempt complete.
Check kernel services manually if needed based on your SK version.


### Create semantic functions


In [12]:
import semantic_kernel as sk
import asyncio # Import asyncio for running async functions
from typing import Dict, Any
from google.colab import drive # Import drive to mount Google Drive
from semantic_kernel.contents.chat_history import ChatHistory # Import ChatHistory
from semantic_kernel.connectors.ai.open_ai import OpenAIChatPromptExecutionSettings # Import execution settings class

# Define execution settings for OpenAI chat completion
execution_settings = OpenAIChatPromptExecutionSettings()


# Specify the path to the prompt file in Google Drive
prompt_file_path = '/content/tribune7283/json_extraction_prompt_v2.txt'

# Read the prompt from the file
try:
    with open(prompt_file_path, 'r') as f:
        json_extraction_prompt = f.read()
    print(f"Prompt read successfully from {prompt_file_path}")
except FileNotFoundError:
    print(f"Error: The prompt file was not found at {prompt_file_path}")
    json_extraction_prompt = None # Set prompt to None if file not loaded
except Exception as e:
    print(f"An error occurred while reading the prompt file: {e}")
    json_extraction_prompt = None # Set prompt to None if error occurs


# Define the asynchronous function for GPT-4o processing using the correct v1.x invocation pattern
async def process_note_gpt4o(note: str) -> str:
    """Processes the medical note using the GPT-4o model via Semantic Kernel."""
    print("Processing note with GPT-4o...")
    if json_extraction_prompt is None:
        return "Error: Prompt file not loaded."

    try:
        # Get the text generation service by ID (Correct v1.x pattern)
        gpt_service = kernel.get_service(gpt_service_id)
        if gpt_service:
            # Create a ChatHistory object and add messages
            chat_history = ChatHistory()
            chat_history.add_system_message("You are a helpful medical assistant that extracts information.")
            chat_history.add_user_message(f"{json_extraction_prompt}\n\nMedical Note:\n{note}\n\nExtracted Information:")


            # Check if the service has the chat_completion method (typical for OpenAI chat services)
            if hasattr(gpt_service, 'get_chat_message_content'):
                 # Invoke the chat completion method using execution_settings
                chat_completion = await gpt_service.get_chat_message_content(chat_history, settings=execution_settings)
                gpt_output = chat_completion
            elif hasattr(gpt_service, 'generate_text'):
                 # Fallback to generate_text if it exists (less likely for gpt-4o)
                 print("Warning: Using generate_text for GPT-4o service, but it might be a chat model.")
                 gpt_output = await gpt_service.generate_text(chat_history.messages_to_prompt()) # Convert ChatHistory to prompt string
            else:
                 raise AttributeError(f"Service '{gpt_service_id}' does not have expected generation method.")


            print("GPT-4o processing successful.")
            return str(gpt_output)
        else:
            return f"Error processing with GPT-4o: Service '{gpt_service_id}' not found in kernel."
    except AttributeError as e:
        print(f"AttributeError processing with GPT-4o: {e}")
        return f"AttributeError processing with GPT-4o: {e}"
    except Exception as e:
        print(f"An unexpected error occurred during GPT-4o processing: {e}")
        return f"An unexpected error occurred during GPT-4o processing: {e}"



Prompt read successfully from /content/tribune7283/json_extraction_prompt_v2.txt


In [13]:
import asyncio # Ensure asyncio is imported for running async functions
import json # Import json for handling JSON data
import os # Import os for path joining
import pandas as pd # Ensure pandas is imported


async def process_medical_notes(df: pd.DataFrame, sample_size: int = 0, seed: int = 42):
    """
    Processes medical notes from a DataFrame using GPT-4o via Semantic Kernel.

    Args:
        df: The input DataFrame containing medical notes in a 'note' column.
        sample_size: The number of notes to sample. If 0 or None, process the full dataset.
        seed: The random seed for reproducible sampling.
    """
    # --- Select notes for processing ---
    if sample_size and sample_size > 0:
        print(f"Processing a reproducible sample of {sample_size} notes (seed: {seed}).")
        # Ensure sample size does not exceed the number of notes available
        num_notes = len(df['note'])
        actual_sample_size = min(sample_size, num_notes)
        if actual_sample_size < sample_size:
            print(f"Warning: Requested sample size ({sample_size}) is larger than available notes ({num_notes}). Processing all available notes.")
            notes_to_process = df['note'].tolist()
        else:
             # Select a reproducible sample of medical notes from the 'note' column
            notes_to_process = df['note'].sample(n=actual_sample_size, random_state=seed).tolist()
    else:
        print("Processing the full dataset.")
        # Use the entire 'note' column
        notes_to_process = df['note'].tolist()


    # Define a directory to save the outputs (optional, but good practice)
    output_dir = "extracted_outputs"
    os.makedirs(output_dir, exist_ok=True)
    print(f"Output directory '{output_dir}' created or already exists.")

    # Define the path for the single output JSON file
    single_output_json_file = os.path.join(output_dir, "all_extracted_gpt4o_outputs.json")


    # 2. Run the asynchronous processing function for the selected notes with GPT-4o only
    results = []
    successfully_parsed_json_outputs = [] # List to collect successfully parsed JSON objects
    print(f"\n--- Running Semantic Kernel Processing on {len(notes_to_process)} Notes ---")

    async def process_list_of_notes(notes):
        processed_count = 0
        for i, note in enumerate(notes):
            processed_count += 1
            print(f"\n--- Processing Note {i+1}/{len(notes)} ---")
            gpt_output = await process_note_gpt4o(note)

            results.append({
                "original_note": note, # Store the full original note
                "gpt_output": gpt_output, # Store the full GPT-4o output
            })
            print(f"Finished processing Note {i+1}. Total processed: {processed_count}")

            # Attempt to parse the GPT-4o output as JSON and add to the collection list
            full_gpt_output_str = gpt_output
            try:
                json_start = full_gpt_output_str.find('{')
                json_end = full_gpt_output_str.rfind('}')
                if json_start != -1 and json_end != -1 and json_end > json_start:
                    json_string = full_gpt_output_str[json_start : json_end + 1]
                    gpt_output_json = json.loads(json_string)
                    successfully_parsed_json_outputs.append(gpt_output_json)
                    print(f"Successfully parsed JSON for Note {i+1}.")
                else:
                     print(f"Warning: Could not find a valid JSON object in GPT-4o output for Note {i+1}. Skipping JSON collection for this note.")

            except json.JSONDecodeError as e:
                print(f"Error decoding JSON from GPT-4o output for Note {i+1}: {e}. Skipping JSON collection for this note.")
            except Exception as e:
                 print(f"An unexpected error occurred during JSON parsing for Note {i+1}: {e}. Skipping JSON collection for this note.")


        return results

    # Run the asynchronous note processing
    processed_results = await process_list_of_notes(notes_to_process)


    # 3. Display the outputs (optional, but good for monitoring)
    print("\n--- Semantic Kernel Processing Results Preview ---")
    for i, result_data in enumerate(processed_results):
        if i >= 3: # Display preview for only the first few notes
            break
        print(f"\n--- Results for Note {i+1} ---")
        # Display full original note
        full_original_note = result_data.get("original_note", "N/A")
        # Display full GPT-4o output (truncated for display)
        full_gpt_output_str = result_data.get("gpt_output", "N/A")


        print("Original Note (first 100 chars):", full_original_note[:100] + "...")
        print("GPT-4o Output (first 500 chars):", full_gpt_output_str[:500] + "...")


    # 4. Save all successfully parsed JSON outputs to a single file
    print(f"\n--- Saving All Parsed JSON Outputs to {single_output_json_file} ---")
    if successfully_parsed_json_outputs:
        try:
            with open(single_output_json_file, 'w', encoding='utf-8') as f:
                json.dump(successfully_parsed_json_outputs, f, indent=4)
            print(f"Successfully saved {len(successfully_parsed_json_outputs)} parsed JSON objects to {single_output_json_file}")
        except Exception as e:
            print(f"Error saving the single JSON output file {single_output_json_file}: {e}")
    else:
        print("No valid JSON outputs were successfully parsed to save.")


    print("\n--- Processing and Saving Complete ---")
    return processed_results # Return the list of raw results for potential downstream use (like evaluation)


# Example of how to call the function:
# await process_medical_notes(df, sample_size=5, seed=123) # Process a sample of 5 notes
# await process_medical_notes(df) # Process the full dataset (if sample_size is 0 or None)

result_list = await process_medical_notes(df, sample_size=0, seed=42)

Processing the full dataset.
Output directory 'extracted_outputs' created or already exists.

--- Running Semantic Kernel Processing on 20 Notes ---

--- Processing Note 1/20 ---
Processing note with GPT-4o...
GPT-4o processing successful.
Finished processing Note 1. Total processed: 1
Successfully parsed JSON for Note 1.

--- Processing Note 2/20 ---
Processing note with GPT-4o...
GPT-4o processing successful.
Finished processing Note 2. Total processed: 2
Successfully parsed JSON for Note 2.

--- Processing Note 3/20 ---
Processing note with GPT-4o...
GPT-4o processing successful.
Finished processing Note 3. Total processed: 3
Successfully parsed JSON for Note 3.

--- Processing Note 4/20 ---
Processing note with GPT-4o...
GPT-4o processing successful.
Finished processing Note 4. Total processed: 4
Successfully parsed JSON for Note 4.

--- Processing Note 5/20 ---
Processing note with GPT-4o...
GPT-4o processing successful.
Finished processing Note 5. Total processed: 5
Successfully 

In [15]:
import os
from google.colab import userdata # Import userdata to access Colab secrets

import semantic_kernel as sk

from semantic_kernel.connectors.ai.anthropic import AnthropicChatCompletion
print("Successfully imported AnthropicChatCompletion from semantic_kernel.connectors.ai.anthropic")
Anthropic_Connector = AnthropicChatCompletion # Alias for easier use


if 'kernel' not in globals() or not isinstance(kernel, sk.Kernel):
    print("Semantic Kernel not initialized. Please run the previous cell to initialize the kernel.")


# 3. Get the Anthropic API key from Colab secrets using userdata.get().
anthropic_api_key = userdata.get("ANTHROPIC_API_KEY")

# 4. Add the Anthropic model to the kernel's services.
anthropic_model_name = "claude-opus-4-1-20250805"
anthropic_service_id = "anthropic_service"

if Anthropic_Connector and anthropic_api_key:
    try:
        # Use the Anthropic connector to add the service
        anthropic_chat_service = Anthropic_Connector(
            ai_model_id=anthropic_model_name,
            api_key=anthropic_api_key,
            service_id=anthropic_service_id,
        )
        # Add the chat service to the kernel
        kernel.add_service(anthropic_chat_service)

        print(f"'{anthropic_model_name}' (service_id: {anthropic_service_id}) added to the kernel as a chat service.")
    except Exception as e:
        print(f"Error adding Anthropic service '{anthropic_model_name}': {e}")
        print("Anthropic service not added. Please check your ANTHROPIC_API_KEY in Colab secrets and the semantic-kernel version compatibility.")
else:
    if not Anthropic_Connector:
        print("Anthropic connector class not available due to import failure.")
    else:
        print("Anthropic API key not found in Colab secrets. Anthropic service not added.")


Successfully imported AnthropicChatCompletion from semantic_kernel.connectors.ai.anthropic
'claude-opus-4-1-20250805' (service_id: anthropic_service) added to the kernel as a chat service.


# Part 2: Evaluation


In [16]:
import asyncio
import json
import pandas as pd
import semantic_kernel as sk
import traceback # Import traceback to get stack traces
from semantic_kernel.contents.chat_history import ChatHistory # Import ChatHistory explicitly
from semantic_kernel.connectors.ai.anthropic import AnthropicChatPromptExecutionSettings


# Define the asynchronous function for Anthropic processing with the LLM judge prompt
async def judge_with_anthropic(original_note: str, gpt_output: str) -> str:
    """Processes the GPT-4o output using the Anthropic model with the LLM judge prompt."""
    print("Processing output with Anthropic (LLM Judge)...")

    # Assume llm_judge_prompt, kernel, and anthropic_service_id are available from previous cells
    if 'llm_judge_prompt' not in globals() or llm_judge_prompt is None:
         return "Error: LLM judge prompt not loaded. Please run cell 50f1d66e first."
    if 'kernel' not in globals() or not isinstance(kernel, sk.Kernel):
         return "Error processing with Anthropic: Semantic Kernel not initialized."
    if 'anthropic_service_id' not in globals():
         return "Error processing with Anthropic: Anthropic service ID not initialized."

    prompt_content = f"{llm_judge_prompt}\n\nOriginal Medical Note:\n{original_note}\n\nOutput to Evaluate (from GPT-4o):\n{gpt_output}\n\nEvaluation Output (JSON):\n"

    try:
        anthropic_service = kernel.get_service(anthropic_service_id)
        if anthropic_service:
            chat_history = ChatHistory()
            chat_history.add_system_message("You are an expert medical text extraction evaluator. Review the provided original note and the extracted output, then provide a structured JSON evaluation.")
            chat_history.add_user_message(prompt_content)

            execution_settings = AnthropicChatPromptExecutionSettings()
            if hasattr(anthropic_service, 'get_chat_message_content'):
                anthropic_output = await anthropic_service.get_chat_message_content(chat_history, settings=execution_settings)
                print("Anthropic processing successful.")
                return str(anthropic_output)
            elif hasattr(anthropic_service, 'generate_text'):
                 print("Warning: Using generate_text for Anthropic service, but it might be a chat model.")
                 anthropic_output = await anthropic_service.generate_text(chat_history.messages_to_prompt())
                 print("Anthropic processing successful (using generate_text).")
                 return str(anthropic_output)
            else:
                 # This should not happen if the service is added correctly, but good for robustness
                 error_msg = f"Service '{anthropic_service_id}' does not have expected generation method (get_chat_message_content or generate_text)."
                 print(error_msg)
                 return error_msg

        else:
            return f"Error processing with Anthropic: Service '{anthropic_service_id}' not found in kernel."
    except Exception as e:
        print(f"An error occurred during Anthropic processing: {e}")
        # Print the traceback for unexpected errors
        traceback_str = traceback.format_exc()
        print(f"Traceback:\n{traceback_str}")
        return f"Error during Anthropic processing: {e}\nTraceback:\n{traceback_str}"


# --- Process GPT-4o Outputs with Anthropic LLM Judge ---
print("\n--- Running Anthropic LLM Judge on GPT-4o Outputs ---")

# Ensure result_list is available from previous processing step
if 'result_list' not in globals() or not isinstance(result_list, list):
    print("Error: 'result_list' not found or not a list. Please run the processing cell (ac3ab4b2) first.")
    result_list = [] # Initialize as empty list to avoid further errors

anthropic_judge_results = []

async def evaluate_outputs_with_anthropic(results):
    evaluated_count = 0
    for i, result_data in enumerate(results):
        evaluated_count += 1
        print(f"\n--- Evaluating Output for Note {i+1}/{len(results)} ---")
        original_note = result_data.get("original_note", "N/A")
        gpt_output = result_data.get("gpt_output", "N/A")

        judge_output = await judge_with_anthropic(original_note, gpt_output)

        anthropic_judge_results.append({
            "note_index": i + 1,
            "original_note": original_note,
            "gpt_output": gpt_output[:200],
            "anthropic_judge_raw_output": judge_output,
            "anthropic_judge_parsed_json": None
        })
        print(f"Finished evaluating Output for Note {i+1}. Total evaluated: {evaluated_count}")
    return anthropic_judge_results # Return the list of judge results

# Run the asynchronous evaluation
evaluated_results = await evaluate_outputs_with_anthropic(result_list)




--- Running Anthropic LLM Judge on GPT-4o Outputs ---

--- Evaluating Output for Note 1/20 ---
Processing output with Anthropic (LLM Judge)...
Finished evaluating Output for Note 1. Total evaluated: 1

--- Evaluating Output for Note 2/20 ---
Processing output with Anthropic (LLM Judge)...
Finished evaluating Output for Note 2. Total evaluated: 2

--- Evaluating Output for Note 3/20 ---
Processing output with Anthropic (LLM Judge)...
Finished evaluating Output for Note 3. Total evaluated: 3

--- Evaluating Output for Note 4/20 ---
Processing output with Anthropic (LLM Judge)...
Finished evaluating Output for Note 4. Total evaluated: 4

--- Evaluating Output for Note 5/20 ---
Processing output with Anthropic (LLM Judge)...
Finished evaluating Output for Note 5. Total evaluated: 5

--- Evaluating Output for Note 6/20 ---
Processing output with Anthropic (LLM Judge)...
Finished evaluating Output for Note 6. Total evaluated: 6

--- Evaluating Output for Note 7/20 ---
Processing output with

In [17]:
import semantic_kernel as sk
import asyncio
from google.colab import drive
import pandas as pd
import json

# Assume kernel and anthropic_service_id are initialized and available from previous cells.
# Assumes df is loaded and available.


# Specify the path to the LLM judge prompt file in Google Drive
llm_judge_prompt_file_path = '/content/llm_judge_prompt.md'

# Read the LLM judge prompt from the file
llm_judge_prompt = None
try:
    with open(llm_judge_prompt_file_path, 'r') as f:
        llm_judge_prompt = f.read()
    print(f"LLM judge prompt read successfully from {llm_judge_prompt_file_path}")
except FileNotFoundError:
    print(f"Error: The LLM judge prompt file was not found at {llm_judge_prompt_file_path}")
    llm_judge_prompt = None # Set prompt to None if file not found
except Exception as e:
    print(f"An error occurred while reading the LLM judge prompt file: {e}")
    llm_judge_prompt = None # Set prompt to None if error occurspr


print("LLM judge prompt loaded. The judge_with_anthropic function has been moved to the cell below.")

Error: The LLM judge prompt file was not found at /content/llm_judge_prompt.md
LLM judge prompt loaded. The judge_with_anthropic function has been moved to the cell below.


In [18]:
import os
import json
import pandas as pd

# Define the output markdown file path
output_markdown_file = "/content/anthropic_judge_evaluation_summary.md"

print(f"Exporting Anthropic judge evaluation summary to '{output_markdown_file}'...")

# Check if evaluated_results is available and is a list
if 'evaluated_results' in globals() and isinstance(evaluated_results, list):
    if evaluated_results:
        with open(output_markdown_file, 'w', encoding='utf-8') as f:
            f.write("# Anthropic LLM Judge Evaluation Summary\n\n")
            f.write("This document summarizes the evaluations from the Anthropic LLM Judge for GPT-4o outputs.\n\n")

            for i, result_data in enumerate(evaluated_results):
                note_index = result_data.get("note_index", i + 1)

                # Get the full original_note and relevant judge results
                full_original_note = result_data.get("original_note")
                raw_judge_output = result_data.get("anthropic_judge_raw_output", "N/A")
                parsed_judge_json = result_data.get("anthropic_judge_parsed_json")

                # Handle potentially missing or empty notes for display
                if full_original_note is None or full_original_note.strip() == "":
                    full_original_note_display = "[Original note is missing or empty]"
                else:
                    full_original_note_display = full_original_note.strip()

                # Add separator between notes (except before the first one)
                if i > 0:
                    f.write("\n---\n\n")

                f.write(f"## Note {note_index}\n\n")

                # Present the entire original note
                f.write("### Original Medical Note\n\n")
                f.write(f"```\n{full_original_note_display}\n```\n\n")

                # Present the Anthropic LLM Judge Output Summary
                f.write("### Anthropic LLM Judge Evaluation\n\n")

                if parsed_judge_json and isinstance(parsed_judge_json, dict):
                    f.write("#### Evaluation Results\n\n")

                    # Create a cleaner table format
                    evaluation_items = []

                    # Add key fields in a logical order
                    field_mappings = {
                        "score": "Overall Score",
                        "validation_status": "Validation Status",
                        "critique_summary": "Summary",
                        "specific_issues": "Specific Issues",
                        "missing_fields": "Missing Fields",
                        "accuracy_assessment": "Accuracy Assessment",
                        "completeness_score": "Completeness Score"
                    }

                    for key, display_name in field_mappings.items():
                        if key in parsed_judge_json:
                            value = parsed_judge_json[key]

                            # Format the value appropriately
                            if isinstance(value, list):
                                if value:  # Non-empty list
                                    formatted_value = "• " + "<br>• ".join(str(item) for item in value)
                                else:
                                    formatted_value = "None identified"
                            elif isinstance(value, dict):
                                formatted_value = json.dumps(value, indent=2)
                            else:
                                formatted_value = str(value)

                            evaluation_items.append([display_name, formatted_value])

                    # Add any remaining fields not in the mappings
                    for key, value in parsed_judge_json.items():
                        if key not in field_mappings:
                            display_name = key.replace("_", " ").title()
                            if isinstance(value, list):
                                if value:
                                    formatted_value = "• " + "<br>• ".join(str(item) for item in value)
                                else:
                                    formatted_value = "None"
                            elif isinstance(value, dict):
                                formatted_value = json.dumps(value, indent=2)
                            else:
                                formatted_value = str(value)
                            evaluation_items.append([display_name, formatted_value])

                    if evaluation_items:
                        # Create markdown table
                        f.write("| Field | Value |\n")
                        f.write("|-------|-------|\n")
                        for field, value in evaluation_items:
                            # Escape pipes in the value to prevent table formatting issues
                            escaped_value = str(value).replace("|", "\\|").replace("\n", "<br>")
                            f.write(f"| {field} | {escaped_value} |\n")
                        f.write("\n")
                    else:
                        f.write("*No evaluation data available*\n\n")

                    # If there are additional fields, show the complete JSON
                    if len(parsed_judge_json) > len(evaluation_items):
                        f.write("#### Complete Evaluation Data\n\n")
                        f.write("```json\n")
                        f.write(json.dumps(parsed_judge_json, indent=2, ensure_ascii=False))
                        f.write("\n```\n\n")

                else:
                    f.write("#### Raw Evaluation Output\n\n")
                    if raw_judge_output and raw_judge_output != "N/A":
                        f.write("```json\n")
                        f.write(str(raw_judge_output))
                        f.write("\n```\n\n")
                    else:
                        f.write("*No evaluation output available*\n\n")

            # Final separator
            f.write("\n---\n\n")
            f.write(f"*Report generated with {len(evaluated_results)} evaluations*\n")
            print(f"Evaluation summary exported successfully to '{output_markdown_file}'")

    else:
        print("The 'evaluated_results' list is empty. No results to export.")
else:
    print("Error: 'evaluated_results' not found or not a list. Please run the evaluation cell first.")

Exporting Anthropic judge evaluation summary to '/content/anthropic_judge_evaluation_summary.md'...
Evaluation summary exported successfully to '/content/anthropic_judge_evaluation_summary.md'


## Demonstrating how validation with Pydantic would work

In [19]:
from typing import List, Optional, Dict, Any, Union, Literal
from pydantic import BaseModel, Field, field_validator # Import field_validator instead of validator
import re

# Define Pydantic models corresponding to the JSON schema structure

class Condition(BaseModel):
    snomed_code: Optional[str] = Field(None, description="SNOMED CT code for the condition")
    snomed_term: Optional[str] = Field(None, description="SNOMED CT term for the condition")

class Diagnosis(BaseModel):
    condition: Condition
    status: Optional[str] = Field(None, description="Status of the diagnosis (e.g., 'active', 'resolved', 'suspected')")
    diagnosis_date: Optional[str] = Field(None, description="Date of diagnosis or relative time (e.g., 'YYYY-MM-DD', 'YYYY', 'X years ago')")
    approx_diagnosis_date: Optional[str] = Field(None, description="Approximate date of diagnosis if specific date is not available")
    approx_onset: Optional[str] = Field(None, description="Approximate onset of the condition if inferrable")
    is_type2_diabetes: Optional[bool] = Field(None, description="Indicates if the diagnosis is Type 2 Diabetes")
    notes: Optional[str] = Field(None, description="Additional notes about the diagnosis")

    @field_validator('status')
    @classmethod
    def validate_status(cls, v):
        if v is not None and v not in ["active", "resolved", "suspected", "inactive", "unknown"]:
            raise ValueError("Invalid diagnosis status")
        return v

    @field_validator('diagnosis_date')
    @classmethod
    def validate_diagnosis_date(cls, v):
        if v is not None and not re.fullmatch(r"^(\d{4}-\d{2}-\d{2}|\d{4}|.*ago.*)?$", v):
            raise ValueError("Invalid diagnosis_date format")
        return v


class RecordMetadata(BaseModel):
    patient_name: str = Field(..., min_length=1, description="Full name of the patient")
    medical_record_number: str = Field(..., min_length=1, description="Unique medical record identifier")
    visit_type: Optional[str] = Field(None, description="Type of medical visit")

    @field_validator('visit_type')
    @classmethod
    def validate_visit_type(cls, v):
        if v is not None and v not in ["follow-up", "initial consultation", "telehealth", "emergency", "routine check-up"]:
            raise ValueError("Invalid visit_type")
        return v

class SocioeconomicIndicators(BaseModel):
    occupation: Optional[str] = Field(None, description="Patient occupation")
    insurance: Optional[str] = Field(None, description="Insurance status or type")
    other: Optional[str] = Field(None, description="Other socioeconomic indicators")

class Demographics(BaseModel):
    age_years: int = Field(..., ge=0, le=150, description="Patient age in years") # Changed field name
    sex: str = Field(..., description="Patient gender") # Changed field name
    race_ethnicity: Optional[str] = Field(None, description="Patient race/ethnicity if specified")
    socioeconomic_indicators: Optional[SocioeconomicIndicators] = None

    @field_validator('sex') # Updated validator field
    @classmethod
    def validate_sex(cls, v):
        if v not in ["male", "female", "other", "unknown"]:
            raise ValueError("Invalid sex")
        return v

class Medication(BaseModel):
    drug_name: str = Field(..., min_length=1, description="Medication name")
    dosage: str = Field(..., description="Dosage with units")
    frequency: str = Field(..., description="Dosing frequency")
    drug_class: Optional[str] = Field(None, description="Drug classification")
    adherence_issues: Optional[str] = Field(None, description="Any reported adherence problems")
    is_current: Optional[bool] = Field(None, description="Indicates if the medication is currently being taken")
    notes: Optional[str] = Field(None, description="Additional notes about the medication")

    @field_validator('dosage')
    @classmethod
    def validate_dosage(cls, v):
        if not re.fullmatch(r"^\d+(\.\d+)?\s*(mg|g|mcg|IU|mL|units?).*$", v):
            raise ValueError("Invalid dosage format")
        return v

    @field_validator('frequency')
    @classmethod
    def validate_frequency(cls, v):
        if v not in ["daily", "BID", "TID", "QID", "weekly", "monthly", "as needed", "PRN", "twice daily", "three times daily", "four times daily", "occasional"]:
            raise ValueError("Invalid frequency")
        return v

    @field_validator('drug_class')
    @classmethod
    def validate_drug_class(cls, v):
        if v is not None and v not in [
            "biguanide", "sulfonylurea", "DPP-4 inhibitor", "GLP-1 agonist",
            "SGLT2 inhibitor", "insulin", "ACE inhibitor", "ARB", "beta blocker",
            "calcium channel blocker", "diuretic", "statin", "antiplatelet",
            "vitamin", "supplement", "other"
        ]:
            raise ValueError("Invalid drug_class")
        return v

class Adherence(BaseModel):
    status: str = Field(..., enum=["good", "moderate", "poor", "unknown"])
    evidence: Optional[str] = Field(None, description="Evidence or notes about adherence")

class Vitals(BaseModel):
    blood_pressure: Optional[str] = Field(None, description="Blood pressure reading (e.g., '120/80 mmHg')")
    pulse_bpm: Optional[int] = Field(None, description="Pulse in beats per minute")
    temperature: Optional[str] = Field(None, description="Temperature reading (e.g., '98.6 F')")

class Lab(BaseModel):
    test_name: str = Field(..., description="Name of the lab test")
    value: Optional[str] = Field(None, description="Test result value as string")
    normalized_value: Optional[float] = Field(None, description="Normalized numeric value of the test result")
    unit: Optional[str] = Field(None, description="Unit of measurement")
    date: Optional[str] = Field(None, description="Date of the lab test")
    status: Optional[str] = Field(None, enum=["normal", "high", "low", "pending", "unknown"])

    @field_validator('status')
    @classmethod
    def validate_lab_status(cls, v):
        if v is not None and v not in ["normal", "high", "low", "pending", "unknown"]:
            raise ValueError("Invalid lab status")
        return v

class Finding(BaseModel):
    condition: Condition
    status: Optional[str] = Field(None, description="Status of the finding (e.g., 'present', 'absent', 'suspected')")
    severity: Optional[str] = Field(None, description="Severity of the finding (e.g., 'mild', 'moderate', 'severe')")
    notes: Optional[str] = Field(None, description="Additional notes about the finding")

    @field_validator('status')
    @classmethod
    def validate_finding_status(cls, v):
         if v is not None and v not in ["present", "absent", "suspected", "resolved", "unknown"]:
            raise ValueError("Invalid finding status")
         return v

    @field_validator('severity')
    @classmethod
    def validate_finding_severity(cls, v):
         if v is not None and v not in ["mild", "moderate", "severe", "early", "advanced", "unknown"]:
            raise ValueError("Invalid finding severity")
         return v


class SubjectReported(BaseModel):
    symptoms: List[str] = []
    factors: List[str] = [] # e.g., stress, sleep issues

class MentalHealth(BaseModel):
    stress: Optional[str] = Field(None, enum=["present", "absent", "unknown"])
    notes: Optional[str] = Field(None, description="Notes about mental health")

    @field_validator('stress')
    @classmethod
    def validate_stress(cls, v):
        if v is not None and v not in ["present", "absent", "unknown"]:
            raise ValueError("Invalid stress status")
        return v

class Sleep(BaseModel):
    issue: Optional[str] = Field(None, enum=["insomnia", "sleep apnea", "restless leg syndrome", "circadian rhythm disorder", "other", "absent", "unknown"])
    notes: Optional[str] = Field(None, description="Notes about sleep issues")

    @field_validator('issue')
    @classmethod
    def validate_sleep_issue(cls, v):
        if v is not None and v not in ["insomnia", "sleep apnea", "restless leg syndrome", "circadian rhythm disorder", "other", "absent", "unknown"]:
            raise ValueError("Invalid sleep issue type")
        return v


class PhysicalActivity(BaseModel):
    notes: Optional[str] = Field(None, description="Notes about physical activity level or type")

class Lifestyle(BaseModel):
    smoking_status: str = Field(..., enum=["current", "former", "never", "unknown", "occasional"])
    alcohol_use: Optional[str] = Field(None, enum=["none", "social", "moderate", "heavy", "occasional", "unknown"])
    diet_notes: Optional[str] = Field(None, description="Notes about diet or adherence")

    @field_validator('smoking_status')
    @classmethod
    def validate_smoking_status(cls, v):
        if v not in ["current", "former", "never", "unknown", "occasional"]:
            raise ValueError("Invalid smoking status")
        return v

    @field_validator('alcohol_use')
    @classmethod
    def validate_alcohol_use(cls, v):
        if v is not None and v not in ["none", "social", "moderate", "heavy", "occasional", "unknown"]:
            raise ValueError("Invalid alcohol use")
        return v


class MajorDiabeticComplication(BaseModel):
    complication: Condition
    had_complication: bool = Field(..., description="Indicates if the patient had this specific major complication")

class MajorDiabeticComplicationsSummary(BaseModel):
     had_complication: bool = Field(..., description="Overall indicator if the patient had any major diabetic complication")
     complications: List[MajorDiabeticComplication] = []


class Negation(BaseModel):
    finding_or_symptom: str = Field(..., description="Finding or symptom that was negated")
    evidence: Optional[str] = Field(None, description="Evidence for the negation")

class Uncertainty(BaseModel):
    finding_or_symptom: str = Field(..., description="Finding or symptom with uncertainty")
    evidence: Optional[str] = Field(None, description="Evidence for the uncertainty")

class IgnoredNonclinical(BaseModel):
     text: str = Field(..., description="Text that was ignored as non-clinical")
     reason: Optional[str] = Field(None, description="Reason for ignoring the text")


class PlanNextSteps(BaseModel):
    step: str = Field(..., description="A planned next step for patient care")


class MedicalRecordExtractionSchema(BaseModel): # Renamed the main model for clarity
    """Schema for extracted medical record information focusing on Type 2 Diabetes patients"""
    record_metadata: RecordMetadata
    demographics: Demographics
    diagnoses: List[Diagnosis] = [] # Changed field name and type
    comorbidities: List[Condition] = [] # Changed field type (assuming comorbidities are conditions)
    medications: List[Medication] = []
    adherence: Optional[Adherence] = None # Added Adherence model
    vitals: Optional[Vitals] = None # Added Vitals model
    labs: List[Lab] = [] # Changed field name
    findings: List[Finding] = [] # Added Findings model
    subject_reported: Optional[SubjectReported] = None # Added SubjectReported model
    mental_health: Optional[MentalHealth] = None # Added MentalHealth model
    sleep: Optional[Sleep] = None # Added Sleep model
    physical_activity: Optional[PhysicalActivity] = None # Added PhysicalActivity model
    lifestyle: Optional[Lifestyle] = None # Added Lifestyle model
    major_diabetic_complication: MajorDiabeticComplicationsSummary # Changed field name and type
    negations: List[Negation] = [] # Added Negations model
    uncertainties: List[Uncertainty] = [] # Added Uncertainties model
    ignored_nonclinical: List[IgnoredNonclinical] = [] # Added IgnoredNonclinical model
    plan_next_steps: List[PlanNextSteps] = [] # Changed field name and type


# Example of a valid data structure (as a dictionary) based on the new schema
valid_sample_output_data = {
  "record_metadata": {
    "patient_name": "John Smith",
    "medical_record_number": "1234"
  },
  "demographics": {
    "age_years": 58,
    "sex": "male",
    "race_ethnicity": None,
    "socioeconomic_indicators": None
  },
  "diagnoses": [
    {
      "condition": {
        "snomed_code": None,
        "snomed_term": "Type 2 diabetes mellitus"
      },
      "status": "active",
      "diagnosis_date": None,
      "approx_diagnosis_date": None,
      "approx_onset": "3 years ago",
      "is_type2_diabetes": True,
      "notes": "Well-controlled"
    },
     {
      "condition": {
        "snomed_code": "38341003", # Example SNOMED code for Hypertension
        "snomed_term": "Hypertension"
      },
      "status": "active",
      "diagnosis_date": None,
      "approx_diagnosis_date": None,
      "approx_onset": None,
      "is_type2_diabetes": False,
      "notes": "Managed with medication"
    }
  ],
  "comorbidities": [
    {
      "snomed_code": "38341003",
      "snomed_term": "Hypertension"
    },
     {
      "snomed_code": "114490001", # Example SNOMED code for Hyperlipidemia
      "snomed_term": "Hyperlipidemia"
    }
  ],
  "medications": [
    {
      "drug_name": "Metformin",
      "dosage": "1000 mg",
      "frequency": "daily",
      "drug_class": "biguanide",
      "adherence_issues": "none",
      "is_current": True,
      "notes": None
    },
    {
      "drug_name": "Lisinopril",
      "dosage": "10 mg",
      "frequency": "daily",
      "drug_class": "ACE inhibitor",
      "adherence_issues": "none",
      "is_current": True,
      "notes": None
    }
  ],
  "adherence": {
    "status": "good",
    "evidence": "Patient reports taking all medications as prescribed"
  },
  "vitals": {
    "blood_pressure": "130/85 mmHg",
    "pulse_bpm": 72,
    "temperature": None
  },
  "labs": [
    {
      "test_name": "A1C",
      "value": "7.2%",
      "normalized_value": 7.2,
      "unit": "%",
      "date": "2024-07-01",
      "status": "high"
    },
     {
      "test_name": "LDL Cholesterol",
      "value": "110 mg/dL",
      "normalized_value": 110.0,
      "unit": "mg/dL",
      "date": "2024-07-01",
      "status": "normal"
    }
  ],
  "findings": [
    {
        "condition": {
            "snomed_code": None,
            "snomed_term": "Peripheral neuropathy"
        },
        "status": "present",
        "severity": "mild",
        "notes": "Reported tingling in feet"
    }
  ],
  "subject_reported": {
    "symptoms": ["fatigue", "increased thirst"],
    "factors": ["stress from work"]
  },
  "mental_health": {
    "stress": "present",
    "notes": "Patient reports high stress levels due to job demands"
  },
  "sleep": {
    "issue": "insomnia",
    "notes": "Difficulty falling asleep"
  },
  "physical_activity": {
    "notes": "Minimal exercise since winter started"
  },
  "lifestyle": {
    "smoking_status": "never",
    "alcohol_use": "social",
    "diet_notes": "Reports trying to eat healthier",
    "medication_compliance": "good"
  },
  "major_diabetic_complication": {
     "had_complication": False,
     "complications": []
  },
  "negations": [
    {
        "finding_or_symptom": "chest pain",
        "evidence": "Patient denies chest pain"
    }
  ],
  "uncertainties": [],
  "ignored_nonclinical": [
    {
        "text": "He mentions his pet dog, a golden retriever.",
        "reason": "Non-clinical personal information"
    }
  ],
  "plan_next_steps": [
    {
        "step": "Encourage more physical activity, potentially indoors"
    },
     {
        "step": "Order new labs to monitor lipid profile and blood glucose control"
    },
     {
        "step": "Discuss possible addition of medication for better glycemic management if needed"
    }
  ]
}

# Example of an invalid data structure (missing required field in nested model)
invalid_sample_output_data = {
  "record_metadata": {
    "patient_name": "Jane Doe",
    "medical_record_number": "5678"
    # visit_type is optional, so its absence is fine
  },
  "demographics": {
    "age_years": 55,
    # sex is required but missing
  },
  "diagnoses": [],
  "comorbidities": [],
  "medications": [],
  "laboratory_values": {},
  "clinical_findings": {},
  "major_diabetic_complication": {
     "had_complication": False,
     "complications": []
  },
  "negations": [],
  "uncertainties": [],
  "ignored_nonclinical": [],
  "plan_next_steps": []
  # diabetes_status, patient_reported_factors, treatment_plan are required but missing at top level
}


# Validate the sample outputs using Pydantic
try:
    valid_output_model = MedicalRecordExtractionSchema(**valid_sample_output_data)
    print("\nValid sample output is valid according to the Pydantic schema.")
    # You can access validated data as attributes:
    # print(valid_output_model.record_metadata.patient_name)
except Exception as e:
    print(f"\nValid sample output validation failed: {e}")

try:
    invalid_output_model = MedicalRecordExtractionSchema(**invalid_sample_output_data)
    print("\nInvalid sample output is valid according to the Pydantic schema (unexpected).")
except Exception as e:
    print(f"\nInvalid sample output validation failed: {e}")


Valid sample output is valid according to the Pydantic schema.

Invalid sample output validation failed: 1 validation error for MedicalRecordExtractionSchema
demographics.sex
  Field required [type=missing, input_value={'age_years': 55}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing




> **The cell above should produce an "validation failed" error to show what would happen if an invalid output was passed to it 😀**



In [20]:
import os
import json
from pydantic import ValidationError

# Assuming MedicalRecordExtractionSchema is defined in a previous cell (e.g., cell 5H0sUcGRtcLI)
# Make sure the cell defining the schema has been executed before running this cell.

output_dir = "extracted_outputs"
validated_outputs = []
validation_errors = {}

print(f"Loading JSON outputs from '{output_dir}' and validating against Pydantic schema...")

# Iterate through files in the output directory
if os.path.exists(output_dir):
    for filename in os.listdir(output_dir):
        if filename.endswith("_gpt4o_output.json"):
            file_path = os.path.join(output_dir, filename)
            print(f"\nValidating file: {filename}")

            try:
                # Load the JSON data from the file
                with open(file_path, 'r') as f:
                    extracted_data = json.load(f)

                # Validate the loaded data using the Pydantic schema
                validated_model = MedicalRecordExtractionSchema(**extracted_data)
                validated_outputs.append({"filename": filename, "data": validated_model})
                print(f"Validation successful for {filename}")

            except FileNotFoundError:
                print(f"Error: File not found at {file_path}")
                validation_errors[filename] = f"File not found: {file_path}"
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON from {filename}: {e}")
                validation_errors[filename] = f"JSON Decode Error: {e}"
            except ValidationError as e:
                print(f"Pydantic Validation Error for {filename}:\n{e}")
                validation_errors[filename] = f"Validation Error: {e}"
            except Exception as e:
                print(f"An unexpected error occurred while processing {filename}: {e}")
                validation_errors[filename] = f"Unexpected Error: {e}"
else:
    print(f"Output directory '{output_dir}' not found. Please ensure the previous processing step was successful.")


print("\n--- Validation Summary ---")
print(f"Successfully validated {len(validated_outputs)} files.")
if validation_errors:
    print(f"Validation failed for {len(validation_errors)} files:")
    for filename, error_msg in validation_errors.items():
        print(f"- {filename}: {error_msg}")
else:
    print("All processed GPT-4o JSON outputs passed Pydantic validation.")


Loading JSON outputs from 'extracted_outputs' and validating against Pydantic schema...

--- Validation Summary ---
Successfully validated 0 files.
All processed GPT-4o JSON outputs passed Pydantic validation.


# Part 3: Output
- Model extraction outputs produced above and saved as **`all_extracted_gpt4o_outputs.json`**
- Evaluation outputs produced in "Evaluation" section above and saved as **`anthropic_judge_evaluation_summary.md`**


# Part 4: Summary


## Approach

1. **Orientation**
    - I first briefly reviewed the research literature to look for any clear benchmarks directly aligned with the task at hand. In the short time allotted this produced mixed signals, some suggesting that for NER, encoder only models like BERT (e.g. ClinicalBERT, BioMedBERT, etc.) still outperformed larger, but still relatively small general purpose LLMs like the 7B parameter Llama-2 and Mistral models. Xie et al. also suggest that GPT-4 performance was “poor” on the i2b2 NER task.
        - Source: Wu, Jiageng, et al. "BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text." arXiv preprint arXiv:2504.19467 (2025).
        - Source: Xie, Qianqian, et al. "Me-llama: Foundation large language models for medical applications." Research square (2024): rs-3.
    - I ran a small automated systematic review (n=10) on Elicit that suggested much larger models like LlaMa-65B, and GPT-4 indeed outperformed ClinicalBERT, and BiomedBERT. Ideally, I would have run a 10x larger review, and also closely compared the specific benchmarks mentioned. With automation tools, and a pro account this could also be done fairly hands off in a few hours.
    - To decide on a path, I drew on my experience fine-tuning BioBERT a few years back.  That was a lengthy process requiring careful development of a dataset, set up of the training infrastructure, and significant time developing and testing the training code. Based on that experience, and the micro-lit review, I decided that for the purpose of this exercise, it would be fastest and most interesting to look primarily at the most performant off-the-shelf models. To this end I chose to use GPT-4o. OpenAI provides a somewhat simple interface to enable endusers to fine-tune a model without any of the complexity typically required to fine-tune models of this type, though this still requires the non-trivial work of preparing an appropriate dataset.
2. **Extraction Approach**
    - The information contained in clinical notes is very heterogenous, and different extraction approaches will vary in their applicability for a particular information type. Despite this, LLMs have proven to be broadly applicable to a wide range of information types.
    - For this exercise, I chose to use GPT-4o, this comes with some significant trade-offs regarding speed, and hallucination risks. In a real-life scenario I would want to use the fastest, most reliable methods for as much as possible, and only use an LLM for more complex, and difficult to extract elements. I would want to evaluate the model’s performance for each information type. For example in the notebook, you’ll see that I begin by extracting names, gender, age, and medical record numbers using RegEx because on a cursory check of the records, it was clear that these fields followed a very consistent pattern, so it would make more sense to directly extract those instead of risking the model doing something unexpected with them.
    - In addition to the attributes provided, there are other pieces of information that could be extracted from the clinical notes, that might be of particular interest to clinical researchers looking at diabetic subjects. I collected a few of these by quickly reviewing some of the clinical research on the topic. Those attributes are listed below, and were incorporated into the extraction prompts.
    - Attributes to extract from notes
        
        ### Patient Demographics & Identification
        
        - **Record Metadata**
            - Patient name
            - Medical record number
        - **Demographics**
            - Race/ethnicity
            - Socieconomic class
        
        ### Clinical Information
        
        - **Medications**
            - Current medications
            - Medication adherence
        - **Medical History**
            - Concomitant diseases
            - Clinical findings (SNOMED CT coded)
        
        ### Disease-Specific Data
        
        - **Diagnoses**
            - Primary disease diagnoses
            - Disease diagnosis date
            - Approximate disease onset
        - **Complications**
            - Major diabetic complications (True/False)
            - Specific examples: Amputation, kidney damage, skin conditions, retinopathy, nephropathy, neuropathy
        
        ### Patient-Reported Information
        
        - **Symptoms & Factors**
            - Subject-reported symptoms
            - Subject-reported contributing factors
        - **Lifestyle & Behavioral Factors**
            - Mental health status (including stress levels)
            - Sleep patterns
            - Physical activity levels
            - Lifestyle habits (smoking, alcohol consumption)
3. **Model Selection**
    - I’ll say directly that I didn’t go through a thorough model selection process. Doing that would have entailed setting up the most promising candidates with an eval dataset, and some well defined performance metrics closely aligned with the downstream application. This was of course not possible, so I choose a model based on instant accessibility (which ruled out some potentially interesting models), and mostly intuition.
4. **Model Orchestration**
    - There are a few good orchestration options out there, but I’ve used Semantic Kernel from Microsoft on a previous project and had a relatively good experience with it, so I decided it would be sufficient for the task. Semantic Kernel is a rather lightweight framework,  but I probably could have even used something more lightweight, but in the interest of time, I decided to stick with something I knew. Semantic Kernel is well supported, and the development team at Microsoft has been super receptive to the community. Holding regular office hours that I’ve frequently used to learn more about the framework, and ask questions when we needed guidance.
5. **Hallucination Mitigation**
    - To address hallucinations I choose a few approaches that could be quickly implemented including developing a JSON schema with expected data types. Additionally for some fields, I provide a field called “evidence” for the model to reference exact quotes from the clinical note that support the attribute value selected.
    - In the notebook, I show an incomplete example of how Pydantic could be used to validate the model’s JSON outputs (note: the Pydantic structure and JSON schema provided in the prompt are currently not in sync, but with more time you would likely design the desired data model, and derive both Pydantic structure and JSON schema provided in the prompt from the same source). Another reason to use Pydantic is some model APIs and orchestration frameworks directly support passing Pydantic classes, and relieve some of the verification burden from the developer. One would also want to carefully choose appropriate validation criteria for each field, the validation criteria shown in the notebook is mostly just a toy example of what validation could look like. Ideally one could lean on an existing standard like i2b2 to provide a reasonable structure and constraints for the most common information types.
    - Lastly it would also be very interesting to explore some of the emerging approaches to uncertainty estimation that could also help us detect, revise, and/or remove LLM outputs that the model is uncertain about. There are a few flavors of approaches ranging from bayesian inference based methods, to ensemble methods, and information theoretic approaches. Some approaches require open weights, but there are some approaches that also applicable on closed-weight models.
6. **Evaluation**
    - I to perform a LLM-as-a-judge evaluation by using Claude Opus 4.1 to produce a rating  of the correctness of the extraction. I provided the model with a rough rubric by which to evaluate the correctness of the extractions. To reduce the required time, I restrict this eval to a “challenging” subset of the fields related to medication adherence, physical activity, and nutrition. These were chosen because this type of information is especially challenging for typical information retrieval methods to make sense of. Ideally one would also perform a manual assessment where a human would review the quality of the extraction, then we would calculate correlation coefficients between the LLM judge and the human evaluations. Scaling this approach would mean expanding to all fields, looking at a significantly larger dataset, and using multiple human evaluators to enable us to understand the inter-evaluator agreement for different extractions to identify inherently challenging aspects of the task.