###  About the dataset 
**ACI-Bench** (Automatic Clinical Inference Benchmark) is a research dataset designed for developing and evaluating AI systems that can convert real-world doctor–patient dialogues into structured clinical documentation.  
It provides pairs of:

- **Narrative Conversations:** Natural doctor–patient transcripts. These are unstructured, free-flowing dialogues where a doctor interviews a patient, asks clarifying questions, discusses medical history, symptoms, lab tests, and next steps.
  
- **Structured Clinical Notes:** Professionally authored SOAP notes that summarize the conversation in a standardized medical format. SOAP stands for:
  - **S (Subjective):** The patient’s reported symptoms, history, complaints.
  - **O (Objective):** Clinician’s observations, exam findings, vital signs, test results.
  - **A (Assessment):** Diagnoses and clinical reasoning.
  - **P (Plan):** Recommended treatments, further tests, follow-up instructions.
  ### 📌 Key Details
- **Source:** [ACI-Bench GitHub](https://github.com/wyim/aci-bench)
- **Format:** CSV file with two main columns: `datasetname` `dialogue` (the conversation) and `note` (the reference SOAP note).
- **Use Case:** Automatic clinical note generation, summarization, and faithful text generation for medical AI research.
  

## Load and Inspect Dataset
In this step, we load the ACI-Bench dataset directly from the GitHub source.  
We print the first few rows to verify the data structure and ensure the required columns (`dialogue` and `note`) are present.  
This helps us understand how each doctor–patient conversation maps to a structured clinical note, which is crucial for building our generation and evaluation pipeline.

In [1]:
import pandas as pd

# Load the CSV from GitHub 
url = "https://raw.githubusercontent.com/wyim/aci-bench/main/data/challenge_data/train.csv"

df = pd.read_csv(url)
print(df.head())
print(df.columns)

      dataset encounter_id                                           dialogue  \
0  virtassist       D2N001  [doctor] hi , martha . how are you ?\n[patient...   
1  virtassist       D2N002  [doctor] hi , andrew , how are you ?\n[patient...   
2  virtassist       D2N003  [doctor] hi , john . how are you ?\n[patient] ...   
3  virtassist       D2N004  [doctor] hi , james , how are you ?\n[patient]...   
4  virtassist       D2N005  [doctor] hey , ms. hill . nice to see you .\n[...   

                                                note  
0  CHIEF COMPLAINT\n\nAnnual exam.\n\nHISTORY OF ...  
1  CHIEF COMPLAINT\n\nJoint pain.\n\nHISTORY OF P...  
2  CHIEF COMPLAINT\n\nBack pain.\n\nHISTORY OF PR...  
3  CHIEF COMPLAINT\n\nBack pain.\n\nHISTORY OF PR...  
4  CC:\n\nRight middle finger pain.\n\nHPI:\n\nMs...  
Index(['dataset', 'encounter_id', 'dialogue', 'note'], dtype='object')


## View a Sample Doctor–Patient Dialogue and Corresponding Clinical Note

Here, we extract the first example from the dataset to inspect its full conversation and reference SOAP note.  
This qualitative check helps us understand:
- How real medical conversations are structured.
- How detailed the SOAP note is.
- What information must be preserved during generation.

This insight will guide how we design our prompt for the LLM and how we evaluate the outputs.

In [2]:
sample = df.iloc[0]
print("\n=== DOCTOR–PATIENT CONVERSATION ===")
print(sample['dialogue'])
dialogue=sample['dialogue']
print("\n=== STRUCTURED CLINICAL NOTE ===")
print(sample['note']) 
note=sample['note']


=== DOCTOR–PATIENT CONVERSATION ===
[doctor] hi , martha . how are you ?
[patient] i'm doing okay . how are you ?
[doctor] i'm doing okay . so , i know the nurse told you about dax . i'd like to tell dax a little bit about you , okay ?
[patient] okay .
[doctor] martha is a 50-year-old female with a past medical history significant for congestive heart failure , depression and hypertension who presents for her annual exam . so , martha , it's been a year since i've seen you . how are you doing ?
[patient] i'm doing well . i've been traveling a lot recently since things have , have gotten a bit lighter . and i got my , my vaccine , so i feel safer about traveling . i've been doing a lot of hiking . uh , went to washington last weekend to hike in northern cascades, like around the mount baker area .
[doctor] nice . that's great . i'm glad to hear that you're staying active , you know . i , i just love this weather . i'm so happy the summer is over . i'm definitely more of a fall person .

In [3]:
apikey="apikey"

## ✏️ Step 4 — Define Prompt Template for SOAP Note Generation

In this cell, we write a detailed system prompt.
It instructs the LLM to behave as a professional medical scribe and convert natural doctor–patient conversations into formal SOAP notes.
The prompt includes strong constraints to reduce hallucinations, preserve factual consistency, and enforce a clear structure.
We also embed a few high-quality example notes to guide the model (few-shot prompting).By tyring different prompt the best prompt is chosen. 

In [29]:
dialogue = sample['dialogue']

system_content = """
You are a professional medical scribe.
Your job is to convert doctor–patient conversations into detailed, structured SOAP notes.
First, read the conversation carefully.
Finally, write the SOAP note with sections, write like medical professional
Avoid hallucinations
-Include all reported symptoms, findings, medications, and durations mentioned.
- NEVER fabricate labs, medications, or vital signs not clearly stated.
- Omit sections ONLY if there is no data available.
Always use this exact format:

- CHIEF COMPLAINT:
- HISTORY OF PRESENT ILLNESS:
- REVIEW OF SYSTEMS:
- PHYSICAL EXAMINATION:
- VITALS REVIEWED:
- RESULTS:
- ASSESSMENT AND PLAN:
-INSTRUCTIONS

Do NOT repeat sentences.
Do NOT invent information not in the dialogue.
Write clear, full medical sentences.
Example1:
CHIEF COMPLAINT

Back pain.

HISTORY OF PRESENT ILLNESS

Mr. John Perry is a 61-year-old male with a past medical history significant for kidney stones, migraines, and gastroesophageal reflux, who presents with some back pain.

The patient reports that he is feeling a lot of the same pain that he had when he had kidney stones about 2 years ago, so he is a little concerned. The pain started from the right side and moved over and he feels it on the left side of his back. This has been going on for the last 4 days. Initially, the pain was intermittent, but over the last 48 hours it has been constant. He has taken Tylenol, but it does not seem to help. He thinks he has hematuria, but it is hard to detect but it does look a little off color. He endorses nausea and vomiting if he exerts himself or climbs the stairs to his apartment or runs to catch the bus. He also endorses dizziness and lightheadedness with pain in his abdomen.

Regarding his migraines, he has been diligent about taking the Imitrex. He denies issues with the migraines.

Regarding his gastroesophageal reflux, he reports that he has been doing well with his diet, but notes lately with his pain, he has been eating more fast food and takeout since these options come with delivery. He is staying hydrated. He is taking Protonix 40 mg daily as directed.

REVIEW OF SYSTEMS

    Gastrointestinal: Endorses abdominal pain. Endorses nausea and vomiting with exertion.
    Genitourinary: Endorses urine discoloration.
    Musculoskeletal: Endorses back pain. Endorses body aches.
    Neurological: Denies headaches. Endorses dizziness and lightheadedness.

PHYSICAL EXAMINATION

    Respiratory: Lungs are clear to auscultation bilaterally. No wheezes, rales, or rhonchi.
    Cardiovascular: No murmurs, gallops, or rubs. No extra heart sounds.
    Gastrointestinal: Tender to palpation to the right lower quadrant. CVA tenderness on the right.

VITALS REVIEWED

    Blood Pressure: Elevated.

RESULTS

Creatinine level slightly elevated.
Abdominal x-ray demonstrates possible kidney stone.

ASSESSMENT AND PLAN

Mr. John Perry is a 61-year-old male with a past medical history significant for kidney stones, migraines, and gastroesophageal reflux, who presents with back pain.

Kidney stones.
    Medical Reasoning: He is experiencing pain in his back that is similar to his previous kidney stone pain. His recent abdominal x-ray demonstrates what appears to be a recurrent kidney stone.
    Additional Testing: I have ordered a CT scan of the abdomen and pelvis without contrast.
    Medical Treatment: We will start him on Ultram 50 mg as needed every 6 hours for pain.
    Patient Education and Counseling: I advised the patient to stay well hydrated and to strain his urine.

Migraines.
    Medical Reasoning: He has been compliant with Imitrex and is doing well at this time.
    Medical Treatment: Continue Imitrex.

Reflux.
    Medical Reasoning: This is typically well-controlled with dietary modifications.
    Medical Treatment: Continue with Protonix 40 mg daily. A refill was provided.

Patient Agreements: The patient understands and agrees with the recommended medical treatment plan.
Example2:
CHIEF COMPLAINT

Hospital follow-up after an anterior STEMI.

MEDICAL HISTORY

Patient reports history of CAD status post prior RCA stent in 2018, hypertension, and diabetes mellitus.

SURGICAL HISTORY

Patient reports history of RCA stent in 2018 and most recently underwent drug-eluting stent placement in the LAD.

SOCIAL HISTORY

Patient reports enjoying walking outside, gardening, and nature photography.

MEDICATIONS

Patient reports taking aspirin 81 mg daily, Brilinta 90 mg twice a day, Lipitor 80 mg daily, Toprol 50 mg daily, and lisinopril 20 mg a day.

REVIEW OF SYSTEMS

Constitutional: Reports fatigue. Denies changes in sleep.
Cardiovascular: Denies chest pain.
Respiratory: Denies shortness of breath.
Musculoskeletal: Denies lower extremity swelling.

VITALS

Vital signs look good today.

PHYSICAL EXAM

Neck
- General Examination: No carotid bruits.

Respiratory
- Auscultation of Lungs: Clear bilaterally.

Cardiovascular
- Auscultation of Heart: Grade 3/6 systolic ejection murmur, heard at the left base.

Musculoskeletal
- Examination of the right upper extremity reveals no swelling or edema on the right radial artery. Cath site is clean, dry, and intact. No hematoma. Palpable right radial artery pulse.

RESULTS

Electrocardiogram is reviewed and revealed normal sinus rhythm with good R wave progression and evolutionary changes, which are anticipated.

ASSESSMENT AND PLAN

1. Coronary artery disease.
- Medical Reasoning: The patient's exam is consistent with coronary artery disease.
- Patient Education and Counseling: We discussed that he should continue to watch his diet and salt intake. We also discussed that the cardiac rehab should help with his confidence with exercising regularly and for his education.
- Medical Treatment: Continue taking aspirin 81 mg daily Continue taking Brilinta 90 mg twice a day. Continue taking Lipitor 80 mg daily. Continue taking Toprol 50 mg daily. I will refer him to cardiac rehab.

2. Newly reduced left ventricular dysfunction and moderate mitral regurgitation.
- Medical Reasoning: The patient's physical exam is consistent with this diagnosis.
- Patient Education and Counseling: We discussed that his pumping function should improve in time. We also discussed that since he is compliant with his medications and presented to the cardiac cath lab quickly, he should recover. I advised the patient that he does not need to start a diuretic at this time.
- Medical Treatment: Continue taking lisinopril 20 mg a day. Prescription for Aldactone 12.5 mg daily provided. Order for labs provided. Repeat echocardiogram ordered to be completed in 2 months.

3. Hypertension.
- Medical Reasoning: This seems stable at this time.
- Medical Treatment: Continue home blood pressure monitoring.

Patient Agreements: The patient understands and agrees with the recommended medical treatment plan.

Example3:
CHIEF COMPLAINT

Kidney stones.

HISTORY OF PRESENT ILLNESS

Mason Ward is a pleasant 80-year-old male who presents to the clinic today for the evaluation of kidney stones. The patient was referred from his primary care physician. The onset of his pain began 1 week ago when he was in his barn moving hay when he had a sudden onset of right back pain. The patient initially thought his pain was due to throwing hay; however, he broke out into a sweat and became nauseated. He was seen by his primary care physician, who ordered a CT scan and told him that he had a kidney stone. He denies having kidney stones before, but states that his father has a history of kidney stones in the past. He explains that when he had pain, which has now resolved, it would radiate almost to his groin. The patient describes the pain as intermittent after he found out it was a kidney stone. He explains that he has been straining his urine, but has not seen anything. He denies any hematuria.

REVIEW OF SYSTEMS

Musculoskeletal: Reports right back pain.

VITALS

Vitals look good, blood pressure and hear rate are within normal limits. Temperature is within normal limits.

PHYSICAL EXAM

MSK: Examination of the abdomen: No pain with palpation of the abdomen. No rebound or guarding. There is CVA tenderness on the right side.

RESULTS

The CT scan of the abdomen revealed a stone that is measuring 0.5 cm located in the proximal right ureter. There is no evidence of hydronephrosis.

ASSESSMENT

Right kidney stone.

PLAN

We reviewed the patient's CT results in detail today. I have recommended that we treat the patient conservatively. I have prescribed the patient oxycodone 5 mg every 6 to 8 hours for pain. He may continue to take Tylenol between the oxycodone doses for any breakthrough pain. The patient should continue to use the strainer when he urinates until the stone passes. I have also recommended that we obtain a BMP, urinalysis, and urine culture to evaluate for any signs of infection.

INSTRUCTIONS

The patient will follow up with me in 1 to 2 weeks to check on his progress. If his symptoms have not improved, we will discuss further treatment options including lithotripsy.

"""

user_content = f"""
Below is a real doctor–patient conversation.

Write the SOAP note in the exact format.

Conversation:
{dialogue}

SOAP Note:
"""


## 📝 Step 5 — Generate SOAP Note for a Single Example

We test the LLM by sending one real conversation and receiving the generated SOAP note.
This helps us validate that:
- The prompt format is effective.
- The LLM respects medical conventions.
- The output is well-structured.
We print the result to manually inspect its coherence, faithfulness, and format.


In [19]:
from openai import OpenAI

client = OpenAI(
    api_key=apikey
)
client.base_url = "https://openrouter.ai/api/v1"
response = client.chat.completions.create(
    model="meta-llama/llama-3-70b-instruct",
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
    ],
)
generated_note = response.choices[0].message.content

print("\n=== GENERATED SOAP NOTE ===")
print(generated_note)



=== GENERATED SOAP NOTE ===
Here is the SOAP note in the exact format:

CHIEF COMPLAINT:

Annual exam.

HISTORY OF PRESENT ILLNESS:

Martha, a 50-year-old female with a past medical history significant for congestive heart failure, depression, and hypertension, presents for her annual exam. She reports doing well, having traveled recently, and getting her vaccine, which makes her feel safer about traveling. She has been doing a lot of hiking, including a recent trip to Washington to hike in the northern Cascades.

REVIEW OF SYSTEMS:

Constitutional: Denies fatigue.
Cardiovascular: Reports no chest pains, shortness of breath, or swelling in her legs.
Respiratory: Endorses nasal congestion from fall pollen and allergies.
Gastrointestinal: Denies nausea, vomiting, or abdominal pain.
Neurological: No issues, no feelings of wanting to harm herself or others.

PHYSICAL EXAMINATION:

Cardiovascular: Appreciates a 3 out of 6 systolic ejection murmur, heard at the apex.
Lower Extremities: Appr

In [20]:
import evaluate

bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
from bert_score import score
pred = generated_note
ref = sample['note']

# BLEU wants tokenized lists:
bleu_result = bleu.compute(predictions=[pred], references=[ref])
print("BLEU:", bleu_result)

# ROUGE:
rouge_result = rouge.compute(predictions=[pred], references=[ref])
print("ROUGE:", rouge_result)

# BERTScore:
P, R, F1 = score([pred], [ref], lang="en", verbose=True)
print("BERTScore F1:", F1.mean().item())


BLEU: {'bleu': 0.1656177855922421, 'precisions': [0.7222222222222222, 0.38498789346246975, 0.23300970873786409, 0.145985401459854], 'brevity_penalty': 0.5310759470298738, 'length_ratio': 0.6124260355029586, 'translation_length': 414, 'reference_length': 676}
ROUGE: {'rouge1': 0.5398230088495576, 'rouge2': 0.27272727272727276, 'rougeL': 0.3716814159292035, 'rougeLsum': 0.5243362831858406}


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.47 seconds, 2.14 sentences/sec
BERTScore F1: 0.8739206194877625


## 🔄 Batch Generate SOAP Notes

Here we loop through multiple samples in the dataset.
For each, we send the conversation to the LLM and generate a structured SOAP note.
This enables large-scale evaluation, as we can compare the generated notes with the ground truth across multiple test cases.
This loop also stores each output for later scoring.


In [31]:
from openai import OpenAI
import pandas as pd
import evaluate
from bert_score import score

# Setup
client = OpenAI(api_key=apikey)
client.base_url = "https://openrouter.ai/api/v1"

# Load data
url = "https://raw.githubusercontent.com/wyim/aci-bench/main/data/challenge_data/valid.csv"
df = pd.read_csv(url)
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

num_samples = 20
MODEL_ID = "meta-llama/llama-3-70b-instruct"
results = []

for idx in range(num_samples):
    sample = df.iloc[idx]
    dialogue = sample['dialogue']
    reference = sample['note']

    system_content = """
You are a professional medical scribe.
Your job is to convert doctor–patient conversations into detailed, structured SOAP notes.
First, read the conversation carefully.
Finally, write the SOAP note with sections, write like medical professional
Avoid hallucinations
-Include all reported symptoms, findings, medications, and durations mentioned.
- NEVER fabricate labs, medications, or vital signs not clearly stated.
- Omit sections ONLY if there is no data available.
Always use this exact format:

- CHIEF COMPLAINT:
- HISTORY OF PRESENT ILLNESS:
- REVIEW OF SYSTEMS:
- PHYSICAL EXAMINATION:
- VITALS REVIEWED:
- RESULTS:
- ASSESSMENT AND PLAN:
-INSTRUCTIONS

Do NOT repeat sentences.
Do NOT invent information not in the dialogue.
Write clear, full medical sentences.
Example1:
CHIEF COMPLAINT

Back pain.

HISTORY OF PRESENT ILLNESS

Mr. John Perry is a 61-year-old male with a past medical history significant for kidney stones, migraines, and gastroesophageal reflux, who presents with some back pain.

The patient reports that he is feeling a lot of the same pain that he had when he had kidney stones about 2 years ago, so he is a little concerned. The pain started from the right side and moved over and he feels it on the left side of his back. This has been going on for the last 4 days. Initially, the pain was intermittent, but over the last 48 hours it has been constant. He has taken Tylenol, but it does not seem to help. He thinks he has hematuria, but it is hard to detect but it does look a little off color. He endorses nausea and vomiting if he exerts himself or climbs the stairs to his apartment or runs to catch the bus. He also endorses dizziness and lightheadedness with pain in his abdomen.

Regarding his migraines, he has been diligent about taking the Imitrex. He denies issues with the migraines.

Regarding his gastroesophageal reflux, he reports that he has been doing well with his diet, but notes lately with his pain, he has been eating more fast food and takeout since these options come with delivery. He is staying hydrated. He is taking Protonix 40 mg daily as directed.

REVIEW OF SYSTEMS

    Gastrointestinal: Endorses abdominal pain. Endorses nausea and vomiting with exertion.
    Genitourinary: Endorses urine discoloration.
    Musculoskeletal: Endorses back pain. Endorses body aches.
    Neurological: Denies headaches. Endorses dizziness and lightheadedness.

PHYSICAL EXAMINATION

    Respiratory: Lungs are clear to auscultation bilaterally. No wheezes, rales, or rhonchi.
    Cardiovascular: No murmurs, gallops, or rubs. No extra heart sounds.
    Gastrointestinal: Tender to palpation to the right lower quadrant. CVA tenderness on the right.

VITALS REVIEWED

    Blood Pressure: Elevated.

RESULTS

Creatinine level slightly elevated.
Abdominal x-ray demonstrates possible kidney stone.

ASSESSMENT AND PLAN

Mr. John Perry is a 61-year-old male with a past medical history significant for kidney stones, migraines, and gastroesophageal reflux, who presents with back pain.

Kidney stones.
    Medical Reasoning: He is experiencing pain in his back that is similar to his previous kidney stone pain. His recent abdominal x-ray demonstrates what appears to be a recurrent kidney stone.
    Additional Testing: I have ordered a CT scan of the abdomen and pelvis without contrast.
    Medical Treatment: We will start him on Ultram 50 mg as needed every 6 hours for pain.
    Patient Education and Counseling: I advised the patient to stay well hydrated and to strain his urine.

Migraines.
    Medical Reasoning: He has been compliant with Imitrex and is doing well at this time.
    Medical Treatment: Continue Imitrex.

Reflux.
    Medical Reasoning: This is typically well-controlled with dietary modifications.
    Medical Treatment: Continue with Protonix 40 mg daily. A refill was provided.

Patient Agreements: The patient understands and agrees with the recommended medical treatment plan.
Example2:
CHIEF COMPLAINT

Hospital follow-up after an anterior STEMI.

MEDICAL HISTORY

Patient reports history of CAD status post prior RCA stent in 2018, hypertension, and diabetes mellitus.

SURGICAL HISTORY

Patient reports history of RCA stent in 2018 and most recently underwent drug-eluting stent placement in the LAD.

SOCIAL HISTORY

Patient reports enjoying walking outside, gardening, and nature photography.

MEDICATIONS

Patient reports taking aspirin 81 mg daily, Brilinta 90 mg twice a day, Lipitor 80 mg daily, Toprol 50 mg daily, and lisinopril 20 mg a day.

REVIEW OF SYSTEMS

Constitutional: Reports fatigue. Denies changes in sleep.
Cardiovascular: Denies chest pain.
Respiratory: Denies shortness of breath.
Musculoskeletal: Denies lower extremity swelling.

VITALS

Vital signs look good today.

PHYSICAL EXAM

Neck
- General Examination: No carotid bruits.

Respiratory
- Auscultation of Lungs: Clear bilaterally.

Cardiovascular
- Auscultation of Heart: Grade 3/6 systolic ejection murmur, heard at the left base.

Musculoskeletal
- Examination of the right upper extremity reveals no swelling or edema on the right radial artery. Cath site is clean, dry, and intact. No hematoma. Palpable right radial artery pulse.

RESULTS

Electrocardiogram is reviewed and revealed normal sinus rhythm with good R wave progression and evolutionary changes, which are anticipated.

ASSESSMENT AND PLAN

1. Coronary artery disease.
- Medical Reasoning: The patient's exam is consistent with coronary artery disease.
- Patient Education and Counseling: We discussed that he should continue to watch his diet and salt intake. We also discussed that the cardiac rehab should help with his confidence with exercising regularly and for his education.
- Medical Treatment: Continue taking aspirin 81 mg daily Continue taking Brilinta 90 mg twice a day. Continue taking Lipitor 80 mg daily. Continue taking Toprol 50 mg daily. I will refer him to cardiac rehab.

2. Newly reduced left ventricular dysfunction and moderate mitral regurgitation.
- Medical Reasoning: The patient's physical exam is consistent with this diagnosis.
- Patient Education and Counseling: We discussed that his pumping function should improve in time. We also discussed that since he is compliant with his medications and presented to the cardiac cath lab quickly, he should recover. I advised the patient that he does not need to start a diuretic at this time.
- Medical Treatment: Continue taking lisinopril 20 mg a day. Prescription for Aldactone 12.5 mg daily provided. Order for labs provided. Repeat echocardiogram ordered to be completed in 2 months.

3. Hypertension.
- Medical Reasoning: This seems stable at this time.
- Medical Treatment: Continue home blood pressure monitoring.

Patient Agreements: The patient understands and agrees with the recommended medical treatment plan.

Example3:
CHIEF COMPLAINT

Kidney stones.

HISTORY OF PRESENT ILLNESS

Mason Ward is a pleasant 80-year-old male who presents to the clinic today for the evaluation of kidney stones. The patient was referred from his primary care physician. The onset of his pain began 1 week ago when he was in his barn moving hay when he had a sudden onset of right back pain. The patient initially thought his pain was due to throwing hay; however, he broke out into a sweat and became nauseated. He was seen by his primary care physician, who ordered a CT scan and told him that he had a kidney stone. He denies having kidney stones before, but states that his father has a history of kidney stones in the past. He explains that when he had pain, which has now resolved, it would radiate almost to his groin. The patient describes the pain as intermittent after he found out it was a kidney stone. He explains that he has been straining his urine, but has not seen anything. He denies any hematuria.

REVIEW OF SYSTEMS

Musculoskeletal: Reports right back pain.

VITALS

Vitals look good, blood pressure and hear rate are within normal limits. Temperature is within normal limits.

PHYSICAL EXAM

MSK: Examination of the abdomen: No pain with palpation of the abdomen. No rebound or guarding. There is CVA tenderness on the right side.

RESULTS

The CT scan of the abdomen revealed a stone that is measuring 0.5 cm located in the proximal right ureter. There is no evidence of hydronephrosis.

ASSESSMENT

Right kidney stone.

PLAN

We reviewed the patient's CT results in detail today. I have recommended that we treat the patient conservatively. I have prescribed the patient oxycodone 5 mg every 6 to 8 hours for pain. He may continue to take Tylenol between the oxycodone doses for any breakthrough pain. The patient should continue to use the strainer when he urinates until the stone passes. I have also recommended that we obtain a BMP, urinalysis, and urine culture to evaluate for any signs of infection.

INSTRUCTIONS

The patient will follow up with me in 1 to 2 weeks to check on his progress. If his symptoms have not improved, we will discuss further treatment options including lithotripsy.

"""

    user_content = f"""
    Below is a real doctor–patient conversation.
    
    Write the SOAP note in the exact format.
    
    Conversation:
    {dialogue}
    
    SOAP Note:
    """

    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[
            {"role": "system", "content": system_content},
            {"role": "user", "content": user_content},
        ]
    )

    generated_note = response.choices[0].message.content

    bleu_result = bleu.compute(predictions=[generated_note], references=[reference])
    rouge_result = rouge.compute(predictions=[generated_note], references=[reference])
    P, R, F1 = score([generated_note], [reference], lang="en")

    results.append({
        "Sample_ID": idx,
        "BLEU": bleu_result["bleu"],
        "ROUGE1": rouge_result["rouge1"],
        "ROUGE2": rouge_result["rouge2"],
        "ROUGEL": rouge_result["rougeL"],
        "BERTScore_P": P.mean().item(),
        "BERTScore_R": R.mean().item(),
        "BERTScore_F1": F1.mean().item(),
        "Generated_Note": generated_note,
        "Reference_Note": reference
    })

    print(f"✅ Done: Sample {idx} | BLEU: {bleu_result['bleu']:.4f} | ROUGE-L: {rouge_result['rougeL']:.4f} | BERTScore-F1: {F1.mean().item():.4f}")

df_results = pd.DataFrame(results)
df_results.to_csv("generated_notes_evaluation.csv", index=False)
print("\n✅ All done! Results saved to: generated_notes_evaluation.csv")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 0 | BLEU: 0.1845 | ROUGE-L: 0.3651 | BERTScore-F1: 0.8835


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 1 | BLEU: 0.1386 | ROUGE-L: 0.3542 | BERTScore-F1: 0.8802


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 2 | BLEU: 0.2543 | ROUGE-L: 0.4523 | BERTScore-F1: 0.8789


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 3 | BLEU: 0.1160 | ROUGE-L: 0.3197 | BERTScore-F1: 0.8727


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 4 | BLEU: 0.1916 | ROUGE-L: 0.4548 | BERTScore-F1: 0.8811


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 5 | BLEU: 0.1835 | ROUGE-L: 0.3704 | BERTScore-F1: 0.8668


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 6 | BLEU: 0.1768 | ROUGE-L: 0.3660 | BERTScore-F1: 0.8754


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 7 | BLEU: 0.1038 | ROUGE-L: 0.2701 | BERTScore-F1: 0.8619


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 8 | BLEU: 0.1782 | ROUGE-L: 0.3549 | BERTScore-F1: 0.8511


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 9 | BLEU: 0.1342 | ROUGE-L: 0.2939 | BERTScore-F1: 0.8831


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 10 | BLEU: 0.1879 | ROUGE-L: 0.4326 | BERTScore-F1: 0.8847


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 11 | BLEU: 0.1901 | ROUGE-L: 0.3989 | BERTScore-F1: 0.8981


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 12 | BLEU: 0.1414 | ROUGE-L: 0.2737 | BERTScore-F1: 0.8566


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 13 | BLEU: 0.1794 | ROUGE-L: 0.3623 | BERTScore-F1: 0.8736


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 14 | BLEU: 0.1620 | ROUGE-L: 0.3455 | BERTScore-F1: 0.8787


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 15 | BLEU: 0.2732 | ROUGE-L: 0.4763 | BERTScore-F1: 0.9066


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 16 | BLEU: 0.1458 | ROUGE-L: 0.2967 | BERTScore-F1: 0.8721


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 17 | BLEU: 0.2124 | ROUGE-L: 0.4708 | BERTScore-F1: 0.8759


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 18 | BLEU: 0.0638 | ROUGE-L: 0.2636 | BERTScore-F1: 0.8574


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Done: Sample 19 | BLEU: 0.1362 | ROUGE-L: 0.3301 | BERTScore-F1: 0.8591

✅ All done! Results saved to: generated_notes_evaluation.csv


## Review and Inspect Generated Notes

In this step, we manually examine a few generated SOAP notes side-by-side with their references.
We look for common shortcomings:
- Hallucinated facts
- Missing medical details
- Incoherent phrasing
This helps us design better prompts or pre/post-processing.


In [32]:
# Calculate metrics
bleu_result = bleu.compute(predictions=[generated_note], references=[reference])
rouge_result = rouge.compute(predictions=[generated_note], references=[reference])
P, R, F1 = score([generated_note], [reference], lang="en")

# Check for obvious failures
review = ""
if bleu_result["bleu"] < 0.2:
    review += "Low BLEU; "
if rouge_result["rougeL"] < 0.3:
    review += "Low ROUGE-L; "
if F1.mean().item() < 0.75:
    review += "Low BERTScore; "

if review:
    print("\n🔍 Potential problem in Sample", idx)
    print(dialogue[:500], "...")
    print("\n=== Generated ===")
    print(generated_note[:500], "...")
    print("\n=== Reference ===")
    print(reference[:500], "...")
    print("---")

# Store all
results.append({
    "Sample_ID": idx,
    "BLEU": bleu_result["bleu"],
    "ROUGE1": rouge_result["rouge1"],
    "ROUGE2": rouge_result["rouge2"],
    "ROUGEL": rouge_result["rougeL"],
    "BERTScore_P": P.mean().item(),
    "BERTScore_R": R.mean().item(),
    "BERTScore_F1": F1.mean().item(),
    "Generated_Note": generated_note,
    "Reference_Note": reference,
    "Review_Flag": review.strip()
})


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



🔍 Potential problem in Sample 19
[doctor] hi richard how are you the medical assistant told me that you have a tick bite is that what happened
[patient] i really do n't know where i got it but i i had i do get out in the woods and i do spend a lot of time out in the yard but yeah i've got a tick bite around my knee and and it's been it's been over a week and and just it just burns and just quite annoying
[doctor] okay and have you had any fever or chills
[patient] i have not at this point it just feels warm on that spot
[doctor ...

=== Generated ===
Here is the SOAP note in the exact format:

**CHIEF COMPLAINT**

Tick bite on the knee.

**HISTORY OF PRESENT ILLNESS**

Richard, a pleasant male, presents to the clinic today with a tick bite on his knee that has been present for over a week. He reports that the bite has been burning and feeling warm to the touch, but denies fever or chills. He also reports a headache and generally not feeling well. He spends a lot of time outdoors and h

In [33]:
import spacy

# Load spaCy English
nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    doc = nlp(text)
    entities = set([ent.text.lower() for ent in doc.ents])
    return entities

# For each sample:
dialogue_entities = extract_entities(dialogue)
note_entities = extract_entities(generated_note)

# Find hallucinated entities
hallucinated = note_entities - dialogue_entities

print("Entities in dialogue:", dialogue_entities)
print("Entities in generated note:", note_entities)
print("Possible hallucinated entities:", hallucinated)




Entities in dialogue: {'the other day', 'about two months ago', 'a week', 'about three weeks', 'about a year ago two years ago', 'anesthesia', 'one hundred', 'twenty', 'about one twenty two', 'two', 'first', 'today', 'one', 'sixty seven', 'ninety eight', 'bull', 'one thousand', 'four', 'second', 'third', 'richard'}
Entities in generated note: {'gallop', '1000', '3 weeks', '120s', 'a week', '2', 'tick', '3', '100', 'soap', '20', 'additional testing: lyme', 'western', '67', 'additional testing: lipid', 'today', 'lyme', '122/70', '98.4', '1', 'additional testing: hemoglobin a1c', 'richard'}
Possible hallucinated entities: {'gallop', '1000', '3 weeks', '120s', '2', 'tick', '3', '100', 'soap', '20', 'additional testing: lyme', 'western', '67', 'additional testing: lipid', 'lyme', '122/70', '98.4', '1', 'additional testing: hemoglobin a1c'}


# Conclusion and Future plan

We wrap up by summarizing:
- What worked well (prompting, zero-shot vs few-shot)? few-shot worked well.
- metrics revealed
    *BLEU
    *ROUGE
    *BERT_SCORE
    *Hallucinated entities

   
# How we could reach better factual consistency in the future.
**Add More Data:** Combine ACI-Bench with other open clinical conversation datasets (e.g., MIMIC-III clinical notes, other SOAP note corpora) to increase coverage and variability.
**Clean & Preprocess:** Ensure all additional data is de-identified, high-quality, and matches the target output style.
**Domain Fine-Tuning:** Fine-tune a suitable open-source medical LLM (e.g., MedAlpaca, LLaMA, Mistral) using this larger dataset. This would help the model better understand medical terminology, SOAP formats, and reduce hallucinations.
**Pipeline Integration:** Use Retrieval-Augmented Generation (RAG) with medical knowledge bases to fact-check and enhance factual accuracy during note generation.
**Monitor Factuality:** Continue evaluating with both automatic metrics and manual review to ensure patient safety and clinical reliability.

