# Synthea Clinical Note Construction

This notebook constructs synthetic clinical narratives from Synthea-generated EHR records by aggregating encounter context, reasons for visit, and clinically meaningful conditions. These note-like texts are used as a controlled evaluation dataset for clinical symptom identification.

In [18]:
import pandas as pd
from pathlib import Path

In [19]:
#Defining the paths
from pathlib import Path
DATA_DIR = Path("../data/synthea")
OUTPUT_PATH = DATA_DIR / "synthea_clinical_notes.csv"

In [20]:
#Load Synthea CSV files
encounters = pd.read_csv(DATA_DIR / "encounters.csv")
conditions = pd.read_csv(DATA_DIR / "conditions.csv")

print("Encounters:", encounters.shape)
print("Conditions:", conditions.shape)

Encounters: (21131, 15)
Conditions: (13564, 7)


### Source Data Statistics

The Synthea simulator generated 21,131 encounters and 13,564 condition records across the synthetic patient population. A subset of encounters containing clinically meaningful information was used to construct concise, note-like clinical narratives for evaluation purposes.

In [21]:
#Keeping only relevant columns
encounters = encounters[
    ["Id", "ENCOUNTERCLASS", "DESCRIPTION", "REASONDESCRIPTION"]
]

conditions = conditions[
    ["ENCOUNTER", "DESCRIPTION"]
]

encounters.head(), conditions.head()

(                                     Id ENCOUNTERCLASS  \
 0  0db52a7b-1110-058e-ba4c-1d6a5b00234a       wellness   
 1  0db52a7b-1110-058e-fa3a-6a2a73825d85       wellness   
 2  0db52a7b-1110-058e-3c03-e02ebebf0f9e       wellness   
 3  0db52a7b-1110-058e-23af-75b12f848543     ambulatory   
 4  0db52a7b-1110-058e-e93b-a3a1ca713120     ambulatory   
 
                                   DESCRIPTION  \
 0                Well child visit (procedure)   
 1  General examination of patient (procedure)   
 2  General examination of patient (procedure)   
 3           Encounter for symptom (procedure)   
 4           Encounter for symptom (procedure)   
 
                     REASONDESCRIPTION  
 0                                 NaN  
 1                                 NaN  
 2                                 NaN  
 3  Acute viral pharyngitis (disorder)  
 4          Viral sinusitis (disorder)  ,
                               ENCOUNTER                          DESCRIPTION
 0  0db52a7b-1110

In [22]:
#Applying soft filtering - removing notes related to administrative and socioeconomic terms
def clinically_relevant(text):
    if not isinstance(text, str):
        return False
    
    text = text.lower()
    
    exclude_terms = [
        "education",
        "employment",
        "insurance",
        "review due",
        "social contact"
    ]
    
    return not any(term in text for term in exclude_terms)

In [23]:
#Constructing note-like clinical text
notes = []

def article_for(word):
    """Return 'a' or 'an' based on pronunciation."""
    if not isinstance(word, str) or len(word) == 0:
        return "a"
    return "an" if word[0].lower() in "aeiou" else "a"

for _, enc in encounters.iterrows():
    encounter_id = enc["Id"]
    parts = []

    # Encounter context
    if pd.notna(enc["ENCOUNTERCLASS"]):
        enc_class = enc["ENCOUNTERCLASS"].lower()
        article = article_for(enc_class)
        parts.append(
            f"The patient presented for {article} {enc_class} encounter."
        )

    # Reason for visit
    if pd.notna(enc["REASONDESCRIPTION"]):
        parts.append(
            f"The reason for visit was {enc['REASONDESCRIPTION']}."
        )

    # Conditions linked to this encounter
    conds = conditions[
        conditions["ENCOUNTER"] == encounter_id
    ]["DESCRIPTION"].dropna().tolist()

    conds = [c for c in conds if clinically_relevant(c)]

    if conds:
        parts.append(
            "Conditions noted during the visit include " +
            ", ".join(sorted(set(conds))) + "."
        )

    if parts:
        notes.append({
            "encounter_id": encounter_id,
            "note_text": " ".join(parts)
        })

notes_df = pd.DataFrame(notes)
notes_df.head()

Unnamed: 0,encounter_id,note_text
0,0db52a7b-1110-058e-ba4c-1d6a5b00234a,The patient presented for a wellness encounter.
1,0db52a7b-1110-058e-fa3a-6a2a73825d85,The patient presented for a wellness encounter...
2,0db52a7b-1110-058e-3c03-e02ebebf0f9e,The patient presented for a wellness encounter.
3,0db52a7b-1110-058e-23af-75b12f848543,The patient presented for an ambulatory encoun...
4,0db52a7b-1110-058e-e93b-a3a1ca713120,The patient presented for an ambulatory encoun...


In [24]:
#Inspecting the generated note_text
for i, row in notes_df.head(5).iterrows():
    print(f"\n--- NOTE {i} ---")
    print(row["note_text"])


--- NOTE 0 ---
The patient presented for a wellness encounter.

--- NOTE 1 ---
The patient presented for a wellness encounter. Conditions noted during the visit include Stress (finding).

--- NOTE 2 ---
The patient presented for a wellness encounter.

--- NOTE 3 ---
The patient presented for an ambulatory encounter. The reason for visit was Acute viral pharyngitis (disorder). Conditions noted during the visit include Acute viral pharyngitis (disorder).

--- NOTE 4 ---
The patient presented for an ambulatory encounter. The reason for visit was Viral sinusitis (disorder). Conditions noted during the visit include Viral sinusitis (disorder).


In [25]:
#Saving the file
notes_df.to_csv(OUTPUT_PATH, index=False)
print(f"Saved {len(notes_df)} synthetic clinical notes to:")
print(OUTPUT_PATH)

Saved 21131 synthetic clinical notes to:
../data/synthea/synthea_clinical_notes.csv
