# Transformer BERT-based Classifier (attention model)
**Update:**
- adding dropout + noise to the training data
- evaluate by stanford cases only



BioBERT Symptom ➜ Syndrome Classifier (Top-K Metrics)

- Takes free-text as input
- Uses attention (self-attention) to understand the meaning of the text
- Produces one vector representing the entire text
- Feeds that vector into a small classifier (MLP / linear layer)
- Outputs a label (e.g., a diagnosis)

**Transformer:**
- look at all words at the same time
- decide which words are important through self-attention
- understand long-distance relationships
(e.g., “seizures since infancy … developmental delay … unlikely psychogenic”)




## Step 0. Setup & Imports

In [1]:
# !pip install -q transformers datasets accelerate scikit-learn

import random
import numpy as np
import pandas as pd
import torch

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score, accuracy_score

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)

pd.set_option("display.max_colwidth", None)   # or 0 in older pandas
pd.set_option("display.width", None)          # optional, for wide tables

In [2]:
# Mount Google Drive
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
# ✅ Install PyTorch (CUDA 12.1 wheels) + sentence-transformers
# Works great on A100 / L4 in Colab. If no GPU, PyTorch will still install (CPU build).
!pip -q install --index-url https://download.pytorch.org/whl/cu121 \
    torch torchvision torchaudio \
    sentence-transformers

# Quick sanity check
import torch
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

Torch: 2.9.0+cu126
CUDA available: True
GPU: NVIDIA L4


## Step 1. Load / Define Base Canonical Data

In [4]:
# folder path
from pathlib import Path

folder = Path('/content/drive/MyDrive/Colab Notebooks/CS229/Final Project')
folder.mkdir(parents=True, exist_ok=True)

In [5]:
# Path to the jsonl file
data_path = folder / '1-Embedding' / 'rawdata_clean.jsonl'

# Read JSON Lines file
df_raw = pd.read_json(data_path, lines=True)
df_raw

Unnamed: 0,mimNumber,preferredTitle,clinicalSynopsis
0,300088,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,"[delayed development variable severity from birth in, developmental regression in about 50 of patients, normal development in, seizures convulsive, seizures tonic-clonic, seizures partial, seizures absence, seizures atonic, seizures myoclonic, status epilepticus, autistic features, aggression, psychosis, obsessive features, carrier males show rigid personality, carrier males show obsessive features, carrier males show controlling and inflexible traits, seizures often associated with fever, intellectual disability is variable, seizure onset at a mean of 14 months, have cessation of seizures at a mean of 12 years, carrier males are unaffected except for psychiatric/behavioral abnormalities]"
1,300491,"EPILEPSY, X-LINKED 1, WITH VARIABLE LEARNING DISABILITIES AND BEHAVIOR DISORDERS; EPILX1","[macrocephaly, seizures, complex partial epilepsy, bathing epilepsy, seizures triggered by water, unprovoked seizures, nocturnal seizures, febrile seizures, secondary generalization, eeg abnormalities in the temporal region, learning difficulties, dyslexia, impaired intellectual development variable, speech delay, autism spectrum disorders, aggressive behavior, affected patients have various combinations of the main clinical features, seizures are often triggered by exposure to water, carrier females may be affected]"
2,300607,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 8; DEE8,"[hypertonia, exaggerated startle response, seizures tonic hyperekplectic, seizures provoked by tactile stimulation or extreme emotion, seizures are poorly controlled, impaired psychomotor development, mental retardation severe, onset at birth]"
3,300672,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 2; DEE2,"[microcephaly progressive, broad forehead, prominent forehead, deep-set eyes, large-appearing eyes, well-defined eyebrows, anteverted nares, full lips, breath-holding episodes, hyperventilation, constipation, gastroesophageal reflux, scoliosis, small hands, tapering fingers, small feet, hypotonia, seizures infantile-onset, infantile spasms, multifocal seizures, generalized seizures, tonic-clonic seizures, myoclonic seizures, hypsarrhythmia, delayed psychomotor development, psychomotor regression, mental retardation profound, lack of speech development, poor eye contact, motor dyspraxia, myoclonus, inability to walk independently, eeg abnormalities, sleep difficulties, autonomic disturbances, autistic features, stereotyped behaviors, hand-wringing, onset in infancy, seizures are usually refractory, males are more severely affected, dysmorphic facial features are subtle, some phenotypic overlap with rett syndrome]"
4,300884,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 36; DEE36,"[microcephaly, dysmorphic facial features, coarse features, low-set ears, hypertelorism, poor eye contact, cortical visual impairment, optic nerve atrophy, nystagmus, swelling of the eyelids, upturned nose, high-arched palate, hepatomegaly, feeding difficulties, gastroesophageal reflux, tube feeding, joint contractures, osteopenia, scoliosis, swelling of the hands, swelling of the feet, seizures refractory, delayed psychomotor development, developmental regression after onset of seizures, intellectual disability severe, poor or absent speech, motor difficulties, inability to walk, hypsarrhythmia seen on eeg, multifocal discharges seen on eeg, hypotonia, hydrocephalus, cerebral atrophy, delayed myelination, extrapyramidal signs, pyramidal signs, prolonged appt, coagulation defects, recurrent infections, serum transferrin n-glycosylation defect consistent with type i cdg, decreased alg13 activity, onset in infancy, mean onset of seizures at 65 months of age, most patient are female, rare patients are male]"
...,...,...,...
1655,620653,"INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL RECESSIVE 80, WITH VARIANT LISSENCEPHALY; MRT80","[dysmorphic facial features, optic atrophy, developmental delay mild, impaired intellectual development mild to moderate, speech delay, hypotonia, mild motor disabilities, uncoordinated gait, spasticity, hyperreflexia, seizures, lissencephaly on brain imaging, frontotemporal pachygyria, mild cortical thickening, hydrocephalus, behavioral abnormalities, autistic features, adhd, aggression, frustration, onset in infancy or early childhood]"
1656,620655,ALFADHEL SYNDROME; AFDL,"[short stature, decreased body weight, decreased head circumference, high forehead, triangular face, flat philtrum, short philtrum, retrognathia, low-set ears, everted ears, arched eyebrows, bulbous nose, thin lips, club feet, global developmental delay, impaired intellectual development, motor delay, expressive speech delay, seizures, regression, unsteady gait, hypotonia, hyporeflexia, delayed myelination, behavioral abnormalities, stereotypy]"
1657,620669,NEURODEGENERATION WITH BRAIN IRON ACCUMULATION 9; NBIA9,"[poor overall growth, failure to thrive, microcephaly, poor vision, cortical visual impairment, poor visual tracking, dysphagia, feeding difficulties, tube-feeding, global developmental delay, neurodegeneration, delayed walking, inability to walk, loss of ambulation, axial hypotonia, speech delay, poor or absent speech, dysarthria, impaired intellectual development variable, cognitive decline, involuntary movements, spasticity, hyperreflexia, ataxia, dystonia, dysmetria, exaggerated startle, seizures, supratentorial white matter loss on brain imaging, iron accumulation in the basal ganglia, cerebellar hypoplasia, cerebellar atrophy, hot-cross bun sign in the pontocerebellar region, figure-of-eight sign in the pontocerebellar region, eye of the tiger sign in the globus pallidus, intracytoplasmic and intranuclear eosinophilic fth1-positive inclusions in neurons throughout the brain seen on neuropathology, iron levels may be low or normal, serum ferritin may be low or normal, transferrin saturation may be low or normal, onset in infancy, progressive disorder]"
1658,620675,"LEUKODYSTROPHY, HYPOMYELINATING, 27; HLD27","[short stature, poor overall growth, head titubation, gaze palsy, nystagmus, optic atrophy, strabismus, cataracts, feeding difficulties, tube feeding, scoliosis, foot deformities, global developmental delay severe, delayed walking, inability to sit or walk, loss of ambulation, poor head control, impaired intellectual development severe to profound, poor or absent speech, neurologic regression, ataxia, dystonia, spasticity, hyperreflexia, extensor plantar responses, pyramidal signs, hypotonia, hypertonia, seizures, hypomyelinating leukodystrophy seen on brain imaging, cerebral atrophy, enlarged ventricles, cerebellar atrophy, brainstem atrophy, thin corpus callosum, onset in infancy, progressive disorder]"


In [6]:
# Confirm symptoms are lists
print(df_raw.iloc[0]["clinicalSynopsis"])
print(type(df_raw.iloc[0]["clinicalSynopsis"]))

['delayed development variable severity from birth in', 'developmental regression in about 50 of patients', 'normal development in', 'seizures convulsive', 'seizures tonic-clonic', 'seizures partial', 'seizures absence', 'seizures atonic', 'seizures myoclonic', 'status epilepticus', 'autistic features', 'aggression', 'psychosis', 'obsessive features', 'carrier males show rigid personality', 'carrier males show obsessive features', 'carrier males show controlling and inflexible traits', 'seizures often associated with fever', 'intellectual disability is variable', 'seizure onset at a mean of 14 months', 'have cessation of seizures at a mean of 12 years', 'carrier males are unaffected except for psychiatric/behavioral abnormalities']
<class 'list'>


In [7]:
# create df_base
df_base = pd.DataFrame({
    "syndrome": df_raw["preferredTitle"].astype(str),
    "symptoms": df_raw["clinicalSynopsis"].apply(lambda x: [s.strip() for s in x if isinstance(s, str)])
})

df_base

Unnamed: 0,syndrome,symptoms
0,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,"[delayed development variable severity from birth in, developmental regression in about 50 of patients, normal development in, seizures convulsive, seizures tonic-clonic, seizures partial, seizures absence, seizures atonic, seizures myoclonic, status epilepticus, autistic features, aggression, psychosis, obsessive features, carrier males show rigid personality, carrier males show obsessive features, carrier males show controlling and inflexible traits, seizures often associated with fever, intellectual disability is variable, seizure onset at a mean of 14 months, have cessation of seizures at a mean of 12 years, carrier males are unaffected except for psychiatric/behavioral abnormalities]"
1,"EPILEPSY, X-LINKED 1, WITH VARIABLE LEARNING DISABILITIES AND BEHAVIOR DISORDERS; EPILX1","[macrocephaly, seizures, complex partial epilepsy, bathing epilepsy, seizures triggered by water, unprovoked seizures, nocturnal seizures, febrile seizures, secondary generalization, eeg abnormalities in the temporal region, learning difficulties, dyslexia, impaired intellectual development variable, speech delay, autism spectrum disorders, aggressive behavior, affected patients have various combinations of the main clinical features, seizures are often triggered by exposure to water, carrier females may be affected]"
2,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 8; DEE8,"[hypertonia, exaggerated startle response, seizures tonic hyperekplectic, seizures provoked by tactile stimulation or extreme emotion, seizures are poorly controlled, impaired psychomotor development, mental retardation severe, onset at birth]"
3,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 2; DEE2,"[microcephaly progressive, broad forehead, prominent forehead, deep-set eyes, large-appearing eyes, well-defined eyebrows, anteverted nares, full lips, breath-holding episodes, hyperventilation, constipation, gastroesophageal reflux, scoliosis, small hands, tapering fingers, small feet, hypotonia, seizures infantile-onset, infantile spasms, multifocal seizures, generalized seizures, tonic-clonic seizures, myoclonic seizures, hypsarrhythmia, delayed psychomotor development, psychomotor regression, mental retardation profound, lack of speech development, poor eye contact, motor dyspraxia, myoclonus, inability to walk independently, eeg abnormalities, sleep difficulties, autonomic disturbances, autistic features, stereotyped behaviors, hand-wringing, onset in infancy, seizures are usually refractory, males are more severely affected, dysmorphic facial features are subtle, some phenotypic overlap with rett syndrome]"
4,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 36; DEE36,"[microcephaly, dysmorphic facial features, coarse features, low-set ears, hypertelorism, poor eye contact, cortical visual impairment, optic nerve atrophy, nystagmus, swelling of the eyelids, upturned nose, high-arched palate, hepatomegaly, feeding difficulties, gastroesophageal reflux, tube feeding, joint contractures, osteopenia, scoliosis, swelling of the hands, swelling of the feet, seizures refractory, delayed psychomotor development, developmental regression after onset of seizures, intellectual disability severe, poor or absent speech, motor difficulties, inability to walk, hypsarrhythmia seen on eeg, multifocal discharges seen on eeg, hypotonia, hydrocephalus, cerebral atrophy, delayed myelination, extrapyramidal signs, pyramidal signs, prolonged appt, coagulation defects, recurrent infections, serum transferrin n-glycosylation defect consistent with type i cdg, decreased alg13 activity, onset in infancy, mean onset of seizures at 65 months of age, most patient are female, rare patients are male]"
...,...,...
1655,"INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL RECESSIVE 80, WITH VARIANT LISSENCEPHALY; MRT80","[dysmorphic facial features, optic atrophy, developmental delay mild, impaired intellectual development mild to moderate, speech delay, hypotonia, mild motor disabilities, uncoordinated gait, spasticity, hyperreflexia, seizures, lissencephaly on brain imaging, frontotemporal pachygyria, mild cortical thickening, hydrocephalus, behavioral abnormalities, autistic features, adhd, aggression, frustration, onset in infancy or early childhood]"
1656,ALFADHEL SYNDROME; AFDL,"[short stature, decreased body weight, decreased head circumference, high forehead, triangular face, flat philtrum, short philtrum, retrognathia, low-set ears, everted ears, arched eyebrows, bulbous nose, thin lips, club feet, global developmental delay, impaired intellectual development, motor delay, expressive speech delay, seizures, regression, unsteady gait, hypotonia, hyporeflexia, delayed myelination, behavioral abnormalities, stereotypy]"
1657,NEURODEGENERATION WITH BRAIN IRON ACCUMULATION 9; NBIA9,"[poor overall growth, failure to thrive, microcephaly, poor vision, cortical visual impairment, poor visual tracking, dysphagia, feeding difficulties, tube-feeding, global developmental delay, neurodegeneration, delayed walking, inability to walk, loss of ambulation, axial hypotonia, speech delay, poor or absent speech, dysarthria, impaired intellectual development variable, cognitive decline, involuntary movements, spasticity, hyperreflexia, ataxia, dystonia, dysmetria, exaggerated startle, seizures, supratentorial white matter loss on brain imaging, iron accumulation in the basal ganglia, cerebellar hypoplasia, cerebellar atrophy, hot-cross bun sign in the pontocerebellar region, figure-of-eight sign in the pontocerebellar region, eye of the tiger sign in the globus pallidus, intracytoplasmic and intranuclear eosinophilic fth1-positive inclusions in neurons throughout the brain seen on neuropathology, iron levels may be low or normal, serum ferritin may be low or normal, transferrin saturation may be low or normal, onset in infancy, progressive disorder]"
1658,"LEUKODYSTROPHY, HYPOMYELINATING, 27; HLD27","[short stature, poor overall growth, head titubation, gaze palsy, nystagmus, optic atrophy, strabismus, cataracts, feeding difficulties, tube feeding, scoliosis, foot deformities, global developmental delay severe, delayed walking, inability to sit or walk, loss of ambulation, poor head control, impaired intellectual development severe to profound, poor or absent speech, neurologic regression, ataxia, dystonia, spasticity, hyperreflexia, extensor plantar responses, pyramidal signs, hypotonia, hypertonia, seizures, hypomyelinating leukodystrophy seen on brain imaging, cerebral atrophy, enlarged ventricles, cerebellar atrophy, brainstem atrophy, thin corpus callosum, onset in infancy, progressive disorder]"


In [8]:
pd.set_option("display.max_colwidth", None)   # or 0 in older pandas
pd.set_option("display.width", None)          # optional, for wide tables

df_base.head()   # or whatever slice you want

Unnamed: 0,syndrome,symptoms
0,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,"[delayed development variable severity from birth in, developmental regression in about 50 of patients, normal development in, seizures convulsive, seizures tonic-clonic, seizures partial, seizures absence, seizures atonic, seizures myoclonic, status epilepticus, autistic features, aggression, psychosis, obsessive features, carrier males show rigid personality, carrier males show obsessive features, carrier males show controlling and inflexible traits, seizures often associated with fever, intellectual disability is variable, seizure onset at a mean of 14 months, have cessation of seizures at a mean of 12 years, carrier males are unaffected except for psychiatric/behavioral abnormalities]"
1,"EPILEPSY, X-LINKED 1, WITH VARIABLE LEARNING DISABILITIES AND BEHAVIOR DISORDERS; EPILX1","[macrocephaly, seizures, complex partial epilepsy, bathing epilepsy, seizures triggered by water, unprovoked seizures, nocturnal seizures, febrile seizures, secondary generalization, eeg abnormalities in the temporal region, learning difficulties, dyslexia, impaired intellectual development variable, speech delay, autism spectrum disorders, aggressive behavior, affected patients have various combinations of the main clinical features, seizures are often triggered by exposure to water, carrier females may be affected]"
2,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 8; DEE8,"[hypertonia, exaggerated startle response, seizures tonic hyperekplectic, seizures provoked by tactile stimulation or extreme emotion, seizures are poorly controlled, impaired psychomotor development, mental retardation severe, onset at birth]"
3,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 2; DEE2,"[microcephaly progressive, broad forehead, prominent forehead, deep-set eyes, large-appearing eyes, well-defined eyebrows, anteverted nares, full lips, breath-holding episodes, hyperventilation, constipation, gastroesophageal reflux, scoliosis, small hands, tapering fingers, small feet, hypotonia, seizures infantile-onset, infantile spasms, multifocal seizures, generalized seizures, tonic-clonic seizures, myoclonic seizures, hypsarrhythmia, delayed psychomotor development, psychomotor regression, mental retardation profound, lack of speech development, poor eye contact, motor dyspraxia, myoclonus, inability to walk independently, eeg abnormalities, sleep difficulties, autonomic disturbances, autistic features, stereotyped behaviors, hand-wringing, onset in infancy, seizures are usually refractory, males are more severely affected, dysmorphic facial features are subtle, some phenotypic overlap with rett syndrome]"
4,DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 36; DEE36,"[microcephaly, dysmorphic facial features, coarse features, low-set ears, hypertelorism, poor eye contact, cortical visual impairment, optic nerve atrophy, nystagmus, swelling of the eyelids, upturned nose, high-arched palate, hepatomegaly, feeding difficulties, gastroesophageal reflux, tube feeding, joint contractures, osteopenia, scoliosis, swelling of the hands, swelling of the feet, seizures refractory, delayed psychomotor development, developmental regression after onset of seizures, intellectual disability severe, poor or absent speech, motor difficulties, inability to walk, hypsarrhythmia seen on eeg, multifocal discharges seen on eeg, hypotonia, hydrocephalus, cerebral atrophy, delayed myelination, extrapyramidal signs, pyramidal signs, prolonged appt, coagulation defects, recurrent infections, serum transferrin n-glycosylation defect consistent with type i cdg, decreased alg13 activity, onset in infancy, mean onset of seizures at 65 months of age, most patient are female, rare patients are male]"


## Step 2. Synthetic Database and training set split

1. Synthetic dataset: For each syndrome, create synthetic records by sampling 20 of \
50%, 60%, 70%, 80%, 90% of the symptoms.

2. Train/Val/Test Split
- per-syndrome split
- Shuffle all its synthetic samples
- train_df: ~80% per syndrome
- val_df: ~20% per syndrome
- no leakage across same-syndrome synthetic variants
- Both subsets non-empty

In [9]:
# import math, random
# import pandas as pd
# from tqdm.auto import tqdm

# # Keep the expanded noise list (it's good, just wasn't used correctly before)
# # NOISE_SYMPTOMS = [
# #     # General / Constitutional
# #     "fatigue", "fever", "lethargy", "poor weight gain", "failure to thrive",
# #     "feeding difficulties", "vomiting", "constipation", "gastroesophageal reflux",
# #     "sleep disturbance", "irritability", "dehydration",

# #     # Neurological / Developmental
# #     "developmental delay", "global developmental delay", "hypotonia",
# #     "speech delay", "motor delay", "poor head control",

# #     # Other Systems
# #     "recurrent infections", "short stature", "strabismus", "scoliosis"
# # ]


# # NOISE_SYMPTOMS = [
# #     # keep generic stuff
# #     "fatigue", "fever", "poor weight gain", "failure to thrive",
# #     "feeding difficulties", "vomiting", "constipation",
# #     "gastroesophageal reflux", "sleep disturbance", "irritability",
# #     "dehydration", "recurrent infections",
# #     "short stature", "strabismus", "scoliosis",
# # ]

# NOISE_SYMPTOMS = [
#     "fatigue", "feeding difficulties", "vomiting", "sleep disturbance",
#     "poor weight gain", "fever", "irritability"
# ]



# # Strictly Raw Templates (as discussed)
# TEMPLATES = [
#     "{symptoms}",
#     "{symptoms}."
# ]

# def format_symptom_list(sym_list):
#     if not sym_list: return ""
#     if len(sym_list) == 1: return sym_list[0]
#     # Mix of "and" vs "comma-only"
#     return ", ".join(sym_list[:-1]) + " and " + sym_list[-1] if random.random() < 0.5 else ", ".join(sym_list)

# def pick_k(n, p, rounding="ceil"):
#     k = math.ceil(p * n) if rounding == "ceil" else round(p * n)
#     if p > 0 and k == 0: k = 1
#     return max(0, min(k, n))

# def make_synthetic_rows(
#     row,
#     percents=(0.2, 0.3, 0.4, 0.5, 0.6),
#     reps_per_level=20,
#     include_full=False,
#     max_noise=3,
# ):
#     syndrome = row["syndrome"]
#     base_symptoms = list(row["symptoms"])
#     n = len(base_symptoms)

#     out = []
#     seen = set()

#     for p in percents:
#         k = pick_k(n, p)
#         for _ in range(reps_per_level):
#             core = random.sample(base_symptoms, k=k)
#             n_noise = random.randint(0, max_noise)
#             noise = random.sample(NOISE_SYMPTOMS, k=n_noise)

#             all_syms = core + noise

#             # ✅ CRITICAL FIX: Shuffle the list so noise isn't always at the end
#             random.shuffle(all_syms)

#             # Deduplicate based on content
#             key = tuple(sorted(all_syms))
#             if key in seen: continue
#             seen.add(key)

#             # Format
#             sym_text = format_symptom_list(all_syms)
#             template = random.choice(TEMPLATES)
#             final_text = template.format(symptoms=sym_text)

#             out.append({
#                 "symptom_text": final_text,
#                 "diagnosis": syndrome,
#                 "pct": p,
#             })

#     if include_full:
#         # For the full list, we also shuffle it so the model doesn't memorize OMIM order
#         full_shuffled = base_symptoms[:]
#         random.shuffle(full_shuffled)
#         sym_text = format_symptom_list(full_shuffled)

#         out.append({
#             "symptom_text": sym_text,
#             "diagnosis": syndrome,
#             "pct": 1.0,
#         })

#     return out

In [10]:
import math, random
import pandas as pd
from tqdm.auto import tqdm


# core: 3–6 original symptoms
# noise: 1–3 symptoms
# noise never duplicates a true symptom


NOISE_SYMPTOMS = [
    "fatigue", "sleep disturbance", "irritability", "poor weight gain",
    "feeding difficulties", "fever", "vomiting", "constipation",
    "diarrhea", "headache", "abdominal pain", "runny nose",
    "cough", "rash", "muscle aches", "back pain",
    "dizziness", "blurred vision", "joint pain", "mild tremor",
    "mood changes", "anxiety", "difficulty concentrating",
    "mild developmental delay", "clumsiness"
]

TEMPLATES = [
    "{symptoms}",
    "{symptoms}."
]

def format_symptom_list(sym_list):
    if not sym_list:
        return ""
    if len(sym_list) == 1:
        return sym_list[0]
    return (
        ", ".join(sym_list[:-1]) + " and " + sym_list[-1]
        if random.random() < 0.5
        else ", ".join(sym_list)
    )


def make_synthetic_rows(
    row,
    min_core=3,
    max_core=6,
    reps_per_core=20,
    max_noise=3,      # 1–3 noise symptoms
    include_full=False,
):
    """
    For each syndrome:
      - sample core symptom subsets of size 3–6 (clipped by total n)
      - add 1–3 noise symptoms (if available)
      - pct = len(core) / n_original
    """
    syndrome = row["syndrome"]
    base_symptoms = list(row["symptoms"])
    n = len(base_symptoms)

    out = []
    seen = set()

    if n == 0:
        return out

    # Adjust core size bounds based on how many symptoms exist
    min_k = max(1, min_core)
    max_k = min(max_core, n)

    # noise pool that excludes true symptoms
    noise_pool = [s for s in NOISE_SYMPTOMS if s not in base_symptoms]

    for k in range(min_k, max_k + 1):
        for _ in range(reps_per_core):
            # pick k original symptoms
            core = random.sample(base_symptoms, k=k)

            # pick 1–3 noises, but not more than we actually have
            if noise_pool:
                n_noise = random.randint(1, min(max_noise, len(noise_pool)))
                noise = random.sample(noise_pool, k=n_noise)
            else:
                noise = []

            all_syms = core + noise
            random.shuffle(all_syms)

            # deduplicate by content
            key = tuple(sorted(all_syms))
            if key in seen:
                continue
            seen.add(key)

            sym_text = format_symptom_list(all_syms)
            template = random.choice(TEMPLATES)
            final_text = template.format(symptoms=sym_text)

            out.append({
                "symptom_text": final_text,
                "diagnosis": syndrome,
                "pct": k / n,  # fraction of original symptoms kept
            })

    if include_full:
        full_shuffled = base_symptoms[:]
        random.shuffle(full_shuffled)
        sym_text = format_symptom_list(full_shuffled)
        out.append({
            "symptom_text": sym_text,
            "diagnosis": syndrome,
            "pct": 1.0,
        })

    return out

In [11]:
# Regenerate synthetic dataset with a progress monitor
from tqdm.auto import tqdm

print(f"Regenerating data with {len(NOISE_SYMPTOMS)} noise terms and max_noise={3}...")
synthetic_rows = []
for _, row in tqdm(df_base.iterrows(), total=len(df_base)):
    synthetic_rows.extend(make_synthetic_rows(row))

df_syn_noice = pd.DataFrame(synthetic_rows)

print(f"Generated {len(df_syn_noice)} rows.")
print("Sample:", df_syn_noice["symptom_text"].iloc[5])

Regenerating data with 25 noise terms and max_noise=3...


  0%|          | 0/1660 [00:00<?, ?it/s]

Generated 132003 rows.
Sample: feeding difficulties, poor weight gain, aggression, seizure onset at a mean of 14 months and seizures partial.


In [12]:
# # Export the senthetic cases
# out_path = folder / 'BioBERT+synthetic_cases_Attention' / 'synthetic_101x_back2normalnoise_166051.csv'
# jsonl_path = folder / 'BioBERT+synthetic_cases_Attention' / 'synthetic_101x_back2normalnoise_166051.jsonl'

# # Make sure the subfolder exists (optional but nice)
# out_path.parent.mkdir(parents=True, exist_ok=True)
# jsonl_path.parent.mkdir(parents=True, exist_ok=True)

# # Save to CSV
# df_syn_noice.to_csv(out_path, index=False)

# # Save to jsonl
# df_syn_noice.to_json(jsonl_path, orient="records", lines=True, force_ascii=False)

# # Print out results
# print("dataset size:  ", len(df_syn_noice))
# print("Saved csv to:", out_path)
# print("Saved JSONL to:", jsonl_path)

In [13]:
import math, random
from collections import defaultdict

GLOBAL_SEED = 42
random.seed(GLOBAL_SEED)

# Group by syndrome
group_by_syndrome = defaultdict(list)
for _, row in df_syn_noice.iterrows():
    group_by_syndrome[row["diagnosis"]].append(row)

train_rows = []
val_rows = []

for syndrome, rows in group_by_syndrome.items():
    # local RNG for reproducibility per syndrome
    rnd = random.Random(GLOBAL_SEED + (hash(syndrome) % 1_000_000))

    # shuffle copies
    shuffled = rows[:]
    rnd.shuffle(shuffled)

    n = len(shuffled)

    # 80/20 split (floor to train), minimum one sample each if possible
    n_train = max(1, int(math.floor(0.8 * n))) if n > 1 else 1
    if n >= 2 and n_train == n:
        n_train = n - 1

    # split
    train_rows.extend(shuffled[:n_train])
    val_rows.extend(shuffled[n_train:])


train_df = pd.DataFrame(train_rows)
val_df = pd.DataFrame(val_rows)

print("Train size:", len(train_df))
print("Val size:  ", len(val_df))
train_df

Train size: 105591
Val size:   26412


Unnamed: 0,symptom_text,diagnosis,pct
16,"obsessive features, seizures often associated with fever, normal development in, rash",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.136364
75,"seizures partial, delayed development variable severity from birth in, cough, obsessive features, carrier males show controlling and inflexible traits, intellectual disability is variable, status epilepticus",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.272727
74,"intellectual disability is variable, abdominal pain, obsessive features, normal development in, irritability, status epilepticus, rash, carrier males show controlling and inflexible traits and delayed development variable severity from birth in.",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.272727
40,"clumsiness, aggression, seizure onset at a mean of 14 months, have cessation of seizures at a mean of 12 years, developmental regression in about 50 of patients, status epilepticus, sleep disturbance, muscle aches",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.227273
70,"diarrhea, have cessation of seizures at a mean of 12 years, abdominal pain, normal development in, psychosis, seizure onset at a mean of 14 months, delayed development variable severity from birth in, carrier males show controlling and inflexible traits.",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.272727
...,...,...,...
131999,"variable manifestations, irritability, onset in infancy, prominent nose, poor overall growth, absent speech, inability to walk and back pain.","INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 74; MRD74",0.176471
131978,"hypertonia, poor weight gain, impaired intellectual development mild to severe, motor tics, onset in infancy and poor speech.","INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 74; MRD74",0.147059
131989,"tube feeding, poor attention span, hypotonia, deep-set eyes, happy demeanor, dysarthria and sleep disturbance","INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 74; MRD74",0.176471
131929,"impaired intellectual development mild to severe, motor tics, sleep disturbance, hypotelorism, runny nose.","INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 74; MRD74",0.088235


In [14]:
# # Save training/testing set
# train_path = folder / 'BioBERT+synthetic_cases_Attention' / 'train_101x_back2normalnoise_131724.jsonl'
# val_path   = folder / 'BioBERT+synthetic_cases_Attention' / 'val_101x_back2normalnoise_34327.jsonl'

# train_df.to_json(train_path, orient="records", lines=True, force_ascii=False)
# val_df.to_json(val_path, orient="records", lines=True, force_ascii=False)

# print("Saved:")
# print("  Train JSONL →", train_path)
# print("  Val JSONL   →", val_path)
# print("Train size:", len(train_df))
# print("Val size:  ", len(val_df))

## Step 3. Training preperation


1.   Tokenizer & Dataset Class
- using dmis-lab/biobert-base-cased-v1.1 or Bio_ClinicalBERT model

In [15]:
# Tokenizer & Dataset Class - using dmis-lab/biobert-base-cased-v1.1 or Bio_ClinicalBERT model

# # Load data from google drive
# from google.colab import drive
# drive.mount('/content/drive')

# from pathlib import Path
# import pandas as pd

# folder = Path('/content/drive/MyDrive/Colab Notebooks/CS229/Final Project')
# folder.mkdir(parents=True, exist_ok=True)


# train_path = folder / 'BioBERT+synthetic_cases_Attention' / 'train_51x_noise_65911.jsonl'
# val_path   = folder / 'BioBERT+synthetic_cases_Attention' / 'val_51x_noise_17955.jsonl'

# # Load jsonl files
# train_df = pd.read_json(train_path, lines=True)
# val_df = pd.read_json(val_path, lines=True)

# print("Train_df shape:", train_df.shape)
# print("Val_df shape:  ", val_df.shape)

# train_df

In [16]:
# Create label from diagnosis:
# Combine to build a consistent mapping
all_df = pd.concat([train_df, val_df], ignore_index=True)

# Build mapping diagnosis → integer ID
labels = sorted(all_df["diagnosis"].unique())
label2id = {lab: i for i, lab in enumerate(labels)}
id2label = {i: lab for lab, i in label2id.items()}

# Add numeric label column
train_df["label"] = train_df["diagnosis"].map(label2id)
val_df["label"]   = val_df["diagnosis"].map(label2id)

train_df

Unnamed: 0,symptom_text,diagnosis,pct,label
16,"obsessive features, seizures often associated with fever, normal development in, rash",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.136364,452
75,"seizures partial, delayed development variable severity from birth in, cough, obsessive features, carrier males show controlling and inflexible traits, intellectual disability is variable, status epilepticus",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.272727,452
74,"intellectual disability is variable, abdominal pain, obsessive features, normal development in, irritability, status epilepticus, rash, carrier males show controlling and inflexible traits and delayed development variable severity from birth in.",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.272727,452
40,"clumsiness, aggression, seizure onset at a mean of 14 months, have cessation of seizures at a mean of 12 years, developmental regression in about 50 of patients, status epilepticus, sleep disturbance, muscle aches",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.227273,452
70,"diarrhea, have cessation of seizures at a mean of 12 years, abdominal pain, normal development in, psychosis, seizure onset at a mean of 14 months, delayed development variable severity from birth in, carrier males show controlling and inflexible traits.",DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 9; DEE9,0.272727,452
...,...,...,...,...
131999,"variable manifestations, irritability, onset in infancy, prominent nose, poor overall growth, absent speech, inability to walk and back pain.","INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 74; MRD74",0.176471,827
131978,"hypertonia, poor weight gain, impaired intellectual development mild to severe, motor tics, onset in infancy and poor speech.","INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 74; MRD74",0.147059,827
131989,"tube feeding, poor attention span, hypotonia, deep-set eyes, happy demeanor, dysarthria and sleep disturbance","INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 74; MRD74",0.176471,827
131929,"impaired intellectual development mild to severe, motor tics, sleep disturbance, hypotelorism, runny nose.","INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 74; MRD74",0.088235,827


## Step 4. Model Training


1.   BioBERT Sequence Classifier
2.   Training process
- Training Setup (Trainer API)
- Evaluation on Test Set
- Top-K Metrics (Top-1, Top-10, Top-20)

In [17]:
from transformers import AutoTokenizer
import torch

# Load Bio_ClinicalBERT tokenizer
model_name = "emilyalsentzer/Bio_ClinicalBERT"  # or BioBERT if you prefer
# model_name = "dmis-lab/biobert-base-cased-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
max_len = 128  # should be plenty for symptom lists


# Build the PyTorch Dataset class
class SymptomDataset(torch.utils.data.Dataset):
    def __init__(self, df, tokenizer, max_len):
        self.df = df.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        text = str(self.df.loc[idx, "symptom_text"])
        label = int(self.df.loc[idx, "label"])

        enc = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt",
        )

        return {
            "input_ids": enc["input_ids"].squeeze(0),
            "attention_mask": enc["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        }


# Create train & validation dataset objects
train_ds = SymptomDataset(train_df, tokenizer, max_len)
val_ds   = SymptomDataset(val_df,   tokenizer, max_len)

# Print out results
len(train_ds), len(val_ds)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

(105591, 26412)

In [18]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [19]:
from transformers import AutoModelForSequenceClassification

num_labels = len(labels) # from the label2id mapping

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

model.to(device)

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [20]:
# !pip install -U transformers accelerate datasets

from transformers import TrainingArguments, Trainer
import numpy as np
import torch
from sklearn.metrics import accuracy_score, f1_score

# ---------- Top-K Metrics ----------
def topk_accuracy_np(logits, labels, k):
    logits_t = torch.tensor(logits)
    probs = torch.softmax(logits_t, dim=1)
    k = min(k, probs.shape[1])
    topk_idx = torch.topk(probs, k=k, dim=1).indices.numpy()

    labels = np.array(labels)
    correct = 0
    for i in range(len(labels)):
        if labels[i] in topk_idx[i]:
            correct += 1
    return correct / len(labels)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    return {
        "accuracy": accuracy_score(labels, preds),
        "macro_f1": f1_score(labels, preds, average="macro"),
        "top1":  topk_accuracy_np(logits, labels, 1),
        "top10": topk_accuracy_np(logits, labels, 10),
        "top20": topk_accuracy_np(logits, labels, 20),
    }

In [21]:
# ---------- Training Arguments (minimal, no evaluation_strategy) ----------
training_args = TrainingArguments(
    output_dir="/content/bioclinicalbert_epilepsy_results",

    num_train_epochs=3,                 # can bump to 4–5 later
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,

    learning_rate=2e-5,
    weight_decay=0.01,

    logging_steps=200,
    report_to="none",
    fp16=torch.cuda.is_available(),     # mixed precision if GPU
)

# ---------- Trainer ----------
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,    # still used when we call trainer.evaluate()
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer

  trainer = Trainer(


<transformers.trainer.Trainer at 0x790c0dda0cb0>

**Using CrossEntropyLoss**

For a random classifier over \(1660\) classes:

$$
P(\text{correct}) = \frac{1}{1660} \approx 0.0006
$$

$$
\text{loss}_{\text{random}}
= -\log\!\left(\frac{1}{1660}\right)
= \log(1660)
\approx 7.41
$$

PyTorch's `CrossEntropyLoss` uses natural log, so a loss around \(7.4\) at the beginning is exactly what you'd expect with \(1660\) labels.

Your training log:

- starts around \(\mathbf{7.48}\) (very close to \(7.41\)),
- then steadily goes down to \(\sim 5.5\) and keeps decreasing — this is perfect.



In [22]:
trainer.train()

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Step,Training Loss
200,7.4809
400,7.4538
600,7.4431
800,7.4769
1000,7.4587
1200,7.4601
1400,7.4277
1600,7.409
1800,7.3873
2000,7.3644


TrainOutput(global_step=39597, training_loss=5.389325227238155, metrics={'train_runtime': 2689.9888, 'train_samples_per_second': 117.76, 'train_steps_per_second': 14.72, 'total_flos': 2.1146430025852416e+16, 'train_loss': 5.389325227238155, 'epoch': 3.0})

In [23]:
trainer.evaluate()

{'eval_loss': 4.219723701477051,
 'eval_accuracy': 0.40553536271391794,
 'eval_macro_f1': 0.35676893056865633,
 'eval_top1': 0.40553536271391794,
 'eval_top10': 0.6800318037255793,
 'eval_top20': 0.7549977283053158,
 'eval_runtime': 36.5211,
 'eval_samples_per_second': 723.197,
 'eval_steps_per_second': 45.207,
 'epoch': 3.0}

In [24]:
# Same the computing results
from pathlib import Path

folder = Path('/content/drive/MyDrive/Colab Notebooks/CS229/Final Project')
folder.mkdir(parents=True, exist_ok=True)

# save_dir = folder / "BioBERT+synthetic_cases_Attention" / "bioclinicalbert_attention_model"
save_dir = folder / "BioBERT+synthetic_cases_Attention" / "biobert_attention_model"
save_dir.mkdir(parents=True, exist_ok=True)

# Save model + tokenizer
trainer.save_model(str(save_dir))
tokenizer.save_pretrained(str(save_dir))

print("Model saved to:", save_dir)

Model saved to: /content/drive/MyDrive/Colab Notebooks/CS229/Final Project/BioBERT+synthetic_cases_Attention/biobert_attention_model


## Step 5 Evaluation

In [51]:
# Read evaluation data
eval_path = folder / '0-Baseline Setting' / 'test-cases-PUBMED-Stanford-combined-11-2025-deleted12cases-cleaned.xlsx'
df_eval = pd.read_excel(eval_path, sheet_name='Sheet1', engine='openpyxl')

# Change column names
df_eval = df_eval.rename(columns={"Symptoms": "symptom_text",
                                  "OMIM-Diagnosis": "diagnosis"})

# Check results
df_eval

Unnamed: 0,Case number,"Case Group (1=pubmed,2=Stanford)",Source Identifier,Number/PMID,symptom_text,OMIM link,diagnosis,Diagnosis-pubmedcases,Gene 1,Gene 1 Mutation
0,1,2,,lyg10,"tonic seizures, febrile seizures during, abnormal EEG, developmental delays, anxiety,Short statue, dysmorphic face,cleft palate",https://www.omim.org/clinicalSynopsis/611867?highlight=22q112,"CHROMOSOME 22q11.2 DELETION SYNDROME, DISTAL",,22q11.2,2.4Mb deletion
1,2,2,,lybb32,"seizure, developmental delay, speech delay, craniosynostosis, Chiari type 1 malformation, truncus arteriosus",https://www.omim.org/clinicalSynopsis/611867?highlight=22q112,"CHROMOSOME 22q11.2 DELETION SYNDROME, DISTAL",,22q11.2 deletion,
2,3,2,,lybb166,"sizure onset since infant, febrile seizure, intellectual disability, cerebral palsy, developmental delay, autism spectrum disorder, delayed motor development, speech delay, failure to thrive",https://www.omim.org/clinicalSynopsis/613443?highlight=5q143,"NEURODEVELOPMENTAL DISORDER WITH HYPOTONIA, STEREOTYPIC HAND MOVEMENTS, AND IMPAIRED LANGUAGE; NEDHSIL",,5q14.3,"5q14.3(87,148,105-88,698,587)x1 deletion"
3,4,2,,lybb132,"refractory epilepsy, intellectual disability, bilateral vision impairment, pachygyria, polymicrogyria,various seizure types, including suspected tonic and atonic seizures, as well as drop attacks, speech delay, walk with assistance",https://omim.org/clinicalSynopsis/606854?highlight=adgrg1,"CORTICAL DYSPLASIA, COMPLEX, WITH OTHER BRAIN MALFORMATIONS 14A (BILATERAL FRONTOPARIETAL); CDCBM14A",,ADGRG1,"c.407T>A (p.Leu136*),heterozygous,Pathogenic"
4,5,2,,lyd52,"intellectual disability, learning disability, Generalized Tonic Clonic seizure, atypical Absence seizure,\ntonic seizure",https://omim.org/entry/604352,"FEBRILE SEIZURES, FAMILIAL, 4; FEB4",,ADGRV1,"c.12211C>T (p.R4071X),heterozygous,Pathogenic"
...,...,...,...,...,...,...,...,...,...,...
191,202,1,"Macha et al., 2022",34617111,Impaired GABAergic inhibition; severe early onset DEE; deficits in postsynaptic clustering (Gephyrin); impairment of inhibitory signal transmission; Moco-deficiency; loss-of-activity in Moco-dependent enzymes,,"MELANOCYTIC NEVUS SYNDROME, CONGENITAL; CMNS",Developmental and epileptic encephalopathy (DEE); Dravet-like patient (referencing Geph G375D variant); genetic epilepsies,,
192,203,1,"Zeka et al., 2023",37732012,"Severe seizures (combined focal and generalized onset); metabolic dysfunction; neurodevelopmental abnormalities; epileptiform abnormality (EEG, inter-ictal from left frontal-central, ictal from left mid-temporal); volume loss in posterior periventricular area and parietal parenchyma (Brain MRI); myelin destruction (Brain MRI, no hypoxic involvement); left dominant enlargement of lateral ventricles secondary to loss of central parenchyma (Brain MRI); refractory seizures; profound developmental delay; microcephaly; neurodevelopmental regression; high mortality; SUDEP (sudden unexpected death in epilepsy)",,SIFRIM-HITZ-WEISS SYNDROME; SIHIWES,Sifrim‚ÄìHitz‚ÄìWeiss syndrome (SIHIWES); Development and epileptic encephalopathy-14 (DEE14); Medium chain acyl-CoA dehydrogenase deficiency (MCADD); early infantile epileptic encephalopathy 52 (EIEE52); epilepsy of infancy migrating focal seizures (EIMFS); malignant migrating partial seizures of infancy (MMPSI),,
193,204,1,"Maroofian et al., 2021",32719099,"Parental consanguinity; sloping forehead; upslanting palpebral fissures; telecanthus; full cheeks; short philtrum; tented upper lip vermilion; pointed chin; pointed and indented helices; microcephaly; muscle wasting; distal contractures of upper extremities; diffuse cerebral atrophy (MRI, with loss of subcortical white matter); enlargement of ventricular system and subarachnoid spaces (MRI); corpus callosum hypoplasia (MRI); slowing of background cerebral activity (EEG); bilateral focal epileptic discharges (EEG); frequent multifocal epileptic discharges (EEG); recurrent seizures (at 4‚Äì5 months); tonic seizures; myoclonic seizures (including chin myoclonus); axial hypotonia; progressive spastic tetraplegia; hyperreflexia; high forehead; coarse faces; epicanthal folds; hypertelorism; long flat philtrum; puffy cheeks; open bite with thick alveolar ridge; gingival hyperplasia; strabismus; cortical blindness; slow-wave elements (EEG); low-voltage to medium-voltage epileptiform activity (EEG, predominant in frontal region bilaterally); multifocal origin (EEG); hippocampal atrophy; linear T2-weighted hyperintensities in lentiform nuclei; normal metabolic screening; apnoea; staring; head deviation; vomiting; eye deviation; twitching",,"NEURODEVELOPMENTAL DISORDER WITH HYPOTONIA, MICROCEPHALY, AND SEIZURES; NEDHYMS",Severe developmental and epileptic encephalopathy (DEE); ADARB1-related DEE,,
194,205,1,"Narayanan et al., 2023",34871784,Tented vermilion of upper lip; medical flaring of eyebrows; prominent frontotemporal subarachnoid spaces (MRI brain at five months); mild delayed myelination (axial T2-weighted MRI at five months); thin corpus callosum (MRI brain at one year); developmental delay (DD); intellectual disability (ID),,FIBROBLAST GROWTH FACTOR 13; FGF13,X-linked DEE90; developmental and epileptic encephalopathy,,


In [52]:
import numpy as np
from torch.utils.data import DataLoader

# 1. Split the dataframe
df_pubmed = df_eval[df_eval["Case Group (1=pubmed,2=Stanford)"] == 1].reset_index(drop=True)
df_eval = df_eval[df_eval["Case Group (1=pubmed,2=Stanford)"] == 2].reset_index(drop=True)

print(f"PubMed Cases (Group 1):   {len(df_eval)}")

PubMed Cases (Group 1):   107


In [45]:
df_eval = df_eval[df_eval["Case Group (1=pubmed,2=Stanford)"] == 1].reset_index(drop=True)

df_eval["symptom_text"] = (
    df_eval["symptom_text"]
    .str.replace(";", ",", regex=False)   # or " and "
    .str.replace(r"\s+", " ", regex=True) # clean extra spaces
    .str.strip()
)

print(f"PubMed Cases (Group 1):   {len(df_eval)}")

PubMed Cases (Group 1):   89


In [46]:
# Map ground-truth diagnosis to label IDs
df_eval["label"] = df_eval["diagnosis"].map(label2id)

missing = df_eval[df_eval["label"].isna()]
print("Unmapped syndromes:", len(missing))
missing["diagnosis"].unique()[:10]

Unmapped syndromes: 8


array(['CORONARY ARTERY DISEASE, AUTOSOMAL DOMINANT 2; ADCAD2',
       'INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 39; MRD39',
       'INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 61; MRD61',
       'INFLAMMATORY BOWEL DISEASE (CROHN DISEASE) 1; IBD1',
       'MELANOCYTIC NEVUS SYNDROME, CONGENITAL; CMNS',
       'SIFRIM-HITZ-WEISS SYNDROME; SIHIWES',
       'FIBROBLAST GROWTH FACTOR 13; FGF13'], dtype=object)

In [47]:
# # Reload the saved model later
# from transformers import AutoTokenizer, AutoModelForSequenceClassification
# from pathlib import Path

# base_folder = Path('/content/drive/MyDrive/Colab Notebooks/CS229/Final Project')
# model_dir = base_folder / "BioBERT+synthetic_cases_Attention" / "bioclinicalbert_attention_model"

# tokenizer = AutoTokenizer.from_pretrained(model_dir)
# model = AutoModelForSequenceClassification.from_pretrained(model_dir)
# model.to(device)

In [48]:
# 1. Drop unlabeled rows from eval set
df_eval_clean = df_eval.dropna(subset=["label"]).copy()
print(len(df_eval), "→", len(df_eval_clean), "after dropping NaNs")

# 2. Make sure labels are ints
df_eval_clean["label"] = df_eval_clean["label"].astype(int)

# 3. Rebuild eval dataset + loader
eval_ds = SymptomDataset(df_eval_clean, tokenizer, max_len)

from torch.utils.data import DataLoader
eval_loader = DataLoader(eval_ds, batch_size=16)

all_labels = df_eval_clean["label"].tolist()

# 4. Run your manual eval loop again
model.eval()
model.to(device)

all_preds = []

with torch.no_grad():
    for batch in eval_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        preds = logits.cpu().numpy()
        all_preds.append(preds)

all_preds = np.vstack(all_preds)  # shape (N, num_labels)


89 → 81 after dropping NaNs


In [49]:
# # Tokenize evaluation set (just like train/test)
# eval_ds = SymptomDataset(df_eval, tokenizer, max_len)

# from torch.utils.data import DataLoader
# import torch
# import numpy as np

# eval_loader = DataLoader(eval_ds, batch_size=16)

# all_preds = []
# all_labels = df_eval["label"].tolist()

# model.eval()
# model.to(device)

# for batch in eval_loader:
#     input_ids = batch["input_ids"].to(device)
#     attention_mask = batch["attention_mask"].to(device)

#     with torch.no_grad():
#         outputs = model(input_ids=input_ids, attention_mask=attention_mask)
#         logits = outputs.logits  # (batch_size, num_classes)
#         preds = logits.cpu().numpy()
#         all_preds.append(preds)

# all_preds = np.vstack(all_preds)   # shape: (N, num_labels)

In [50]:
# Compute Top-K accuracy manually
def topk_accuracy(preds, labels, k):
    topk = np.argsort(preds, axis=1)[:, -k:]  # last k = highest scores
    correct = sum(labels[i] in topk[i] for i in range(len(labels)))
    return correct / len(labels)

top1 = topk_accuracy(all_preds, all_labels, 1)
top10 = topk_accuracy(all_preds, all_labels, 10)
top20 = topk_accuracy(all_preds, all_labels, 20)

top1, top10, top20

print("Real data evaluation:")
print(f"Top-1 accuracy:  {top1:.4f}")
print(f"Top-10 accuracy: {top10:.4f}")
print(f"Top-20 accuracy: {top20:.4f}")


Real data evaluation:
Top-1 accuracy:  0.0123
Top-10 accuracy: 0.0494
Top-20 accuracy: 0.0741


In [34]:
import torch
import pandas as pd
import numpy as np

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Let's look at the first 10 examples in the evaluation set
num_examples = 10
results = []

print(f"Inspecting first {num_examples} examples from Evaluation set (Checking Top-20 Accuracy):\n")

for i in range(num_examples):
    # Prepare input
    row = df_eval.iloc[i]
    text = str(row["symptom_text"])
    true_label_id = row["label"]
    true_diagnosis = row["diagnosis"]

    inputs = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=128,
        return_tensors="pt"
    )

    # Move inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=1)

        # Get Top-20 predictions
        top_k = 20
        top_probs, top_ids = torch.topk(probs, k=top_k, dim=1)

        # Check if correct label is in the top 20
        top_ids_list = top_ids[0].cpu().numpy()
        is_correct_top20 = true_label_id in top_ids_list

        # Get Top-1 for display
        top1_id = top_ids_list[0]
        top1_prob = top_probs[0][0].item()
        pred_diagnosis_top1 = id2label[top1_id]

    results.append({
        "Input Text (truncated)": text[:100] + "...",
        "True Diagnosis": true_diagnosis,
        "Predicted (Top-1)": pred_diagnosis_top1,
        "Confidence (Top-1)": f"{top1_prob:.4f}",
        "In Top-20?": "✅" if is_correct_top20 else "❌"
    })

# Display as a clean dataframe
results_df = pd.DataFrame(results)
display(results_df)

Inspecting first 10 examples from Evaluation set (Checking Top-20 Accuracy):



Unnamed: 0,Input Text (truncated),True Diagnosis,Predicted (Top-1),Confidence (Top-1),In Top-20?
0,"tonic seizures, febrile seizures during, abnormal EEG, developmental delays, anxiety,Short statue, ...","CHROMOSOME 22q11.2 DELETION SYNDROME, DISTAL",CHROMOSOME Xp11.23-p11.22 DUPLICATION SYNDROME,0.0203,❌
1,"seizure, developmental delay, speech delay, craniosynostosis, Chiari type 1 malformation, truncus ...","CHROMOSOME 22q11.2 DELETION SYNDROME, DISTAL",OCULOGASTROINTESTINAL NEURODEVELOPMENTAL SYNDROME; OGIN,0.019,❌
2,"sizure onset since infant, febrile seizure, intellectual disability, cerebral palsy, developmental d...","NEURODEVELOPMENTAL DISORDER WITH HYPOTONIA, STEREOTYPIC HAND MOVEMENTS, AND IMPAIRED LANGUAGE; NEDHSIL","LODDER-MERLA SYNDROME, TYPE 1, WITH IMPAIRED INTELLECTUAL DEVELOPMENT AND CARDIAC ARRHYTHMIA; LDMLS1",0.0323,❌
3,"refractory epilepsy, intellectual disability, bilateral vision impairment, pachygyria, polymicrogyri...","CORTICAL DYSPLASIA, COMPLEX, WITH OTHER BRAIN MALFORMATIONS 14A (BILATERAL FRONTOPARIETAL); CDCBM14A","POLYMICROGYRIA, BILATERAL PERISYLVIAN, X-LINKED; BPPX",0.023,❌
4,"intellectual disability, learning disability, Generalized Tonic Clonic seizure, atypical Absence se...","FEBRILE SEIZURES, FAMILIAL, 4; FEB4","EPILEPSY, IDIOPATHIC GENERALIZED, SUSCEPTIBILITY TO, 16; EIG16",0.0301,❌
5,"seizure, Developmental Delay, hypotonia, delayed speech, behavioral issues...",XIA-GIBBS SYNDROME; XIGIS,"INTELLECTUAL DEVELOPMENTAL DISORDER, AUTOSOMAL DOMINANT 52; MRD52",0.0243,❌
6,"global developmental delay, seizures, intellectual disability, migraines...",KBG SYNDROME; KBGS,PAROXYSMAL NONKINESIGENIC DYSKINESIA 2; PNKD2,0.0428,❌
7,"Developmental delay, Intellectual disability, clonic seizures, hypotonia, Autism spectrum disorder, ...",BAINBRIDGE-ROPERS SYNDROME; BRPS,"EPILEPSY, X-LINKED 1, WITH VARIABLE LEARNING DISABILITIES AND BEHAVIOR DISORDERS; EPILX1",0.0455,❌
8,"dystonia, episodes of hemiplegia, focal epilepsy, possible left mesial temporal sclerosis on MRI, ...",ALTERNATING HEMIPLEGIA OF CHILDHOOD 2; AHC2,"EPILEPSY, FAMILIAL ADULT MYOCLONIC, 5; FAME5",0.0337,❌
9,"seizures, Moya Moya disease, beta thalassemia,liver fibrosis,...",IMMUNODEFICIENCY 47; IMD47,GLYCOSYLPHOSPHATIDYLINOSITOL BIOSYNTHESIS DEFECT 1; GPIBD1,0.0367,❌


In [None]:
# # Individual checkpoint
# import pandas as pd
# import torch
# import re

# # 1. Create a lookup dictionary for canonical symptoms from the base data
# #    (Map: Diagnosis Name -> List of Symptoms)
# symptom_lookup = pd.Series(
#     df_base.symptoms.values, index=df_base.syndrome
# ).to_dict()

# def normalize_text(text):
#     """Simple normalization: lowercase, remove punctuation for matching."""
#     text = text.lower()
#     text = re.sub(r'[^a-z0-9\s]', ' ', text)
#     return set(text.split())

# def predict_and_analyze(input_text, ground_truth=None, top_k=5):
#     """
#     1. Run Model inference.
#     2. Show Prediction vs Ground Truth.
#     3. Show Canonical Symptoms for the predicted disease.
#     4. Show Matching Symptoms (overlap between input and canonical).
#     """
#     model.eval()

#     # --- 1. Inference ---
#     inputs = tokenizer(
#         input_text,
#         truncation=True,
#         padding="max_length",
#         max_length=128,
#         return_tensors="pt"
#     )
#     inputs = {k: v.to(device) for k, v in inputs.items()}

#     with torch.no_grad():
#         logits = model(**inputs).logits
#         probs = torch.softmax(logits, dim=1)
#         top_probs, top_ids = torch.topk(probs, k=top_k, dim=1)

#     # --- 2. Build Analysis Table ---
#     results = []
#     input_tokens = normalize_text(input_text)

#     print(f"\n📝 Input Symptoms: \"{input_text}\"")
#     if ground_truth:
#         print(f"🏆 Ground Truth:    {ground_truth}")

#         # Check if GT is in top_k
#         top_ids_list = top_ids[0].cpu().tolist()
#         gt_id = label2id.get(ground_truth)
#         if gt_id in top_ids_list:
#             rank = top_ids_list.index(gt_id) + 1
#             print(f"✅ Ground Truth found at Rank #{rank}")
#         else:
#             print(f"❌ Ground Truth NOT in Top-{top_k}")

#     print("-" * 80)

#     for i in range(top_k):
#         pred_id = top_ids[0][i].item()
#         prob = top_probs[0][i].item()
#         pred_name = id2label[pred_id]

#         # Retrieve Canonical Symptoms
#         canonical_symptoms = symptom_lookup.get(pred_name, [])

#         # Find "Matching" symptoms (Simple token overlap for display)
#         # (Checking which canonical symptoms appear in the input text)
#         matches = []
#         for sym in canonical_symptoms:
#             # simplistic check: if the symptom phrase (or part of it) is in input
#             sym_norm = normalize_text(sym)
#             if sym_norm & input_tokens: # if overlap exists
#                 matches.append(sym)

#         results.append({
#             "Rank": i+1,
#             "Confidence": f"{prob:.4f}",
#             "Predicted Diagnosis": pred_name,
#             "Matching Symptoms (Input ∩ Database)": ", ".join(matches) if matches else "(No direct text match)",
#             "Canonical Symptoms (First 5)": ", ".join(canonical_symptoms[:5]) + "..."
#         })

#     display(pd.DataFrame(results))

# # --- Example Usage using a case from the Eval set ---
# print("--- Test Run with an Example from the Evaluation Set ---")
# sample_row = df_eval.iloc[0] # Grab the first row
# predict_and_analyze(sample_row["symptom_text"], ground_truth=sample_row["diagnosis"])



---


# Appendix: Training Log
**Target:**
- Top-1: 10–20%
- Top-10: 40–60%
- Top-20: 60–80%




**Training log:**
1. 20 per p synthetic, emilyalsentzer/Bio_ClinicalBERT:
- Top-1 accuracy:  0.0104
- Top-10 accuracy: 0.0729
- Top-20 accuracy: 0.1354

2. adding noise, 101x, and only pubmed cases:
- Top-1 accuracy:  0.0227
- Top-10 accuracy: 0.1250
- Top-20 accuracy: 0.1932

3. adding noise, 51x, and only stanford cases:
- Top-1 accuracy:  0.0294
- Top-10 accuracy: 0.1569
- Top-20 accuracy: 0.2157

4. adding noise, 101x, and only stanford cases:
- Top-1 accuracy:  0.0686
- Top-10 accuracy: 0.2353
- Top-20 accuracy: 0.2745

5. Expand the Noise List, 101x, and only stanford cases:
- Top-1 accuracy:  0.0490
- Top-10 accuracy: 0.2157
- Top-20 accuracy: 0.2843

5. Back to regular Noise List, 101x, and only stanford cases:
- Top-1 accuracy:  0.0686
- Top-10 accuracy: 0.1667
- Top-20 accuracy: 0.2255

6. Back to regular Noise List, 100x, 20%-60%, and only stanford cases:
- Top-1 accuracy:  0.0784
- Top-10 accuracy: 0.2843
- Top-20 accuracy: 0.3431

7. Back to regular Noise List, 150x, 20%-60%, and only stanford cases:
- Top-1 accuracy:  0.0588
- Top-10 accuracy: 0.2353
- Top-20 accuracy: 0.2941

8. Regular Noise List, 100x, 20%-60%, and only stanford cases, dmis-lab/biobert-base-cased-v1.1
- Top-1 accuracy:  0.0882
- Top-10 accuracy: 0.1863
- Top-20 accuracy: 0.2941

9. Regular Noise List, 100x, 3-6 original symptoms& 1-3 noises, and only stanford cases, clinicalbert
- Top-1 accuracy:  0.0588
- Top-10 accuracy: 0.2941
- Top-20 accuracy: 0.3529

10. Regular Noise List, 100x, 3-6 original symptoms& 1-3 noises, and only pubmed cases, clinicalbert
- Top-1 accuracy:  0.0123
- Top-10 accuracy: 0.0370
- Top-20 accuracy: 0.0741

11. Regular Noise List, 100x, 3-6 original symptoms& 1-3 noises, and only pubmed cases, clinicalbert, modify ";" to ","
- Top-1 accuracy:  0.0123
- Top-10 accuracy: 0.0494
- Top-20 accuracy: 0.0741

**Next step:**
- ✅ add more synthetic (*30 or *40) per syndrome, tried 30, less performed than 20
- keep only 3-5 true symptoms per syndrome and train
- ✅ change the correction label from top 1 to top 20
- ✅ set up an individual check point, intaking free text symptoms, output prediction vs. ground truth, with associated symptoms, and matching symptoms
- try biobert + llama
- try chatgpt for embedding


In [None]:
# 💤 Auto-Shutdown Script
# This will disconnect the runtime immediately when executed.
# Make sure all training and saving is done before this runs!

from google.colab import runtime
print("Job finished. Shutting down runtime to save compute units...")
runtime.unassign()