# LLM Tutorial

**Objective**: Predict whether a COVID‑19 patient had a sever outcome - an inpatient or emergency encounter - within 30 days after onset of COVID.

**Data Source**: SyntheticMass [COVID-19 10K, CSV](https://mitre.box.com/shared/static/9iglv8kbs1pfi7z8phjl9sbpjk08spze.zip)

## Import Dataset

We'll use three tables: `patients`, `encounters` and `conditions`.

In [68]:
import pandas as pd

patients = pd.read_csv("../10k_synthea_covid19_csv/patients.csv")
encounters = pd.read_csv("../10k_synthea_covid19_csv/encounters.csv")
conditions = pd.read_csv("../10k_synthea_covid19_csv/conditions.csv")

## Prepare Dataset

Prepare the dataset for use in our prompts:

- **Make dates easier to work with**: Convert date strings to datetime
- **Determine first onset of COVID**: Identify patients with COVID conditions and determine the date of first onset
- **Label training examples**: by whether or not they had a severe encounter within 30 days of first onset of COVID
- **Build Featureset**: Identify features for each training example based on demographic an comorbidity info
- **Convert Dataframe to Text**: Convert examples from dataframe rows to a consistent text format "Patient Cards"

### Make dates easier to work with

In [69]:
parse_dates = ["START","STOP","DATE","BIRTHDATE","DEATHDATE","ONSET","RECORDED_DATE"]

# normalize date columns
for df in (patients, encounters, conditions):
    for c in [c for c in parse_dates if c in df.columns]:
        df[c] = pd.to_datetime(
            df[c],
            errors="coerce",
            utc=True,
        ).dt.tz_localize(None)

### Determine first onset of COVID per patient

In [70]:
covid_cond = conditions[
    conditions["DESCRIPTION"].str.contains("COVID", case=False, na=False)
].copy()
first_onset = covid_cond.groupby("PATIENT")["START"].min().rename("COVID_ONSET")

### Label training examples

In [71]:
# Join encounters with info about first onset of COVID for each patient patient
enc_join_covid= encounters.merge(
    first_onset,
    left_on="PATIENT",
    right_index=True,
    how="inner")


# determine if each row is within the 30 day window of the
# first onset of COVID
window = (
    enc_join_covid["START"] >= enc_join_covid["COVID_ONSET"]
) & (
    enc_join_covid["START"] <= enc_join_covid["COVID_ONSET"] +
    pd.Timedelta(days=30)
)

# determine if encounters marked within the window are severe
severe_hit = enc_join_covid.loc[
    window & enc_join_covid["ENCOUNTERCLASS"].str.lower().isin(
        ["inpatient","emergency"]
    )
].groupby("PATIENT").size().gt(0)

# create a label for each patient based on whether they had a severe encounter
# within 30 days of first onset of COVID
label = severe_hit.reindex(
    first_onset.index,
    fill_value=False
).astype(
        int
).rename("severe_30d")

### Build Featureset

In [72]:
# determine age at onset
age_at_onset = (
    (
        first_onset - first_onset.to_frame().join(
            patients.set_index("Id"),
            how="left")["BIRTHDATE"]
    ).dt.days / 365.25
).round()

# get a list of comorbidities
COMORBID_KEYWORDS = [
    "diabetes",
    "hypertension",
    "asthma",
    "copd",
    "coronary artery",
    "heart failure",
    "obesity",
    "chronic kidney",
    "ckd",
    "cancer",
    "immunodefic",
    "hyperlipid"
]
# merge other conditions with first onset of COVID
pre = conditions.merge(
    first_onset,
    left_on="PATIENT",
    right_index=True,
    how="inner"
)
# filter out conditions that occurred after onset of COVID
pre = pre[pre["START"] < pre["COVID_ONSET"]]
# for each patient, create column with flags for each comorbidity
# indicating whether the patient has the comorbidity before onset of COVID
flags = pre.assign(**{
        kw: pre["DESCRIPTION"].str.contains(kw, na=False) for kw in COMORBID_KEYWORDS
}).groupby("PATIENT")[COMORBID_KEYWORDS].max().astype(bool)

# construct our dataset
frame = pd.DataFrame(index=first_onset.index)
frame["age_at_onset"] = age_at_onset.astype("Int64")
frame["gender"] = patients.set_index("Id").reindex(frame.index)["GENDER"]
# make sure flags has an entry for every patient in frame
flags = flags.reindex(frame.index, fill_value=False)
frame = frame.join(flags, how="left", on="PATIENT")
frame = frame.join(label, how="left").dropna(subset=["severe_30d"])

frame.head()

Unnamed: 0_level_0,age_at_onset,gender,diabetes,hypertension,asthma,copd,coronary artery,heart failure,obesity,chronic kidney,ckd,cancer,immunodefic,hyperlipid,severe_30d
PATIENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0000b247-1def-417a-a783-41c8682be022,12,F,False,False,False,False,False,False,False,False,False,False,False,False,0
00049ee8-5953-4edd-a277-b9c1b1a7f16b,35,M,True,False,False,False,False,False,True,False,False,False,False,False,0
00079a57-24a8-430f-b4f8-a1cf34f90060,29,F,False,False,False,False,False,False,False,False,False,False,False,False,1
0008a63c-c95c-46c2-9ef3-831d68892019,27,M,False,False,False,False,False,False,True,False,False,False,False,False,1
00093cdd-a9f0-4ad8-87e9-53534501f008,67,F,False,False,False,False,False,False,True,False,False,False,False,False,1


### Convert Dataframe to Text

We now have all of our examples labeled, but LLMs work better with consistent, compact text inputs, so we convert each row into a short “patient card".

In [73]:
import textwrap

def patient_card(df, pid) -> str:
    row = df.loc[pid]
    comorbs = [kw for kw in COMORBID_KEYWORDS if bool(row.get(kw, False))]
    comorb_str = ", ".join(sorted(set(c.title() for c in comorbs))) or "None"
    age = int(row["age_at_onset"]) if pd.notna(row["age_at_onset"]) else "Unknown"
    gender = str(row["gender"])
    return textwrap.dedent(f"""
    Patient Info:
      - Age at onset of COVID: {age}
      - Gender: {gender}
      - Known Comorbidities: {comorb_str}
    """)

patient_zero_id = frame.index[0]
print(patient_card(frame, patient_zero_id))


Patient Info:
  - Age at onset of COVID: 12
  - Gender: F
  - Known Comorbidities: None



## Zero-Shot Prompts

Now that we have a dataset and a way to generate examples. We'll start by making simple, zero-shot prompts

First we need to setup OpenAI API.

In [7]:
import getpass
# get the OpenAI API key and confirm it is set
openai_key = getpass.getpass("MY_API_KEY: ")
if openai_key and openai_key.strip():
    print("OPENAI_KEY is non-empty")
else:
    print("OPENAI_KEY is empty or not set")


OPENAI_KEY is non-empty


### Ask GPT 5 Nano to classify some patients without providing any examples.

In [74]:
from openai import OpenAI
import os, json, random, hashlib, pathlib as pl
client = OpenAI(api_key=openai_key)

MODEL = os.getenv("OPENAI_MODEL", "gpt-5-nano")
RNG = random.Random(3)

PROJ = pl.Path.cwd()
CACHE = PROJ / "_cache"
CACHE.mkdir(exist_ok=True)

# simple cache to control costs
def _key(s: str) -> str:
    return hashlib.sha1(s.encode()).hexdigest()[:16]

def cache_call(tag: str, prompt: str, fn):
    path = CACHE / f"{tag}-{_key(prompt)}.json"
    if path.exists():
        cached = path.read_text()
        try:
            return json.loads(cached)
        except json.JSONDecodeError:
            return cached
    out = fn()
    if isinstance(out, (dict, list)):
        path.write_text(json.dumps(out))
    else:
        path.write_text(str(out))
    return out

BASE_SYSTEM = (
    "You are a clinical risk scorer working on determining whether a patient is likely to have a severe outcome within 30 days of COVID onset. "
    "A severe outcome is defined as an inpatient or emergency encounter. "
    "Task: in a single word, classify the patient as 'severe' or 'non-severe'."
)

def llm_predict(card_text: str, shots=None) -> list[str]:
    messages = []
    if shots:
        for ex_card, ex_out in shots:
            messages.append({"role": "user", "content": ex_card})
            messages.append({"role": "assistant", "content": str(ex_out)})
    messages.append({"role": "user", "content": card_text + "Predict severe_30d as 'severe' or 'non-severe'."})

    def _call():
        r = client.responses.create(
            model=MODEL,
            input=[{"role":"system","content": BASE_SYSTEM}, *messages],
        )
        return r.output_text.strip() if hasattr(r, "output_text") else ""

    out = cache_call("risk", json.dumps(messages), _call)
    if isinstance(out, list):
        return out
    if isinstance(out, dict):
        return [json.dumps(out)]
    return [str(out).strip()]

# sanity check
llm_predict(patient_card(frame, patient_zero_id))[0]


'non-severe'

Run zero-shot prediction on 5 random patients

In [75]:
# prepare five patients for zero-shot prompting
NUM_ZERO_SHOT = 10
zero_shot_ids = RNG.sample(list(frame.index), NUM_ZERO_SHOT)
zero_shot_cards = [
    {
        "patient_id": pid,
        "card": patient_card(frame, pid)
    }
    for pid in zero_shot_ids
]
zero_shot_cards

# predict on zero-shot patients
zero_shot_preds = [llm_predict(card["card"])[0] for card in zero_shot_cards]
zero_shot_preds


['severe',
 'severe',
 'severe',
 'severe',
 'severe',
 'severe',
 'non-severe',
 'severe',
 'non-severe',
 'severe']

### Evaluate zero-shot results

In [76]:
actual = frame.loc[zero_shot_ids, "severe_30d"].to_list()
actual_mapped = [ {1: "severe", 0: "non-severe"}.get(x, x) for x in actual ]
actual_mapped

['non-severe',
 'severe',
 'non-severe',
 'non-severe',
 'non-severe',
 'non-severe',
 'non-severe',
 'non-severe',
 'non-severe',
 'severe']

## In-Context, Few-Shot Prompts

The zero-shot predictions do not appear to perform very well, at all... Do our results improve if we add some labeled examples?

In [None]:
# few-shot prep: 10 prediction cards + 3 labeled shots
NUM_SHOTS = 3
NUM_FEWSHOT_TARGETS = 10

# ensure we don't reuse patients already selected for zero-shot evaluation
used_ids = set(zero_shot_ids)
remaining_ids = [pid for pid in frame.index if pid not in used_ids]
selected_ids = RNG.sample(remaining_ids, NUM_SHOTS + NUM_FEWSHOT_TARGETS)

fewshot_shot_ids = selected_ids[:NUM_SHOTS]
fewshot_target_ids = selected_ids[NUM_SHOTS:]

fewshot_shots = []
for pid in fewshot_shot_ids:
    label = "severe" if frame.loc[pid, "severe_30d"] == 1 else "non-severe"
    fewshot_shots.append((patient_card(frame, pid), label))

fewshot_targets = [
    {
        "patient_id": pid,
        "card": patient_card(frame, pid)
    }
    for pid in fewshot_target_ids
]

print(json.dumps(fewshot_shots, indent=2))
print(json.dumps(fewshot_targets, indent=2))

few_shot_preds = [llm_predict(card["card"], fewshot_shots)[0] for card in fewshot_targets]
few_shot_preds

[
  [
    "\nPatient Info:\n  - Age at onset of COVID: 23\n  - Gender: M\n  - Known Comorbidities: None\n",
    "non-severe"
  ],
  [
    "\nPatient Info:\n  - Age at onset of COVID: 9\n  - Gender: F\n  - Known Comorbidities: None\n",
    "non-severe"
  ],
  [
    "\nPatient Info:\n  - Age at onset of COVID: 38\n  - Gender: F\n  - Known Comorbidities: Obesity\n",
    "severe"
  ]
]
[
  {
    "patient_id": "f6dee18e-eacc-4e66-908f-e3aa8fabd502",
    "card": "\nPatient Info:\n  - Age at onset of COVID: 28\n  - Gender: F\n  - Known Comorbidities: None\n"
  },
  {
    "patient_id": "0e55ddcc-ada3-46bc-870b-b43e2e73443c",
    "card": "\nPatient Info:\n  - Age at onset of COVID: 47\n  - Gender: M\n  - Known Comorbidities: Diabetes\n"
  },
  {
    "patient_id": "5a4b442d-b339-4ae9-960d-673c6eeb049c",
    "card": "\nPatient Info:\n  - Age at onset of COVID: 28\n  - Gender: M\n  - Known Comorbidities: None\n"
  },
  {
    "patient_id": "bdd35a56-ab19-4e3d-91ea-2d66961c5a9d",
    "card": "\nPati

['non-severe',
 'severe',
 'non-severe',
 'non-severe',
 'non-severe',
 'severe',
 'non-severe',
 'severe',
 'non-severe',
 'severe']

### Evaluate few-shot results

In [31]:
few_shot_actual = frame.loc[fewshot_target_ids, "severe_30d"].to_list()
few_shot_actual_mapped = [ {1: "severe", 0: "non-severe"}.get(x, x) for x in few_shot_actual ]
few_shot_actual_mapped

['non-severe',
 'severe',
 'non-severe',
 'non-severe',
 'non-severe',
 'non-severe',
 'severe',
 'non-severe',
 'non-severe',
 'non-severe']

## In-Context, Few-Shot with Similar examples

At first glance, the few-shot results seem a little better, but still not great. But with a small, random set of examples, that isn't too surprising. To provide better examples, for each query we could include include K most‑similar training patients as examples and see if local exemples can beat global random ones.

In [77]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import classification_report

patient_ids = [pid for pid in frame.index]
feat_cols = ["age_at_onset", "gender"] + COMORBID_KEYWORDS
X = frame.loc[patient_ids, feat_cols].copy()
X["gender"] = X["gender"].astype("category").cat.codes
X = pd.get_dummies(X, columns=["gender"], drop_first=True)

# I'm just using train_test_split to split the data into a training set and a test set,
# but it's a very small "test" set, while I'm using most of the training set
# as a pool of similar patients examples to the small test set.
train_ids, test_ids = train_test_split(
    patient_ids,
    test_size=0.001,
    random_state=3,
    stratify=[frame.loc[i, "severe_30d"] for i in patient_ids])
X_train = X.loc[train_ids]
X_test  = X.loc[test_ids]

# cluster the training set
knn = NearestNeighbors(n_neighbors=3).fit(X_train)

true_y = {pid: ("severe" if frame.loc[pid, "severe_30d"] == 1 else "non-severe") for pid in patient_ids}

def predict_with_retrieval(pid):
    _, idx = knn.kneighbors(X_test.loc[[pid]])
    nn_ids = [X_train.index[i] for i in idx[0]]
    shots_dyn = [(patient_card(frame, nn), true_y[nn]) for nn in nn_ids]
    return llm_predict(patient_card(frame, pid), shots=shots_dyn)[0]

X_test.shape

(10, 14)

### Make predictions and evaluate

Before making the predictions, modify the system role to provide reasoning (chain-of-thought) to predictions

In [78]:
preds_ret = {pid: predict_with_retrieval(pid) for pid in test_ids}
print(classification_report([true_y[i] for i in test_ids], [preds_ret[i] for i in test_ids], digits=3))

              precision    recall  f1-score   support

  non-severe      0.857     0.750     0.800         8
      severe      0.333     0.500     0.400         2

    accuracy                          0.700        10
   macro avg      0.595     0.625     0.600        10
weighted avg      0.752     0.700     0.720        10



### Update System Role to provide chain-of-thought to prompt

In [85]:
BASE_SYSTEM = (
    "You are a clinical risk scorer working on determining whether a patient is likely to have a severe outcome within 30 days of COVID onset. "
    "A severe outcome is defined as an inpatient or emergency encounter. "
    "First reason through the key risk factors in 2-3 short bullet points, then finish with a single word classification ('severe' or 'non-severe')."
)

def llm_predict(card_text: str, shots=None) -> list[str]:
    messages = []
    if shots:
        for ex_card, ex_out in shots:
            messages.append({"role": "user", "content": ex_card})
            messages.append({"role": "assistant", "content": str(ex_out)})
    messages.append({"role": "user", "content": card_text + "Predict severe_30d as 'severe' or 'non-severe'."})

    r = client.responses.create(
        model=MODEL,
        input=[{"role":"system","content": BASE_SYSTEM}, *messages],
    )
    return r.output_text


preds_ret = llm_predict(patient_card(frame, test_ids[0]))
print(preds_ret)


- Young age (26) is associated with lower risk of severe COVID-19 outcomes.
- Absence of known comorbidities reduces likelihood of progression to severe disease.
- No high-risk factors identified (no stated immunosuppression, pregnancy, etc.).

non-severe


## Use Embedding to cluster patient cards

Now we'll try embedding our whole dataset with an OpenAI model, so each of our patient cards is embedded into a numerical representation of the text that can be used to quantify how related our patient examples are. Once we have an embedding for our dataset, we'll use a simple classification model to make predictions based on these numerical representations.

In [53]:
# embed every patient card with text-embedding-3-small
EMBED_MODEL = "text-embedding-3-small"
# batch up patient cards to try to stay under token limits
EMBED_BATCH_SIZE = 32

# pre-build the full set of patient cards
all_cards = [
    {"patient_id": pid, "card": patient_card(frame, pid)}
    for pid in frame.index
]

print(f"Preparing embeddings for {len(all_cards)} patients using {EMBED_MODEL}...")

def chunk(items, size):
    """Yield successive chunks from a list."""
    for start in range(0, len(items), size):
        yield items[start:start + size]

embedding_rows = []
for batch in chunk(all_cards, EMBED_BATCH_SIZE):
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=[entry["card"] for entry in batch],
    )
    for entry, data in zip(batch, response.data):
        embedding_rows.append(
            {
                "patient_id": entry["patient_id"],
                "card_text": entry["card"],
                "embedding": data.embedding,
            }
        )

patient_embeddings = (
    pd.DataFrame(embedding_rows)
    .set_index("patient_id")
    .sort_index()
)

patient_embeddings.head()


Preparing embeddings for 9106 patients using text-embedding-3-small...


Unnamed: 0_level_0,card_text,embedding
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0000b247-1def-417a-a783-41c8682be022,\nPatient Info:\n - Age at onset of COVID: 12...,"[0.024089792743325233, -0.06078261509537697, 0..."
00049ee8-5953-4edd-a277-b9c1b1a7f16b,\nPatient Info:\n - Age at onset of COVID: 35...,"[-0.003965935669839382, -0.043785903602838516,..."
00079a57-24a8-430f-b4f8-a1cf34f90060,\nPatient Info:\n - Age at onset of COVID: 29...,"[0.003200449049472809, -0.06400898098945618, -..."
0008a63c-c95c-46c2-9ef3-831d68892019,\nPatient Info:\n - Age at onset of COVID: 27...,"[-0.004471933469176292, -0.04955108091235161, ..."
00093cdd-a9f0-4ad8-87e9-53534501f008,\nPatient Info:\n - Age at onset of COVID: 67...,"[0.006540210917592049, -0.06269494444131851, 0..."


### Train a Logistic Regression model on embeddings

In [62]:
# train a Logistic Regression classifier on top of the embeddings
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# align embeddings with the severe_30d label
aligned = patient_embeddings.join(frame["severe_30d"], how="inner")
X = np.vstack(aligned["embedding"].to_list())
y = aligned["severe_30d"].astype(int).values

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=41,
    stratify=y,
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(max_iter=2000, class_weight="balanced")
log_reg.fit(X_train_scaled, y_train)

y_pred = log_reg.predict(X_test_scaled)
print(f"accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=["non-severe", "severe"]))


accuracy: 0.605
              precision    recall  f1-score   support

  non-severe       0.85      0.61      0.71      1423
      severe       0.30      0.60      0.40       399

    accuracy                           0.61      1822
   macro avg       0.57      0.60      0.55      1822
weighted avg       0.73      0.61      0.64      1822



### Observations

This approach still seems to be affected by the class imbalance; it emphasizes negative cases (non-severe), but is still generally not very accurate. Improvements would likely arise from providing a richer featureset in the embeddings. There is more demographic information in the patients table that has not yet been utiliized. But even with these preliminary results, we've shown how to repurpose a pretrained LLM as a feature extractor, turning each patient card into a embedding and then training a logistic regression classifier on those vectors. In effect, we’re approximating “using the last layer” of the LLM for classification.

In [65]:
frame["severe_30d"].value_counts()

severe_30d
0    7111
1    1995
Name: count, dtype: int64