# Exploratory Data Analysis (EDA)

This notebook performs initial EDA on the adverse event detection dataset.

In [1]:
import pandas as pd

# Load dataset
df = pd.read_csv('../data/synthetic_ehr.csv')

# Quick overview
print("Shape:", df.shape)
display(df.head(10))

# Describe numerical columns
display(df.describe())

# Check for missing values
print("Missing values per column:\n", df.isnull().sum())

# Print 3 example clinical notes
print("Sample Clinical Notes:")
for idx, row in df.sample(3).iterrows():
    print(f"- {row['note']} (Adverse Event: {row['adverse_event']})")

# Document observations
observations = []

# 1. Check balance of adverse_event labels
label_counts = df['adverse_event'].value_counts()
observations.append(f"Adverse event label counts: {dict(label_counts)}")

# 2. Unique diagnoses and medications
observations.append(f"Unique conditions: {df['condition'].unique()}")
observations.append(f"Unique medications: {df['medication'].unique()}")

# 3. Range of ages
observations.append(f"Age range: {df['age'].min()} to {df['age'].max()}")

# 4. Note any quirks (do clinical notes look relevant? Is data imbalanced?)
if label_counts.min() < 10:
    observations.append("Warning: Adverse event column may be imbalanced.")

print("\nObservations:")
for obs in observations:
    print("-", obs)


Shape: (200, 10)


Unnamed: 0,patient_id,age,sex,condition,medication,bp_sys,bp_dia,heart_rate,note,adverse_event
0,0,23,F,Diabetes,Albuterol,150,94,55,"Elevated blood pressure, monitoring closely. O...",0
1,1,50,M,COPD,Atorvastatin,145,100,99,Complains of chest pain after receiving Atorva...,0
2,2,26,F,COPD,Metformin,114,98,58,Patient with COPD prescribed Metformin.,0
3,3,77,M,Asthma,Atorvastatin,116,75,103,PT with Asthma denies new symptoms.,0
4,4,23,F,Hypertension,Albuterol,90,60,108,Reported dizziness after starting Albuterol.,1
5,5,64,F,Cancer,Albuterol,97,94,75,"Elevated blood pressure, monitoring closely. O...",0
6,6,88,M,Asthma,Metformin,131,107,116,Patient with Asthma prescribed Metformin.,0
7,7,84,F,Hypertension,Aspirin,106,109,116,"Vitals stable, no complaints today.",0
8,8,44,F,Asthma,Aspirin,128,93,71,Developed rash and swelling post Aspirin.,1
9,9,26,M,COPD,Albuterol,175,106,115,Developed rash and swelling post Albuterol.,0


Unnamed: 0,patient_id,age,bp_sys,bp_dia,heart_rate,adverse_event
count,200.0,200.0,200.0,200.0,200.0,200.0
mean,99.5,52.135,133.77,84.07,86.35,0.295
std,57.879185,21.187765,27.323248,15.146358,19.542185,0.457187
min,0.0,18.0,90.0,60.0,55.0,0.0
25%,49.75,34.0,110.0,71.75,69.0,0.0
50%,99.5,51.0,134.0,84.0,84.0,0.0
75%,149.25,69.0,158.0,96.0,104.0,1.0
max,199.0,90.0,180.0,110.0,120.0,1.0


Missing values per column:
 patient_id        0
age               0
sex               0
condition        38
medication       32
bp_sys            0
bp_dia            0
heart_rate        0
note              0
adverse_event     0
dtype: int64
Sample Clinical Notes:
- Normal exam, no adverse reaction noted. (Adverse Event: 0)
- Patient with Cancer prescribed Albuterol. (Adverse Event: 0)
- Normal exam, no adverse reaction noted. (Adverse Event: 0)

Observations:
- Adverse event label counts: {0: np.int64(141), 1: np.int64(59)}
- Unique conditions: ['Diabetes' 'COPD' 'Asthma' 'Hypertension' 'Cancer' nan]
- Unique medications: ['Albuterol' 'Atorvastatin' 'Metformin' 'Aspirin' nan]
- Age range: 18 to 90


### Observations

- **Dataset shape:** 200 rows × 10 columns.  
- **No missing values** in patient_id, age, sex, bp_sys, bp_dia, heart_rate, note, adverse_event.
- **Missing values detected:**  
  - `condition` has 38 missing entries  
  - `medication` has 32 missing entries  
  *Consider how to handle these missing values (impute, drop, or flag).*

- **Adverse event label counts:**  
  - 0: 141  
  - 1: 59  
  *~30% adverse events—decent balance for binary classification.*

- **Unique conditions:** Diabetes, COPD, Asthma, Hypertension, Cancer, and nan  
- **Unique medications:** Albuterol, Atorvastatin, Metformin, Aspirin, and nan

- **Age range:** 18 to 90

- **Sample notes:**  
    - "Patient with COPD prescribed Aspirin."  
    - "Patient with Cancer prescribed Aspirin."  
    - "Elevated blood pressure, monitoring closely. On Metformin."
  *These notes are varied and suitable for prompt engineering with an LLM.*

---

### Data Quirks & Considerations

- **Missing condition/medication:** Some records are missing these key fields. These could affect model performance, especially for feature engineering or LLM prompt context.
  - *Next step: Decide on imputation or removal strategy for these rows.*

- **Label balance:** Reasonably balanced for adverse event prediction, but consider stratified sampling in train/test split if class imbalance worsens.

- **Notes quality:** Notes generally match diagnosis and medication context. Useful for LLM extraction/classification task design.

---

### Next Steps

- Decide how to handle missing field values (impute, drop, or treat as unknown).
- Prepare prompt templates for sending notes to Gemini for entity extraction.
- Explore correlations and feature importance in subsequent EDA.
- Confirm label distribution in train/test splits for modeling.


Step 2: Sample Clinical Notes
You want a variety of notes, including ones mentioning adverse events, medications, and normal findings.

Option 1: Random Sample

In [2]:
sample_notes = df['note'].sample(5, random_state=42).tolist()
for i, note in enumerate(sample_notes, 1):
    print(f"{i}. {note}")


1. PT with None denies new symptoms.
2. Developed rash and swelling post Atorvastatin.
3. Complains of chest pain after receiving None.
4. Developed rash and swelling post Albuterol.
5. Vitals stable, no complaints today.


Option 2: Diverse Examples by Class and Content

One with adverse event

One with medication mention only

One with diagnosis only

One normal/negative finding

In [3]:
examples = []

# 1. Note with adverse event
adv = df[df['adverse_event'] == 1]['note'].sample(1, random_state=1).values[0]
examples.append(adv)

# 2. Note mentioning medication
med = df[df['medication'].notnull()]['note'].sample(1, random_state=2).values[0]
examples.append(med)

# 3. Note with diagnosis only
diag = df[df['condition'].notnull()]['note'].sample(1, random_state=3).values[0]
examples.append(diag)

# 4. Note with no event or complaints
norm = df[df['note'].str.contains('no complaints|normal', case=False, na=False)].sample(1, random_state=4).values[0]
examples.append(norm)

for i, note in enumerate(examples, 1):
    print(f"{i}. {note}")


1. Complains of chest pain after receiving Aspirin.
2. Elevated blood pressure, monitoring closely. On Aspirin.
3. Vitals stable, no complaints today.
4. [22 76 'M' 'Diabetes' 'Aspirin' 170 100 106
 'Normal exam, no adverse reaction noted.' 0]


#### Sample Clinical Notes for Prompt Testing

1. Patient with hypertension prescribed Aspirin. Developed rash and swelling post Aspirin.
2. Complains of chest pain after receiving Atorvastatin.
3. Vitals stable, no complaints today.
4. PT with Asthma denies new symptoms.



#### Extraction Targets for Gemini

For each clinical note, extract:
- diagnosis (string, e.g. "hypertension")
- medications (list of strings, e.g. ["Aspirin"])
- symptoms_side_effects (list of strings, e.g. ["rash", "swelling"])
- adverse_event (object: {present: true/false, description: string})


In [4]:
extraction_targets = {
    "diagnosis": None,
    "medications": [],
    "symptoms_side_effects": [],
    "adverse_event": {
        "present": False,
        "description": ""
    }
}
