# Create Y and E Arrays from ICD10 Data

This notebook converts a dataframe with patient diagnoses (eid, diag_icd10, age_diag) into:
- **Y array**: Binary array of shape (N, D, T) where Y[n, d, t] = 1 if patient n had disease d at age (30+t)
- **E array**: Exposure/censor array of shape (N, D, T) indicating when patients are at risk

Then computes:
- Max censor replacement
- Corrected E matrix
- Prevalence estimates


In [1]:
# Import the script
import sys
sys.path.append('/Users/sarahurbut/aladynoulli2/pyScripts/dec_6_revision/new_notebooks/')
from create_Y_E_from_icd10 import create_Y_E_arrays

import pandas as pd
import numpy as np
import torch
from pathlib import Path

print("Setup complete")


Setup complete


## Step 1: Load ICD10 Data


In [2]:
# Load ICD10 data
# Format: eid, diag_icd10, age_diag
icd10_path = '/Users/sarahurbut/aladynoulli2/aou_icd10.csv'  # Change for AOU or MGB

df = pd.read_csv(icd10_path)
print(f"Loaded {len(df):,} diagnosis records")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
print(df.head())

print(f"\nUnique patients: {df['eid'].nunique():,}")
print(f"Unique diseases: {df['diag_icd10'].nunique():,}")
print(f"Age range: {df['age_diag'].min()} - {df['age_diag'].max()}")


Loaded 5,875,618 diagnosis records
Columns: ['eid', 'diag_icd10', 'age_diag']

First few rows:
       eid  diag_icd10  age_diag
0  1000000      550.20        64
1  1000000      601.12        63
2  1000000      599.40        64
3  1000000      565.00        64
4  1000000      455.00        64

Unique patients: 243,303
Unique diseases: 348
Age range: 2 - 105


## Step 2: Create Y and E Arrays


In [3]:
output_dir = Path('/Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/allofusbigdata')

# Use disease order CSV to match original Y_binary.rds order
disease_order_csv = '/Users/sarahurbut/aladynoulli2/aou_diag_names.csv'  # Change for AOU or MGB

results = create_Y_E_arrays(
    df, 
    age_offset=30,
    output_dir=str(output_dir),
    disease_order_csv=disease_order_csv  # This ensures diseases are in the same order as original Y
)

# Extract results
Y = results['Y']
E = results['E']
E_corrected = results['E_corrected']
prevalence_t = results['prevalence_t']
patient_names = results['patient_names']
disease_names = results['disease_names']
max_censor_df = results['max_censor_df']

print(f"\nResults:")
print(f"  Y shape: {Y.shape}")
print(f"  E shape: {E.shape}")
print(f"  E_corrected shape: {E_corrected.shape}")
print(f"  Prevalence shape: {prevalence_t.shape}")
print(f"  Patient names: {len(patient_names)}")
print(f"  Disease names: {len(disease_names)}")


Creating Y and E arrays from ICD10 data

Step 1: Creating mappings...
  Loading disease order from CSV: /Users/sarahurbut/aladynoulli2/aou_diag_names.csv
  Reference order has 348 diseases
  Patients: 243303
  Diseases: 348

Step 2: Time dimension
  Age range: 2 to 105
  Timepoints T: 76 (ages 30 to 105)

Step 3: Creating Y array (243303, 348, 76)...
  Processed 100,000 records...
  Processed 200,000 records...
  Processed 300,000 records...
  Processed 400,000 records...
  Processed 500,000 records...
  Processed 600,000 records...
  Processed 700,000 records...
  Processed 800,000 records...
  Processed 900,000 records...
  Processed 1,000,000 records...
  Processed 1,100,000 records...
  Processed 1,200,000 records...
  Processed 1,300,000 records...
  Processed 1,400,000 records...
  Processed 1,500,000 records...
  Processed 1,600,000 records...
  Processed 1,700,000 records...
  Processed 1,800,000 records...
  Processed 1,900,000 records...
  Processed 2,000,000 records...
  Pro

In [6]:
# ============================================================================
# PASTE THESE CODE SNIPPETS INTO YOUR NOTEBOOK
# ============================================================================

# ============================================================================
# 1. VERIFY EVENT PATIENT - Person who had an event
# ============================================================================

# Find a patient with an event
patient_idx = 100  # Change this to a patient index
disease_idx = 10  # Change this to a disease index

# Check Y - should have 1 at the event timepoint
Y_patient_disease = Y[patient_idx, disease_idx, :]
event_timepoints = np.where(Y_patient_disease == 1)[0]

print("1. EVENT PATIENT VERIFICATION")
print("="*60)
print(f"Patient index: {patient_idx}, Disease index: {disease_idx}")
print(f"Y values: {Y_patient_disease[:20]}...")
print(f"Event occurred at timepoints: {event_timepoints}")
if len(event_timepoints) > 0:
    for t in event_timepoints[:5]:  # Show first 5 events
        age_at_event = 30 + t
        print(f"  Timepoint {t} = Age {age_at_event}: Y = {Y_patient_disease[t]}")

# Check E - should be >= event timepoint
E_patient_disease = E[patient_idx, disease_idx, :]
E_corrected_patient_disease = E_corrected[patient_idx, disease_idx]

print(f"\nE_corrected (max censor age): {E_corrected_patient_disease}")
print(f"E array (first 20): {E_patient_disease[:20]}")
if len(event_timepoints) > 0:
    first_event_tp = event_timepoints[0]
    print(f"\nVerification:")
    print(f"  First event at timepoint {first_event_tp} (age {30 + first_event_tp})")
    print(f"  E_corrected = {E_corrected_patient_disease} (age {E_corrected_patient_disease})")
    print(f"  E_corrected >= first_event_age? {E_corrected_patient_disease >= (30 + first_event_tp)}")
    if first_event_tp > 0:
        E_before_event = E_patient_disease[:first_event_tp]
        print(f"  E values before event: {E_before_event[:5]}...")
        print(f"  All E >= event timepoint? {np.all(E_before_event >= first_event_tp+1)}")


# ============================================================================
# 2. VERIFY EARLY/LATE DIAGNOSIS - Person diagnosed before/after time period
# ============================================================================
icd10_df = df
# Find a patient with early diagnosis (before age 30)
patient_id = patient_names[patient_idx]  # Use patient_idx from above or set new one
patient_diagnoses = icd10_df[icd10_df['eid'] == patient_id].copy()

print("\n\n2. EARLY/LATE DIAGNOSIS VERIFICATION")
print("="*60)
print(f"Patient ID: {patient_id}")
print(f"Patient diagnoses:")
print(patient_diagnoses.head(10))

# Check for diagnoses before age 30
early_diagnoses = patient_diagnoses[patient_diagnoses['age_diag'] < 30]
print(f"\nDiagnoses BEFORE age 30: {len(early_diagnoses)}")
if len(early_diagnoses) > 0:
    print(early_diagnoses[['diag_icd10', 'age_diag']].head())
    print(f"  Earliest diagnosis age: {early_diagnoses['age_diag'].min()}")

# Check for diagnoses after max age
max_age = 30 + Y.shape[2] - 1
late_diagnoses = patient_diagnoses[patient_diagnoses['age_diag'] > max_age]
print(f"\nDiagnoses AFTER age {max_age}: {len(late_diagnoses)}")
if len(late_diagnoses) > 0:
    print(late_diagnoses[['diag_icd10', 'age_diag']].head())

# Check Y - should be 0 for all timepoints if diagnosis was before/after
Y_patient_disease = Y[patient_idx, disease_idx, :]
print(f"\nY values for disease {disease_idx}:")
print(f"  Y sum (should be 0 if before/after): {Y_patient_disease.sum()}")
print(f"  Y values: {Y_patient_disease[:10]}...")

# Check E
E_corrected_patient_disease = E_corrected[patient_idx, disease_idx]
max_censor_age = patient_diagnoses['age_diag'].max() if len(patient_diagnoses) > 0 else None

print(f"\nE values:")
print(f"  Max censor age from data: {max_censor_age}")
print(f"  E_corrected: {E_corrected_patient_disease}")
if max_censor_age is not None:
    if max_censor_age < 30:
        print(f"  ⚠️ Max censor ({max_censor_age}) < age_offset (30) - E should be 0 or set to 30")
    elif max_censor_age > max_age:
        print(f"  ⚠️ Max censor ({max_censor_age}) > max age ({max_age}) - E should be capped")


# ============================================================================
# 3. VERIFY CENSORED PATIENT - Person who left before end of follow-up
# ============================================================================

# Find a patient who was censored (max_censor < max_age)
max_age = 30 + Y.shape[2] - 1
censored_patients = max_censor_df[max_censor_df['max_censor'] < max_age]
if len(censored_patients) > 0:
    censored_patient_id = censored_patients.iloc[0]['eid']
    censored_patient_idx = patient_names.index(censored_patient_id)
    censored_disease_idx = 0
    
    print("\n\n3. CENSORED PATIENT VERIFICATION")
    print("="*60)
    print(f"Patient ID: {censored_patient_id}")
    print(f"Patient index: {censored_patient_idx}")
    
    # Get max censor
    max_censor_row = max_censor_df[max_censor_df['eid'] == censored_patient_id]
    max_censor_age = max_censor_row['max_censor'].values[0]
    max_censor_timepoint = int(max_censor_age - 30)
    T = E.shape[2]
    max_timepoint_in_array = min(max_censor_timepoint, T - 1)
    
    print(f"Max censor age: {max_censor_age}")
    print(f"Max censor timepoint: {max_censor_timepoint}")
    print(f"Max timepoint in array: {max_timepoint_in_array}")
    print(f"Total timepoints T: {T}")
    
    # Check E_corrected
    E_corrected_patient_disease = E_corrected[censored_patient_idx, censored_disease_idx]
    print(f"\nE_corrected: {E_corrected_patient_disease}")
    print(f"Should equal max_censor_age ({max_censor_age}): {E_corrected_patient_disease == max_censor_age}")
    
    # Check E array
    E_patient_disease = E[censored_patient_idx, censored_disease_idx, :]
    print(f"\nE array:")
    print(f"  E[0:5]: {E_patient_disease[:5]}")
    if max_timepoint_in_array >= 0:
        print(f"  E[{max_timepoint_in_array-2}:{max_timepoint_in_array+3}]: {E_patient_disease[max_timepoint_in_array-2:max_timepoint_in_array+3]}")
        if max_timepoint_in_array < T - 1:
            print(f"  E[{max_timepoint_in_array+1}:{max_timepoint_in_array+5}]: {E_patient_disease[max_timepoint_in_array+1:max_timepoint_in_array+5]}")
        
        # Verify: E should be constant up to max_timepoint, then 0
        E_before_censor = E_patient_disease[:max_timepoint_in_array+1]
        expected_value = max_timepoint_in_array + 1
        print(f"\nVerification:")
        print(f"  E values up to timepoint {max_timepoint_in_array} should all be {expected_value}")
        print(f"  Actual: {E_before_censor[:5]}... (last: {E_before_censor[-1]})")
        print(f"  All equal to {expected_value}? {np.all(E_before_censor == expected_value)}")
        
        if max_timepoint_in_array < T - 1:
            E_after_censor = E_patient_disease[max_timepoint_in_array+1:]
            print(f"  E values after timepoint {max_timepoint_in_array} should all be 0")
            print(f"  Actual: {E_after_censor[:5]}...")
            print(f"  All equal to 0? {np.all(E_after_censor == 0)}")
    else:
        print(f"  ⚠️ Max censor timepoint ({max_censor_timepoint}) is negative!")
        print(f"  Patient was censored before observation window starts")
        print(f"  E should be all 0s: {np.all(E_patient_disease == 0)}")



1. EVENT PATIENT VERIFICATION
Patient index: 100, Disease index: 10
Y values: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]...
Event occurred at timepoints: []

E_corrected (max censor age): 44
E array (first 20): [15 15 15 15 15 15 15 15 15 15 15 15 15 15 15  0  0  0  0  0]


2. EARLY/LATE DIAGNOSIS VERIFICATION
Patient ID: 1000717
Patient diagnoses:
          eid  diag_icd10  age_diag
2591  1000717      495.00        44
2592  1000717      340.00        44
2593  1000717      622.10        44
2594  1000717      619.20        44
2595  1000717      626.00        43
2596  1000717      626.10        43
2597  1000717      626.12        43
2598  1000717      611.30        44

Diagnoses BEFORE age 30: 0

Diagnoses AFTER age 105: 0

Y values for disease 10:
  Y sum (should be 0 if before/after): 0
  Y values: [0 0 0 0 0 0 0 0 0 0]...

E values:
  Max censor age from data: 44
  E_corrected: 44


3. CENSORED PATIENT VERIFICATION
Patient ID: 1000000
Patient index: 0
Max censor age: 64
Max censor time