# NeMo Data Designer for Stroke Prediction Class Imbalance

Using NeMo Data Designer to generate synthetic stroke patients (minority class augmentation)

**⚠️ Important Workflow Note:**
- This notebook uses **UNSCALED** data (real ages, glucose levels, BMI) for NDD generation
- The LLM validation evaluates medical plausibility with actual values (age=67, not 0.72)
- After generation, we **scale** the synthetic samples using the same `RobustScaler` fitted during feature engineering
- Finally, we combine **scaled** synthetic + **scaled** original data for model training
- This ensures the LLM can meaningfully evaluate medical coherence while maintaining proper ML preprocessing


## 1. Setup and Load Packages


In [1]:
import os
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score, 
    precision_score, recall_score, f1_score, roc_auc_score, 
    average_precision_score
)
import xgboost as xgb

plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Load environment variables
load_dotenv()

print("All packages loaded successfully!")

All packages loaded successfully!


In [2]:
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    LLMTextColumnConfig,
    NeMoDataDesignerClient,
    UniformSamplerParams,
    SamplerColumnConfig,
    SamplerType,
)

data_designer_client = NeMoDataDesignerClient(
    base_url="https://ai.api.nvidia.com/v1/nemo/dd",
    default_headers={"Authorization": f"Bearer {os.getenv('NVIDIA_API_KEY')}"}
)

model_alias = "nemotron-super"  

print(f"✓ NeMo Data Designer client initialized!")
print(f"✓ Using model: {model_alias}")

✓ NeMo Data Designer client initialized!
✓ Using model: nemotron-super


## 2. Load and Analyze Data

We need to understand the distributions of stroke patients to configure our NDD samplers


In [3]:
# Load UNSCALED data for NeMo Data Designer
# NDD needs real values (age in years, glucose in mg/dL, BMI) for LLM validation
df = pd.read_csv('../data/stroke_data_unscaled.csv')

print(f"📊 Dataset shape: {df.shape}")
print(f"\n📋 Features ({len(df.columns)} total):")
print(f"   {df.columns.tolist()}")
print(f"\n🔍 Using UNSCALED data for NDD (real ages, glucose, BMI)")
print(f"   This allows LLM to evaluate medical plausibility with actual values")
print(f"\n📈 Class distribution:")
print(df['stroke'].value_counts())
print(f"\n📊 Class distribution (%):")
print(df['stroke'].value_counts(normalize=True) * 100)

stroke_cases = df[df['stroke'] == 1]
non_stroke_cases = df[df['stroke'] == 0]

print(f"\n✓ Stroke cases: {len(stroke_cases)} ({len(stroke_cases)/len(df)*100:.2f}%)")
print(f"✓ Non-stroke cases: {len(non_stroke_cases)} ({len(non_stroke_cases)/len(df)*100:.2f}%)")

df.head()

📊 Dataset shape: (5110, 17)

📋 Features (17 total):
   ['id', 'age', 'avg_glucose_level', 'bmi', 'stroke', 'gender_Male', 'hypertension_Yes', 'heart_disease_Yes', 'ever_married_Yes', 'work_type_Never_worked', 'work_type_Private', 'work_type_Self-employed', 'work_type_children', 'Residence_type_Urban', 'smoking_status_formerly smoked', 'smoking_status_never smoked', 'smoking_status_smokes']

🔍 Using UNSCALED data for NDD (real ages, glucose, BMI)
   This allows LLM to evaluate medical plausibility with actual values

📈 Class distribution:
stroke
0    4861
1     249
Name: count, dtype: int64

📊 Class distribution (%):
stroke
0    95.127202
1     4.872798
Name: proportion, dtype: float64

✓ Stroke cases: 249 (4.87%)
✓ Non-stroke cases: 4861 (95.13%)


Unnamed: 0,id,age,avg_glucose_level,bmi,stroke,gender_Male,hypertension_Yes,heart_disease_Yes,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,9046,67.0,228.69,36.6,1,1,0,1,1,0,1,0,0,1,1,0,0
1,51676,61.0,202.21,28.1,1,0,0,0,1,0,0,1,0,0,0,1,0
2,31112,80.0,105.92,32.5,1,1,0,1,1,0,1,0,0,0,0,1,0
3,60182,49.0,171.23,34.4,1,0,0,0,1,0,1,0,0,1,0,0,1
4,1665,79.0,174.12,24.0,1,0,1,0,1,0,0,1,0,0,0,1,0


In [4]:
print("=" * 70)
print("Numerical Feature Statistics - STROKE PATIENTS (UNSCALED)")
print("=" * 70)
print(stroke_cases[['age', 'avg_glucose_level', 'bmi']].describe())
print("\nℹ️  These are REAL values:")
print("   - Age: in years")
print("   - Glucose: in mg/dL")
print("   - BMI: body mass index")

print("\n" + "=" * 70)
print("Numerical Feature Statistics - NON-STROKE PATIENTS (UNSCALED)")
print("=" * 70)
print(non_stroke_cases[['age', 'avg_glucose_level', 'bmi']].describe())

Numerical Feature Statistics - STROKE PATIENTS (UNSCALED)
              age  avg_glucose_level         bmi
count  249.000000         249.000000  249.000000
mean    67.728193         132.544739   30.090361
std     12.727419          61.921056    5.861877
min      1.320000          56.110000   16.900000
25%     59.000000          79.790000   27.000000
50%     71.000000         105.220000   28.100000
75%     78.000000         196.710000   32.500000
max     82.000000         271.740000   56.600000

ℹ️  These are REAL values:
   - Age: in years
   - Glucose: in mg/dL
   - BMI: body mass index

Numerical Feature Statistics - NON-STROKE PATIENTS (UNSCALED)
               age  avg_glucose_level          bmi
count  4861.000000        4861.000000  4861.000000
mean     41.971545         104.795513    28.799115
std      22.291940          43.846069     7.777269
min       0.080000          55.120000    10.300000
25%      24.000000          77.120000    23.600000
50%      43.000000          91.47000

### Categorical Feature Distributions for Stroke Patients


In [5]:
# Analyze categorical feature distributions for stroke patients
binary_features = ['gender_Male', 'hypertension_Yes', 'heart_disease_Yes', 
                    'ever_married_Yes', 'Residence_type_Urban']

print("Binary Feature Distributions (Stroke Patients):")
print("=" * 70)
for feature in binary_features:
    dist = stroke_cases[feature].value_counts(normalize=True)
    print(f"\n{feature}:")
    print(f"  0: {dist.get(0, 0):.2%}")
    print(f"  1: {dist.get(1, 0):.2%}")

print("\n" + "=" * 70)
print("Work Type Distribution (Stroke Patients):")
print("=" * 70)
work_cols = ['work_type_Never_worked', 'work_type_Private', 'work_type_Self-employed', 'work_type_children']
for col in work_cols:
    pct = stroke_cases[col].sum() / len(stroke_cases)
    print(f"  {col}: {pct:.2%}")

print("\n" + "=" * 70)
print("Smoking Status Distribution (Stroke Patients):")
print("=" * 70)
smoking_cols = ['smoking_status_formerly smoked', 'smoking_status_never smoked', 'smoking_status_smokes']
for col in smoking_cols:
    pct = stroke_cases[col].sum() / len(stroke_cases)
    print(f"  {col}: {pct:.2%}")

Binary Feature Distributions (Stroke Patients):

gender_Male:
  0: 56.63%
  1: 43.37%

hypertension_Yes:
  0: 73.49%
  1: 26.51%

heart_disease_Yes:
  0: 81.12%
  1: 18.88%

ever_married_Yes:
  0: 11.65%
  1: 88.35%

Residence_type_Urban:
  0: 45.78%
  1: 54.22%

Work Type Distribution (Stroke Patients):
  work_type_Never_worked: 0.00%
  work_type_Private: 59.84%
  work_type_Self-employed: 26.10%
  work_type_children: 0.80%

Smoking Status Distribution (Stroke Patients):
  smoking_status_formerly smoked: 28.11%
  smoking_status_never smoked: 36.14%
  smoking_status_smokes: 16.87%


## 3. Configure NeMo Data Designer

Now we'll configure NDD samplers based on the stroke patient distributions we just analyzed

### Step 1: Initialize Config Builder and Configure Numerical Features


In [6]:
# Initialize configuration builder
config_builder = DataDesignerConfigBuilder()

print("⚙️  Configuring NDD samplers for synthetic stroke patients...")
print("\n1️⃣ Adding numerical features (UNSCALED - real values)...")

# Add Age (unscaled - in years)
age_min = float(stroke_cases['age'].min())
age_max = float(stroke_cases['age'].max())
config_builder.add_column(
    SamplerColumnConfig(
        name="age",
        sampler_type=SamplerType.UNIFORM,
        params=UniformSamplerParams(low=age_min, high=age_max),
    )
)
print(f"   Age: {age_min:.1f} - {age_max:.1f} years")

# Add Average Glucose Level (unscaled - in mg/dL)
glucose_min = float(stroke_cases['avg_glucose_level'].min())
glucose_max = float(stroke_cases['avg_glucose_level'].max())
config_builder.add_column(
    SamplerColumnConfig(
        name="avg_glucose_level",
        sampler_type=SamplerType.UNIFORM,
        params=UniformSamplerParams(low=glucose_min, high=glucose_max),
    )
)
print(f"   Glucose: {glucose_min:.1f} - {glucose_max:.1f} mg/dL")

# Add BMI (unscaled - body mass index)
bmi_min = float(stroke_cases['bmi'].min())
bmi_max = float(stroke_cases['bmi'].max())
config_builder.add_column(
    SamplerColumnConfig(
        name="bmi",
        sampler_type=SamplerType.UNIFORM,
        params=UniformSamplerParams(low=bmi_min, high=bmi_max),
    )
)
print(f"   BMI: {bmi_min:.1f} - {bmi_max:.1f}")

print("\n✓ Numerical features configured with REAL VALUES")

⚙️  Configuring NDD samplers for synthetic stroke patients...

1️⃣ Adding numerical features (UNSCALED - real values)...
   Age: 1.3 - 82.0 years
   Glucose: 56.1 - 271.7 mg/dL
   BMI: 16.9 - 56.6

✓ Numerical features configured with REAL VALUES


### Step 2: Configure Binary Categorical Features

In [7]:
print("2️⃣ Adding binary categorical features...")

binary_features = ['gender_Male', 'hypertension_Yes', 'heart_disease_Yes', 
                    'ever_married_Yes', 'Residence_type_Urban']

for feature in binary_features:
    dist = stroke_cases[feature].value_counts(normalize=True).to_dict()
    config_builder.add_column(
        SamplerColumnConfig(
            name=feature,
            sampler_type=SamplerType.CATEGORY,
            params=CategorySamplerParams(
                values=[0, 1],
                weights=[dist.get(0, 0.5), dist.get(1, 0.5)]
            ),
            convert_to="int"
        )
    )
    
print(f"✓ Configured {len(binary_features)} binary features")


2️⃣ Adding binary categorical features...
✓ Configured 5 binary features


### Step 3: Configure Multi-Class Categorical Features (Work Type & Smoking Status)

In [8]:
print("3️⃣ Adding multi-class categorical features...")

# Work Type
work_types = ["Never_worked", "Private", "Self-employed", "children"]
work_weights = [
    stroke_cases['work_type_Never_worked'].sum() / len(stroke_cases),
    stroke_cases['work_type_Private'].sum() / len(stroke_cases),
    stroke_cases['work_type_Self-employed'].sum() / len(stroke_cases),
    stroke_cases['work_type_children'].sum() / len(stroke_cases),
]

config_builder.add_column(
    SamplerColumnConfig(
        name="work_type",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=work_types,
            weights=work_weights
        ),
    )
)

# Smoking Status
smoking_statuses = ["formerly smoked", "never smoked", "smokes"]
smoking_weights = [
    stroke_cases['smoking_status_formerly smoked'].sum() / len(stroke_cases),
    stroke_cases['smoking_status_never smoked'].sum() / len(stroke_cases),
    stroke_cases['smoking_status_smokes'].sum() / len(stroke_cases),
]

config_builder.add_column(
    SamplerColumnConfig(
        name="smoking_status",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=smoking_statuses,
            weights=smoking_weights
        ),
    )
)

print("✓ Work type and smoking status configured")

3️⃣ Adding multi-class categorical features...
✓ Work type and smoking status configured


### Step 4: (Optional) Add LLM Medical Coherence Validation

This is NDD's unique strength - using an LLM to ensure feature combinations make medical sense


In [9]:
# Set to True to enable LLM-based medical coherence check (slower, more expensive, but validates data quality)
use_llm_validation = True  # Change to False to skip

if use_llm_validation:
    print("4️⃣ Adding LLM medical coherence validation...")
    
    config_builder.add_column(
        LLMTextColumnConfig(
            name="medical_plausibility",
            prompt=(
                "Given a stroke patient with:\\n"
                "- Age: {{ age }} years\\n"
                "- Average glucose level: {{ avg_glucose_level }} mg/dL\\n"
                "- BMI: {{ bmi }}\\n"
                "- Hypertension: {{ hypertension_Yes }} (0=No, 1=Yes)\\n"
                "- Heart disease: {{ heart_disease_Yes }} (0=No, 1=Yes)\\n"
                "- Smoking status: {{ smoking_status }}\\n\\n"
                "Rate the medical plausibility of this combination (1-10). "
                "Higher scores mean more realistic/typical stroke patient profile. "
                "Consider how these risk factors commonly occur together in actual stroke patients. "
                "Respond with ONLY a number from 1 to 10."
            ),
            system_prompt=(
                "You are a medical expert evaluating stroke patient profiles. "
                "Consider realistic medical scenarios and how risk factors "
                "(age, glucose, BMI, hypertension, heart disease, smoking) "
                "typically present together in actual stroke patients."
            ),
            model_alias=model_alias,
        )
    )
    
    print("✓ LLM validation configured - will generate plausibility scores (1-10)")
    print("   LLM will see ACTUAL medical values (age in years, glucose in mg/dL, etc.)")
else:
    print("4️⃣ Skipping LLM validation")

print("\n✅ Configuration complete!")


4️⃣ Adding LLM medical coherence validation...
✓ LLM validation configured - will generate plausibility scores (1-10)
   LLM will see ACTUAL medical values (age in years, glucose in mg/dL, etc.)

✅ Configuration complete!


In [14]:
def extract_plausibility_score(text):
    """
    Extract numeric plausibility score (1-10) from LLM response
    Handles responses with <think> tags and explanations
    
    The LLM may include its thinking process in <think>...</think> tags,
    followed by the final numeric score. This function extracts just the number.
    
    Args:
        text: Raw LLM response text
        
    Returns:
        Numeric score (1-10) or NaN if extraction fails
    """
    if pd.isna(text):
        return np.nan
    
    text = str(text).strip()
    
    # Remove <think>...</think> tags and their content
    text_cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
    
    # Find all numbers in the remaining text (after removing think tags)
    numbers = re.findall(r'\b([1-9]|10)\b', text_cleaned)
    
    if numbers:
        # Return the first valid number found (should be the final answer)
        return int(numbers[0])
    
    # Fallback: try to find any number (1-10) in original text
    numbers = re.findall(r'\b([1-9]|10)\b', text)
    if numbers:
        return int(numbers[0])
    
    return np.nan

## 4. Generate Preview (Test Configuration)

First, generate a small preview to verify everything works correctly


In [10]:
print("🔍 Generating preview (10 synthetic stroke patients)...")
print("⏱️  This may take 1-2 minutes...\n")

preview = data_designer_client.preview(config_builder, num_records=10)

print("\n✓ Preview generated successfully!")


[16:30:21] [INFO] ✅ Validation passed
[16:30:21] [INFO] 🚀 Starting preview generation


🔍 Generating preview (10 synthetic stroke patients)...
⏱️  This may take 1-2 minutes...



[16:30:22] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph
[16:30:22] [INFO] 🩺 Running health checks for models...
[16:30:24] [INFO]   |-- 👀 Checking 'nvidia/nvidia-nemotron-nano-9b-v2'...
[16:30:24] [INFO]   |-- ✅ Passed!
[16:30:26] [INFO]   |-- 👀 Checking 'nvidia/llama-3.3-nemotron-super-49b-v1.5'...
[16:30:26] [INFO]   |-- ✅ Passed!
[16:30:27] [INFO]   |-- 👀 Checking 'mistralai/mistral-small-24b-instruct'...
[16:30:27] [INFO]   |-- ✅ Passed!
[16:30:28] [INFO]   |-- 👀 Checking 'openai/gpt-oss-20b'...
[16:30:28] [INFO]   |-- ✅ Passed!
[16:30:29] [INFO]   |-- 👀 Checking 'openai/gpt-oss-120b'...
[16:30:29] [INFO]   |-- ✅ Passed!
[16:30:30] [INFO]   |-- 👀 Checking 'meta/llama-4-scout-17b-16e-instruct'...
[16:30:30] [INFO]   |-- ✅ Passed!
[16:30:30] [INFO] ⏳ Processing batch 1 of 1
[16:30:30] [INFO] 🎲 Preparing samplers to generate 10 records across 10 columns
[16:30:30] [INFO] 📝 Preparing llm-text column generation
[16:30:30] [INFO]   |-- column name: 'medical_plausibility


✓ Preview generated successfully!


In [12]:
preview.display_sample_record()

In [15]:
preview_df = preview.dataset
print(f"Preview dataset shape: {preview_df.shape}")
print(f"\nPreview columns: {preview_df.columns.tolist()}")
preview_df['medical_plausibility'] = preview_df['medical_plausibility'].apply(extract_plausibility_score)
preview_df.head(n=10)

Preview dataset shape: (10, 11)

Preview columns: ['age', 'avg_glucose_level', 'bmi', 'gender_Male', 'hypertension_Yes', 'heart_disease_Yes', 'ever_married_Yes', 'Residence_type_Urban', 'work_type', 'smoking_status', 'medical_plausibility']


Unnamed: 0,age,avg_glucose_level,bmi,gender_Male,hypertension_Yes,heart_disease_Yes,ever_married_Yes,Residence_type_Urban,work_type,smoking_status,medical_plausibility
0,59.522645,86.555396,44.485638,1,0,0,1,0,Private,never smoked,6
1,36.106907,262.348128,52.238289,0,0,0,1,0,Private,never smoked,5
2,38.081645,186.497012,17.270724,1,1,0,0,1,Private,formerly smoked,6
3,21.715803,264.722253,27.61964,0,0,0,1,1,Private,never smoked,5
4,47.167543,259.310869,39.984499,1,0,0,1,0,Private,never smoked,6
5,15.826247,137.896119,55.229181,1,0,0,1,0,Private,formerly smoked,1
6,61.609836,67.032915,36.118685,0,0,1,1,0,Private,formerly smoked,7
7,81.303494,184.890897,56.377818,1,0,1,1,1,Self-employed,never smoked,7
8,40.747273,64.155775,36.614394,1,0,0,1,1,Private,formerly smoked,5
9,56.117794,147.119623,52.858248,1,1,0,1,1,Self-employed,never smoked,8


## 5. Generate Full Synthetic Dataset

**⚠️ Important**: This step will take time (10-20 minutes) and consume API credits (~$0.50-2.00)

Choose your augmentation strategy:
- **Conservative** (2x minority): 249 samples (recommended for testing)
- **Moderate** (4x minority): 747 samples
- **Aggressive** (full balance): ~4,612 samples


In [13]:
# Calculate generation options
num_minority = len(stroke_cases)
num_majority = len(non_stroke_cases)

print(f"Original minority class size: {num_minority}")
print(f"Original majority class size: {num_majority}")
print(f"\nGeneration options:")
print(f"  Conservative (2x minority): {num_minority} samples")
print(f"  Moderate (4x minority): {num_minority * 3} samples")
print(f"  Aggressive (full balance): {num_majority - num_minority} samples")

# Choose strategy - CONSERVATIVE by default (recommended for testing)
num_to_generate = num_minority  # Start with conservative

print(f"\n🎯 Will generate: {num_to_generate} synthetic stroke patients")
print(f"⚠️  Uncomment the cell below to run the generation")

Original minority class size: 249
Original majority class size: 4861

Generation options:
  Conservative (2x minority): 249 samples
  Moderate (4x minority): 747 samples
  Aggressive (full balance): 4612 samples

🎯 Will generate: 249 synthetic stroke patients
⚠️  Uncomment the cell below to run the generation


In [16]:
# DEBUG: Let's understand what's different between preview and create
# First, let's enable debug logging to see what endpoints are being called

import logging
logging.basicConfig(level=logging.DEBUG)

# Also let's check if there's a way to see the actual request being made
import httpx

# Create a custom client with request logging
print("=" * 70)
print("DEBUGGING: What endpoints are being called?")
print("=" * 70)
print("\nClient base_url:", data_designer_client.base_url)
print("Client has .preview method:", hasattr(data_designer_client, 'preview'))
print("Client has .create method:", hasattr(data_designer_client, 'create'))
print("\nThe preview() call succeeded, which means the API was working")
print("The create() call fails with 404, which suggests:")
print("  1. Different endpoint path for create vs preview")
print("  2. Service became unavailable between calls")
print("  3. Different authentication/quota for create")
print("=" * 70)


DEBUGGING: What endpoints are being called?


AttributeError: 'NeMoDataDesignerClient' object has no attribute 'base_url'

In [19]:
# UNCOMMENT AND RUN THIS CELL TO GENERATE SYNTHETIC DATA
# ⚠️ This will take 10-20 minutes and consume API credits!

print(f"🏭 Generating {num_to_generate} synthetic stroke patients...")
print(f"⏱️  This will take several minutes. Please wait...\n")

synthetic_result = data_designer_client.create(config_builder, num_records=num_to_generate, wait_until_done=True)
synthetic_df = synthetic_result.dataset

print(f"\n✓ Generated {len(synthetic_df)} synthetic samples!")
print(f"Synthetic dataset shape: {synthetic_df.shape}")

[16:18:06] [INFO] 🎨 Creating Data Designer generation job
[16:18:06] [INFO] ✅ Validation passed


🏭 Generating 249 synthetic stroke patients...
⏱️  This will take several minutes. Please wait...



DataDesignerClientError: ‼️ Something went wrong!
404 page not found

## 6. Load Scaler for Synthetic Data

We need to scale the synthetic data using the same scaler fitted on the original training data


In [None]:
import joblib

# Load the fitted scaler from feature engineering
scaler_path = '../data/robust_scaler.pkl'
robust_scaler = joblib.load(scaler_path)

print(f"✓ Loaded RobustScaler from: {scaler_path}")
print(f"   This scaler will be applied to numerical features in synthetic data")
print(f"   to ensure consistency with the training data")


## 7. Process Generated Data

Convert NDD output to match original dataset format and **scale numerical features**


In [None]:
def process_generated_data(generated_df, scaler, min_plausibility=None):
    """
    Convert generated data to match original dataset format and scale numerical features
    
    Args:
        generated_df: DataFrame from NDD (with UNSCALED numerical features)
        scaler: Fitted RobustScaler to apply to numerical features
        min_plausibility: Minimum plausibility score to keep (if using LLM validation)
    """
    print("🔄 Processing generated data...")
    
    # Filter by plausibility if specified
    if min_plausibility is not None and 'medical_plausibility' in generated_df.columns:
        original_len = len(generated_df)
        # Extract just the numeric value from the plausibility column
        # (handles LLM responses with <think> tags and explanations)
        generated_df['plausibility_score'] = generated_df['medical_plausibility'].apply(
            extract_plausibility_score
        )
        generated_df = generated_df[
            generated_df['plausibility_score'] >= min_plausibility
        ]
        print(f"   ✓ Extracted plausibility scores (1-10) from LLM responses")
        print(f"   Filtered by plausibility ≥{min_plausibility}: {original_len} → {len(generated_df)}")
    
    processed = pd.DataFrame()
    
    # Numerical features (UNSCALED from NDD)
    numerical_features = ['age', 'avg_glucose_level', 'bmi']
    processed[numerical_features] = generated_df[numerical_features].copy()
    
    print(f"   📊 Before scaling: age range [{processed['age'].min():.1f}, {processed['age'].max():.1f}]")
    
    # Apply the fitted scaler to numerical features
    processed[numerical_features] = scaler.transform(processed[numerical_features])
    
    print(f"   ✓ Scaled numerical features using fitted RobustScaler")
    print(f"   📊 After scaling: age range [{processed['age'].min():.3f}, {processed['age'].max():.3f}]")
    
    # Binary features
    processed['gender_Male'] = generated_df['gender_Male'].astype(int)
    processed['hypertension_Yes'] = generated_df['hypertension_Yes'].astype(int)
    processed['heart_disease_Yes'] = generated_df['heart_disease_Yes'].astype(int)
    processed['ever_married_Yes'] = generated_df['ever_married_Yes'].astype(int)
    processed['Residence_type_Urban'] = generated_df['Residence_type_Urban'].astype(int)
    
    # One-hot encode work_type
    processed['work_type_Never_worked'] = (generated_df['work_type'] == 'Never_worked').astype(int)
    processed['work_type_Private'] = (generated_df['work_type'] == 'Private').astype(int)
    processed['work_type_Self-employed'] = (generated_df['work_type'] == 'Self-employed').astype(int)
    processed['work_type_children'] = (generated_df['work_type'] == 'children').astype(int)
    
    # One-hot encode smoking_status
    processed['smoking_status_formerly smoked'] = (generated_df['smoking_status'] == 'formerly smoked').astype(int)
    processed['smoking_status_never smoked'] = (generated_df['smoking_status'] == 'never smoked').astype(int)
    processed['smoking_status_smokes'] = (generated_df['smoking_status'] == 'smokes').astype(int)
    
    # Add stroke label (all synthetic samples are stroke cases)
    processed['stroke'] = 1
    
    print(f"   ✓ Processed and scaled {len(processed)} samples")
    return processed

# Test with preview data
processed_preview = process_generated_data(
    preview_df, 
    scaler=robust_scaler,
    min_plausibility=7 if use_llm_validation else None
)
print(f"\nProcessed preview shape: {processed_preview.shape}")
print(f"Processed columns ({len(processed_preview.columns)}): {processed_preview.columns.tolist()}")
processed_preview.head()


In [None]:
# Process the full synthetic dataset (uncomment after generation)
# processed_synthetic = process_generated_data(
#     synthetic_df,
#     scaler=robust_scaler,
#     min_plausibility=7 if use_llm_validation else None
# )
# print(f"Final processed synthetic data shape: {processed_synthetic.shape}")


## 8. Combine Original and Synthetic Data

Create the augmented dataset by combining **scaled** original data with **scaled** synthetic samples


In [None]:
# UNCOMMENT AFTER GENERATING AND PROCESSING SYNTHETIC DATA

# print("🔗 Combining original and synthetic data...")

# # Load the SCALED original data (not the unscaled version we used for NDD)
# df_original = pd.read_csv('../data/stroke_data_prepared.csv')

# # Remove ID column from original if present
# df_original = df_original.drop('id', axis=1) if 'id' in df_original.columns else df_original.copy()

# # Ensure column order matches
# processed_synthetic = processed_synthetic[df_original.columns]

# # Combine SCALED original + SCALED synthetic
# df_augmented = pd.concat([df_original, processed_synthetic], ignore_index=True)

# print(f"\n📊 Dataset Summary:")
# print(f"   Original (scaled): {len(df_original)} samples")
# print(f"   Synthetic (scaled): {len(processed_synthetic)} samples")
# print(f"   Augmented: {len(df_augmented)} samples")

# print(f"\n📈 Augmented Class Distribution:")
# print(df_augmented['stroke'].value_counts())
# print(f"\n📊 Augmented Class Distribution (%):")
# print(df_augmented['stroke'].value_counts(normalize=True) * 100)

# print(f"\n✓ Augmented dataset created (all features properly scaled)!")

print("⚠️ Uncomment after generating synthetic data")


## 9. Train XGBoost Model on Augmented Data


In [None]:
# UNCOMMENT AFTER CREATING AUGMENTED DATASET

# print("🤖 Training XGBoost model on NDD-augmented data...\n")

# # Split features and target
# X_augmented = df_augmented.drop('stroke', axis=1)
# y_augmented = df_augmented['stroke']

# # Train-test split
# X_train, X_test, y_train, y_test = train_test_split(
#     X_augmented, y_augmented,
#     test_size=0.2,
#     random_state=42,
#     stratify=y_augmented
# )

# print(f"Training set shape: {X_train.shape}")
# print(f"Testing set shape: {X_test.shape}")
# print(f"\nTraining class distribution:")
# print(y_train.value_counts(normalize=True) * 100)

# # Train model
# ndd_model = xgb.XGBClassifier(
#     tree_method='hist',
#     eval_metric='aucpr',
#     random_state=42
# )

# print("\n⏳ Training model...")
# ndd_model.fit(X_train, y_train)
# print("✓ Training complete!")

print("⚠️ Uncomment after creating augmented dataset")


## 10. Evaluate Model Performance


In [None]:
# UNCOMMENT AFTER TRAINING MODEL

# print("📊 Evaluating model performance...\n")

# # Make predictions
# y_pred = ndd_model.predict(X_test)
# y_pred_proba = ndd_model.predict_proba(X_test)[:, 1]

# # Calculate metrics
# accuracy = accuracy_score(y_test, y_pred)
# precision = precision_score(y_test, y_pred)
# recall = recall_score(y_test, y_pred)
# f1 = f1_score(y_test, y_pred)
# roc_auc = roc_auc_score(y_test, y_pred_proba)
# pr_auc = average_precision_score(y_test, y_pred_proba)

# print("=" * 60)
# print("NeMo Data Designer Model Performance")
# print("=" * 60)
# print(f"Accuracy:  {accuracy:.4f}")
# print(f"Precision: {precision:.4f}")
# print(f"Recall:    {recall:.4f}")
# print(f"F1 Score:  {f1:.4f}")
# print(f"ROC-AUC:   {roc_auc:.4f}")
# print(f"PR-AUC:    {pr_auc:.4f}")
# print("=" * 60)

# print("\n📋 Detailed Classification Report:")
# print(classification_report(y_test, y_pred, target_names=['No Stroke', 'Stroke']))

# print("\n📊 Confusion Matrix:")
# cm = confusion_matrix(y_test, y_pred)
# print(cm)
# print(f"\nTrue Negatives: {cm[0,0]}")
# print(f"False Positives: {cm[0,1]}")
# print(f"False Negatives: {cm[1,0]}")
# print(f"True Positives: {cm[1,1]}")

print("⚠️ Uncomment after training model")


## 11. Visualize Results


In [None]:
# UNCOMMENT AFTER EVALUATION

# # Plot confusion matrix
# fig, ax = plt.subplots(figsize=(8, 6))
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax, cbar=True)
# ax.set_xlabel('Predicted Label', fontsize=12)
# ax.set_ylabel('True Label', fontsize=12)
# ax.set_title('Confusion Matrix - NeMo Data Designer Model', fontsize=14, fontweight='bold')
# ax.set_xticklabels(['No Stroke', 'Stroke'])
# ax.set_yticklabels(['No Stroke', 'Stroke'])
# plt.tight_layout()
# plt.show()

print("⚠️ Uncomment after evaluation")


## 12. Compare with Other Techniques

Compare NDD results with Base Model, Scale Pos Weight, and SMOTE approaches


In [None]:
# Create a comparison dataframe
# Fill in your actual values from other notebooks

comparison_df = pd.DataFrame({
    'Technique': [
        'Base Model',
        'Scale Pos Weight',
        'SMOTE',
        'NeMo Data Designer'
    ],
    'Accuracy': [
        0.0,  # Fill in from base-model.ipynb
        0.0,  # Fill in from scale-pos-weight-model.ipynb
        0.0,  # Fill in from smote-model.ipynb
        0.0,  # Will be filled after running NDD model
    ],
    'Precision': [
        0.0,  # Fill in from base-model.ipynb
        0.0,  # Fill in from scale-pos-weight-model.ipynb
        0.0,  # Fill in from smote-model.ipynb
        0.0,  # Will be filled after running NDD model
    ],
    'Recall': [
        0.0,  # Fill in from base-model.ipynb
        0.0,  # Fill in from scale-pos-weight-model.ipynb
        0.0,  # Fill in from smote-model.ipynb
        0.0,  # Will be filled after running NDD model
    ],
    'F1 Score': [
        0.0,  # Fill in from base-model.ipynb
        0.0,  # Fill in from scale-pos-weight-model.ipynb
        0.0,  # Fill in from smote-model.ipynb
        0.0,  # Will be filled after running NDD model
    ],
    'ROC-AUC': [
        0.0,  # Fill in from base-model.ipynb
        0.0,  # Fill in from scale-pos-weight-model.ipynb
        0.0,  # Fill in from smote-model.ipynb
        0.0,  # Will be filled after running NDD model
    ],
    'PR-AUC': [
        0.0,  # Fill in from base-model.ipynb
        0.0,  # Fill in from scale-pos-weight-model.ipynb
        0.0,  # Fill in from smote-model.ipynb
        0.0,  # Will be filled after running NDD model
    ],
})

# After running NDD model, update the last row with your actual metrics:
# comparison_df.loc[3, 'Accuracy'] = accuracy
# comparison_df.loc[3, 'Precision'] = precision
# comparison_df.loc[3, 'Recall'] = recall
# comparison_df.loc[3, 'F1 Score'] = f1
# comparison_df.loc[3, 'ROC-AUC'] = roc_auc
# comparison_df.loc[3, 'PR-AUC'] = pr_auc

print("\n" + "=" * 80)
print("Performance Comparison Across All Techniques")
print("=" * 80)
print(comparison_df.to_string(index=False))
print("=" * 80)
print("\n⚠️ Fill in actual values from your other notebooks after running all models")


In [None]:
# Visual comparison of metrics (uncomment after filling in comparison_df with actual values)

# fig, axes = plt.subplots(2, 3, figsize=(16, 10))
# axes = axes.ravel()

# metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC', 'PR-AUC']
# colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

# for idx, metric in enumerate(metrics):
#     ax = axes[idx]
#     bars = ax.bar(comparison_df['Technique'], comparison_df[metric], color=colors)
#     ax.set_title(metric, fontsize=12, fontweight='bold')
#     ax.set_ylabel('Score', fontsize=10)
#     ax.set_ylim(0, 1)
#     ax.tick_params(axis='x', rotation=45)
#     ax.grid(axis='y', alpha=0.3)
    
#     # Add value labels on bars
#     for bar in bars:
#         height = bar.get_height()
#         if height > 0:  # Only show if value is filled in
#             ax.text(bar.get_x() + bar.get_width()/2., height,
#                     f'{height:.3f}',
#                     ha='center', va='bottom', fontsize=9)

# plt.tight_layout()
# plt.suptitle('Performance Comparison: All Class Imbalance Techniques', 
#              fontsize=14, fontweight='bold', y=1.02)
# plt.show()

print("⚠️ Uncomment after filling in comparison metrics")


## 13. Key Insights & Next Steps


In [None]:
print("=" * 80)
print("WORKFLOW SUMMARY - NeMo Data Designer for Stroke Prediction")
print("=" * 80)

print("\n✅ What You've Done:")
print("   1. Loaded UNSCALED data for NDD (real ages, glucose, BMI)")
print("   2. Configured NDD samplers based on actual stroke patient distributions")
print("   3. Added LLM-based medical coherence validation with real values")
print("   4. Generated preview to verify configuration")
print("   5. Loaded fitted scaler to apply to synthetic data")
print("   6. Set up processing pipeline that scales synthetic samples")

print("\n🔄 To Complete the Experiment:")
print("   1. Uncomment Cell 25 to generate full synthetic dataset (~10-20 min)")
print("   2. Uncomment Cell 30 to process and SCALE the generated data")
print("   3. Uncomment Cell 32 to combine SCALED original + SCALED synthetic")
print("   4. Uncomment Cell 34 to train XGBoost model")
print("   5. Uncomment Cell 36 to evaluate and get metrics")
print("   6. Fill in comparison table with metrics from all models")
print("   7. Uncomment visualization cells to compare results")

print("\n💡 Expected Insights:")
print("   • NDD may perform similarly to SMOTE (possibly slightly worse)")
print("   • LLM coherence validation is NDD's unique advantage")
print("   • SMOTE is faster and more cost-effective for numerical data")
print("   • NDD would excel with text-based medical features (clinical notes)")

print("\n📊 Key Considerations:")
print("   • Time: 10-20 minutes for generation")
print("   • Cost: ~$0.50-2.00 in API credits")
print("   • Learning: Valuable experience with LLM-based synthetic data")
print("   • IMPORTANT: NDD uses unscaled data, then we scale synthetic samples")
print("     This allows LLM to evaluate real medical values (age=67, not 0.72)")

print("\n🎯 Recommendations:")
print("   • Use Conservative generation (249 samples) for initial testing")
print("   • Keep plausibility threshold at 7+ for quality")
print("   • Compare honestly with SMOTE - document when each is better")
print("   • Save this approach for future text-based medical projects")

print("\n" + "=" * 80)
print("✓ Notebook Ready! Uncomment cells as you progress through the workflow.")
print("=" * 80)


## Additional Resources

### 📚 Documentation
- **NDD Implementation Guide**: `../NDD_IMPLEMENTATION_GUIDE.md`
- **Quick Start Guide**: `../README_NDD.md`
- **Python Script Version**: `../ndd_stroke_augmentation.py`

### 🔧 Configuration Options

**Model Selection** (Cell 3):
```python
model_alias = "nemotron-super"  # Recommended
# Options: nemotron-nano-v2, nemotron-super, mistral-small, gpt-oss-120b
```

**LLM Validation** (Cell 18):
```python
use_llm_validation = True  # Set False to skip (faster, cheaper)
```

**Plausibility Threshold** (Cell 27):
```python
min_plausibility=7  # Only keep samples with score ≥7 (1-10 scale)
```

**Generation Amount** (Cell 24):
```python
num_to_generate = num_minority  # Conservative (249)
# num_to_generate = num_minority * 3  # Moderate (747)
# num_to_generate = num_majority - num_minority  # Aggressive (4,612)
```

### ⚠️ Important Notes

1. **Generation is commented out by default** - Uncomment Cell 25 when ready
2. **Cost**: ~$0.50-2.00 for conservative generation with nemotron-super
3. **Time**: ~10-20 minutes for 249 samples
4. **Quality over quantity**: Use plausibility filtering for better results
5. **Realistic expectations**: NDD may not outperform SMOTE for numerical data

### 💭 Honest Assessment

**When NDD is worth it:**
- Text-based features (clinical notes, patient narratives)
- Need medical coherence validation
- Learning LLM-based synthetic data generation
- Complex categorical relationships

**When SMOTE is better:**
- Purely numerical data (like this stroke dataset)
- Need speed and efficiency
- Limited budget/API credits
- Production pipelines

### 📝 Citation

If you use NeMo Data Designer in your research:
```
NVIDIA NeMo Microservices - Data Designer
https://build.nvidia.com/
```

---

**Ready to proceed?** Start by running cells 1-20 to generate and verify your preview, then uncomment the remaining cells as needed.
