## Data Preprocessing for Training

Here from Synthea we only got the data which is synthea_sample_data_csv_apr2020 from 
https://github.com/yuan-code/Healthcare_Data_in_SQL_and_Visualization_in_Tableau/tree/main/synthea_sample_data_csv_apr2020 

As this data is not enough, we generate syntheric data. And with synthetic data, we generate traing/ test / and validation data sets.



## Available Data:
- **Patients**: Demographics, BMI data
- **Observations**: Vital signs, lab results (BP, glucose, HbA1c, lipids)
- **Conditions**: Medical diagnoses
- **Medications**: Current treatments

## Required Processing Steps:

### 1. **Clinical Guidelines Implementation**
- Implement ACC/AHA BP categories
- ADA diabetes/HbA1c categories  
- NCEP ATP III lipid categories
- WHO BMI categories

### 2. **Synthetic Training Data Generation**
- Create patient profiles with clinical metrics
- Generate realistic medical questions
- Craft appropriate medical responses
- Ensure proper risk assessment language

### 3. **Training Format**
- Structure data for QLoRA fine-tuning
- Create instruction-following format
- Split into train/validation sets

Would you like to start with:
- **A)** Implementing clinical guideline classifications
- **B)** Creating the data preprocessing pipeline
- **C)** Setting up synthetic data generation

In [27]:
# Analyze existing data to determine if it meets training requirements
import pandas as pd
import os

# Path to your CSV data
csv_path = "../../csv"


# Load and examine key datasets
try:
    # Load patients data
    patients = pd.read_csv(f"{csv_path}/patients.csv")
    print(f"Columns: {list(patients.columns)}")
    
    # Load observations data  
    observations = pd.read_csv(f"{csv_path}/observations.csv")
    print(f"Columns: {list(observations.columns)}")
    
    # Check what types of observations we have
    if 'DESCRIPTION' in observations.columns:
        obs_types = observations['DESCRIPTION'].value_counts().head(10)
        print(f"\nTop observation types:")
        for obs, count in obs_types.items():
            print(f"  • {obs}: {count}")
    
    # Load conditions
    conditions = pd.read_csv(f"{csv_path}/conditions.csv") 
    
    # Load medications
    medications = pd.read_csv(f"{csv_path}/medications.csv")
    
except Exception as e:
    print(f"Error loading data: {e}")

Columns: ['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX', 'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL', 'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE', 'COUNTY', 'ZIP', 'LAT', 'LON', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE']
Columns: ['DATE', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION', 'VALUE', 'UNITS', 'TYPE']

Top observation types:
  • Pain severity - 0-10 verbal numeric rating [Score] - Reported: 16820
  • Diastolic Blood Pressure: 12963
  • Systolic Blood Pressure: 12963
  • Body Height: 12552
  • Tobacco smoking status NHIS: 12552
  • Body Weight: 12552
  • Heart rate: 12552
  • Respiratory rate: 12552
  • Body Mass Index: 11451
  • QOLS: 10121

Top observation types:
  • Pain severity - 0-10 verbal numeric rating [Score] - Reported: 16820
  • Diastolic Blood Pressure: 12963
  • Systolic Blood Pressure: 12963
  • Body Height: 12552
  • Tobacco smoking status NHIS: 12552
  • Body Weight: 12552
  • Heart rate: 12552
  • Resp

In [41]:
# Check if data has the required clinical metrics for training
def check_required_metrics():
    """
    Check if existing data contains the required clinical metrics:
    - Age, Sex, BMI (height + weight)  Basic info
    - Blood pressure (SBP/DBP) - To assess hypertension
    - Heart rate, SpO₂ - Vital signs
    - Fasting glucose, HbA1c - Diabetes indicators
    - Lipid panel (LDL, HDL, TG, Total Cholesterol) - Cardiovascular risk factors
    """
    
    required_metrics = {
        'Age/Demographics': ['age', 'birthdate', 'gender', 'sex'],
        'BMI Components': ['height', 'weight', 'body mass index', 'bmi'],
        'Blood Pressure': ['blood pressure', 'systolic', 'diastolic', 'sbp', 'dbp'],
        'Vital Signs': ['heart rate', 'pulse', 'oxygen saturation', 'spo2'],
        'Glucose Metrics': ['glucose', 'fasting glucose', 'blood sugar'],
        'HbA1c': ['hemoglobin a1c', 'hba1c', 'glycated hemoglobin'],
        'Lipid Panel': ['cholesterol', 'ldl', 'hdl', 'triglyceride']
    }
    
    # Check observations for clinical metrics (observations variable should exist)
    try:
        obs_descriptions = observations['DESCRIPTION'].str.lower().unique()
        
        found_metrics = {}
        missing_metrics = {}
        
        for category, keywords in required_metrics.items():
            found = []
            for keyword in keywords:
                matches = [desc for desc in obs_descriptions if keyword in desc]
                found.extend(matches)
            
            if found:
                found_metrics[category] = found[:3]  # Show first 3 matches
                print(f"{category}: Found {len(found)} related observations")
                for match in found[:3]:
                    print(f"   • {match}")
            else:
                missing_metrics[category] = keywords
                print(f"{category}: No matching observations found")
        
        return found_metrics, missing_metrics
        
    except NameError:
        print("Observations data not loaded")
        return {}, required_metrics
    except Exception as e:
        print(f"Error checking metrics: {e}")
        return {}, required_metrics

found, missing = check_required_metrics()

Age/Demographics: Found 10 related observations
   • body mass index (bmi) [percentile] per age and gender
   • weight-for-length per age and sex
   • stage group.clinical cancer
BMI Components: Found 7 related observations
   • body height
   • body weight
   • weight-for-length per age and sex
Blood Pressure: Found 4 related observations
   • diastolic blood pressure
   • systolic blood pressure
   • systolic blood pressure
Vital Signs: Found 2 related observations
   • heart rate
   • oxygen saturation in arterial blood
Glucose Metrics: Found 3 related observations
   • glucose
   • glucose [mass/volume] in urine by test strip
   • glucose [presence] in urine by test strip
HbA1c: Found 1 related observations
   • hemoglobin a1c/hemoglobin.total in blood
Lipid Panel: Found 4 related observations
   • total cholesterol
   • low density lipoprotein cholesterol
   • high density lipoprotein cholesterol


In [42]:
# BUILD Q&A TRAINING DATA GENERATOR FROM CSV DATA
import pandas as pd
import numpy as np
import random
from datetime import datetime
import json

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("BUILDING Q&A TRAINING DATA GENERATOR")

# Use the already loaded data (from previous cells)
print("Using previously loaded CSV datasets...")
patients_df = patients  # Use the loaded patients dataframe
observations_df = observations  # Use the loaded observations dataframe
conditions_df = conditions  # Use the loaded conditions dataframe
medications_df = medications  # Use the loaded medications dataframe

    
# Define clinical guidelines for classifications
CLINICAL_GUIDELINES = {
    'bp_categories': {
        'Normal': {'sbp': (0, 120), 'dbp': (0, 80)},
        'Elevated': {'sbp': (120, 129), 'dbp': (0, 80)},
        'Stage 1': {'sbp': (130, 139), 'dbp': (80, 89)},
        'Stage 2': {'sbp': (140, 999), 'dbp': (90, 999)}
    },
    'bmi_categories': {
        'Underweight': (0, 18.5),
        'Normal': (18.5, 25.0),
        'Overweight': (25.0, 30.0),
        'Obese': (30.0, 999)
    },
    'hba1c_categories': {
        'Normal': (0, 5.7),
        'Prediabetes': (5.7, 6.5),
        'Diabetes': (6.5, 999)
    },
    'ldl_categories': {
        'Optimal': (0, 100),
        'Near Optimal': (100, 129),
        'Borderline High': (130, 159),
        'High': (160, 189),
        'Very High': (190, 999)
    },
    'heart_rate_categories': {
        'Bradycardia': (0, 60),
        'Normal': (60, 100),
        'Tachycardia': (100, 999)
    },
    'spo2_categories': {
        'Critical': (0, 90),
        'Low': (90, 95),
        'Normal': (95, 100),
        'Excellent': (100, 101)  # Handle perfect 100% readings
    }
}

## For future cases

# Vitals
# 'HeartRate': [211, 220045],
# 'SysBP': [51, 442, 455, 6701, 220179, 220050],
# 'DiasBP': [8368, 8440, 8441, 8555, 220180, 220051],
# 'RespRate': [618, 615, 220210, 224690],
# 'TempC': [223762, 676],
# 'SpO2': [646, 220277],
# 'GCS': [198, 220739, 223900, 223901], # Total GCS

# # Labs
# 'Lactate': [50813, 52442],
# 'Creatinine': [50912, 52546],
# 'WBC': [51300, 51301],
# 'BUN': [51006],
# 'pH': [50820, 50831]

print("Clinical guidelines defined")
print("Ready to generate Q&A training data")

# Check what observations we have available for training
print(f"\nAVAILABLE CLINICAL METRICS:")
top_observations = observations_df['DESCRIPTION'].value_counts().head(15)
for obs_type, count in top_observations.items():
    print(f"  • {obs_type}: {count} measurements")

BUILDING Q&A TRAINING DATA GENERATOR
Using previously loaded CSV datasets...
Clinical guidelines defined
Ready to generate Q&A training data

AVAILABLE CLINICAL METRICS:
  • Pain severity - 0-10 verbal numeric rating [Score] - Reported: 16820 measurements
  • Diastolic Blood Pressure: 12963 measurements
  • Systolic Blood Pressure: 12963 measurements
  • Body Height: 12552 measurements
  • Tobacco smoking status NHIS: 12552 measurements
  • Body Weight: 12552 measurements
  • Heart rate: 12552 measurements
  • Respiratory rate: 12552 measurements
  • Body Mass Index: 11451 measurements
  • QOLS: 10121 measurements
  • DALY: 10121 measurements
  • QALY: 10121 measurements
  • Chloride: 6515 measurements
  • Sodium: 6515 measurements
  • Calcium: 6515 measurements


In [44]:
# PATIENT PROFILE BUILDER - Extract comprehensive patient data

# Took help for this builder from various online resources and adapted to our needs

class PatientProfileBuilder:
    def __init__(self, patients_df, observations_df, conditions_df, medications_df):
        self.patients_df = patients_df
        self.observations_df = observations_df
        self.conditions_df = conditions_df
        self.medications_df = medications_df
        
    def extract_patient_observations(self, patient_id):
        """Extract all observations for a patient and organize by type"""
        patient_obs = self.observations_df[self.observations_df['PATIENT'] == patient_id]
        
        obs_dict = {}
        for _, row in patient_obs.iterrows():
            desc = row['DESCRIPTION'].lower()
            value = row.get('VALUE', None)
            units = row.get('UNITS', '')
            
            # Categorize observations
            if 'height' in desc:
                obs_dict['height'] = {'value': value, 'units': units}
            elif 'weight' in desc:
                obs_dict['weight'] = {'value': value, 'units': units}
            elif 'body mass index' in desc or 'bmi' in desc:
                obs_dict['bmi'] = {'value': value, 'units': units}
            elif 'systolic' in desc or 'blood pressure systolic' in desc:
                obs_dict['sbp'] = {'value': value, 'units': units}
            elif 'diastolic' in desc or 'blood pressure diastolic' in desc:
                obs_dict['dbp'] = {'value': value, 'units': units}
            elif 'heart rate' in desc:
                obs_dict['heart_rate'] = {'value': value, 'units': units}
            elif 'glucose' in desc and 'fasting' in desc:
                obs_dict['fasting_glucose'] = {'value': value, 'units': units}
            elif 'hemoglobin a1c' in desc or 'hba1c' in desc:
                obs_dict['hba1c'] = {'value': value, 'units': units}
            elif 'cholesterol' in desc and 'ldl' in desc:
                obs_dict['ldl'] = {'value': value, 'units': units}
            elif 'cholesterol' in desc and 'hdl' in desc:
                obs_dict['hdl'] = {'value': value, 'units': units}
            elif 'cholesterol' in desc and ('total' in desc or 'serum' in desc):
                obs_dict['total_cholesterol'] = {'value': value, 'units': units}
            elif 'triglyceride' in desc:
                obs_dict['triglycerides'] = {'value': value, 'units': units}
            elif 'oxygen saturation' in desc or 'spo2' in desc:
                obs_dict['spo2'] = {'value': value, 'units': units}
                
        return obs_dict
    
    def get_patient_conditions(self, patient_id):
        """Get all conditions for a patient"""
        patient_conditions = self.conditions_df[self.conditions_df['PATIENT'] == patient_id]
        return patient_conditions['DESCRIPTION'].tolist()
    
    def get_patient_medications(self, patient_id):
        """Get all medications for a patient"""
        patient_meds = self.medications_df[self.medications_df['PATIENT'] == patient_id]
        if 'DESCRIPTION' in patient_meds.columns:
            return patient_meds['DESCRIPTION'].tolist()
        elif 'CODE' in patient_meds.columns:
            return patient_meds['CODE'].tolist()
        else:
            return []
    
    def calculate_age(self, birthdate_str):
        """Calculate age from birthdate"""
        try:
            birth_year = pd.to_datetime(birthdate_str).year
            current_year = datetime.now().year
            return current_year - birth_year
        except:
            return None
    
    def build_complete_profile(self, patient_id):
        """Build a complete patient profile with all available data"""
        
        # Get basic patient info
        patient_info = self.patients_df[self.patients_df['Id'] == patient_id].iloc[0]
        
        profile = {
            'patient_id': patient_id,
            'age': self.calculate_age(patient_info.get('BIRTHDATE', '')),
            'gender': patient_info.get('GENDER', 'Unknown'),
            'race': patient_info.get('RACE', 'Unknown'),
            'ethnicity': patient_info.get('ETHNICITY', 'Unknown'),
            'observations': self.extract_patient_observations(patient_id),
            'conditions': self.get_patient_conditions(patient_id),
            'medications': self.get_patient_medications(patient_id)
        }
        
        # Calculate BMI if height/weight available
        if 'height' in profile['observations'] and 'weight' in profile['observations']:
            try:
                height_cm = float(profile['observations']['height']['value'])
                weight_kg = float(profile['observations']['weight']['value'])
                
                # Convert height to meters if needed
                height_m = height_cm / 100 if height_cm > 3 else height_cm
                bmi = weight_kg / (height_m ** 2)
                profile['observations']['calculated_bmi'] = {'value': round(bmi, 1), 'units': 'kg/m2'}
            except:
                pass
        
        return profile

# Initialize the patient profile builder
profile_builder = PatientProfileBuilder(patients_df, observations_df, conditions_df, medications_df)
print("Patient Profile Builder initialized")

Patient Profile Builder initialized


In [46]:
# CLINICAL CLASSIFIER - Apply medical guidelines to patient data
# Took help for this classifier from various online resources and adapted to our needs
class ClinicalClassifier:
    def __init__(self, guidelines):
        self.guidelines = guidelines
    
    def classify_blood_pressure(self, sbp, dbp):
        """Classify blood pressure according to ACC/AHA guidelines"""
        if sbp is None or dbp is None:
            return "Unknown"
        
        try:
            sbp, dbp = float(sbp), float(dbp)
            
            for category, ranges in self.guidelines['bp_categories'].items():
                if (ranges['sbp'][0] <= sbp < ranges['sbp'][1] and 
                    ranges['dbp'][0] <= dbp < ranges['dbp'][1]):
                    return category
            return "Stage 2"  # Default for very high values
        except:
            return "Unknown"
    
    def classify_bmi(self, bmi):
        """Classify BMI according to WHO guidelines"""
        if bmi is None:
            return "Unknown"
        
        try:
            bmi = float(bmi)
            for category, (min_val, max_val) in self.guidelines['bmi_categories'].items():
                if min_val <= bmi < max_val:
                    return category
            return "Obese"  # Default for very high values
        except:
            return "Unknown"
    
    def classify_hba1c(self, hba1c):
        """Classify HbA1c according to ADA guidelines"""
        if hba1c is None:
            return "Unknown"
        
        try:
            hba1c = float(hba1c)
            for category, (min_val, max_val) in self.guidelines['hba1c_categories'].items():
                if min_val <= hba1c < max_val:
                    return category
            return "Diabetes"  # Default for very high values
        except:
            return "Unknown"
    
    def classify_ldl(self, ldl):
        """Classify LDL cholesterol according to NCEP ATP III guidelines"""
        if ldl is None:
            return "Unknown"
        
        try:
            ldl = float(ldl)
            for category, (min_val, max_val) in self.guidelines['ldl_categories'].items():
                if min_val <= ldl < max_val:
                    return category
            return "Very High"  # Default for very high values
        except:
            return "Unknown"
    
    def classify_heart_rate(self, heart_rate):
        """Classify heart rate according to clinical guidelines"""
        if heart_rate is None:
            return "Unknown"
        
        try:
            hr = float(heart_rate)
            for category, (min_val, max_val) in self.guidelines['heart_rate_categories'].items():
                if min_val <= hr < max_val:
                    return category
            return "Tachycardia"  # Default for very high values
        except:
            return "Unknown"
    
    def classify_spo2(self, spo2):
        """Classify oxygen saturation according to clinical guidelines"""
        if spo2 is None:
            return "Unknown"
        
        try:
            spo2_val = float(spo2)
            for category, (min_val, max_val) in self.guidelines['spo2_categories'].items():
                if min_val <= spo2_val < max_val:
                    return category
            return "Excellent"  # Default for perfect readings
        except:
            return "Unknown"
    
    def get_risk_assessment(self, profile):
        """Generate comprehensive risk assessment for a patient"""
        obs = profile['observations']
        
        # Extract values
        sbp = obs.get('sbp', {}).get('value')
        dbp = obs.get('dbp', {}).get('value')
        bmi = obs.get('bmi', {}).get('value') or obs.get('calculated_bmi', {}).get('value')
        hba1c = obs.get('hba1c', {}).get('value')
        ldl = obs.get('ldl', {}).get('value')
        heart_rate = obs.get('heart_rate', {}).get('value')
        spo2 = obs.get('spo2', {}).get('value')
        
        # Classify each metric
        classifications = {
            'bp_category': self.classify_blood_pressure(sbp, dbp),
            'bmi_category': self.classify_bmi(bmi),
            'hba1c_category': self.classify_hba1c(hba1c),
            'ldl_category': self.classify_ldl(ldl),
            'heart_rate_category': self.classify_heart_rate(heart_rate),
            'spo2_category': self.classify_spo2(spo2)
        }
        
        return classifications

# Initialize classifier
classifier = ClinicalClassifier(CLINICAL_GUIDELINES)
print("Clinical Classifier initialized")

Clinical Classifier initialized


In [47]:
# Q&A GENERATOR - Create realistic medical conversations
class MedicalQAGenerator:
    def __init__(self):
        self.question_templates = {
            'risk_assessment': [
                "What are my health risks based on my recent test results?",
                "Can you explain my cardiovascular risk factors?",
                "Am I at risk for diabetes based on my lab values?",
                "What do my blood pressure readings mean for my health?",
                "Should I be concerned about my cholesterol levels?",
                "What does my BMI indicate about my health status?",
                "Can you assess my overall health risk profile?",
                "What are the implications of my HbA1c level?",
                "How do my current health metrics affect my long-term health?",
                "Is my heart rate normal? What does it indicate about my health?",
                "Should I be concerned about my oxygen levels?"
            ],
            'lifestyle_advice': [
                "What lifestyle changes should I make to improve my health?",
                "What diet modifications would you recommend for my condition?",
                "How much exercise should I be doing with my current health status?",
                "What can I do to lower my blood pressure naturally?",
                "How can I manage my prediabetes through lifestyle changes?",
                "What foods should I avoid with my current cholesterol levels?",
                "Can lifestyle changes help me avoid medication?",
                "What daily habits would improve my health markers?",
                "How can I improve my heart rate and cardiovascular fitness?"
            ],
            'specific_concerns': [
                "I'm worried about my family history of heart disease. What should I know?",
                "My doctor mentioned my blood sugar is elevated. What does this mean?",
                "I've been experiencing some symptoms. Could they be related to my test results?",
                "Should I be taking medication based on these numbers?",
                "When should I schedule my next health screening?",
                "What warning signs should I watch for with my current health status?",
                "My heart rate seems unusual - is this something to worry about?"
            ]
        }
        
    def format_patient_data(self, profile, classifications):
        """Format patient data for inclusion in conversation"""
        obs = profile['observations']
        
        data_text = f"Patient Profile:\n"
        data_text += f"Age: {profile['age']}, Gender: {profile['gender']}\n"
        
        # Add clinical metrics if available
        if 'bmi' in obs or 'calculated_bmi' in obs:
            bmi_val = obs.get('bmi', {}).get('value') or obs.get('calculated_bmi', {}).get('value')
            data_text += f"BMI: {bmi_val} kg/m² ({classifications['bmi_category']})\n"
        
        if 'sbp' in obs and 'dbp' in obs:
            sbp = obs['sbp']['value']
            dbp = obs['dbp']['value']
            data_text += f"Blood Pressure: {sbp}/{dbp} mmHg ({classifications['bp_category']})\n"
        
        if 'heart_rate' in obs:
            hr = obs['heart_rate']['value']
            data_text += f"Heart Rate: {hr} bpm ({classifications['heart_rate_category']})\n"
        
        if 'spo2' in obs:
            spo2 = obs['spo2']['value']
            data_text += f"Oxygen Saturation: {spo2}% ({classifications['spo2_category']})\n"
        
        if 'hba1c' in obs:
            hba1c = obs['hba1c']['value']
            data_text += f"HbA1c: {hba1c}% ({classifications['hba1c_category']})\n"
        
        if 'ldl' in obs:
            ldl = obs['ldl']['value']
            data_text += f"LDL Cholesterol: {ldl} mg/dL ({classifications['ldl_category']})\n"
        
        if 'hdl' in obs:
            hdl = obs['hdl']['value']
            data_text += f"HDL Cholesterol: {hdl} mg/dL\n"
        
        if 'fasting_glucose' in obs:
            glucose = obs['fasting_glucose']['value']
            data_text += f"Fasting Glucose: {glucose} mg/dL\n"
        
        # Add conditions if present
        if profile['conditions']:
            conditions_text = ", ".join(profile['conditions'][:3])  # Limit to first 3
            data_text += f"Known Conditions: {conditions_text}\n"
        
        # Add medications if present
        if profile['medications']:
            meds_text = ", ".join(profile['medications'][:3])  # Limit to first 3
            data_text += f"Current Medications: {meds_text}\n"
        
        return data_text.strip()
    
    def generate_medical_response(self, profile, classifications, question_type):
        """Generate appropriate medical response based on patient data and question type"""
        
        response = "Based on your health profile, here are my observations and recommendations:\n\n"
        
        # Risk assessment based on classifications
        risks = []
        if classifications['bp_category'] in ['Stage 1', 'Stage 2']:
            risks.append("elevated blood pressure")
        if classifications['bmi_category'] in ['Overweight', 'Obese']:
            risks.append("excess weight")
        if classifications['hba1c_category'] in ['Prediabetes', 'Diabetes']:
            risks.append("elevated blood glucose")
        if classifications['ldl_category'] in ['High', 'Very High']:
            risks.append("high cholesterol")
        if classifications['heart_rate_category'] in ['Bradycardia', 'Tachycardia']:
            risks.append("abnormal heart rate")
        if classifications['spo2_category'] in ['Critical', 'Low']:
            risks.append("low oxygen saturation")
        
        if risks:
            response += f"**Current Risk Factors:** You have {', '.join(risks)}. "
            response += "These factors can increase your risk of cardiovascular disease and other complications.\n\n"
        
        # Lifestyle recommendations
        response += "**Lifestyle Recommendations:**\n"
        
        if classifications['bp_category'] != 'Normal':
            response += "• Focus on reducing sodium intake and increasing physical activity to help manage blood pressure\n"
        
        if classifications['bmi_category'] in ['Overweight', 'Obese']:
            response += "• Consider a structured weight management program with balanced nutrition and regular exercise\n"
        
        if classifications['hba1c_category'] in ['Prediabetes', 'Diabetes']:
            response += "• Monitor carbohydrate intake and consider working with a nutritionist for diabetes management\n"
        
        if classifications['ldl_category'] in ['Borderline High', 'High', 'Very High']:
            response += "• Adopt a heart-healthy diet low in saturated fats and high in fiber\n"
        
        if classifications['heart_rate_category'] == 'Bradycardia':
            response += "• Monitor for symptoms and discuss with your healthcare provider as bradycardia may require evaluation\n"
        elif classifications['heart_rate_category'] == 'Tachycardia':
            response += "• Consider stress reduction techniques and limit caffeine intake; discuss with your healthcare provider\n"
        
        if classifications['spo2_category'] in ['Critical', 'Low']:
            response += "• Seek immediate medical attention for low oxygen levels - this requires urgent evaluation\n"
        
        response += "• Regular monitoring and follow-up with your healthcare provider is important\n\n"
        
        response += "**Important Note:** This guidance is for educational purposes. Please consult with your healthcare provider for personalized medical advice and treatment decisions."
        
        return response
    
    def generate_qa_pair(self, profile, classifications):
        """Generate a complete question-answer pair for training"""
        
        # Select random question type and template
        question_type = random.choice(list(self.question_templates.keys()))
        question_template = random.choice(self.question_templates[question_type])
        
        # Format patient data
        patient_data = self.format_patient_data(profile, classifications)
        
        # Create the complete question with patient context
        full_question = f"{patient_data}\n\nPatient Question: {question_template}"
        
        # Generate appropriate response
        response = self.generate_medical_response(profile, classifications, question_type)
        
        return {
            'instruction': "You are a healthcare assistant. Based on the patient's profile and question, provide appropriate medical guidance.",
            'input': full_question,
            'output': response,
            'patient_id': profile['patient_id'],
            'question_type': question_type,
            'classifications': classifications
        }

# Initialize Q&A generator
qa_generator = MedicalQAGenerator()
print(" Medical Q&A Generator initialized")

 Medical Q&A Generator initialized


In [50]:
# TRAINING DATA GENERATOR - Create 10,000+ training examples
# need to review if this is enough for training
def generate_training_dataset(target_size=10000):
    """Generate training dataset with proper train/validation/test splits"""
    
    # Get all unique patient IDs
    unique_patients = patients_df['Id'].unique()
    
    training_examples = []
    patients_processed = 0
    
    # Generate multiple Q&A pairs per patient to reach target size
    examples_per_patient = max(1, target_size // len(unique_patients))
    
    
    for i, patient_id in enumerate(unique_patients):
        try:
            # Build complete patient profile
            profile = profile_builder.build_complete_profile(patient_id)
            
            # Skip patients with insufficient data
            if not profile['observations'] or len(profile['observations']) < 2:
                continue
            
            # Get clinical classifications
            classifications = classifier.get_risk_assessment(profile)
            
            # Skip if no meaningful classifications
            if all(v == 'Unknown' for v in classifications.values()):
                continue
            
            # Generate multiple Q&A pairs for this patient
            for _ in range(examples_per_patient):
                try:
                    qa_pair = qa_generator.generate_qa_pair(profile, classifications)
                    training_examples.append(qa_pair)
                except Exception as e:
                    continue
            
            patients_processed += 1
            
            # Progress update
            if patients_processed % 100 == 0:
                print(f"  Processed {patients_processed} patients, Generated {len(training_examples)} examples")
            
            # Stop when we reach target size
            if len(training_examples) >= target_size:
                break
                
        except Exception as e:
            print(f" Skipping patient {patient_id}: {str(e)}")
            continue
    
    print(f"\nDATASET GENERATION COMPLETE!")
    print(f"Total examples generated: {len(training_examples)}")
    print(f"Patients processed: {patients_processed}")
    
    return training_examples

# Generate the dataset
print("Starting dataset generation...")
training_data = generate_training_dataset(target_size=10000)

Starting dataset generation...
  Processed 100 patients, Generated 800 examples
  Processed 100 patients, Generated 800 examples
  Processed 200 patients, Generated 1600 examples
  Processed 200 patients, Generated 1600 examples
  Processed 300 patients, Generated 2400 examples
  Processed 300 patients, Generated 2400 examples
  Processed 400 patients, Generated 3200 examples
  Processed 400 patients, Generated 3200 examples
  Processed 500 patients, Generated 4000 examples
  Processed 500 patients, Generated 4000 examples
  Processed 600 patients, Generated 4800 examples
  Processed 600 patients, Generated 4800 examples
  Processed 700 patients, Generated 5600 examples
  Processed 700 patients, Generated 5600 examples
  Processed 800 patients, Generated 6400 examples
  Processed 800 patients, Generated 6400 examples
  Processed 900 patients, Generated 7200 examples
  Processed 900 patients, Generated 7200 examples
  Processed 1000 patients, Generated 8000 examples
  Processed 1000 pat

In [51]:
# CREATE TRAIN/VALIDATION/TEST SPLITS
def create_dataset_splits(training_data, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15):
    """Split data into train/validation/test sets"""
    
    # Shuffle the data
    random.shuffle(training_data)
    
    total_size = len(training_data)
    train_size = int(total_size * train_ratio)
    val_size = int(total_size * val_ratio)
    
    # Create splits
    train_data = training_data[:train_size]
    val_data = training_data[train_size:train_size + val_size]
    test_data = training_data[train_size + val_size:]
    
    print(f"  Training: {len(train_data)} ({len(train_data)/total_size*100:.1f}%)")
    print(f"  Validation: {len(val_data)} ({len(val_data)/total_size*100:.1f}%)")
    print(f"  Test: {len(test_data)} ({len(test_data)/total_size*100:.1f}%)")
    print(f"  Total: {total_size}")
    
    return train_data, val_data, test_data

# Create the splits
train_split, val_split, test_split = create_dataset_splits(training_data)

# Save datasets
def save_datasets(train_data, val_data, test_data, base_path="../data"):
    """Save the datasets to JSON files"""
    
    import os
    os.makedirs(base_path, exist_ok=True)
    
    # Save each split
    datasets = {
        'train': train_data,
        'validation': val_data, 
        'test': test_data
    }
    
    print(f"\n SAVING DATASETS to {base_path}/")
    
    for split_name, data in datasets.items():
        filename = f"{base_path}/{split_name}_medical_qa.json"
        
        with open(filename, 'w') as f:
            json.dump(data, f, indent=2)
        
    
    # Also create a combined dataset file
    combined_filename = f"{base_path}/combined_medical_qa.json"
    all_data = {
        'train': train_data,
        'validation': val_data,
        'test': test_data,
        'metadata': {
            'total_examples': len(train_data) + len(val_data) + len(test_data),
            'train_size': len(train_data),
            'val_size': len(val_data),
            'test_size': len(test_data),
            'generation_date': datetime.now().isoformat()
        }
    }
    
    with open(combined_filename, 'w') as f:
        json.dump(all_data, f, indent=2)
    
    
    return datasets

# Save the datasets
saved_datasets = save_datasets(train_split, val_split, test_split)

  Training: 6557 (70.0%)
  Validation: 1405 (15.0%)
  Test: 1406 (15.0%)
  Total: 9368

 SAVING DATASETS to ../data/


In [52]:
# PREVIEW TRAINING EXAMPLES
def preview_training_examples(train_data, num_examples=3):
    """Show sample training examples"""
    for i, example in enumerate(train_data[:num_examples]):
        print(f"Instruction: {example['instruction']}")
        print(f"\n Input:\n{example['input']}")
        print(f"\n Output:\n{example['output']}")
        print(f"\n Metadata: Patient {example['patient_id']}, Type: {example['question_type']}")
        print(f"Classifications: {example['classifications']}")

# Preview examples
if training_data:
    preview_training_examples(train_split, num_examples=2)
else:
    print("No training data generated yet - run the generation cell above!")

Instruction: You are a healthcare assistant. Based on the patient's profile and question, provide appropriate medical guidance.

 Input:
Patient Profile:
Age: 70, Gender: M
BMI: 30.2 kg/m² (Obese)
Blood Pressure: 115.0/76.0 mmHg (Normal)
Heart Rate: 73.0 bpm (Normal)
Known Conditions: Hypertension, Chronic sinusitis (disorder), Body mass index 30+ - obesity (finding)
Current Medications: Hydrochlorothiazide 25 MG Oral Tablet, Hydrochlorothiazide 25 MG Oral Tablet, Hydrochlorothiazide 25 MG Oral Tablet

Patient Question: Am I at risk for diabetes based on my lab values?

 Output:
Based on your health profile, here are my observations and recommendations:

**Current Risk Factors:** You have excess weight. These factors can increase your risk of cardiovascular disease and other complications.

**Lifestyle Recommendations:**
• Consider a structured weight management program with balanced nutrition and regular exercise
• Regular monitoring and follow-up with your healthcare provider is impo