# Chapter 1: Clinical Informatics Foundations - Interactive Tutorial

This notebook provides hands-on examples for the concepts covered in Chapter 1 of the Healthcare AI Implementation Guide.

## Learning Objectives
- Implement clinical data processing pipelines
- Build FHIR-compliant data structures
- Create clinical decision support systems
- Develop healthcare data quality assessment tools

## Setup and Imports

In [None]:
# Install required packages
!pip install pandas numpy matplotlib seaborn scikit-learn fhir.resources
!pip install plotly dash jupyter-dash

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from datetime import datetime, timedelta
import json
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Healthcare AI Tutorial - Chapter 1: Clinical Informatics")
print("=" * 60)

## 1. Clinical Data Generation and Processing

Let's start by generating synthetic clinical data that represents real-world healthcare scenarios.

In [None]:
def generate_synthetic_clinical_data(n_patients=1000):
    """
    Generate synthetic clinical data for demonstration purposes.
    """
    np.random.seed(42)
    
    # Patient demographics
    patient_ids = [f"P{i:06d}" for i in range(1, n_patients + 1)]
    ages = np.random.normal(65, 15, n_patients).astype(int)
    ages = np.clip(ages, 18, 100)
    
    genders = np.random.choice(['M', 'F'], n_patients, p=[0.48, 0.52])
    races = np.random.choice(['White', 'Black', 'Hispanic', 'Asian', 'Other'], 
                           n_patients, p=[0.6, 0.2, 0.15, 0.04, 0.01])
    
    # Clinical measurements
    systolic_bp = np.random.normal(140, 20, n_patients)
    diastolic_bp = np.random.normal(90, 15, n_patients)
    heart_rate = np.random.normal(75, 12, n_patients)
    temperature = np.random.normal(98.6, 1.2, n_patients)
    
    # Laboratory values
    glucose = np.random.lognormal(np.log(120), 0.3, n_patients)
    creatinine = np.random.lognormal(np.log(1.0), 0.4, n_patients)
    hemoglobin = np.random.normal(13.5, 2.0, n_patients)
    
    # Comorbidities (correlated with age)
    diabetes_prob = 0.1 + 0.01 * (ages - 40)
    diabetes_prob = np.clip(diabetes_prob, 0, 0.8)
    diabetes = np.random.binomial(1, diabetes_prob, n_patients)
    
    hypertension_prob = 0.2 + 0.015 * (ages - 30)
    hypertension_prob = np.clip(hypertension_prob, 0, 0.9)
    hypertension = np.random.binomial(1, hypertension_prob, n_patients)
    
    # Outcomes (30-day readmission)
    risk_score = (0.02 * ages + 0.5 * diabetes + 0.3 * hypertension + 
                 0.01 * (systolic_bp - 120) + 0.5 * (creatinine > 1.5))
    readmission_prob = 1 / (1 + np.exp(-risk_score + 3))
    readmission = np.random.binomial(1, readmission_prob, n_patients)
    
    # Create DataFrame
    clinical_data = pd.DataFrame({
        'patient_id': patient_ids,
        'age': ages,
        'gender': genders,
        'race': races,
        'systolic_bp': systolic_bp,
        'diastolic_bp': diastolic_bp,
        'heart_rate': heart_rate,
        'temperature': temperature,
        'glucose': glucose,
        'creatinine': creatinine,
        'hemoglobin': hemoglobin,
        'diabetes': diabetes,
        'hypertension': hypertension,
        'readmission_30d': readmission
    })
    
    return clinical_data

# Generate the dataset
clinical_df = generate_synthetic_clinical_data(2000)

print(f"Generated clinical data for {len(clinical_df)} patients")
print(f"Readmission rate: {clinical_df['readmission_30d'].mean():.1%}")
print("\nFirst 5 rows:")
clinical_df.head()

## 2. Data Quality Assessment

Healthcare data quality is critical for AI applications. Let's implement comprehensive data quality checks.

In [None]:
class ClinicalDataQualityAssessment:
    """
    Comprehensive data quality assessment for clinical data.
    """
    
    def __init__(self, data):
        self.data = data
        self.quality_report = {}
    
    def assess_completeness(self):
        """Assess data completeness."""
        missing_data = self.data.isnull().sum()
        missing_percent = (missing_data / len(self.data)) * 100
        
        self.quality_report['completeness'] = {
            'missing_counts': missing_data.to_dict(),
            'missing_percentages': missing_percent.to_dict()
        }
        
        return missing_percent
    
    def assess_validity(self):
        """Assess data validity using clinical ranges."""
        validity_issues = {}
        
        # Define clinical ranges
        clinical_ranges = {
            'age': (0, 120),
            'systolic_bp': (60, 250),
            'diastolic_bp': (30, 150),
            'heart_rate': (30, 200),
            'temperature': (95, 110),
            'glucose': (30, 800),
            'creatinine': (0.1, 15),
            'hemoglobin': (3, 20)
        }
        
        for column, (min_val, max_val) in clinical_ranges.items():
            if column in self.data.columns:
                out_of_range = ((self.data[column] < min_val) | 
                              (self.data[column] > max_val)).sum()
                validity_issues[column] = {
                    'out_of_range_count': out_of_range,
                    'out_of_range_percent': (out_of_range / len(self.data)) * 100
                }
        
        self.quality_report['validity'] = validity_issues
        return validity_issues
    
    def assess_consistency(self):
        """Assess data consistency."""
        consistency_issues = {}
        
        # Check systolic vs diastolic BP
        bp_inconsistent = (self.data['systolic_bp'] <= self.data['diastolic_bp']).sum()
        consistency_issues['bp_inconsistency'] = {
            'count': bp_inconsistent,
            'percent': (bp_inconsistent / len(self.data)) * 100
        }
        
        self.quality_report['consistency'] = consistency_issues
        return consistency_issues
    
    def generate_quality_report(self):
        """Generate comprehensive quality report."""
        print("Clinical Data Quality Assessment Report")
        print("=" * 50)
        
        # Completeness
        completeness = self.assess_completeness()
        print("\n1. Data Completeness:")
        for col, pct in completeness.items():
            if pct > 0:
                print(f"   {col}: {pct:.1f}% missing")
        if completeness.max() == 0:
            print("   ✓ No missing data found")
        
        # Validity
        validity = self.assess_validity()
        print("\n2. Data Validity:")
        for col, issues in validity.items():
            if issues['out_of_range_count'] > 0:
                print(f"   {col}: {issues['out_of_range_count']} values out of range ({issues['out_of_range_percent']:.1f}%)")
        
        # Consistency
        consistency = self.assess_consistency()
        print("\n3. Data Consistency:")
        for issue, data in consistency.items():
            if data['count'] > 0:
                print(f"   {issue}: {data['count']} inconsistencies ({data['percent']:.1f}%)")
        
        return self.quality_report

# Perform quality assessment
qa = ClinicalDataQualityAssessment(clinical_df)
quality_report = qa.generate_quality_report()

## 3. Clinical Decision Support System

Let's build a simple clinical decision support system for readmission risk prediction.

In [None]:
class ClinicalDecisionSupport:
    """
    Clinical Decision Support System for readmission risk prediction.
    """
    
    def __init__(self):
        self.model = None
        self.feature_columns = None
        self.risk_thresholds = {
            'low': 0.2,
            'moderate': 0.4,
            'high': 0.6
        }
    
    def prepare_features(self, data):
        """Prepare features for modeling."""
        features = data.copy()
        
        # Encode categorical variables
        features['gender_M'] = (features['gender'] == 'M').astype(int)
        features['race_Black'] = (features['race'] == 'Black').astype(int)
        features['race_Hispanic'] = (features['race'] == 'Hispanic').astype(int)
        
        # Create derived features
        features['pulse_pressure'] = features['systolic_bp'] - features['diastolic_bp']
        features['bmi_estimated'] = np.random.normal(28, 5, len(features))  # Simulated
        features['age_diabetes_interaction'] = features['age'] * features['diabetes']
        
        # Select feature columns
        self.feature_columns = [
            'age', 'gender_M', 'race_Black', 'race_Hispanic',
            'systolic_bp', 'diastolic_bp', 'heart_rate', 'temperature',
            'glucose', 'creatinine', 'hemoglobin',
            'diabetes', 'hypertension',
            'pulse_pressure', 'bmi_estimated', 'age_diabetes_interaction'
        ]
        
        return features[self.feature_columns]
    
    def train_model(self, data):
        """Train the readmission risk prediction model."""
        # Prepare features
        X = self.prepare_features(data)
        y = data['readmission_30d']
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42, stratify=y
        )
        
        # Train model
        self.model = RandomForestClassifier(
            n_estimators=100, random_state=42, max_depth=10
        )
        self.model.fit(X_train, y_train)
        
        # Evaluate model
        y_pred = self.model.predict(X_test)
        y_pred_proba = self.model.predict_proba(X_test)[:, 1]
        
        print("Model Training Results:")
        print("=" * 30)
        print(classification_report(y_test, y_pred))
        
        # Feature importance
        feature_importance = pd.DataFrame({
            'feature': self.feature_columns,
            'importance': self.model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        print("\nTop 10 Most Important Features:")
        print(feature_importance.head(10))
        
        return X_test, y_test, y_pred_proba
    
    def predict_risk(self, patient_data):
        """Predict readmission risk for a single patient."""
        if self.model is None:
            raise ValueError("Model must be trained first")
        
        # Prepare features
        features = self.prepare_features(patient_data)
        
        # Get prediction
        risk_probability = self.model.predict_proba(features)[:, 1][0]
        
        # Determine risk category
        if risk_probability < self.risk_thresholds['low']:
            risk_category = 'Low'
        elif risk_probability < self.risk_thresholds['moderate']:
            risk_category = 'Moderate'
        elif risk_probability < self.risk_thresholds['high']:
            risk_category = 'High'
        else:
            risk_category = 'Very High'
        
        return {
            'risk_probability': risk_probability,
            'risk_category': risk_category,
            'recommendations': self._generate_recommendations(risk_category, patient_data)
        }
    
    def _generate_recommendations(self, risk_category, patient_data):
        """Generate clinical recommendations based on risk category."""
        recommendations = []
        
        if risk_category in ['High', 'Very High']:
            recommendations.append("Consider discharge planning consultation")
            recommendations.append("Schedule follow-up within 7 days")
            recommendations.append("Medication reconciliation required")
            
            # Specific recommendations based on patient characteristics
            if patient_data['diabetes'].iloc[0] == 1:
                recommendations.append("Diabetes education and glucose monitoring")
            
            if patient_data['creatinine'].iloc[0] > 1.5:
                recommendations.append("Nephrology consultation if not already done")
            
            if patient_data['age'].iloc[0] > 75:
                recommendations.append("Geriatric assessment recommended")
        
        elif risk_category == 'Moderate':
            recommendations.append("Standard discharge planning")
            recommendations.append("Follow-up within 14 days")
        
        else:
            recommendations.append("Standard discharge process")
            recommendations.append("Routine follow-up as needed")
        
        return recommendations

# Train the clinical decision support system
cds = ClinicalDecisionSupport()
X_test, y_test, y_pred_proba = cds.train_model(clinical_df)

## 4. Interactive Risk Assessment

Let's create an interactive example where we can assess risk for individual patients.

In [None]:
# Select a few example patients for risk assessment
example_patients = clinical_df.sample(5, random_state=42)

print("Individual Patient Risk Assessments")
print("=" * 50)

for idx, (_, patient) in enumerate(example_patients.iterrows()):
    patient_df = pd.DataFrame([patient])
    risk_assessment = cds.predict_risk(patient_df)
    
    print(f"\nPatient {idx + 1} (ID: {patient['patient_id']}):")
    print(f"  Age: {patient['age']}, Gender: {patient['gender']}, Race: {patient['race']}")
    print(f"  Diabetes: {'Yes' if patient['diabetes'] else 'No'}, Hypertension: {'Yes' if patient['hypertension'] else 'No'}")
    print(f"  Creatinine: {patient['creatinine']:.2f}, Glucose: {patient['glucose']:.1f}")
    print(f"  \n  Risk Assessment:")
    print(f"    Probability: {risk_assessment['risk_probability']:.1%}")
    print(f"    Category: {risk_assessment['risk_category']}")
    print(f"    Actual Readmission: {'Yes' if patient['readmission_30d'] else 'No'}")
    print(f"  \n  Recommendations:")
    for rec in risk_assessment['recommendations']:
        print(f"    • {rec}")
    print("-" * 40)

## 5. Data Visualization and Analytics

Let's create comprehensive visualizations to understand our clinical data and model performance.

In [None]:
# Create comprehensive visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Age distribution by readmission status
axes[0, 0].hist(clinical_df[clinical_df['readmission_30d'] == 0]['age'], 
                alpha=0.7, label='No Readmission', bins=20)
axes[0, 0].hist(clinical_df[clinical_df['readmission_30d'] == 1]['age'], 
                alpha=0.7, label='Readmission', bins=20)
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution by Readmission Status')
axes[0, 0].legend()

# 2. Readmission rates by demographic groups
readmission_by_race = clinical_df.groupby('race')['readmission_30d'].mean()
readmission_by_race.plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_xlabel('Race')
axes[0, 1].set_ylabel('Readmission Rate')
axes[0, 1].set_title('Readmission Rates by Race')
axes[0, 1].tick_params(axis='x', rotation=45)

# 3. Clinical parameters correlation
clinical_params = ['age', 'systolic_bp', 'glucose', 'creatinine', 'readmission_30d']
corr_matrix = clinical_df[clinical_params].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[0, 2])
axes[0, 2].set_title('Clinical Parameters Correlation')

# 4. Model performance - ROC curve
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

axes[1, 0].plot(fpr, tpr, color='darkorange', lw=2, 
                label=f'ROC curve (AUC = {roc_auc:.2f})')
axes[1, 0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[1, 0].set_xlim([0.0, 1.0])
axes[1, 0].set_ylim([0.0, 1.05])
axes[1, 0].set_xlabel('False Positive Rate')
axes[1, 0].set_ylabel('True Positive Rate')
axes[1, 0].set_title('ROC Curve - Readmission Prediction')
axes[1, 0].legend(loc="lower right")

# 5. Risk distribution
axes[1, 1].hist(y_pred_proba, bins=20, alpha=0.7, edgecolor='black')
axes[1, 1].axvline(x=0.2, color='green', linestyle='--', label='Low Risk Threshold')
axes[1, 1].axvline(x=0.4, color='orange', linestyle='--', label='Moderate Risk Threshold')
axes[1, 1].axvline(x=0.6, color='red', linestyle='--', label='High Risk Threshold')
axes[1, 1].set_xlabel('Predicted Risk Probability')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Risk Predictions')
axes[1, 1].legend()

# 6. Feature importance
feature_importance = pd.DataFrame({
    'feature': cds.feature_columns,
    'importance': cds.model.feature_importances_
}).sort_values('importance', ascending=True).tail(10)

axes[1, 2].barh(range(len(feature_importance)), feature_importance['importance'])
axes[1, 2].set_yticks(range(len(feature_importance)))
axes[1, 2].set_yticklabels(feature_importance['feature'])
axes[1, 2].set_xlabel('Feature Importance')
axes[1, 2].set_title('Top 10 Feature Importances')

plt.tight_layout()
plt.show()

# Print summary statistics
print("\nModel Performance Summary:")
print(f"AUC-ROC: {roc_auc:.3f}")
print(f"Total patients: {len(clinical_df)}")
print(f"Readmission rate: {clinical_df['readmission_30d'].mean():.1%}")
print(f"High-risk patients (>60% risk): {(y_pred_proba > 0.6).sum()} ({(y_pred_proba > 0.6).mean():.1%})")

## 6. FHIR Data Structure Example

Let's create an example of how to structure clinical data using FHIR standards.

In [None]:
def create_fhir_patient_resource(patient_row):
    """
    Create a FHIR Patient resource from clinical data.
    """
    patient_resource = {
        "resourceType": "Patient",
        "id": patient_row['patient_id'],
        "identifier": [
            {
                "use": "usual",
                "system": "http://hospital.example.com/patient-ids",
                "value": patient_row['patient_id']
            }
        ],
        "gender": "male" if patient_row['gender'] == 'M' else "female",
        "birthDate": str(datetime.now().year - patient_row['age']) + "-01-01",
        "extension": [
            {
                "url": "http://hl7.org/fhir/us/core/StructureDefinition/us-core-race",
                "extension": [
                    {
                        "url": "text",
                        "valueString": patient_row['race']
                    }
                ]
            }
        ]
    }
    
    return patient_resource

def create_fhir_observation_resource(patient_id, observation_type, value, unit):
    """
    Create a FHIR Observation resource.
    """
    observation_resource = {
        "resourceType": "Observation",
        "status": "final",
        "subject": {
            "reference": f"Patient/{patient_id}"
        },
        "effectiveDateTime": datetime.now().isoformat(),
        "code": {
            "coding": [
                {
                    "system": "http://loinc.org",
                    "code": observation_type['code'],
                    "display": observation_type['display']
                }
            ]
        },
        "valueQuantity": {
            "value": value,
            "unit": unit,
            "system": "http://unitsofmeasure.org"
        }
    }
    
    return observation_resource

# Create FHIR resources for the first patient
sample_patient = clinical_df.iloc[0]

# Patient resource
patient_fhir = create_fhir_patient_resource(sample_patient)

# Observation resources
observations = []

# Blood pressure
bp_systolic = create_fhir_observation_resource(
    sample_patient['patient_id'],
    {'code': '8480-6', 'display': 'Systolic blood pressure'},
    sample_patient['systolic_bp'],
    'mmHg'
)
observations.append(bp_systolic)

# Glucose
glucose_obs = create_fhir_observation_resource(
    sample_patient['patient_id'],
    {'code': '2345-7', 'display': 'Glucose [Mass/volume] in Serum or Plasma'},
    sample_patient['glucose'],
    'mg/dL'
)
observations.append(glucose_obs)

# Creatinine
creatinine_obs = create_fhir_observation_resource(
    sample_patient['patient_id'],
    {'code': '2160-0', 'display': 'Creatinine [Mass/volume] in Serum or Plasma'},
    sample_patient['creatinine'],
    'mg/dL'
)
observations.append(creatinine_obs)

print("FHIR Patient Resource Example:")
print("=" * 40)
print(json.dumps(patient_fhir, indent=2))

print("\n\nFHIR Observation Resource Example (Systolic BP):")
print("=" * 50)
print(json.dumps(bp_systolic, indent=2))

## 7. Summary and Next Steps

This tutorial has demonstrated key concepts in clinical informatics including:

1. **Clinical Data Generation**: Creating realistic synthetic healthcare data
2. **Data Quality Assessment**: Implementing comprehensive quality checks
3. **Clinical Decision Support**: Building predictive models for clinical use
4. **Risk Assessment**: Individual patient risk evaluation
5. **Data Visualization**: Creating meaningful clinical analytics
6. **FHIR Standards**: Structuring data according to healthcare interoperability standards

### Key Takeaways:
- Healthcare data requires specialized quality assessment approaches
- Clinical decision support systems must provide actionable recommendations
- FHIR standards enable interoperability across healthcare systems
- Model interpretability is crucial for clinical acceptance

### Next Steps:
1. Explore Chapter 2 for mathematical foundations
2. Learn about advanced ML techniques in subsequent chapters
3. Implement real-world clinical validation frameworks
4. Study regulatory compliance requirements for healthcare AI

In [None]:
# Save the clinical dataset for use in other tutorials
clinical_df.to_csv('clinical_data_tutorial.csv', index=False)
print("Clinical dataset saved as 'clinical_data_tutorial.csv'")
print("\nTutorial completed successfully!")
print("Ready to proceed to Chapter 2: Mathematical Foundations")