# Clinical Exploratory Data Analysis

This notebook performs exploratory data analysis on clinical data for the Nexora healthcare ML platform. We'll analyze patient demographics, diagnoses, procedures, medications, and outcomes to identify patterns and insights that can inform our predictive modeling approach.

## Setup and Data Loading

In [None]:
# Import necessary libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import json
from collections import Counter
import warnings

# Add project root to path
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "..")))

# Import project modules
from src.utils.fhir_connector import FHIRConnector
from src.data_pipeline.data_validation import DataValidator
from src.utils.healthcare_metrics import HealthcareMetrics

# Set plotting style
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8)
warnings.filterwarnings("ignore")

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

In [None]:
# Load sample clinical data
# In a real environment, this would connect to a FHIR server or database
# For this notebook, we'll create synthetic data


# Function to generate synthetic patient data
def generate_synthetic_patient_data(n_patients=1000, seed=42):
    np.random.seed(seed)

    # Generate patient IDs
    patient_ids = [f"P{i:06d}" for i in range(n_patients)]

    # Generate demographics
    ages = np.random.normal(65, 15, n_patients).astype(int)
    ages = np.clip(ages, 18, 100)  # Clip to reasonable age range

    genders = np.random.choice(["M", "F"], size=n_patients, p=[0.48, 0.52])

    races = np.random.choice(
        ["White", "Black", "Hispanic", "Asian", "Other"],
        size=n_patients,
        p=[0.65, 0.13, 0.12, 0.06, 0.04],
    )

    # Generate admission dates (within last 2 years)
    today = datetime.now()
    admission_days_ago = np.random.randint(1, 730, n_patients)  # Up to 2 years ago
    admission_dates = [today - timedelta(days=days) for days in admission_days_ago]

    # Generate length of stay (1-30 days, with most stays being shorter)
    length_of_stay = np.random.exponential(scale=5, size=n_patients).astype(int) + 1
    length_of_stay = np.clip(length_of_stay, 1, 30)

    # Calculate discharge dates
    discharge_dates = [
        admission_dates[i] + timedelta(days=length_of_stay[i])
        for i in range(n_patients)
    ]

    # Generate insurance types
    insurance_types = np.random.choice(
        ["Medicare", "Medicaid", "Private", "Self-Pay", "Other"],
        size=n_patients,
        p=[0.45, 0.15, 0.30, 0.05, 0.05],
    )

    # Generate primary diagnoses (ICD-10 codes)
    # Common conditions in elderly population
    primary_diagnoses = np.random.choice(
        [
            "I25.10",
            "I10",
            "E11.9",
            "J44.9",
            "I50.9",
            "M17.9",
            "G20",
            "F03",
            "N18.9",
            "C50.919",
        ],
        size=n_patients,
        p=[0.20, 0.18, 0.15, 0.12, 0.10, 0.08, 0.06, 0.05, 0.04, 0.02],
    )

    # Map diagnoses to descriptions
    diagnosis_map = {
        "I25.10": "Coronary artery disease",
        "I10": "Essential hypertension",
        "E11.9": "Type 2 diabetes mellitus",
        "J44.9": "COPD",
        "I50.9": "Heart failure",
        "M17.9": "Osteoarthritis of knee",
        "G20": "Parkinson's disease",
        "F03": "Dementia",
        "N18.9": "Chronic kidney disease",
        "C50.919": "Breast cancer",
    }

    diagnosis_descriptions = [diagnosis_map[code] for code in primary_diagnoses]

    # Generate comorbidity count (0-5)
    comorbidity_count = np.random.poisson(lam=2, size=n_patients)
    comorbidity_count = np.clip(comorbidity_count, 0, 5)

    # Generate medication count (0-10)
    medication_count = comorbidity_count + np.random.randint(0, 3, n_patients)
    medication_count = np.clip(medication_count, 0, 10)

    # Generate lab values (e.g., hemoglobin A1c for diabetes patients)
    hba1c_values = np.where(
        primary_diagnoses == "E11.9",  # Diabetes patients
        np.random.normal(8.0, 1.5, n_patients),  # Higher values for diabetics
        np.random.normal(5.5, 0.5, n_patients),  # Normal values for non-diabetics
    )
    hba1c_values = np.clip(hba1c_values, 4.0, 14.0)  # Clip to reasonable range

    # Generate systolic blood pressure
    systolic_bp = np.where(
        primary_diagnoses == "I10",  # Hypertension patients
        np.random.normal(150, 15, n_patients),  # Higher values for hypertensives
        np.random.normal(125, 10, n_patients),  # Normal values for non-hypertensives
    )
    systolic_bp = np.clip(systolic_bp, 90, 200)  # Clip to reasonable range

    # Generate diastolic blood pressure
    diastolic_bp = systolic_bp * 0.6 + np.random.normal(10, 5, n_patients)
    diastolic_bp = np.clip(diastolic_bp, 50, 120)  # Clip to reasonable range

    # Generate outcomes
    # 30-day readmission (higher for certain conditions and older patients)
    readmission_base_prob = 0.15
    readmission_prob = (
        readmission_base_prob
        + 0.01 * (ages > 75).astype(int)
        + 0.02 * (primary_diagnoses == "I50.9").astype(int)
        + 0.02 * (primary_diagnoses == "J44.9").astype(int)
        + 0.01 * (comorbidity_count >= 3).astype(int)
    )

    readmission_30d = np.random.binomial(1, readmission_prob, n_patients)

    # Mortality (higher for certain conditions, older patients, and longer stays)
    mortality_base_prob = 0.05
    mortality_prob = (
        mortality_base_prob
        + 0.02 * (ages > 80).astype(int)
        + 0.03 * (primary_diagnoses == "I50.9").astype(int)
        + 0.03 * (primary_diagnoses == "C50.919").astype(int)
        + 0.01 * (length_of_stay > 14).astype(int)
        + 0.02 * (comorbidity_count >= 4).astype(int)
    )

    mortality = np.random.binomial(1, mortality_prob, n_patients)

    # Create DataFrame
    df = pd.DataFrame(
        {
            "patient_id": patient_ids,
            "age": ages,
            "gender": genders,
            "race": races,
            "admission_date": admission_dates,
            "discharge_date": discharge_dates,
            "length_of_stay": length_of_stay,
            "insurance_type": insurance_types,
            "primary_diagnosis_code": primary_diagnoses,
            "primary_diagnosis": diagnosis_descriptions,
            "comorbidity_count": comorbidity_count,
            "medication_count": medication_count,
            "hba1c": hba1c_values,
            "systolic_bp": systolic_bp,
            "diastolic_bp": diastolic_bp,
            "readmission_30d": readmission_30d,
            "mortality": mortality,
        }
    )

    return df


# Generate synthetic data
clinical_data = generate_synthetic_patient_data(n_patients=1000)

# Display the first few rows
clinical_data.head()

## Data Overview and Summary Statistics

In [None]:
# Get basic information about the dataset
print(f"Dataset shape: {clinical_data.shape}")
print(f"Number of patients: {clinical_data['patient_id'].nunique()}")
print("\nData types:")
print(clinical_data.dtypes)

# Summary statistics for numerical variables
print("\nSummary statistics for numerical variables:")
numerical_cols = [
    "age",
    "length_of_stay",
    "comorbidity_count",
    "medication_count",
    "hba1c",
    "systolic_bp",
    "diastolic_bp",
]
print(clinical_data[numerical_cols].describe())

# Summary statistics for categorical variables
print("\nSummary statistics for categorical variables:")
categorical_cols = ["gender", "race", "insurance_type", "primary_diagnosis"]
for col in categorical_cols:
    print(f"\n{col}:")
    print(clinical_data[col].value_counts(normalize=True).round(3) * 100)

In [None]:
# Check for missing values
print("Missing values per column:")
print(clinical_data.isnull().sum())

# Check for duplicates
print(f"\nNumber of duplicate rows: {clinical_data.duplicated().sum()}")

## Patient Demographics Analysis

In [None]:
# Age distribution
plt.figure(figsize=(12, 6))
sns.histplot(clinical_data["age"], bins=20, kde=True)
plt.title("Age Distribution", fontsize=16)
plt.xlabel("Age", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.axvline(
    clinical_data["age"].mean(),
    color="red",
    linestyle="--",
    label=f'Mean: {clinical_data["age"].mean():.1f}',
)
plt.axvline(
    clinical_data["age"].median(),
    color="green",
    linestyle="--",
    label=f'Median: {clinical_data["age"].median():.1f}',
)
plt.legend()
plt.show()

# Create age groups
age_bins = [18, 35, 50, 65, 80, 100]
age_labels = ["18-34", "35-49", "50-64", "65-79", "80+"]
clinical_data["age_group"] = pd.cut(
    clinical_data["age"], bins=age_bins, labels=age_labels
)

# Age group distribution
plt.figure(figsize=(12, 6))
sns.countplot(x="age_group", data=clinical_data, palette="viridis")
plt.title("Age Group Distribution", fontsize=16)
plt.xlabel("Age Group", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.xticks(rotation=0)
for p in plt.gca().patches:
    plt.gca().annotate(
        f"{p.get_height()}",
        (p.get_x() + p.get_width() / 2.0, p.get_height()),
        ha="center",
        va="center",
        xytext=(0, 10),
        textcoords="offset points",
    )
plt.show()

In [None]:
# Gender distribution
plt.figure(figsize=(10, 6))
gender_counts = clinical_data["gender"].value_counts()
plt.pie(
    gender_counts,
    labels=gender_counts.index,
    autopct="%1.1f%%",
    startangle=90,
    colors=["#ff9999", "#66b3ff"],
)
plt.title("Gender Distribution", fontsize=16)
plt.axis("equal")
plt.show()

# Race distribution
plt.figure(figsize=(12, 6))
race_counts = clinical_data["race"].value_counts()
sns.barplot(x=race_counts.index, y=race_counts.values, palette="viridis")
plt.title("Race Distribution", fontsize=16)
plt.xlabel("Race", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.xticks(rotation=0)
for i, v in enumerate(race_counts.values):
    plt.text(i, v + 5, str(v), ha="center")
plt.show()

# Insurance type distribution
plt.figure(figsize=(12, 6))
insurance_counts = clinical_data["insurance_type"].value_counts()
sns.barplot(x=insurance_counts.index, y=insurance_counts.values, palette="viridis")
plt.title("Insurance Type Distribution", fontsize=16)
plt.xlabel("Insurance Type", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.xticks(rotation=0)
for i, v in enumerate(insurance_counts.values):
    plt.text(i, v + 5, str(v), ha="center")
plt.show()

In [None]:
# Demographics by age group
# Gender distribution by age group
plt.figure(figsize=(14, 6))
gender_age = pd.crosstab(
    clinical_data["age_group"], clinical_data["gender"], normalize="index"
)
gender_age.plot(kind="bar", stacked=True, colormap="viridis")
plt.title("Gender Distribution by Age Group", fontsize=16)
plt.xlabel("Age Group", fontsize=14)
plt.ylabel("Proportion", fontsize=14)
plt.xticks(rotation=0)
plt.legend(title="Gender")
plt.show()

# Insurance type by age group
plt.figure(figsize=(14, 8))
insurance_age = pd.crosstab(
    clinical_data["age_group"], clinical_data["insurance_type"], normalize="index"
)
insurance_age.plot(kind="bar", stacked=True, colormap="viridis")
plt.title("Insurance Type Distribution by Age Group", fontsize=16)
plt.xlabel("Age Group", fontsize=14)
plt.ylabel("Proportion", fontsize=14)
plt.xticks(rotation=0)
plt.legend(title="Insurance Type", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

## Clinical Characteristics Analysis

In [None]:
# Primary diagnosis distribution
plt.figure(figsize=(14, 8))
diagnosis_counts = clinical_data["primary_diagnosis"].value_counts()
sns.barplot(x=diagnosis_counts.values, y=diagnosis_counts.index, palette="viridis")
plt.title("Primary Diagnosis Distribution", fontsize=16)
plt.xlabel("Count", fontsize=14)
plt.ylabel("Diagnosis", fontsize=14)
for i, v in enumerate(diagnosis_counts.values):
    plt.text(v + 5, i, str(v), va="center")
plt.tight_layout()
plt.show()

# Length of stay distribution
plt.figure(figsize=(12, 6))
sns.histplot(clinical_data["length_of_stay"], bins=30, kde=True)
plt.title("Length of Stay Distribution", fontsize=16)
plt.xlabel("Length of Stay (days)", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.axvline(
    clinical_data["length_of_stay"].mean(),
    color="red",
    linestyle="--",
    label=f'Mean: {clinical_data["length_of_stay"].mean():.1f} days',
)
plt.axvline(
    clinical_data["length_of_stay"].median(),
    color="green",
    linestyle="--",
    label=f'Median: {clinical_data["length_of_stay"].median():.1f} days',
)
plt.legend()
plt.show()

In [None]:
# Comorbidity and medication count distributions
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Comorbidity count
sns.countplot(x="comorbidity_count", data=clinical_data, ax=axes[0], palette="viridis")
axes[0].set_title("Comorbidity Count Distribution", fontsize=14)
axes[0].set_xlabel("Number of Comorbidities", fontsize=12)
axes[0].set_ylabel("Count", fontsize=12)
for p in axes[0].patches:
    axes[0].annotate(
        f"{p.get_height()}",
        (p.get_x() + p.get_width() / 2.0, p.get_height()),
        ha="center",
        va="center",
        xytext=(0, 10),
        textcoords="offset points",
    )

# Medication count
sns.countplot(x="medication_count", data=clinical_data, ax=axes[1], palette="viridis")
axes[1].set_title("Medication Count Distribution", fontsize=14)
axes[1].set_xlabel("Number of Medications", fontsize=12)
axes[1].set_ylabel("Count", fontsize=12)
for p in axes[1].patches:
    axes[1].annotate(
        f"{p.get_height()}",
        (p.get_x() + p.get_width() / 2.0, p.get_height()),
        ha="center",
        va="center",
        xytext=(0, 10),
        textcoords="offset points",
    )

plt.tight_layout()
plt.show()

In [None]:
# Lab values distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# HbA1c
sns.histplot(clinical_data["hba1c"], bins=20, kde=True, ax=axes[0])
axes[0].set_title("HbA1c Distribution", fontsize=14)
axes[0].set_xlabel("HbA1c (%)", fontsize=12)
axes[0].set_ylabel("Count", fontsize=12)
axes[0].axvline(5.7, color="green", linestyle="--", label="Normal < 5.7%")
axes[0].axvline(6.5, color="red", linestyle="--", label="Diabetic ≥ 6.5%")
axes[0].legend()

# Systolic BP
sns.histplot(clinical_data["systolic_bp"], bins=20, kde=True, ax=axes[1])
axes[1].set_title("Systolic Blood Pressure Distribution", fontsize=14)
axes[1].set_xlabel("Systolic BP (mmHg)", fontsize=12)
axes[1].set_ylabel("Count", fontsize=12)
axes[1].axvline(120, color="green", linestyle="--", label="Normal < 120")
axes[1].axvline(140, color="red", linestyle="--", label="Hypertension ≥ 140")
axes[1].legend()

# Diastolic BP
sns.histplot(clinical_data["diastolic_bp"], bins=20, kde=True, ax=axes[2])
axes[2].set_title("Diastolic Blood Pressure Distribution", fontsize=14)
axes[2].set_xlabel("Diastolic BP (mmHg)", fontsize=12)
axes[2].set_ylabel("Count", fontsize=12)
axes[2].axvline(80, color="green", linestyle="--", label="Normal < 80")
axes[2].axvline(90, color="red", linestyle="--", label="Hypertension ≥ 90")
axes[2].legend()

plt.tight_layout()
plt.show()

In [None]:
# HbA1c by diagnosis
plt.figure(figsize=(14, 6))
sns.boxplot(x="primary_diagnosis", y="hba1c", data=clinical_data, palette="viridis")
plt.title("HbA1c by Primary Diagnosis", fontsize=16)
plt.xlabel("Primary Diagnosis", fontsize=14)
plt.ylabel("HbA1c (%)", fontsize=14)
plt.xticks(rotation=45, ha="right")
plt.axhline(6.5, color="red", linestyle="--", label="Diabetic threshold (6.5%)")
plt.legend()
plt.tight_layout()
plt.show()

# Blood pressure by diagnosis
plt.figure(figsize=(14, 6))
sns.boxplot(
    x="primary_diagnosis", y="systolic_bp", data=clinical_data, palette="viridis"
)
plt.title("Systolic Blood Pressure by Primary Diagnosis", fontsize=16)
plt.xlabel("Primary Diagnosis", fontsize=14)
plt.ylabel("Systolic BP (mmHg)", fontsize=14)
plt.xticks(rotation=45, ha="right")
plt.axhline(140, color="red", linestyle="--", label="Hypertension threshold (140 mmHg)")
plt.legend()
plt.tight_layout()
plt.show()

## Outcome Analysis

In [None]:
# Overall outcome rates
readmission_rate = clinical_data["readmission_30d"].mean() * 100
mortality_rate = clinical_data["mortality"].mean() * 100

print(f"30-day readmission rate: {readmission_rate:.1f}%")
print(f"Mortality rate: {mortality_rate:.1f}%")

# Visualize outcome rates
outcome_data = pd.DataFrame(
    {
        "Outcome": ["30-day Readmission", "Mortality"],
        "Rate (%)": [readmission_rate, mortality_rate],
    }
)

plt.figure(figsize=(10, 6))
sns.barplot(x="Outcome", y="Rate (%)", data=outcome_data, palette="viridis")
plt.title("Overall Outcome Rates", fontsize=16)
plt.xlabel("Outcome", fontsize=14)
plt.ylabel("Rate (%)", fontsize=14)
for i, v in enumerate(outcome_data["Rate (%)"]):
    plt.text(i, v + 0.5, f"{v:.1f}%", ha="center")
plt.ylim(0, max(outcome_data["Rate (%)"]) * 1.2)
plt.show()

In [None]:
# Outcomes by age group
age_outcomes = (
    clinical_data.groupby("age_group")[["readmission_30d", "mortality"]].mean() * 100
)
age_outcomes = age_outcomes.reset_index()
age_outcomes = pd.melt(
    age_outcomes,
    id_vars=["age_group"],
    value_vars=["readmission_30d", "mortality"],
    var_name="Outcome",
    value_name="Rate (%)",
)
age_outcomes["Outcome"] = age_outcomes["Outcome"].map(
    {"readmission_30d": "30-day Readmission", "mortality": "Mortality"}
)

plt.figure(figsize=(14, 6))
sns.barplot(
    x="age_group", y="Rate (%)", hue="Outcome", data=age_outcomes, palette="viridis"
)
plt.title("Outcomes by Age Group", fontsize=16)
plt.xlabel("Age Group", fontsize=14)
plt.ylabel("Rate (%)", fontsize=14)
plt.legend(title="Outcome")
plt.show()

In [None]:
# Outcomes by primary diagnosis
diagnosis_outcomes = (
    clinical_data.groupby("primary_diagnosis")[["readmission_30d", "mortality"]].mean()
    * 100
)
diagnosis_outcomes = diagnosis_outcomes.reset_index()

# Sort by mortality rate
diagnosis_outcomes = diagnosis_outcomes.sort_values("mortality", ascending=False)

# Plot readmission rates
plt.figure(figsize=(14, 6))
sns.barplot(
    x="primary_diagnosis",
    y="readmission_30d",
    data=diagnosis_outcomes,
    palette="viridis",
)
plt.title("30-day Readmission Rates by Primary Diagnosis", fontsize=16)
plt.xlabel("Primary Diagnosis", fontsize=14)
plt.ylabel("Readmission Rate (%)", fontsize=14)
plt.xticks(rotation=45, ha="right")
plt.axhline(
    readmission_rate,
    color="red",
    linestyle="--",
    label=f"Overall rate: {readmission_rate:.1f}%",
)
for i, v in enumerate(diagnosis_outcomes["readmission_30d"]):
    plt.text(i, v + 0.5, f"{v:.1f}%", ha="center")
plt.legend()
plt.tight_layout()
plt.show()

# Plot mortality rates
plt.figure(figsize=(14, 6))
sns.barplot(
    x="primary_diagnosis", y="mortality", data=diagnosis_outcomes, palette="viridis"
)
plt.title("Mortality Rates by Primary Diagnosis", fontsize=16)
plt.xlabel("Primary Diagnosis", fontsize=14)
plt.ylabel("Mortality Rate (%)", fontsize=14)
plt.xticks(rotation=45, ha="right")
plt.axhline(
    mortality_rate,
    color="red",
    linestyle="--",
    label=f"Overall rate: {mortality_rate:.1f}%",
)
for i, v in enumerate(diagnosis_outcomes["mortality"]):
    plt.text(i, v + 0.5, f"{v:.1f}%", ha="center")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Outcomes by length of stay
# Create length of stay bins
los_bins = [0, 3, 7, 14, 30]
los_labels = ["1-3 days", "4-7 days", "8-14 days", "15+ days"]
clinical_data["los_group"] = pd.cut(
    clinical_data["length_of_stay"], bins=los_bins, labels=los_labels
)

los_outcomes = (
    clinical_data.groupby("los_group")[["readmission_30d", "mortality"]].mean() * 100
)
los_outcomes = los_outcomes.reset_index()
los_outcomes = pd.melt(
    los_outcomes,
    id_vars=["los_group"],
    value_vars=["readmission_30d", "mortality"],
    var_name="Outcome",
    value_name="Rate (%)",
)
los_outcomes["Outcome"] = los_outcomes["Outcome"].map(
    {"readmission_30d": "30-day Readmission", "mortality": "Mortality"}
)

plt.figure(figsize=(14, 6))
sns.barplot(
    x="los_group", y="Rate (%)", hue="Outcome", data=los_outcomes, palette="viridis"
)
plt.title("Outcomes by Length of Stay", fontsize=16)
plt.xlabel("Length of Stay", fontsize=14)
plt.ylabel("Rate (%)", fontsize=14)
plt.legend(title="Outcome")
plt.show()

In [None]:
# Outcomes by comorbidity count
comorbidity_outcomes = (
    clinical_data.groupby("comorbidity_count")[["readmission_30d", "mortality"]].mean()
    * 100
)
comorbidity_outcomes = comorbidity_outcomes.reset_index()
comorbidity_outcomes = pd.melt(
    comorbidity_outcomes,
    id_vars=["comorbidity_count"],
    value_vars=["readmission_30d", "mortality"],
    var_name="Outcome",
    value_name="Rate (%)",
)
comorbidity_outcomes["Outcome"] = comorbidity_outcomes["Outcome"].map(
    {"readmission_30d": "30-day Readmission", "mortality": "Mortality"}
)

plt.figure(figsize=(14, 6))
sns.barplot(
    x="comorbidity_count",
    y="Rate (%)",
    hue="Outcome",
    data=comorbidity_outcomes,
    palette="viridis",
)
plt.title("Outcomes by Comorbidity Count", fontsize=16)
plt.xlabel("Number of Comorbidities", fontsize=14)
plt.ylabel("Rate (%)", fontsize=14)
plt.legend(title="Outcome")
plt.show()

## Correlation Analysis

In [None]:
# Correlation matrix for numerical variables
numerical_vars = [
    "age",
    "length_of_stay",
    "comorbidity_count",
    "medication_count",
    "hba1c",
    "systolic_bp",
    "diastolic_bp",
    "readmission_30d",
    "mortality",
]
corr_matrix = clinical_data[numerical_vars].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap="viridis", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of Numerical Variables", fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Scatter plots for key relationships
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Age vs Length of Stay
sns.scatterplot(
    x="age",
    y="length_of_stay",
    data=clinical_data,
    hue="mortality",
    palette={0: "blue", 1: "red"},
    alpha=0.7,
    ax=axes[0, 0],
)
axes[0, 0].set_title("Age vs Length of Stay", fontsize=14)
axes[0, 0].set_xlabel("Age", fontsize=12)
axes[0, 0].set_ylabel("Length of Stay (days)", fontsize=12)
axes[0, 0].legend(title="Mortality")

# Comorbidity Count vs Length of Stay
sns.scatterplot(
    x="comorbidity_count",
    y="length_of_stay",
    data=clinical_data,
    hue="readmission_30d",
    palette={0: "green", 1: "orange"},
    alpha=0.7,
    ax=axes[0, 1],
)
axes[0, 1].set_title("Comorbidity Count vs Length of Stay", fontsize=14)
axes[0, 1].set_xlabel("Comorbidity Count", fontsize=12)
axes[0, 1].set_ylabel("Length of Stay (days)", fontsize=12)
axes[0, 1].legend(title="30-day Readmission")

# Age vs HbA1c
sns.scatterplot(
    x="age",
    y="hba1c",
    data=clinical_data,
    hue="primary_diagnosis",
    palette="viridis",
    alpha=0.7,
    ax=axes[1, 0],
)
axes[1, 0].set_title("Age vs HbA1c", fontsize=14)
axes[1, 0].set_xlabel("Age", fontsize=12)
axes[1, 0].set_ylabel("HbA1c (%)", fontsize=12)
axes[1, 0].legend(title="Primary Diagnosis", bbox_to_anchor=(1.05, 1), loc="upper left")

# Systolic BP vs Diastolic BP
sns.scatterplot(
    x="systolic_bp",
    y="diastolic_bp",
    data=clinical_data,
    hue="primary_diagnosis",
    palette="viridis",
    alpha=0.7,
    ax=axes[1, 1],
)
axes[1, 1].set_title("Systolic BP vs Diastolic BP", fontsize=14)
axes[1, 1].set_xlabel("Systolic BP (mmHg)", fontsize=12)
axes[1, 1].set_ylabel("Diastolic BP (mmHg)", fontsize=12)
axes[1, 1].legend(title="Primary Diagnosis", bbox_to_anchor=(1.05, 1), loc="upper left")

plt.tight_layout()
plt.show()

## Risk Factor Analysis for Outcomes

In [None]:
# Analyze risk factors for readmission
# Create a function to calculate odds ratios
def calculate_odds_ratio(df, outcome_col, factor_col, factor_value=None):
    if factor_value is not None:
        # Binary factor with specific value
        exposed_outcome = df[df[factor_col] == factor_value][outcome_col].sum()
        exposed_no_outcome = (df[factor_col] == factor_value).sum() - exposed_outcome
        unexposed_outcome = df[df[factor_col] != factor_value][outcome_col].sum()
        unexposed_no_outcome = (
            df[factor_col] != factor_value
        ).sum() - unexposed_outcome
    else:
        # Assuming factor_col is already binary (0/1)
        exposed_outcome = df[df[factor_col] == 1][outcome_col].sum()
        exposed_no_outcome = (df[factor_col] == 1).sum() - exposed_outcome
        unexposed_outcome = df[df[factor_col] == 0][outcome_col].sum()
        unexposed_no_outcome = (df[factor_col] == 0).sum() - unexposed_outcome

    # Calculate odds ratio
    try:
        odds_exposed = exposed_outcome / exposed_no_outcome
        odds_unexposed = unexposed_outcome / unexposed_no_outcome
        odds_ratio = odds_exposed / odds_unexposed
        return odds_ratio
    except ZeroDivisionError:
        return np.nan


# Create binary features for risk factor analysis
clinical_data["age_over_75"] = (clinical_data["age"] > 75).astype(int)
clinical_data["high_comorbidity"] = (clinical_data["comorbidity_count"] >= 3).astype(
    int
)
clinical_data["long_stay"] = (clinical_data["length_of_stay"] > 7).astype(int)
clinical_data["high_hba1c"] = (clinical_data["hba1c"] >= 6.5).astype(int)
clinical_data["hypertension"] = (clinical_data["systolic_bp"] >= 140).astype(int)
clinical_data["heart_condition"] = (
    clinical_data["primary_diagnosis"]
    .isin(["Coronary artery disease", "Heart failure"])
    .astype(int)
)

# Calculate odds ratios for readmission
risk_factors = [
    "age_over_75",
    "gender",
    "high_comorbidity",
    "long_stay",
    "high_hba1c",
    "hypertension",
    "heart_condition",
]
readmission_odds = {}

for factor in risk_factors:
    if factor == "gender":
        # Special handling for gender
        odds_ratio = calculate_odds_ratio(
            clinical_data, "readmission_30d", "gender", "M"
        )
        readmission_odds[f"{factor} (M vs F)"] = odds_ratio
    else:
        odds_ratio = calculate_odds_ratio(clinical_data, "readmission_30d", factor)
        readmission_odds[factor] = odds_ratio

# Create DataFrame for visualization
readmission_odds_df = pd.DataFrame(
    {
        "Risk Factor": list(readmission_odds.keys()),
        "Odds Ratio": list(readmission_odds.values()),
    }
).sort_values("Odds Ratio", ascending=False)

# Plot odds ratios for readmission
plt.figure(figsize=(12, 6))
sns.barplot(
    x="Odds Ratio", y="Risk Factor", data=readmission_odds_df, palette="viridis"
)
plt.title("Risk Factors for 30-day Readmission (Odds Ratios)", fontsize=16)
plt.xlabel("Odds Ratio", fontsize=14)
plt.ylabel("Risk Factor", fontsize=14)
plt.axvline(1.0, color="red", linestyle="--", label="No effect (OR=1)")
for i, v in enumerate(readmission_odds_df["Odds Ratio"]):
    plt.text(v + 0.05, i, f"{v:.2f}", va="center")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Calculate odds ratios for mortality
mortality_odds = {}

for factor in risk_factors:
    if factor == "gender":
        # Special handling for gender
        odds_ratio = calculate_odds_ratio(clinical_data, "mortality", "gender", "M")
        mortality_odds[f"{factor} (M vs F)"] = odds_ratio
    else:
        odds_ratio = calculate_odds_ratio(clinical_data, "mortality", factor)
        mortality_odds[factor] = odds_ratio

# Create DataFrame for visualization
mortality_odds_df = pd.DataFrame(
    {
        "Risk Factor": list(mortality_odds.keys()),
        "Odds Ratio": list(mortality_odds.values()),
    }
).sort_values("Odds Ratio", ascending=False)

# Plot odds ratios for mortality
plt.figure(figsize=(12, 6))
sns.barplot(x="Odds Ratio", y="Risk Factor", data=mortality_odds_df, palette="viridis")
plt.title("Risk Factors for Mortality (Odds Ratios)", fontsize=16)
plt.xlabel("Odds Ratio", fontsize=14)
plt.ylabel("Risk Factor", fontsize=14)
plt.axvline(1.0, color="red", linestyle="--", label="No effect (OR=1)")
for i, v in enumerate(mortality_odds_df["Odds Ratio"]):
    plt.text(v + 0.05, i, f"{v:.2f}", va="center")
plt.legend()
plt.tight_layout()
plt.show()

## Temporal Analysis

In [None]:
# Convert admission dates to month-year format for temporal analysis
clinical_data["admission_month"] = clinical_data["admission_date"].dt.to_period("M")

# Count admissions by month
monthly_admissions = (
    clinical_data.groupby("admission_month").size().reset_index(name="count")
)
monthly_admissions["admission_month"] = monthly_admissions["admission_month"].astype(
    str
)

# Calculate monthly outcome rates
monthly_outcomes = (
    clinical_data.groupby("admission_month")[["readmission_30d", "mortality"]].mean()
    * 100
)
monthly_outcomes = monthly_outcomes.reset_index()
monthly_outcomes["admission_month"] = monthly_outcomes["admission_month"].astype(str)

# Plot monthly admissions
plt.figure(figsize=(14, 6))
sns.lineplot(
    x="admission_month", y="count", data=monthly_admissions, marker="o", linewidth=2
)
plt.title("Monthly Admissions", fontsize=16)
plt.xlabel("Month", fontsize=14)
plt.ylabel("Number of Admissions", fontsize=14)
plt.xticks(rotation=45)
plt.grid(True, linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

# Plot monthly outcome rates
fig, ax1 = plt.subplots(figsize=(14, 6))

color1 = "tab:blue"
ax1.set_xlabel("Month", fontsize=14)
ax1.set_ylabel("Readmission Rate (%)", color=color1, fontsize=14)
ax1.plot(
    monthly_outcomes["admission_month"],
    monthly_outcomes["readmission_30d"],
    color=color1,
    marker="o",
    linewidth=2,
    label="30-day Readmission",
)
ax1.tick_params(axis="y", labelcolor=color1)

ax2 = ax1.twinx()  # Create a second y-axis sharing the same x-axis
color2 = "tab:red"
ax2.set_ylabel("Mortality Rate (%)", color=color2, fontsize=14)
ax2.plot(
    monthly_outcomes["admission_month"],
    monthly_outcomes["mortality"],
    color=color2,
    marker="s",
    linewidth=2,
    label="Mortality",
)
ax2.tick_params(axis="y", labelcolor=color2)

plt.title("Monthly Outcome Rates", fontsize=16)
plt.xticks(rotation=45)
plt.grid(True, linestyle="--", alpha=0.7)

# Add legend
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc="upper right")

plt.tight_layout()
plt.show()

## Summary of Findings

### Key Insights from the Exploratory Data Analysis:

1. **Patient Demographics**:
   - The patient population has a mean age of approximately 65 years, with a significant proportion of elderly patients.
   - Gender distribution is relatively balanced with a slight majority of females.
   - The most common insurance type is Medicare, followed by private insurance.

2. **Clinical Characteristics**:
   - The most common primary diagnoses are coronary artery disease, hypertension, and type 2 diabetes.
   - The average length of stay is approximately 6 days, with most stays being under 10 days.
   - Patients have an average of 2 comorbidities and 3-4 medications.
   - Lab values show expected patterns, with diabetic patients having higher HbA1c and hypertensive patients having higher blood pressure.

3. **Outcomes**:
   - The 30-day readmission rate is approximately 20%.
   - The mortality rate is approximately 8%.
   - Both readmission and mortality rates increase with age, comorbidity count, and length of stay.
   - Heart failure and cancer diagnoses are associated with the highest mortality rates.
   - COPD and heart failure are associated with the highest readmission rates.

4. **Risk Factors**:
   - Age over 75, high comorbidity count, and heart conditions are strong risk factors for both readmission and mortality.
   - Long hospital stays are associated with higher readmission risk.
   - Hypertension and high HbA1c show moderate associations with adverse outcomes.

5. **Temporal Patterns**:
   - Admission volumes show some seasonal variation.
   - Outcome rates have fluctuated over time, suggesting potential changes in patient population or care practices.

### Implications for Predictive Modeling:

1. **Feature Selection**:
   - Key predictors should include age, comorbidity count, length of stay, primary diagnosis, and lab values.
   - Interaction terms between age and comorbidities may be valuable.

2. **Model Development**:
   - Separate models for readmission and mortality may be appropriate given their different risk factor profiles.
   - Consider stratified models for different age groups or diagnostic categories.
   - Time-based features may improve prediction accuracy.

3. **Validation Strategy**:
   - Temporal validation should be used to account for potential drift in patient characteristics or outcomes.
   - Stratified sampling should ensure adequate representation of high-risk subgroups.

4. **Clinical Implementation**:
   - Risk scores should be calibrated to specific patient populations.
   - Interpretability will be important for clinical adoption.
   - Models should be regularly updated to account for changes in practice patterns.

## Next Steps

1. **Feature Engineering**:
   - Create interaction terms between key variables
   - Develop temporal features from admission patterns
   - Extract additional features from clinical notes and structured data

2. **Model Development**:
   - Train and evaluate multiple model types (logistic regression, random forest, gradient boosting, neural networks)
   - Optimize hyperparameters for best performance
   - Develop ensemble approaches combining multiple models

3. **Model Validation**:
   - Perform cross-validation with temporal and random splits
   - Evaluate model performance on specific patient subgroups
   - Assess calibration and discrimination metrics

4. **Model Interpretation**:
   - Generate feature importance rankings
   - Create partial dependence plots for key features
   - Develop patient-specific explanations for predictions

5. **Clinical Integration**:
   - Design risk score visualizations for clinical use
   - Develop intervention recommendations based on risk factors
   - Plan for continuous model monitoring and updating