# Disease Surveillance and Outbreak Analysis

## Introduction to Epidemiological Methods

This notebook demonstrates fundamental epidemiological concepts and methods used in disease surveillance and outbreak investigation. We'll analyze a simulated disease outbreak dataset and apply standard public health metrics to characterize the epidemic.

### Dataset Description

Our dataset contains **500 confirmed disease cases** collected over **12 months** across **5 geographic regions**. Each case record includes:
- **Temporal data**: Date of symptom onset and diagnosis
- **Geographic data**: Region of residence
- **Demographics**: Age and sex
- **Clinical outcomes**: Hospitalization status (severity proxy)

### Epidemiological Methods Covered

1. **Epidemic Curves (Epi Curves)**: Visualizing disease incidence over time to identify outbreak patterns
2. **Incidence Rates**: Calculating new case rates standardized per 100,000 population
3. **Attack Rates**: Measuring the proportion of a population affected during an outbreak
4. **Case Fatality/Severity Rates**: Assessing disease severity using hospitalization as a proxy
5. **Outbreak Detection**: Identifying epidemic periods using moving averages and thresholds
6. **Temporal Analysis**: Growth rates, doubling times, and epidemic phases
7. **Geographic Comparison**: Regional variation in outbreak timing and magnitude
8. **Demographic Risk Factors**: Age and sex-specific rates

### Key Epidemiological Concepts

- **Incidence vs. Prevalence**: We focus on *incidence* (new cases over time) for acute outbreak analysis
- **Standardization**: Rates per 100,000 population allow fair comparison across different population sizes
- **Epidemic Phases**: Growth (exponential increase), peak (maximum incidence), and decline phases
- **Public Health Surveillance**: Systematic collection and analysis of disease data to guide interventions

## Setup and Import Libraries

We'll import essential Python libraries for data analysis, statistical computation, and visualization.

In [None]:
import warnings
from datetime import timedelta

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Load and Explore Data

First, we'll load the disease surveillance dataset and perform initial exploratory analysis to understand the data structure and key variables.

In [None]:
# Load the disease surveillance data
df = pd.read_csv("../data/disease_outbreak.csv")

# Convert date columns to datetime
df["onset_date"] = pd.to_datetime(df["onset_date"])
df["diagnosis_date"] = pd.to_datetime(df["diagnosis_date"])

# Display basic information
print("Dataset Overview")
print("=" * 60)
print(f"Total cases: {len(df)}")
print(f"Date range: {df['onset_date'].min().date()} to {df['onset_date'].max().date()}")
print(f"Duration: {(df['onset_date'].max() - df['onset_date'].min()).days} days")
print("\nFirst few records:")
display(df.head(10))

print("\nData Types and Missing Values:")
print(df.info())

print("\nBasic Statistics:")
display(df.describe())

In [None]:
# Group by region to understand geographic distribution
print("Cases by Region")
print("=" * 60)
region_summary = (
    df.groupby("region")
    .agg({"case_id": "count", "age": "mean", "hospitalized": "sum"})
    .rename(
        columns={"case_id": "total_cases", "age": "mean_age", "hospitalized": "hospitalizations"}
    )
)

region_summary["hospitalization_rate"] = (
    region_summary["hospitalizations"] / region_summary["total_cases"] * 100
).round(2)
display(region_summary)

# Visualize regional distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Case counts by region
region_summary["total_cases"].plot(kind="bar", ax=axes[0], color="steelblue")
axes[0].set_title("Total Cases by Region", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Region")
axes[0].set_ylabel("Number of Cases")
axes[0].tick_params(axis="x", rotation=45)

# Hospitalization rate by region
region_summary["hospitalization_rate"].plot(kind="bar", ax=axes[1], color="coral")
axes[1].set_title("Hospitalization Rate by Region (%)", fontsize=14, fontweight="bold")
axes[1].set_xlabel("Region")
axes[1].set_ylabel("Hospitalization Rate (%)")
axes[1].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.show()

## Epidemic Curve (Epi Curve)

The **epidemic curve** is the fundamental visualization in outbreak investigation. It plots the number of cases over time and reveals:
- **Outbreak onset**: When the epidemic began
- **Peak timing**: Maximum disease incidence
- **Outbreak pattern**: Point source, continuous common source, or propagated (person-to-person)
- **Epidemic phases**: Growth, peak, and decline

We'll create both weekly and monthly epi curves to examine temporal patterns at different scales.

In [None]:
# Create weekly epidemic curve
df["week"] = df["onset_date"].dt.to_period("W")
weekly_cases = df.groupby("week").size().reset_index(name="cases")
weekly_cases["week_start"] = weekly_cases["week"].dt.start_time

# Create monthly epidemic curve
df["month"] = df["onset_date"].dt.to_period("M")
monthly_cases = df.groupby("month").size().reset_index(name="cases")
monthly_cases["month_start"] = monthly_cases["month"].dt.start_time

# Plot epidemic curves
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Weekly epi curve
axes[0].bar(
    weekly_cases["week_start"],
    weekly_cases["cases"],
    width=5,
    color="steelblue",
    alpha=0.7,
    edgecolor="black",
)
axes[0].set_title("Epidemic Curve - Weekly Cases", fontsize=16, fontweight="bold", pad=20)
axes[0].set_xlabel("Week of Symptom Onset", fontsize=12)
axes[0].set_ylabel("Number of Cases", fontsize=12)
axes[0].grid(True, alpha=0.3)

# Add peak annotation
peak_week = weekly_cases.loc[weekly_cases["cases"].idxmax()]
axes[0].annotate(
    f"Peak: {int(peak_week['cases'])} cases",
    xy=(peak_week["week_start"], peak_week["cases"]),
    xytext=(20, 20),
    textcoords="offset points",
    bbox={"boxstyle": "round,pad=0.5", "fc": "yellow", "alpha": 0.7},
    arrowprops={"arrowstyle": "->", "connectionstyle": "arc3,rad=0"},
)

# Monthly epi curve
axes[1].bar(
    monthly_cases["month_start"],
    monthly_cases["cases"],
    width=20,
    color="coral",
    alpha=0.7,
    edgecolor="black",
)
axes[1].set_title("Epidemic Curve - Monthly Cases", fontsize=16, fontweight="bold", pad=20)
axes[1].set_xlabel("Month of Symptom Onset", fontsize=12)
axes[1].set_ylabel("Number of Cases", fontsize=12)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nEpidemic Curve Summary:")
print("=" * 60)
print(f"Peak week: {peak_week['week']} with {int(peak_week['cases'])} cases")
print(f"Mean weekly cases: {weekly_cases['cases'].mean():.1f}")
print(f"Maximum weekly cases: {weekly_cases['cases'].max()}")
print(f"Minimum weekly cases: {weekly_cases['cases'].min()}")

## Calculate Epidemiological Metrics

### Key Public Health Metrics

1. **Incidence Rate**: Number of new cases per 100,000 population during a time period
   - Formula: (New cases / Population at risk) × 100,000
   - Standardized measure allowing comparison across different populations

2. **Attack Rate**: Proportion of the population that develops disease during an outbreak
   - Formula: (Number of cases / Population at risk) × 100
   - Expressed as a percentage

3. **Case Severity Rate**: Proportion of cases with severe outcomes
   - Using hospitalization as a proxy for severity
   - Formula: (Hospitalized cases / Total cases) × 100

4. **Age-Specific Rates**: Rates calculated for specific age groups
   - Identifies high-risk populations
   - Guides targeted interventions

In [None]:
# Define regional populations (simulated)
regional_populations = {
    "North": 250000,
    "South": 300000,
    "East": 200000,
    "West": 275000,
    "Central": 225000,
}

# Calculate incidence rate per 100,000 population
print("Incidence Rates by Region")
print("=" * 60)

incidence_data = []
for region, population in regional_populations.items():
    cases = len(df[df["region"] == region])
    incidence_rate = (cases / population) * 100000
    incidence_data.append(
        {
            "region": region,
            "population": population,
            "cases": cases,
            "incidence_per_100k": round(incidence_rate, 2),
        }
    )

incidence_df = pd.DataFrame(incidence_data)
display(incidence_df)

# Visualize incidence rates
plt.figure(figsize=(12, 6))
plt.bar(
    incidence_df["region"],
    incidence_df["incidence_per_100k"],
    color="darkgreen",
    alpha=0.7,
    edgecolor="black",
)
plt.title("Incidence Rate per 100,000 Population by Region", fontsize=16, fontweight="bold")
plt.xlabel("Region", fontsize=12)
plt.ylabel("Cases per 100,000 Population", fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

print(
    f"\nOverall incidence rate: {(len(df) / sum(regional_populations.values()) * 100000):.2f} per 100,000"
)

In [None]:
# Calculate attack rate by region (percentage of population affected)
print("Attack Rates by Region")
print("=" * 60)

attack_rate_data = []
for region, population in regional_populations.items():
    cases = len(df[df["region"] == region])
    attack_rate = (cases / population) * 100
    attack_rate_data.append({"region": region, "attack_rate_percent": round(attack_rate, 3)})

attack_rate_df = pd.DataFrame(attack_rate_data)
display(attack_rate_df)

# Visualize attack rates
plt.figure(figsize=(12, 6))
plt.bar(
    attack_rate_df["region"],
    attack_rate_df["attack_rate_percent"],
    color="purple",
    alpha=0.7,
    edgecolor="black",
)
plt.title("Attack Rate by Region (%)", fontsize=16, fontweight="bold")
plt.xlabel("Region", fontsize=12)
plt.ylabel("Attack Rate (%)", fontsize=12)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()

overall_attack_rate = (len(df) / sum(regional_populations.values())) * 100
print(f"\nOverall attack rate: {overall_attack_rate:.3f}%")

In [None]:
# Calculate case severity rate (using hospitalization as proxy)
print("Case Severity Analysis")
print("=" * 60)

severity_by_region = (
    df.groupby("region")
    .agg({"case_id": "count", "hospitalized": "sum"})
    .rename(columns={"case_id": "total_cases", "hospitalized": "hospitalizations"})
)

severity_by_region["severity_rate_percent"] = (
    severity_by_region["hospitalizations"] / severity_by_region["total_cases"] * 100
).round(2)

display(severity_by_region)

# Overall severity
overall_severity = (df["hospitalized"].sum() / len(df)) * 100
print(f"\nOverall severity rate (hospitalization): {overall_severity:.2f}%")
print(f"Total hospitalizations: {df['hospitalized'].sum()} out of {len(df)} cases")

In [None]:
# Calculate age-specific rates
print("Age-Specific Analysis")
print("=" * 60)

# Create age groups
df["age_group"] = pd.cut(
    df["age"], bins=[0, 18, 35, 50, 65, 100], labels=["0-18", "19-35", "36-50", "51-65", "65+"]
)

age_analysis = (
    df.groupby("age_group")
    .agg({"case_id": "count", "hospitalized": "sum", "age": "mean"})
    .rename(columns={"case_id": "cases", "hospitalized": "hospitalizations", "age": "mean_age"})
)

age_analysis["hospitalization_rate"] = (
    age_analysis["hospitalizations"] / age_analysis["cases"] * 100
).round(2)

age_analysis["percent_of_cases"] = (
    age_analysis["cases"] / age_analysis["cases"].sum() * 100
).round(2)

display(age_analysis)

# Visualize age distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Cases by age group
age_analysis["cases"].plot(kind="bar", ax=axes[0], color="teal", alpha=0.7, edgecolor="black")
axes[0].set_title("Cases by Age Group", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Age Group")
axes[0].set_ylabel("Number of Cases")
axes[0].tick_params(axis="x", rotation=45)
axes[0].grid(True, alpha=0.3, axis="y")

# Hospitalization rate by age group
age_analysis["hospitalization_rate"].plot(
    kind="bar", ax=axes[1], color="crimson", alpha=0.7, edgecolor="black"
)
axes[1].set_title("Hospitalization Rate by Age Group", fontsize=14, fontweight="bold")
axes[1].set_xlabel("Age Group")
axes[1].set_ylabel("Hospitalization Rate (%)")
axes[1].tick_params(axis="x", rotation=45)
axes[1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

## Geographic Analysis

Geographic analysis helps identify:
- **Spatial patterns**: Where the outbreak is most severe
- **Timing differences**: When different regions experienced peak transmission
- **Spread patterns**: How the disease moved between regions
- **Resource allocation**: Which regions need more public health resources

We'll compare outbreak timing and magnitude across all five regions.

In [None]:
# Create epidemic curves for each region
df["week"] = df["onset_date"].dt.to_period("W")
regional_weekly = df.groupby(["week", "region"]).size().reset_index(name="cases")
regional_weekly["week_start"] = regional_weekly["week"].dt.start_time

# Plot regional epidemic curves
fig, axes = plt.subplots(3, 2, figsize=(16, 12))
axes = axes.flatten()

regions = df["region"].unique()
colors = ["steelblue", "coral", "green", "purple", "orange"]

for idx, region in enumerate(regions):
    region_data = regional_weekly[regional_weekly["region"] == region]
    axes[idx].bar(
        region_data["week_start"],
        region_data["cases"],
        width=5,
        color=colors[idx],
        alpha=0.7,
        edgecolor="black",
    )
    axes[idx].set_title(f"{region} Region - Weekly Cases", fontsize=12, fontweight="bold")
    axes[idx].set_xlabel("Week")
    axes[idx].set_ylabel("Cases")
    axes[idx].grid(True, alpha=0.3)

    # Add peak annotation
    if len(region_data) > 0:
        peak = region_data.loc[region_data["cases"].idxmax()]
        axes[idx].annotate(
            f"Peak: {int(peak['cases'])}",
            xy=(peak["week_start"], peak["cases"]),
            xytext=(10, 10),
            textcoords="offset points",
            bbox={"boxstyle": "round,pad=0.3", "fc": "yellow", "alpha": 0.5},
            fontsize=9,
        )

# Remove extra subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

# Regional comparison summary
print("\nRegional Outbreak Timing Comparison")
print("=" * 60)

regional_summary = []
for region in regions:
    region_df = df[df["region"] == region]
    region_weekly = region_df.groupby("week").size()
    peak_week = region_weekly.idxmax()
    peak_cases = region_weekly.max()

    regional_summary.append(
        {
            "region": region,
            "first_case": region_df["onset_date"].min().date(),
            "last_case": region_df["onset_date"].max().date(),
            "peak_week": str(peak_week),
            "peak_cases": peak_cases,
            "total_cases": len(region_df),
        }
    )

regional_summary_df = pd.DataFrame(regional_summary)
display(regional_summary_df)

In [None]:
# Overlay all regional curves for comparison
plt.figure(figsize=(14, 7))

for idx, region in enumerate(regions):
    region_data = regional_weekly[regional_weekly["region"] == region]
    plt.plot(
        region_data["week_start"],
        region_data["cases"],
        marker="o",
        linewidth=2,
        label=region,
        color=colors[idx],
        alpha=0.8,
    )

plt.title("Regional Outbreak Comparison - All Regions", fontsize=16, fontweight="bold")
plt.xlabel("Week of Symptom Onset", fontsize=12)
plt.ylabel("Number of Cases", fontsize=12)
plt.legend(title="Region", fontsize=11, title_fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Outbreak Detection and Monitoring

### Surveillance Methods

Public health surveillance uses statistical methods to detect when disease incidence exceeds expected levels:

1. **Moving Averages**: Smooth out random variation to identify trends
2. **Threshold Detection**: Define outbreak as exceeding baseline + threshold (e.g., mean + 2 SD)
3. **Outbreak Periods**: Consecutive time periods above the threshold

These methods help public health officials:
- Detect outbreaks early
- Trigger investigation and response
- Monitor intervention effectiveness
- Declare when outbreak is over

In [None]:
# Calculate moving averages and detect outbreak periods
weekly_cases_sorted = weekly_cases.sort_values("week_start").copy()
weekly_cases_sorted.reset_index(drop=True, inplace=True)

# Calculate 3-week moving average
weekly_cases_sorted["moving_avg_3wk"] = (
    weekly_cases_sorted["cases"].rolling(window=3, center=True).mean()
)

# Calculate 4-week moving average
weekly_cases_sorted["moving_avg_4wk"] = (
    weekly_cases_sorted["cases"].rolling(window=4, center=True).mean()
)

# Define outbreak threshold (mean + 1.5 standard deviations)
baseline_mean = weekly_cases_sorted["cases"].mean()
baseline_std = weekly_cases_sorted["cases"].std()
outbreak_threshold = baseline_mean + (1.5 * baseline_std)

# Identify outbreak periods
weekly_cases_sorted["is_outbreak"] = weekly_cases_sorted["cases"] > outbreak_threshold

# Visualize outbreak detection
plt.figure(figsize=(14, 8))

# Plot actual cases
plt.bar(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["cases"],
    width=5,
    color="lightblue",
    alpha=0.5,
    label="Weekly Cases",
    edgecolor="black",
)

# Plot moving averages
plt.plot(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["moving_avg_3wk"],
    color="blue",
    linewidth=2.5,
    label="3-Week Moving Average",
    marker="o",
)
plt.plot(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["moving_avg_4wk"],
    color="green",
    linewidth=2.5,
    label="4-Week Moving Average",
    marker="s",
)

# Plot threshold line
plt.axhline(
    y=outbreak_threshold,
    color="red",
    linestyle="--",
    linewidth=2.5,
    label=f"Outbreak Threshold ({outbreak_threshold:.1f})",
)
plt.axhline(
    y=baseline_mean,
    color="orange",
    linestyle=":",
    linewidth=2,
    label=f"Baseline Mean ({baseline_mean:.1f})",
)

# Highlight outbreak periods
outbreak_weeks = weekly_cases_sorted[weekly_cases_sorted["is_outbreak"]]
for _, row in outbreak_weeks.iterrows():
    plt.axvspan(
        row["week_start"] - timedelta(days=3),
        row["week_start"] + timedelta(days=3),
        alpha=0.2,
        color="red",
    )

plt.title("Outbreak Detection - Moving Averages and Threshold", fontsize=16, fontweight="bold")
plt.xlabel("Week", fontsize=12)
plt.ylabel("Number of Cases", fontsize=12)
plt.legend(fontsize=11, loc="upper left")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Summary statistics
print("Outbreak Detection Summary")
print("=" * 60)
print(f"Baseline mean: {baseline_mean:.1f} cases/week")
print(f"Baseline standard deviation: {baseline_std:.1f}")
print(f"Outbreak threshold: {outbreak_threshold:.1f} cases/week")
print(
    f"\nWeeks above threshold: {weekly_cases_sorted['is_outbreak'].sum()} out of {len(weekly_cases_sorted)} weeks"
)
print(
    f"Percentage of time in outbreak: {(weekly_cases_sorted['is_outbreak'].sum() / len(weekly_cases_sorted) * 100):.1f}%"
)

## Demographics Analysis

Understanding demographic patterns helps identify:
- **High-risk groups**: Which populations are most affected
- **Severity by demographics**: Which groups have worse outcomes
- **Targeted interventions**: Where to focus prevention and treatment efforts
- **Health disparities**: Differences in disease burden across populations

In [None]:
# Age and sex distribution
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Age distribution histogram
axes[0, 0].hist(df["age"], bins=20, color="skyblue", edgecolor="black", alpha=0.7)
axes[0, 0].set_title("Age Distribution of Cases", fontsize=14, fontweight="bold")
axes[0, 0].set_xlabel("Age (years)")
axes[0, 0].set_ylabel("Number of Cases")
axes[0, 0].axvline(
    df["age"].mean(),
    color="red",
    linestyle="--",
    linewidth=2,
    label=f"Mean: {df['age'].mean():.1f}",
)
axes[0, 0].axvline(
    df["age"].median(),
    color="green",
    linestyle="--",
    linewidth=2,
    label=f"Median: {df['age'].median():.1f}",
)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Sex distribution
sex_counts = df["sex"].value_counts()
axes[0, 1].pie(
    sex_counts.values,
    labels=sex_counts.index,
    autopct="%1.1f%%",
    colors=["lightcoral", "lightskyblue"],
    startangle=90,
)
axes[0, 1].set_title("Cases by Sex", fontsize=14, fontweight="bold")

# Age by sex
df.boxplot(column="age", by="sex", ax=axes[1, 0])
axes[1, 0].set_title("Age Distribution by Sex", fontsize=14, fontweight="bold")
axes[1, 0].set_xlabel("Sex")
axes[1, 0].set_ylabel("Age (years)")
plt.sca(axes[1, 0])
plt.xticks(rotation=0)

# Hospitalization by age group
hosp_by_age = df.groupby("age_group")["hospitalized"].mean() * 100
hosp_by_age.plot(kind="bar", ax=axes[1, 1], color="darkred", alpha=0.7, edgecolor="black")
axes[1, 1].set_title("Hospitalization Rate by Age Group", fontsize=14, fontweight="bold")
axes[1, 1].set_xlabel("Age Group")
axes[1, 1].set_ylabel("Hospitalization Rate (%)")
axes[1, 1].tick_params(axis="x", rotation=45)
axes[1, 1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

# Summary statistics
print("Demographic Summary")
print("=" * 60)
print("\nAge Statistics:")
print(f"  Mean age: {df['age'].mean():.1f} years")
print(f"  Median age: {df['age'].median():.1f} years")
print(f"  Age range: {df['age'].min()}-{df['age'].max()} years")
print("\nSex Distribution:")
for sex, count in sex_counts.items():
    print(f"  {sex}: {count} ({count / len(df) * 100:.1f}%)")

In [None]:
# Detailed hospitalization analysis
print("Hospitalization Patterns Analysis")
print("=" * 60)

# By sex
hosp_by_sex = (
    df.groupby("sex")
    .agg({"case_id": "count", "hospitalized": "sum"})
    .rename(columns={"case_id": "total", "hospitalized": "hospitalized_count"})
)
hosp_by_sex["hospitalization_rate"] = (
    hosp_by_sex["hospitalized_count"] / hosp_by_sex["total"] * 100
).round(2)

print("\nBy Sex:")
display(hosp_by_sex)

# By age group and sex
hosp_by_age_sex = (
    df.groupby(["age_group", "sex"])
    .agg({"case_id": "count", "hospitalized": "sum"})
    .rename(columns={"case_id": "total", "hospitalized": "hospitalized_count"})
)
hosp_by_age_sex["hospitalization_rate"] = (
    hosp_by_age_sex["hospitalized_count"] / hosp_by_age_sex["total"] * 100
).round(2)

print("\nBy Age Group and Sex:")
display(hosp_by_age_sex)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Hospitalization rate by sex
hosp_by_sex["hospitalization_rate"].plot(
    kind="bar", ax=axes[0], color=["lightcoral", "lightskyblue"], alpha=0.7, edgecolor="black"
)
axes[0].set_title("Hospitalization Rate by Sex", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Sex")
axes[0].set_ylabel("Hospitalization Rate (%)")
axes[0].tick_params(axis="x", rotation=0)
axes[0].grid(True, alpha=0.3, axis="y")

# Hospitalization rate by age group and sex
hosp_pivot = hosp_by_age_sex.reset_index().pivot(
    index="age_group", columns="sex", values="hospitalization_rate"
)
hosp_pivot.plot(
    kind="bar", ax=axes[1], color=["lightcoral", "lightskyblue"], alpha=0.7, edgecolor="black"
)
axes[1].set_title("Hospitalization Rate by Age Group and Sex", fontsize=14, fontweight="bold")
axes[1].set_xlabel("Age Group")
axes[1].set_ylabel("Hospitalization Rate (%)")
axes[1].tick_params(axis="x", rotation=45)
axes[1].legend(title="Sex")
axes[1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

## Temporal Patterns and Growth Dynamics

### Epidemic Growth Metrics

1. **Week-over-Week Growth Rate**: Percentage change in cases from one week to the next
   - Positive values indicate epidemic growth
   - Negative values indicate decline
   
2. **Doubling Time**: Time required for cases to double during exponential growth phase
   - Shorter doubling time = faster epidemic spread
   - Critical for projecting healthcare capacity needs
   
3. **Epidemic Phases**:
   - **Growth phase**: Exponential increase in cases
   - **Peak**: Maximum incidence
   - **Decline phase**: Decreasing cases

These metrics help predict outbreak trajectory and evaluate intervention effectiveness.

In [None]:
# Calculate week-over-week growth rate
weekly_cases_sorted["week_over_week_growth"] = weekly_cases_sorted["cases"].pct_change() * 100

# Calculate cumulative cases
weekly_cases_sorted["cumulative_cases"] = weekly_cases_sorted["cases"].cumsum()

# Identify growth phase (consecutive weeks with positive growth)
weekly_cases_sorted["is_growth"] = weekly_cases_sorted["week_over_week_growth"] > 0

# Visualize temporal patterns
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Weekly cases with growth/decline coloring
colors_growth = [
    "green" if x > 0 else "red" if x < 0 else "gray"
    for x in weekly_cases_sorted["week_over_week_growth"]
]
axes[0].bar(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["cases"],
    width=5,
    color=colors_growth,
    alpha=0.6,
    edgecolor="black",
)
axes[0].set_title("Weekly Cases (Green=Growth, Red=Decline)", fontsize=14, fontweight="bold")
axes[0].set_ylabel("Cases")
axes[0].grid(True, alpha=0.3)

# Week-over-week growth rate
axes[1].plot(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["week_over_week_growth"],
    color="blue",
    marker="o",
    linewidth=2,
    markersize=6,
)
axes[1].axhline(y=0, color="black", linestyle="-", linewidth=1)
axes[1].fill_between(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["week_over_week_growth"],
    0,
    where=(weekly_cases_sorted["week_over_week_growth"] > 0),
    alpha=0.3,
    color="green",
    label="Growth",
)
axes[1].fill_between(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["week_over_week_growth"],
    0,
    where=(weekly_cases_sorted["week_over_week_growth"] < 0),
    alpha=0.3,
    color="red",
    label="Decline",
)
axes[1].set_title("Week-over-Week Growth Rate", fontsize=14, fontweight="bold")
axes[1].set_ylabel("Growth Rate (%)")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Cumulative cases (epidemic curve)
axes[2].plot(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["cumulative_cases"],
    color="darkred",
    marker="o",
    linewidth=2.5,
    markersize=6,
)
axes[2].fill_between(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["cumulative_cases"],
    alpha=0.3,
    color="red",
)
axes[2].set_title("Cumulative Cases Over Time", fontsize=14, fontweight="bold")
axes[2].set_xlabel("Week")
axes[2].set_ylabel("Cumulative Cases")
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary statistics
print("Temporal Growth Analysis")
print("=" * 60)
print(
    f"\nMean week-over-week growth rate: {weekly_cases_sorted['week_over_week_growth'].mean():.2f}%"
)
print(f"Maximum growth rate: {weekly_cases_sorted['week_over_week_growth'].max():.2f}%")
print(f"Minimum growth rate: {weekly_cases_sorted['week_over_week_growth'].min():.2f}%")
print(
    f"\nWeeks with positive growth: {weekly_cases_sorted['is_growth'].sum()} out of {len(weekly_cases_sorted) - 1} weeks"
)

In [None]:
# Calculate doubling time during growth phase
print("Doubling Time Analysis")
print("=" * 60)

# Identify consecutive growth periods
growth_periods = weekly_cases_sorted[weekly_cases_sorted["is_growth"]].copy()

if len(growth_periods) > 0:
    # Calculate doubling time using exponential growth formula
    # Doubling time = ln(2) / growth rate (where growth rate is the weekly average)

    # Find the longest consecutive growth period
    growth_periods["period_id"] = (
        growth_periods["is_growth"] != growth_periods["is_growth"].shift()
    ).cumsum()

    avg_growth_rate = growth_periods["week_over_week_growth"].mean() / 100  # Convert to decimal

    if avg_growth_rate > 0:
        doubling_time_weeks = np.log(2) / np.log(1 + avg_growth_rate)
        doubling_time_days = doubling_time_weeks * 7

        print(f"\nAverage growth rate during growth periods: {avg_growth_rate * 100:.2f}% per week")
        print(
            f"Estimated doubling time: {doubling_time_weeks:.2f} weeks ({doubling_time_days:.1f} days)"
        )
        print(
            f"\nInterpretation: During the growth phase, cases doubled approximately every {doubling_time_days:.1f} days."
        )
    else:
        print("No significant growth period detected for doubling time calculation.")

    # Visualize growth periods
    plt.figure(figsize=(14, 6))
    plt.semilogy(
        weekly_cases_sorted["week_start"],
        weekly_cases_sorted["cumulative_cases"],
        color="darkblue",
        marker="o",
        linewidth=2.5,
        markersize=6,
        label="Cumulative Cases (log scale)",
    )

    # Highlight growth periods
    for _, row in growth_periods.iterrows():
        plt.axvspan(
            row["week_start"] - timedelta(days=3),
            row["week_start"] + timedelta(days=3),
            alpha=0.2,
            color="green",
        )

    plt.title(
        "Cumulative Cases (Log Scale) - Growth Periods Highlighted", fontsize=16, fontweight="bold"
    )
    plt.xlabel("Week", fontsize=12)
    plt.ylabel("Cumulative Cases (log scale)", fontsize=12)
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3, which="both")
    plt.tight_layout()
    plt.show()
else:
    print("No growth periods detected in the data.")

## Summary Statistics and Outbreak Characterization

This final section provides a comprehensive summary of the outbreak, synthesizing all key metrics and findings into actionable public health intelligence.

In [None]:
# Create comprehensive summary table
print("OUTBREAK SUMMARY REPORT")
print("=" * 80)

# Overall metrics
total_cases = len(df)
total_population = sum(regional_populations.values())
outbreak_start = df["onset_date"].min()
outbreak_end = df["onset_date"].max()
outbreak_duration_days = (outbreak_end - outbreak_start).days

print("\n1. TEMPORAL CHARACTERISTICS")
print("-" * 80)
print(f"   Outbreak period: {outbreak_start.date()} to {outbreak_end.date()}")
print(f"   Duration: {outbreak_duration_days} days ({outbreak_duration_days / 7:.1f} weeks)")
print(f"   Peak week: {peak_week['week']} with {int(peak_week['cases'])} cases")
print(f"   Mean weekly cases: {weekly_cases['cases'].mean():.1f}")

print("\n2. INCIDENCE AND ATTACK RATES")
print("-" * 80)
print(f"   Total cases: {total_cases}")
print(f"   Total population at risk: {total_population:,}")
print(f"   Overall incidence rate: {(total_cases / total_population * 100000):.2f} per 100,000")
print(f"   Overall attack rate: {overall_attack_rate:.3f}%")
print(
    f"   Highest regional incidence: {incidence_df['incidence_per_100k'].max():.2f} per 100,000 ({incidence_df.loc[incidence_df['incidence_per_100k'].idxmax(), 'region']})"
)

print("\n3. SEVERITY AND OUTCOMES")
print("-" * 80)
total_hospitalizations = df["hospitalized"].sum()
print(f"   Total hospitalizations: {total_hospitalizations} ({overall_severity:.2f}%)")
print(
    f"   Highest hospitalization rate by age: {age_analysis['hospitalization_rate'].max():.2f}% ({age_analysis['hospitalization_rate'].idxmax()})"
)
print(
    f"   Lowest hospitalization rate by age: {age_analysis['hospitalization_rate'].min():.2f}% ({age_analysis['hospitalization_rate'].idxmin()})"
)

print("\n4. DEMOGRAPHIC PATTERNS")
print("-" * 80)
print(f"   Mean age: {df['age'].mean():.1f} years")
print(f"   Age range: {df['age'].min()}-{df['age'].max()} years")
print(
    f"   Most affected age group: {age_analysis['cases'].idxmax()} ({age_analysis['cases'].max()} cases, {age_analysis.loc[age_analysis['cases'].idxmax(), 'percent_of_cases']:.1f}%)"
)
male_pct = len(df[df["sex"] == "Male"]) / len(df) * 100
female_pct = len(df[df["sex"] == "Female"]) / len(df) * 100
print(f"   Sex distribution: Male {male_pct:.1f}%, Female {female_pct:.1f}%")

print("\n5. GEOGRAPHIC DISTRIBUTION")
print("-" * 80)
print(f"   Number of regions affected: {df['region'].nunique()}")
print(
    f"   Region with most cases: {region_summary['total_cases'].idxmax()} ({region_summary['total_cases'].max()} cases)"
)
print(
    f"   Region with fewest cases: {region_summary['total_cases'].idxmin()} ({region_summary['total_cases'].min()} cases)"
)

print("\n6. EPIDEMIC DYNAMICS")
print("-" * 80)
print(f"   Outbreak threshold: {outbreak_threshold:.1f} cases/week")
print(f"   Weeks above threshold: {weekly_cases_sorted['is_outbreak'].sum()} weeks")
if len(growth_periods) > 0 and avg_growth_rate > 0:
    print(f"   Average growth rate (growth phase): {avg_growth_rate * 100:.2f}% per week")
    print(f"   Estimated doubling time: {doubling_time_days:.1f} days")
else:
    print("   Growth dynamics: Insufficient growth period for doubling time calculation")

print("\n" + "=" * 80)

In [None]:
# Create comprehensive summary dashboard
fig = plt.figure(figsize=(16, 10))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# 1. Main epidemic curve
ax1 = fig.add_subplot(gs[0, :])
ax1.bar(
    weekly_cases["week_start"],
    weekly_cases["cases"],
    width=5,
    color="steelblue",
    alpha=0.7,
    edgecolor="black",
)
ax1.axhline(
    y=outbreak_threshold, color="red", linestyle="--", linewidth=2, label="Outbreak Threshold"
)
ax1.set_title("EPIDEMIC CURVE - Weekly Cases", fontsize=16, fontweight="bold")
ax1.set_xlabel("Week")
ax1.set_ylabel("Cases")
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Regional comparison
ax2 = fig.add_subplot(gs[1, 0])
region_summary["total_cases"].plot(kind="barh", ax=ax2, color="coral", alpha=0.7, edgecolor="black")
ax2.set_title("Cases by Region", fontsize=12, fontweight="bold")
ax2.set_xlabel("Cases")
ax2.grid(True, alpha=0.3, axis="x")

# 3. Age distribution
ax3 = fig.add_subplot(gs[1, 1])
age_analysis["cases"].plot(kind="bar", ax=ax3, color="teal", alpha=0.7, edgecolor="black")
ax3.set_title("Cases by Age Group", fontsize=12, fontweight="bold")
ax3.set_xlabel("Age Group")
ax3.set_ylabel("Cases")
ax3.tick_params(axis="x", rotation=45)
ax3.grid(True, alpha=0.3, axis="y")

# 4. Hospitalization rate by age
ax4 = fig.add_subplot(gs[1, 2])
age_analysis["hospitalization_rate"].plot(
    kind="bar", ax=ax4, color="crimson", alpha=0.7, edgecolor="black"
)
ax4.set_title("Hospitalization Rate by Age", fontsize=12, fontweight="bold")
ax4.set_xlabel("Age Group")
ax4.set_ylabel("Rate (%)")
ax4.tick_params(axis="x", rotation=45)
ax4.grid(True, alpha=0.3, axis="y")

# 5. Growth rate over time
ax5 = fig.add_subplot(gs[2, 0])
ax5.plot(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["week_over_week_growth"],
    color="blue",
    marker="o",
    linewidth=2,
)
ax5.axhline(y=0, color="black", linestyle="-", linewidth=1)
ax5.set_title("Week-over-Week Growth Rate", fontsize=12, fontweight="bold")
ax5.set_xlabel("Week")
ax5.set_ylabel("Growth Rate (%)")
ax5.grid(True, alpha=0.3)

# 6. Cumulative cases
ax6 = fig.add_subplot(gs[2, 1])
ax6.plot(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["cumulative_cases"],
    color="darkred",
    marker="o",
    linewidth=2.5,
)
ax6.fill_between(
    weekly_cases_sorted["week_start"],
    weekly_cases_sorted["cumulative_cases"],
    alpha=0.3,
    color="red",
)
ax6.set_title("Cumulative Cases", fontsize=12, fontweight="bold")
ax6.set_xlabel("Week")
ax6.set_ylabel("Cumulative Cases")
ax6.grid(True, alpha=0.3)

# 7. Key metrics summary box
ax7 = fig.add_subplot(gs[2, 2])
ax7.axis("off")
summary_text = f"""
KEY METRICS

Total Cases: {total_cases}
Duration: {outbreak_duration_days} days

Incidence Rate:
{(total_cases / total_population * 100000):.1f} per 100,000

Attack Rate: {overall_attack_rate:.3f}%

Hospitalization: {overall_severity:.1f}%

Peak: Week {peak_week["week"]}
{int(peak_week["cases"])} cases
"""
ax7.text(
    0.1,
    0.5,
    summary_text,
    fontsize=11,
    verticalalignment="center",
    bbox={"boxstyle": "round", "facecolor": "wheat", "alpha": 0.5},
    family="monospace",
)

plt.suptitle("DISEASE OUTBREAK SUMMARY DASHBOARD", fontsize=18, fontweight="bold", y=0.995)
plt.show()

## Public Health Implications and Conclusions

### Key Findings

This epidemiological analysis has characterized a disease outbreak across five regions over 12 months. The analysis reveals:

1. **Outbreak Pattern**: The epidemic curve shows a classic outbreak pattern with distinct growth, peak, and decline phases

2. **Geographic Variation**: Regional differences in attack rates and outbreak timing suggest varying transmission dynamics or intervention effectiveness

3. **Risk Groups**: Age-specific analysis identifies populations at higher risk for both infection and severe outcomes (hospitalization)

4. **Epidemic Dynamics**: Growth rate and doubling time calculations provide insights into transmission speed and potential for rapid spread

5. **Surveillance Thresholds**: Moving averages and threshold-based detection methods successfully identify outbreak periods

### Public Health Recommendations

Based on these findings, public health officials should consider:

- **Targeted Interventions**: Focus resources on high-incidence regions and high-risk age groups
- **Continued Surveillance**: Maintain monitoring to detect potential resurgence
- **Preparedness Planning**: Use doubling time estimates to prepare healthcare capacity
- **Health Equity**: Investigate reasons for regional disparities and address underlying factors
- **Communication**: Share findings with stakeholders to guide policy and resource allocation

### Epidemiological Concepts Demonstrated

This notebook has illustrated fundamental epidemiological methods:

- Epidemic curves for visualizing outbreak patterns
- Standardized incidence rates (per 100,000) for fair comparison
- Attack rates to measure population impact
- Severity assessment using hospitalization data
- Outbreak detection using statistical thresholds
- Growth dynamics and doubling time calculations
- Demographic analysis to identify vulnerable populations
- Geographic comparison to understand spatial patterns

These methods form the foundation of disease surveillance and outbreak investigation in public health practice.