# COVID-19 Trends Analysis (OWID)

**Goal:** Load a real dataset, run a few clean analyses, and visualize the big picture while keeping it simple and without being fancy.

**What you’ll see:**
- Clean code cells (pandas + matplotlib)
- Clear charts (one per figure)
- Short, human explanations in Markdown.
- Country comparisons (peaks, fastest rollouts).
- A quick look for anomalies (data corrections and big spikes).

*Dataset:* `owid-covid-data.csv` (Our World in Data)  
*Generated:* 2025-08-27 00:09

## 1) Setup

We'll import just what we need: `pandas` for data, `matplotlib` for charts, and a bit of `numpy` for simple math. No fancy styles, no complications.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Make charts a bit bigger by default
plt.rcParams['figure.figsize'] = (10, 5)

# Load data
DATA_PATH = "/mnt/data/owid-covid-data.csv"
df = pd.read_csv(DATA_PATH, parse_dates=["date"])

# Keep only countries (drop OWID_ aggregates)
countries = df[~df["iso_code"].astype(str).str.startswith("OWID_")].copy()

# Fill NaNs for daily flows to avoid sum issues
for col in ["new_cases", "new_deaths", "new_vaccinations"]:
    if col in countries.columns:
        countries[col] = countries[col].fillna(0)

df.head(3)

## 2) Quick Data Check

Let’s check the shape and a few key columns so we know what we’re working with.

In [None]:
print("Rows:", len(df), "| Columns:", len(df.columns))
df[["iso_code","continent","location","date","new_cases","new_deaths","people_fully_vaccinated_per_hundred"]].head(10)

## 3) Global Trends (Big Picture)

First, how did the pandemic evolve globally? We'll sum daily new cases and vaccinations across all locations.

In [None]:
# Global new cases per day
global_cases = df.groupby("date", as_index=False)["new_cases"].sum()

plt.figure()
plt.plot(global_cases["date"], global_cases["new_cases"])
plt.title("Global New COVID-19 Cases per Day")
plt.xlabel("Date")
plt.ylabel("New Cases")
plt.tight_layout()
plt.show()

# Global vaccinations per day (using smoothed if available)
vacc_col = "new_vaccinations_smoothed" if "new_vaccinations_smoothed" in df.columns else "new_vaccinations"
global_vax = df.groupby("date", as_index=False)[vacc_col].sum()

plt.figure()
plt.plot(global_vax["date"], global_vax[vacc_col])
plt.title("Global New Vaccinations per Day")
plt.xlabel("Date")
plt.ylabel("New Vaccinations (Smoothed)")
plt.tight_layout()
plt.show()

## 4) Country Comparisons

We’ll find:
- **Top 10 peaks** in daily new cases *per million* (normalizes by population).
- **Fastest vaccine rollout**: countries with the highest peak of smoothed daily vaccinations *per million* (higher peak = faster rollout at any point).

In [None]:
# Top peaks: cases per million
if "new_cases_per_million" in countries.columns:
    peak_cases_pm = (
        countries.groupby("location")["new_cases_per_million"]
        .max()
        .sort_values(ascending=False)
        .head(10)
        .reset_index()
    )
else:
    if "population" in countries.columns:
        tmp = countries[["location","date","new_cases","population"]].dropna(subset=["population"]).copy()
        tmp["new_cases_per_million"] = (tmp["new_cases"] / tmp["population"]) * 1_000_000
        peak_cases_pm = tmp.groupby("location")["new_cases_per_million"].max().sort_values(ascending=False).head(10).reset_index()
    else:
        peak_cases_pm = pd.DataFrame(columns=["location","new_cases_per_million"])

peak_cases_pm

plt.figure()
plt.barh(peak_cases_pm["location"][::-1], peak_cases_pm["new_cases_per_million"][::-1])
plt.title("Top 10 Countries by Peak New Cases per Million (Daily)")
plt.xlabel("Peak Daily New Cases per Million")
plt.ylabel("Country")
plt.tight_layout()
plt.show()

# Fastest vaccine rollout: peak smoothed daily vaccinations per million
vpm_col = "new_vaccinations_smoothed_per_million" if "new_vaccinations_smoothed_per_million" in countries.columns else None

if vpm_col:
    fastest_vax = (
        countries.groupby("location")[vpm_col]
        .max()
        .sort_values(ascending=False)
        .head(10)
        .reset_index()
    )
else:
    if "population" in countries.columns and "new_vaccinations_smoothed" in countries.columns:
        tmp = countries[["location","date","new_vaccinations_smoothed","population"]].dropna(subset=["population"]).copy()
        tmp["new_vax_per_million"] = (tmp["new_vaccinations_smoothed"] / tmp["population"]) * 1_000_000
        fastest_vax = tmp.groupby("location")["new_vax_per_million"].max().sort_values(ascending=False).head(10).reset_index()
        vpm_col = "new_vax_per_million"
    else:
        fastest_vax = pd.DataFrame(columns=["location","new_vaccinations_smoothed_per_million"])
        vpm_col = "new_vaccinations_smoothed_per_million"

fastest_vax

plt.figure()
plt.barh(fastest_vax["location"][::-1], fastest_vax[vpm_col][::-1])
plt.title("Fastest Vaccine Rollout (Peak Daily New Vaccinations per Million, Smoothed)")
plt.xlabel("Peak Daily New Vaccinations per Million")
plt.ylabel("Country")
plt.tight_layout()
plt.show()

## 5) Quick Anomaly Scan

Data isn't perfect. We'll check two simple things:
- **Negative daily values** (usually corrections).
- **Extreme spikes** in cases per million (z-score > 5 within a country)

In [None]:
notes = []

# Negative corrections
if "new_cases_per_million" in countries.columns:
    neg_corr = countries[countries["new_cases_per_million"] < 0].copy()
    if not neg_corr.empty:
        notes.append(f"Found {len(neg_corr)} rows with negative new_cases_per_million (data corrections).")
    else:
        notes.append("No negative new_cases_per_million values found.")
else:
    notes.append("Column new_cases_per_million not available.")

# Spike detection
import numpy as np

def spikes_for_group(g):
    vals = g["new_cases_per_million"].fillna(0).values if "new_cases_per_million" in g else np.array([])
    if len(vals) < 10 or np.nanstd(vals) == 0:
        return pd.Series([0, np.nan, pd.NaT], index=["num_spikes","max_val","date_max"])
    z = (vals - np.nanmean(vals)) / (np.nanstd(vals) + 1e-9)
    spikes_idx = np.where(z > 5)[0]
    if len(spikes_idx) == 0:
        return pd.Series([0, np.nan, pd.NaT], index=["num_spikes","max_val","date_max"])
    max_idx = spikes_idx[np.argmax(vals[spikes_idx])]
    return pd.Series([len(spikes_idx), vals[max_idx], g.iloc[max_idx]["date"]], index=["num_spikes","max_val","date_max"])

if "new_cases_per_million" in countries.columns:
    spike_summary = countries.groupby("location").apply(spikes_for_group).reset_index()
    big_spikers = spike_summary.sort_values(by=["num_spikes","max_val"], ascending=False).head(10)
else:
    big_spikers = pd.DataFrame(columns=["location","num_spikes","max_val","date_max"])

notes, big_spikers.head(10)

## 6) What Stood Out

- The **global cases line** has clear waves — big winter spikes in many regions, then declines, then new variants created fresh peaks.
- Vaccination picked up fast after initial rollouts, then **slowed** as demand and logistics varied by region
- The **peak cases per million** list is often dominated by **small countries/territories** or those with intense short waves per-million metrics can look huge even if absolute numbers are small
- **Fast vaccine rollouts** were driven by early access + efficient logistics (e.g., small, high income countries or places with centralized healthcare).
- **Anomalies** (negatives/spikes) are normal in public health data they usually mean backfills, audits, or method changes.

## 7) Next Steps (Optional)

- Slice by **continent** or **income group** for fairer comparisons.
- Compare **cases vs. deaths** to see severity patterns across waves.
- Add **rolling averages** for smoother lines when presenting to non-technical audiences.
- Export a **PDF** or a **PowerPoint** deck with screenshots (we included a ready-made PPT with key charts).

**Done**