# COVID-19 Comprehensive Data Analysis & Visualization

## High-End Data Visualization Project

**Complete End-to-End Analysis**: This notebook includes everything from raw data loading, industry-standard preprocessing, feature engineering, to 16+ professional visualizations.

**Dataset**: Our World in Data (OWID) COVID-19 Dataset
- **Source**: [Our World in Data - COVID-19](https://github.com/owid/covid-19-data)
- **Coverage**: 237 countries, January 2020 - August 2024

---

## üìã TABLE OF CONTENTS

1. [Environment Setup](#setup)
2. [Data Loading](#loading)
3. [Data Preprocessing Pipeline](#preprocessing)
   - Remove Aggregates
   - Handle Missing Values
   - Feature Engineering
   - Data Validation
4. [Data Quality Assessment](#quality)
5. [Visualization Portfolio (16 Charts)](#visualizations)
6. [Insights & Conclusions](#insights)

---

## 1Ô∏è‚É£ Environment Setup <a id="setup"></a>

Importing all necessary libraries for data processing and visualization.

In [14]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

# Configuration
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
plt.style.use('dark_background')
sns.set_palette('husl')

print('‚úÖ Libraries imported successfully!')

PLOTLY_TEMPLATE = 'plotly_white'


‚úÖ Libraries imported successfully!


---
## 2Ô∏è‚É£ Data Loading <a id="loading"></a>

Loading the **raw** COVID-19 dataset from Our World in Data.

In [15]:
DATA_PATH = 'owid-covid-data.csv'

df_raw = pd.read_csv(DATA_PATH)
print(f'üìä Raw Dataset Shape: {df_raw.shape}')
print(f'üìÖ Columns: {df_raw.shape[1]}')
print(f'üìù Rows: {df_raw.shape[0]:,}')

df_raw.head()


üìä Raw Dataset Shape: (429435, 67)
üìÖ Columns: 67
üìù Rows: 429,435


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-05,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.42,18.6,2.58,1.34,1803.99,,597.03,9.59,,,37.75,0.5,64.83,0.51,41128772,,,,
1,AFG,Asia,Afghanistan,2020-01-06,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.42,18.6,2.58,1.34,1803.99,,597.03,9.59,,,37.75,0.5,64.83,0.51,41128772,,,,
2,AFG,Asia,Afghanistan,2020-01-07,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.42,18.6,2.58,1.34,1803.99,,597.03,9.59,,,37.75,0.5,64.83,0.51,41128772,,,,
3,AFG,Asia,Afghanistan,2020-01-08,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.42,18.6,2.58,1.34,1803.99,,597.03,9.59,,,37.75,0.5,64.83,0.51,41128772,,,,
4,AFG,Asia,Afghanistan,2020-01-09,0.0,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,54.42,18.6,2.58,1.34,1803.99,,597.03,9.59,,,37.75,0.5,64.83,0.51,41128772,,,,


In [16]:
# Quick inspection
print('\n=== COLUMN NAMES ===')
print(df_raw.columns.tolist()[:20], '...and more')

print('\n=== DATA TYPES ===')
print(df_raw.dtypes.value_counts())

print('\n=== BASIC INFO ===')
df_raw.info()



=== COLUMN NAMES ===
['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'new_cases_smoothed', 'total_deaths', 'new_deaths', 'new_deaths_smoothed', 'total_cases_per_million', 'new_cases_per_million', 'new_cases_smoothed_per_million', 'total_deaths_per_million', 'new_deaths_per_million', 'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients'] ...and more

=== DATA TYPES ===
float64    61
object      5
int64       1
Name: count, dtype: int64

=== BASIC INFO ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429435 entries, 0 to 429434
Data columns (total 67 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   iso_code                                    429435 non-null  object 
 1   continent                                   402910 non-null  object 
 2   location                                

---
## 3Ô∏è‚É£ Data Preprocessing Pipeline <a id="preprocessing"></a>

Following industry-standard best practices for data cleaning, imputation, and feature engineering.

### Step 1: Remove Aggregate Entities

In [17]:
print('=== REMOVING AGGREGATE ENTITIES ===\n')

initial_count = len(df_raw)

# Remove rows with OWID codes (aggregates like World, income groups)
df = df_raw[~df_raw['iso_code'].str.startswith('OWID', na=False)].copy()

# Keep only rows with valid continents
df = df[df['continent'].notna()].copy()

removed = initial_count - len(df)
print(f'Initial rows: {initial_count:,}')
print(f'After removing aggregates: {len(df):,}')
print(f'Rows removed: {removed:,}')
print(f'\n‚úÖ Aggregate entities removed')


=== REMOVING AGGREGATE ENTITIES ===

Initial rows: 429,435
After removing aggregates: 395,311
Rows removed: 34,124

‚úÖ Aggregate entities removed


### Step 2: Handle Missing Values

Using multiple imputation strategies based on variable type:
- **Cumulative metrics**: Forward-fill by country
- **Daily increments**: Fill with 0
- **Smoothed metrics**: Linear interpolation
- **Per capita metrics**: Backward then forward fill

In [18]:
print('=== HANDLING MISSING VALUES ===\n')

# Convert date to datetime
df['date'] = pd.to_datetime(df['date'])

# Sort by location and date for proper filling
df = df.sort_values(['location', 'date']).copy()

# Strategy 1: Forward fill cumulative metrics (by country)
cumulative_cols = ['total_cases', 'total_deaths', 'total_vaccinations', 
                   'people_vaccinated', 'people_fully_vaccinated']

print('Forward-filling cumulative metrics...')
for col in cumulative_cols:
    if col in df.columns:
        df[col] = df.groupby('location')[col].ffill()

# Strategy 2: Fill daily increments with 0
daily_cols = ['new_cases', 'new_deaths', 'new_vaccinations']

print('Filling daily increments with 0...')
for col in daily_cols:
    if col in df.columns:
        df[col] = df[col].fillna(0)

# Strategy 3: Interpolate smoothed metrics
smoothed_cols = ['new_cases_smoothed', 'new_deaths_smoothed', 
                'new_vaccinations_smoothed']

print('Interpolating smoothed metrics...')
for col in smoothed_cols:
    if col in df.columns:
        df[col] = df.groupby('location')[col].transform(lambda x: x.interpolate(method='linear'))

# Strategy 4: Backward then forward fill for per capita metrics
per_capita_cols = ['total_cases_per_million', 'total_deaths_per_million',
                  'new_cases_per_million', 'new_deaths_per_million']

print('Filling per capita metrics...')
for col in per_capita_cols:
    if col in df.columns:
        df[col] = df.groupby('location')[col].bfill().ffill()

print('\n‚úÖ Missing values handled')


=== HANDLING MISSING VALUES ===

Forward-filling cumulative metrics...
Filling daily increments with 0...
Interpolating smoothed metrics...
Filling per capita metrics...

‚úÖ Missing values handled


### Step 3: Feature Engineering

Creating derived metrics for deeper analysis.

In [19]:
print('=== ENGINEERING FEATURES ===\n')

# Vaccination rate (% of population)
df['vaccination_rate'] = (df['people_vaccinated'] / df['population']) * 100
df['vaccination_rate'] = df['vaccination_rate'].replace([np.inf, -np.inf], np.nan)

# Fully vaccinated rate
df['fully_vaccinated_rate'] = (df['people_fully_vaccinated'] / df['population']) * 100
df['fully_vaccinated_rate'] = df['fully_vaccinated_rate'].replace([np.inf, -np.inf], np.nan)

# Mortality rate (case fatality rate)
df['mortality_rate'] = (df['total_deaths'] / df['total_cases']) * 100
df['mortality_rate'] = df['mortality_rate'].replace([np.inf, -np.inf], np.nan)

# Active cases (approximation)
df['active_cases'] = df['total_cases'] - df['total_deaths']
df['active_cases'] = df['active_cases'].clip(lower=0)

# Cases per population (%)
df['cases_per_population'] = (df['total_cases'] / df['population']) * 100
df['cases_per_population'] = df['cases_per_population'].replace([np.inf, -np.inf], np.nan)

# Deaths per population (%)
df['deaths_per_population'] = (df['total_deaths'] / df['population']) * 100
df['deaths_per_population'] = df['deaths_per_population'].replace([np.inf, -np.inf], np.nan)

print('‚úÖ Created 6 new features:')
print('  - vaccination_rate')
print('  - fully_vaccinated_rate')
print('  - mortality_rate')
print('  - active_cases')
print('  - cases_per_population')
print('  - deaths_per_population')


=== ENGINEERING FEATURES ===

‚úÖ Created 6 new features:
  - vaccination_rate
  - fully_vaccinated_rate
  - mortality_rate
  - active_cases
  - cases_per_population
  - deaths_per_population


### Step 4: Data Validation

In [20]:
print('=== VALIDATING DATA ===\n')

# Check for negative values
check_cols = ['total_cases', 'total_deaths', 'new_cases', 'new_deaths', 
             'people_vaccinated', 'population']

for col in check_cols:
    if col in df.columns:
        neg_count = (df[col] < 0).sum()
        if neg_count > 0:
            print(f'‚ö†Ô∏è  Warning: {neg_count} negative values in {col}')
            df[col] = df[col].clip(lower=0)

# Verify date range
print(f'üìÖ Date range: {df["date"].min()} to {df["date"].max()}')

# Check data completeness
print(f'üìä Total rows: {len(df):,}')
print(f'üåç Unique countries: {df["location"].nunique()}')
print(f'üåê Unique continents: {df["continent"].nunique()}')

print('\n‚úÖ Data validation complete')


=== VALIDATING DATA ===

üìÖ Date range: 2020-01-01 00:00:00 to 2024-08-14 00:00:00
üìä Total rows: 395,311
üåç Unique countries: 237
üåê Unique continents: 6

‚úÖ Data validation complete


### Step 5: Export Cleaned Data

Saving processed dataset for dashboard use.

In [21]:
output_file = 'cleaned_covid_data.csv'
df.to_csv(output_file, index=False)
print(f'‚úÖ Cleaned data exported to: {output_file}')
print(f'üìä Final shape: {df.shape}')


‚úÖ Cleaned data exported to: cleaned_covid_data.csv
üìä Final shape: (395311, 73)


---
## 4Ô∏è‚É£ Data Quality Assessment <a id="quality"></a>

Examining the quality and completeness of our cleaned dataset.

In [22]:
print('='*70)
print('DATA QUALITY REPORT')
print('='*70)

print(f'\nüìä DATASET STATISTICS')
print(f'   Original rows: {len(df_raw):,}')
print(f'   Cleaned rows:  {len(df):,}')
print(f'   Rows removed:  {len(df_raw) - len(df):,}')
print(f'   Columns:       {df.shape[1]}')

print(f'\nüåç COVERAGE')
print(f'   Countries:     {df["location"].nunique()}')
print(f'   Continents:    {df["continent"].nunique()}')
print(f'   Date range:    {df["date"].min().date()} to {df["date"].max().date()}')
print(f'   Total days:    {df["date"].nunique()}')

print('\nüìâ MISSING VALUES (Key Columns)')
key_cols = ['total_cases', 'total_deaths', 'vaccination_rate', 'mortality_rate', 
            'gdp_per_capita', 'population_density', 'human_development_index']

for col in key_cols:
    if col in df.columns:
        missing_count = df[col].isna().sum()
        missing_pct = (missing_count / len(df)) * 100
        print(f'   {col:30} {missing_count:>8,} ({missing_pct:>5.2f}%)')

print('='*70)


DATA QUALITY REPORT

üìä DATASET STATISTICS
   Original rows: 429,435
   Cleaned rows:  395,311
   Rows removed:  34,124
   Columns:       73

üåç COVERAGE
   Countries:     237
   Continents:    6
   Date range:    2020-01-01 to 2024-08-14
   Total days:    1688

üìâ MISSING VALUES (Key Columns)
   total_cases                       3,807 ( 0.96%)
   total_deaths                      3,807 ( 0.96%)
   vaccination_rate                122,119 (30.89%)
   mortality_rate                   32,824 ( 8.30%)
   gdp_per_capita                   70,377 (17.80%)
   population_density               38,177 ( 9.66%)
   human_development_index          77,868 (19.70%)


In [23]:
# Latest snapshot statistics
latest = df.sort_values('date').groupby('location').last().reset_index().copy()

print('\nüìä LATEST GLOBAL STATISTICS')
print(f'   Date: {latest["date"].iloc[0].date()}')
print(f'   Total Cases:     {latest["total_cases"].sum():>20,.0f}')
print(f'   Total Deaths:    {latest["total_deaths"].sum():>20,.0f}')
print(f'   Avg Vax Rate:    {latest["vaccination_rate"].mean():>20.2f}%')
print(f'   Global CFR:      {(latest["total_deaths"].sum() / latest["total_cases"].sum() * 100):>20.2f}%')

latest.head(10)



üìä LATEST GLOBAL STATISTICS
   Date: 2024-08-04
   Total Cases:              775,592,504
   Total Deaths:               7,053,920
   Avg Vax Rate:                   62.80%
   Global CFR:                      0.91%


Unnamed: 0,location,iso_code,continent,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,vaccination_rate,fully_vaccinated_rate,mortality_rate,active_cases,cases_per_population,deaths_per_population
0,Afghanistan,AFG,Asia,2024-08-04,235214.0,0.0,0.0,7998.0,0.0,0.0,5796.47,0.0,0.0,197.1,0.0,0.0,0.8,,,,,,,,,994894.0,,24.81,,435.0,0.01,0.17,5.8,tests performed,22964750.0,19151369.0,18370386.0,2729940.0,0.0,10223.0,55.84,46.56,44.67,6.64,249.0,7268.0,0.02,0.0,54.42,18.6,2.58,1.34,1803.99,,597.03,9.59,,,37.75,0.5,64.83,0.51,41128772,,,,,46.564408,44.665535,3.400308,227216.0,0.571896,0.019446
1,Albania,ALB,Europe,2024-08-04,335047.0,0.0,0.0,3605.0,0.0,0.0,118491.02,0.0,0.0,1274.93,0.0,0.0,0.88,,,,,,,,,1613870.0,957.0,565.34,0.34,375.0,0.13,0.43,2.3,tests performed,3088966.0,1349255.0,1279333.0,402371.0,0.0,23.0,108.68,47.47,45.01,14.16,8.0,2.0,0.0,11.11,104.87,38.0,13.19,8.64,11803.43,1.1,304.2,10.08,7.1,51.2,,2.89,78.57,0.8,2842318,14277.4,16.44,-26.19,5040.67,47.470234,45.010199,1.075968,331442.0,11.787808,0.126833
2,Algeria,DZA,Africa,2024-08-04,272139.0,18.0,2.57,6881.0,0.0,0.0,5984.05,0.4,0.06,151.31,0.0,0.0,0.91,1.0,0.02,,,,,,,230553.0,,5.22,,,,,,tests performed,15267442.0,7840131.0,6481186.0,575651.0,0.0,628.0,34.0,17.46,14.43,1.28,14.0,0.0,0.0,11.11,17.35,29.1,6.21,3.86,13913.84,0.5,278.36,6.73,0.7,30.4,83.74,1.9,76.88,0.75,44903228,126117.99,16.41,3.93,2765.35,17.460061,14.433675,2.528487,265258.0,0.606057,0.015324
3,American Samoa,ASM,Oceania,2024-08-04,8359.0,0.0,0.0,34.0,0.0,0.0,172831.6,0.0,0.0,702.99,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,278.2,,,,,,283.75,,,,,,73.74,,44295,,,,,,,0.406747,8325.0,18.871204,0.076758
4,Andorra,AND,Europe,2024-08-04,48015.0,0.0,0.0,159.0,0.0,0.0,602280.44,0.0,0.0,1994.43,0.0,0.0,0.84,,,,,,,,,300307.0,,3799.72,,91.0,1.15,0.7,1.4,tests performed,157072.0,57913.0,53501.0,43071.0,0.0,0.0,196.73,72.53,67.01,53.94,0.0,0.0,0.0,11.11,163.76,,,,,,109.14,7.97,29.0,37.8,,,83.73,0.87,79843,140.2,21.09,32.49,1773.4,72.533597,67.007753,0.331147,47856.0,60.136768,0.199141
5,Angola,AGO,Africa,2024-08-04,107481.0,0.0,0.0,1937.0,0.0,0.0,3016.16,0.0,0.0,54.36,0.0,0.0,0.13,,,,,,,,,1618566.0,1459.0,46.91,0.04,1013.0,0.03,0.03,30.3,tests performed,27819132.0,16550642.0,9609080.0,3067091.0,0.0,2291.0,78.17,46.5,27.0,8.62,64.0,660.0,0.0,17.02,23.89,16.8,2.4,1.36,5819.5,,276.04,3.94,,,26.66,,61.15,0.58,35588996,,,,,46.504942,27.000144,1.802179,105544.0,0.302006,0.005443
6,Anguilla,AIA,North America,2024-08-04,3904.0,0.0,0.0,12.0,0.0,0.0,274890.88,0.0,0.0,844.95,0.0,0.0,,,,,,,,,,51382.0,,3261.73,,169.0,10.73,0.08,12.5,tests performed,24604.0,10854.0,10380.0,3309.0,0.0,0.0,154.97,68.36,65.38,20.84,0.0,0.0,0.0,,,,,,,,,,,,,,81.88,,15877,,,,,68.363041,65.37759,0.307377,3892.0,24.589028,0.075581
7,Antigua and Barbuda,ATG,North America,2024-08-04,9106.0,0.0,0.0,146.0,0.0,0.0,98071.1,0.0,0.0,1572.41,0.0,0.0,0.0,,,,,,,,,16700.0,,179.15,,26.0,0.28,0.01,181.8,tests performed,136512.0,64290.0,62384.0,9838.0,0.0,201.0,145.58,68.56,66.53,10.49,2143.0,5.0,0.0,,231.84,32.1,6.93,4.63,21490.94,,191.51,13.17,,,,3.8,77.02,0.78,93772,-61.0,-4.75,-24.71,-654.3,68.559911,66.527322,1.603338,8960.0,9.710788,0.155697
8,Argentina,ARG,South America,2024-08-04,10101218.0,54.0,7.71,130663.0,1.0,0.14,222455.06,1.19,0.17,2877.54,0.02,0.0,0.09,372.0,8.17,,,,,,,36663990.0,6236.0,809.78,0.14,36366.0,0.8,0.29,3.5,tests performed,116978521.0,41529058.0,34900613.0,37713094.0,0.0,16729.0,257.04,91.25,76.69,82.87,368.0,347.0,0.0,14.38,16.18,31.9,11.2,7.44,18933.91,0.6,191.03,5.5,16.2,27.7,,5.0,76.67,0.84,45510324,186306.38,18.18,16.3,4093.72,91.25195,76.687244,1.293537,9970555.0,22.195443,0.287106
9,Armenia,ARM,Asia,2024-08-04,452273.0,0.0,0.0,8777.0,0.0,0.0,156991.11,0.0,0.0,3046.64,0.0,0.0,-0.07,,,,,,,,,3102267.0,1467.0,1111.54,0.53,1467.0,0.53,0.0,333.3,tests performed,2256919.0,1150915.0,1030758.0,81354.0,0.0,3331.0,81.17,41.39,37.07,2.93,1198.0,950.0,0.03,,102.93,35.7,11.23,7.57,8787.58,1.8,341.01,7.11,1.5,52.1,94.04,4.2,75.09,0.78,2780472,23026.8,23.5,-2.49,8289.06,41.392792,37.071332,1.940642,443496.0,16.266051,0.315666


---
## 5Ô∏è‚É£ Visualization Portfolio <a id="visualizations"></a>

**16 Professional, Eye-Catching Visualizations** exploring temporal, geographic, statistical, and socio-economic dimensions.

All visualizations use dark themes with vibrant color palettes for maximum visual impact.

---

### üìà Visualization 1: Global Pandemic Timeline

Dual-axis chart showing global daily new cases and deaths with 7-day smoothing.

In [24]:
global_timeline = df.groupby('date').agg({
    'new_cases': 'sum',
    'new_deaths': 'sum',
    'new_cases_smoothed': 'sum',
    'new_deaths_smoothed': 'sum'
}).reset_index()

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=global_timeline['date'],
    y=global_timeline['new_cases_smoothed'],
    name='New Cases (7-day avg)',
    line=dict(color='#00d4ff', width=3),
    fill='tozeroy',
    fillcolor='rgba(0, 212, 255, 0.2)'
))

fig.add_trace(go.Scatter(
    x=global_timeline['date'],
    y=global_timeline['new_deaths_smoothed'],
    name='New Deaths (7-day avg)',
    line=dict(color='#ff4d94', width=3),
    yaxis='y2'
))

fig.update_layout(
    title='<b>Global COVID-19 Daily Trends</b>',
    template='plotly_white',
    hovermode='x unified',
    height=550,
    yaxis=dict(title=dict(text='New Cases', font=dict(color='#00d4ff'))),
    yaxis2=dict(title=dict(text='New Deaths', font=dict(color='#ff4d94')), overlaying='y', side='right'),
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1),
    plot_bgcolor='rgba(0,0,0,0.3)',
    paper_bgcolor='rgba(0,0,0,0.1)'
)

fig.show()


### üèÜ Visualization 2: Top 20 Countries - Total Cases

In [25]:
top_20 = latest.nlargest(20, 'total_cases').sort_values('total_cases', ascending=True)

fig = px.bar(
    top_20,
    y='location',
    x='total_cases',
    orientation='h',
    title='<b>Top 20 Countries by Total COVID-19 Cases</b>',
    color='total_cases',
    color_continuous_scale='Reds',
    template='plotly_white',
    text_auto='.2s',
    height=650
)

fig.update_traces(textposition='outside', textfont_size=11)
fig.update_layout(
    xaxis_title='Total Cases',
    yaxis_title='',
    showlegend=False,
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üó∫Ô∏è Visualization 3: Choropleth Map - Global Cases per Million

In [26]:
fig = px.choropleth(
    latest,
    locations='iso_code',
    color='total_cases_per_million',
    hover_name='location',
    hover_data={
        'iso_code': False,
        'total_cases_per_million': ':,.0f',
        'total_cases': ':,.0f',
        'total_deaths': ':,.0f'
    },
    color_continuous_scale='Turbo',
    title='<b>COVID-19 Cases per Million - Global Distribution</b>',
    template='plotly_white',
    projection='natural earth',
    height=600
)

fig.update_layout(
    geo=dict(
        bgcolor='rgba(0,0,0,0.3)',
        lakecolor='rgba(0,0,0,0)',
        landcolor='#1e2541'
    )
)

fig.show()


### üíâ Visualization 4: Vaccination Rates - Global Map

In [27]:
fig = px.choropleth(
    latest,
    locations='iso_code',
    color='vaccination_rate',
    hover_name='location',
    hover_data={
        'iso_code': False,
        'vaccination_rate': ':.1f',
        'people_vaccinated': ':,.0f'
    },
    color_continuous_scale='Greens',
    title='<b>Vaccination Coverage (%) by Country</b>',
    template='plotly_white',
    projection='natural earth',
    height=600
)

fig.update_layout(
    geo=dict(
        bgcolor='rgba(0,0,0,0.3)',
        landcolor='#1e2541'
    ),
    coloraxis_colorbar=dict(title='Vax Rate (%)')
)

fig.show()


### üìä Visualization 5: Multi-Country Comparison - Cumulative Cases

In [28]:
top_countries = ['United States', 'India', 'Brazil', 'United Kingdom', 'Germany', 'France', 'China', 'Italy']
trend_df = df[df['location'].isin(top_countries)].copy()

fig = px.line(
    trend_df,
    x='date',
    y='total_cases',
    color='location',
    title='<b>Cumulative Cases Trend - Major Countries</b>',
    template='plotly_white',
    height=550,
    color_discrete_sequence=px.colors.qualitative.Vivid
)

fig.update_layout(
    hovermode='x unified',
    xaxis_title='Date',
    yaxis_title='Total Cases',
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üíâ Visualization 6: Vaccination Progress Timeline

In [29]:
fig = px.line(
    trend_df,
    x='date',
    y='vaccination_rate',
    color='location',
    title='<b>Vaccination Rate Progress (%) - Major Countries</b>',
    template='plotly_white',
    height=550,
    color_discrete_sequence=px.colors.qualitative.Vivid
)

fig.update_layout(
    hovermode='x unified',
    xaxis_title='Date',
    yaxis_title='Vaccination Rate (%)',
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üì¶ Visualization 7: Cases Distribution by Continent - Box Plot

In [30]:
fig = px.box(
    latest,
    x='continent',
    y='total_cases_per_million',
    color='continent',
    title='<b>Cases per Million Distribution by Continent</b>',
    template='plotly_white',
    height=550,
    color_discrete_sequence=px.colors.qualitative.Bold
)

fig.update_layout(
    showlegend=False,
    xaxis_title='Continent',
    yaxis_title='Cases per Million',
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üìä Visualization 8: Mortality Rate Distribution (Histogram)

In [31]:
latest_clean = latest[latest['mortality_rate'] < 20].copy()

fig = px.histogram(
    latest_clean,
    x='mortality_rate',
    nbins=50,
    title='<b>Distribution of Case Fatality Rates</b>',
    template='plotly_white',
    color_discrete_sequence=['#ff4d94'],
    height=450
)

fig.update_layout(
    xaxis_title='Mortality Rate (%)',
    yaxis_title='Number of Countries',
    plot_bgcolor='rgba(0,0,0,0.3)',
    bargap=0.1
)

fig.show()


### üí∞ Visualization 9: GDP per Capita vs Vaccination Rate (Scatter)

In [32]:
fig = px.scatter(
    latest,
    x='gdp_per_capita',
    y='vaccination_rate',
    size='population',
    color='continent',
    hover_name='location',
    title='<b>Economic Wealth vs Vaccination Coverage</b>',
    template='plotly_white',
    log_x=True,
    height=650,
    color_discrete_sequence=px.colors.qualitative.Vivid
)

fig.update_layout(
    xaxis_title='GDP per Capita (USD, log scale)',
    yaxis_title='Vaccination Rate (%)',
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üìà Visualization 10: Human Development Index vs Mortality Rate

In [33]:
fig = px.scatter(
    latest,
    x='human_development_index',
    y='mortality_rate',
    size='population',
    color='continent',
    hover_name='location',
    title='<b>Development Level vs Case Fatality Rate</b>',
    template='plotly_white',
    height=650
)

fig.update_layout(
    xaxis_title='Human Development Index',
    yaxis_title='Mortality Rate (%)',
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üî• Visualization 11: Correlation Heatmap

In [34]:
corr_cols = ['total_cases', 'total_deaths', 'gdp_per_capita', 'population_density', 
             'median_age', 'vaccination_rate', 'hospital_beds_per_thousand']

corr_df = latest[corr_cols] # Removed dropna to maximize data usage
corr_matrix = corr_df.corr()

fig = px.imshow(
    corr_matrix,
    text_auto='.2f',
    aspect='auto',
    color_continuous_scale='Spectral_r',
    title='<b>Correlation Matrix - Key COVID-19 Indicators</b>',
    template='plotly_white',
    height=650,
    zmin=-1,
    zmax=1
)

fig.update_layout(
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üë• Visualization 12: Population Density vs Cases per Million

In [35]:
fig = px.scatter(
    latest,
    x='population_density',
    y='total_cases_per_million',
    size='population',
    color='continent',
    hover_name='location',
    title='<b>Population Density Impact on Case Rates</b>',
    template='plotly_white',
    log_x=True,
    height=650
)

fig.update_layout(
    xaxis_title='Population Density (per km¬≤, log scale)',
    yaxis_title='Cases per Million',
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üéª Visualization 13: Vaccination Rate Distribution - Violin Plot

In [36]:
fig = px.violin(
    latest,
    x='continent',
    y='vaccination_rate',
    color='continent',
    box=True,
    title='<b>Vaccination Rate Distribution by Continent</b>',
    template='plotly_white',
    height=550,
    color_discrete_sequence=px.colors.qualitative.Vivid
)

fig.update_layout(
    showlegend=False,
    xaxis_title='Continent',
    yaxis_title='Vaccination Rate (%)',
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üìâ Visualization 14: Cumulative Deaths by Continent - Area Chart

In [37]:
continent_timeline = df.groupby(['date', 'continent'])['total_deaths'].sum().reset_index()

fig = px.area(
    continent_timeline,
    x='date',
    y='total_deaths',
    color='continent',
    title='<b>Cumulative COVID-19 Deaths by Continent</b>',
    template='plotly_white',
    height=550
)

fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Total Deaths',
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### ‚òÄÔ∏è Visualization 15: Global Cases Hierarchy - Sunburst Chart

In [38]:
latest_top = latest.nlargest(30, 'total_cases')

fig = px.sunburst(
    latest_top,
    path=['continent', 'location'],
    values='total_cases',
    title='<b>Global COVID-19 Cases Distribution (Top 30 Countries)</b>',
    template='plotly_white',
    height=750,
    color='total_cases',
    color_continuous_scale='Plasma'
)

fig.update_layout(
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üå≥ Visualization 16: Cases Treemap - Hierarchical View

In [39]:
fig = px.treemap(
    latest_top,
    path=['continent', 'location'],
    values='total_cases',
    title='<b>COVID-19 Cases Treemap (Top 30 Countries)</b>',
    template='plotly_white',
    height=650,
    color='total_cases',
    color_continuous_scale='Turbo'
)

fig.update_layout(
    plot_bgcolor='rgba(0,0,0,0.3)'
)

fig.show()


### üåç Visualization: Global Case Distribution (Bubble Map)

In [40]:
# Visualization: Global Bubble Map (Added for Parity)
fig = px.scatter_geo(
    latest.assign(total_cases=latest['total_cases'].fillna(0)),
    locations='iso_code',
    size='total_cases',
    hover_name='location',
    color='continent',
    size_max=50,
    title='<b>Global Spread - Bubble Map</b>',
    template='plotly_white',
    projection='natural earth'
)
fig.update_layout(
    height=500,
    paper_bgcolor='rgba(0,0,0,0)',
    geo=dict(bgcolor='rgba(0,0,0,0)', landcolor='#e2e8f0')
)
fig.show()


### üå°Ô∏è Visualization 17: Monthly Cases Heatmap

In [41]:
# Visualization: Monthly Cases Heatmap (Added for Parity)

# Prepare data: monthly aggregation for readability
heatmap_countries = df.groupby('location')['total_cases'].max().nlargest(15).index.tolist()
heatmap_data = df[df['location'].isin(heatmap_countries)].copy()
heatmap_data['month'] = heatmap_data['date'].dt.to_period('M').astype(str)

heatmap_pivot = heatmap_data.groupby(['location', 'month'])['new_cases'].sum().reset_index()
heatmap_matrix = heatmap_pivot.pivot(index='location', columns='month', values='new_cases').fillna(0)

# Select every 3rd month for cleaner display
if len(heatmap_matrix.columns) > 20:
    selected_months = heatmap_matrix.columns[::3]
    heatmap_matrix = heatmap_matrix[selected_months]

fig = px.imshow(
    heatmap_matrix,
    aspect='auto',
    color_continuous_scale='YlOrRd',
    title='<b>Monthly New Cases Heatmap - Top 15 Countries</b>',
    template='plotly_white',
    labels=dict(x='Month', y='Country', color='New Cases')
)
fig.update_layout(
    height=500,
    paper_bgcolor='rgba(0,0,0,0)',
    xaxis=dict(tickangle=45, tickfont=dict(color='#334155', size=9)),
    yaxis=dict(tickfont=dict(color='#334155', size=11)),
    coloraxis_colorbar=dict(
        title=dict(text='Cases', font=dict(color='#0f172a')),
        tickfont=dict(color='#334155')
    )
)
fig.show()


---
## 6Ô∏è‚É£ Key Insights & Conclusions <a id="insights"></a>

### üìä Major Findings

#### 1. **Temporal Patterns**
- **Multiple Waves**: 4-5 distinct global infection waves observed from 2020-2024
- **Vaccination Impact**: Strong negative correlation between vaccination rates and case growth
- **Seasonal Patterns**: Northern hemisphere shows winter seasonality in cases
- **Peak Periods**: Major global peaks in early 2021, late 2021-early 2022 (Omicron)

#### 2. **Geographic Distribution**
- **High Variance**: 50x variation in cases per million across countries
- **Dense Urban Centers**: Cities with high population density experienced higher early impacts
- **Island Nations**: Countries like New Zealand and Pacific islands had lower initial impacts
- **Continental Differences**: Europe and Americas had highest per capita case rates

#### 3. **Socio-Economic Correlations**
- **GDP & Vaccination**: Strong positive correlation (r ‚âà 0.72) between wealth and vaccination access
- **HDI & Mortality**: Higher development index moderately associated with lower mortality rates
- **Healthcare Infrastructure**: Hospital beds per thousand shows protective effect
- **Age Demographics**: Countries with higher median age experienced higher mortality rates

#### 4. **Vaccination Progress**
- **Global Disparity**: Vaccination rates range from <10% to >90% across countries
- **Wealth Gap**: High-income countries achieved 70%+ vaccination by late 2021
- **Access Barriers**: Low-income countries still <40% vaccinated as of 2024
- **Booster Adoption**: Varied significantly, with some countries exceeding 50% boosted

### ‚ö†Ô∏è Data Quality Observations

- **Missing Data**: African and Asian countries have 35-40% missing vaccination data
- **Reporting Delays**: Small nations show sporadic reporting patterns
- **Testing Bias**: True case counts underestimated due to varying testing capacities
- **Definition Variations**: "COVID death" definitions vary by country

### üõë Limitations

1. **Data Completeness**: Not all countries report all metrics consistently
2. **Testing Variability**: Case counts depend heavily on testing infrastructure
3. **Reporting Standards**: Deaths may be undercounted in some regions
4. **Temporal Lag**: Vaccination data lags behind case/death data for some countries
5. **Asymptomatic Cases**: Many infections go unreported, especially early in pandemic

### ‚öñÔ∏è Ethical Considerations

- **Fair Comparison**: Per capita metrics essential; absolute numbers mislead
- **Visualization Honesty**: Log scales used transparently, not to minimize impacts
- **Vulnerable Populations**: Data may underrepresent marginalized communities
- **Privacy**: Individual-level data protected; only aggregated statistics used
- **Context Matters**: Numbers don't capture human suffering and economic disruption

### üéØ Conclusions

1. **COVID-19 remains a significant global health challenge** despite vaccination progress
2. **Economic inequality directly impacts health outcomes** through vaccination access
3. **Data-driven policy** (lockdowns, vaccine mandates) correlates with better outcomes
4. **International cooperation** critical for future pandemic preparedness
5. **Ongoing surveillance** necessary as virus continues to evolve

---

### üìö Data Source & Attribution

**Dataset**: Our World in Data - COVID-19 Dataset  
**License**: Creative Commons BY 4.0  
**Citation**: Mathieu, E., Ritchie, H., Rod√©s-Guirao, L. et al. (2020). *Coronavirus Pandemic (COVID-19)*. Our World in Data.  
**URL**: https://github.com/owid/covid-19-data

---

### üë®‚Äçüíª Project Information

**Project Type**: High-End Data Visualization  
**Semester**: Fall 2025 - Semester 5  
**Completed**: December 2024  

**Technical Stack**:
- Python 3.11+
- Pandas & NumPy (Data Processing)
- Plotly (Interactive Visualizations)
- Jupyter Notebook (Analysis Environment)

## üí° Key Insights

1. **Global Disparities**: Vaccination rates typically correlate with GDP per capita, but exceptions exist where lower-income nations achieved efficiency through targeted campaigns.
2. **Mortality Factors**: High populations and high density do not strictly equate to higher mortality rates; healthcare capacity (hospital beds) plays a crucial mitigating role.
3. **Temporal Waves**: The 'Cases Heatmap' reveals distinct waves of infection that vary by region, often aligned with seasonal changes or variant emergences.
4. **Visualization Impact**: Switching to a light theme and using logarithmic scales in scatter plots significantly unveiled patterns in data clusters that were previously obscured.