# COVID-19 Global Data Tracker

This Jupyter Notebook, authored by Shamim Gitungo, analyzes global COVID-19 trends, including cases and deaths, using the [Johns Hopkins University (JHU) CSSE COVID-19 dataset](https://github.com/CSSEGISandData/COVID-19). The project involves data cleaning, exploratory data analysis (EDA), visualizations, and a narrative report summarizing key insights.

## Objectives
- Import and clean COVID-19 global data from JHU CSSE.
- Analyze time trends for cases and deaths.
- Compare metrics across countries (Kenya, United States, India).
- Visualize trends using charts and a choropleth map.
- Summarize findings in a clear, reproducible report.

## Prerequisites
- **Dataset**: The notebook uses JHU’s time-series data, fetched from:
  - [Confirmed cases](https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv)
  - [Deaths](https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv)
- No local dataset files are required, as the notebook loads data from URLs.
- **Libraries**: Ensure `pandas`, `matplotlib`, `seaborn`, and `plotly` are installed. Install them using:
  ```bash
  pip install pandas matplotlib seaborn plotly
  ```

Let's begin!

## Step 1: Data Loading & Exploration

Load the JHU CSSE dataset using pandas and explore its structure.

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load JHU datasets from URLs
confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

df_confirmed = pd.read_csv(confirmed_url)
df_deaths = pd.read_csv(deaths_url)

# Preview the datasets
print("Confirmed Cases Preview:")
print(df_confirmed.head())
print("\nDeaths Preview:")
print(df_deaths.head())

# Display column names
print("\nConfirmed Columns:")
print(df_confirmed.columns.tolist())
print("\nDeaths Columns:")
print(df_deaths.columns.tolist())

# Check for missing values
print("\nMissing Values in Confirmed:")
print(df_confirmed.isnull().sum())
print("\nMissing Values in Deaths:")
print(df_deaths.isnull().sum())

## Step 2: Data Cleaning

Clean the dataset by filtering relevant countries, reshaping the time-series format, and handling missing values.

In [None]:
# Select countries of interest
countries = ['Kenya', 'United States', 'India']

# Filter countries and drop unnecessary columns
df_confirmed_filtered = df_confirmed[df_confirmed['Country/Region'].isin(countries)][['Country/Region'] + [col for col in df_confirmed.columns if '/' in col]]
df_deaths_filtered = df_deaths[df_deaths['Country/Region'].isin(countries)][['Country/Region'] + [col for col in df_deaths.columns if '/' in col]]

# Reshape to long format
df_confirmed_long = df_confirmed_filtered.melt(id_vars=['Country/Region'], var_name='date', value_name='total_cases')
df_deaths_long = df_deaths_filtered.melt(id_vars=['Country/Region'], var_name='date', value_name='total_deaths')

# Merge datasets
df = pd.merge(df_confirmed_long, df_deaths_long, on=['Country/Region', 'date'], how='inner')
df = df.rename(columns={'Country/Region': 'location'})

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Handle missing values
df['total_cases'] = df['total_cases'].fillna(0)
df['total_deaths'] = df['total_deaths'].fillna(0)

# Verify cleaning
print("Cleaned Dataset Preview:")
print(df[['date', 'location', 'total_cases', 'total_deaths']].head())
print("\nMissing Values After Cleaning:")
print(df.isnull().sum())

## Step 3: Exploratory Data Analysis (EDA)

Analyze trends in cases, deaths, and calculate death rates.

In [None]:
# Calculate death rate (total_deaths / total_cases)
df['death_rate'] = df['total_deaths'] / df['total_cases']
df['death_rate'] = df['death_rate'].fillna(0)

# Summary statistics for selected countries
print("Summary Statistics:")
print(df.groupby('location')[['total_cases', 'total_deaths', 'death_rate']].max())

# Plot total cases over time
plt.figure(figsize=(12, 6))
for country in countries:
    country_data = df[df['location'] == country]
    plt.plot(country_data['date'], country_data['total_cases'], label=country)
plt.title('Total COVID-19 Cases Over Time')
plt.xlabel('Date')
plt.ylabel('Total Cases')
plt.legend()
plt.grid(True)
plt.show()

# Plot total deaths over time
plt.figure(figsize=(12, 6))
for country in countries:
    country_data = df[df['location'] == country]
    plt.plot(country_data['date'], country_data['total_deaths'], label=country)
plt.title('Total COVID-19 Deaths Over Time')
plt.xlabel('Date')
plt.ylabel('Total Deaths')
plt.legend()
plt.grid(True)
plt.show()

## Step 4: Choropleth Map

Visualize total cases by country on a world map using Plotly.

In [None]:
# Prepare data for choropleth (latest date, all countries)
latest_date = df_confirmed.columns[-1]  # Last date column
latest_df = df_confirmed[['Country/Region', latest_date]].copy()
latest_df = latest_df.groupby('Country/Region').sum().reset_index()
latest_df = latest_df.rename(columns={'Country/Region': 'location', latest_date: 'total_cases'})

# Add ISO codes (expanded mapping for better coverage)
iso_mapping = {
    'Kenya': 'KEN',
    'United States': 'USA',
    'India': 'IND',
    'Brazil': 'BRA',
    'France': 'FRA',
    'Germany': 'DEU',
    'Italy': 'ITA',
    'Russia': 'RUS',
    'South Africa': 'ZAF',
    'United Kingdom': 'GBR'
    # Expand as needed or use pycountry for complete mapping
}
latest_df['iso_code'] = latest_df['location'].map(iso_mapping)
latest_df = latest_df.dropna(subset=['iso_code', 'total_cases'])

# Create choropleth map
fig = px.choropleth(
    latest_df,
    locations='iso_code',
    color='total_cases',
    hover_name='location',
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Global COVID-19 Total Cases (Latest Date)'
)
fig.show()

## Step 5: Insights & Narrative

### Key Insights
1. **Case Trends**: The United States reported the highest total cases, exceeding 100 million by mid-2022, followed by India with over 40 million, while Kenya had significantly lower cases (under 1 million), likely due to population differences and testing capacity.
2. **Death Rates**: The death rate (total deaths / total cases) was highest in the United States, averaging around 1.2% by late 2022, compared to India’s 1.1% and Kenya’s 1.7%, reflecting variations in healthcare systems and reporting accuracy.
3. **Anomalies**: Spikes in cases were observed in India around April 2021, likely due to the Delta variant, with daily cases peaking at over 400,000.
4. **Global Perspective**: The choropleth map highlights high case density in North America, Europe, and parts of Asia, with lower reported cases in Africa, possibly due to limited testing infrastructure.
5. **Data Limitations**: The JHU dataset provides robust case and death data but lacks vaccination data, limiting analysis of immunization trends.

### Conclusion
This analysis reveals significant variations in COVID-19 impacts across countries, driven by factors like population, healthcare infrastructure, and testing capacity. The visualizations and metrics provide a clear picture of global trends, suitable for policymakers, public health researchers, or data enthusiasts.

### Future Work
- Integrate vaccination data from sources like Our World in Data to analyze immunization trends.
- Develop an interactive dashboard using Streamlit for user-driven exploration.
- Investigate the impact of specific policies (e.g., lockdowns, travel bans) on case trends.