# COVID-19 Data Preprocessing and Analysis

## 1. Introduction
This notebook covers the data cleaning and preprocessing pipeline for the COVID-19 dataset. The goal is to prepare the raw data for visualization in a Streamlit dashboard.

**Dataset:** `owid-covid-data.csv`

## 2. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

# Load the dataset
file_path = 'owid-covid-data.csv'
df = pd.read_csv(file_path)

# Optimize memory usage by converting object columns to categories where appropriate
df['iso_code'] = df['iso_code'].astype('category')
df['continent'] = df['continent'].astype('category')

print(f"Data Shape: {df.shape}")
df.head()

## 3. Data Inspection
Checking for missing values and understanding the structure of the data.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0].head(10))

# Inspect data types
df.info()

## 4. Data Cleaning

### 4.1. Filtering Non-Country Entries
The dataset includes aggregated rows for continents and income groups (e.g., 'World', 'Europe', 'High income'). We will filter these out to focus on country-level analysis. We can identify them by missing `continent` values, as country entries usually have a continent specified.

In [None]:
# Remove rows where continent is null (these are usually aggregates like 'World', 'Asia', etc.)
df_clean = df[df['continent'].notna()].copy()

print(f"Shape after filtering aggregates: {df_clean.shape}")

### 4.2. Handling Dates
Convert the `date` column to datetime objects for time-series operations.

In [None]:
df_clean['date'] = pd.to_datetime(df_clean['date'])
df_clean = df_clean.sort_values(by=['location', 'date'])

### 4.3. Handling Missing Values
For cumulative columns (like `total_cases`, `total_vaccinations`), we can forward-fill missing values because the total doesn't change if no new cases are reported. For daily changes (like `new_cases`), we can fill with 0.

In [None]:
# Columns to fix
fill_0_cols = ['new_cases', 'new_deaths', 'new_vaccinations']
ffill_cols = ['total_cases', 'total_deaths', 'people_vaccinated', 'people_fully_vaccinated']

# Fill daily changes with 0 (assumption: null means no report/0 change)
for col in fill_0_cols:
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].fillna(0)

# Forward fill cumulative stats by location
for col in ffill_cols:
    if col in df_clean.columns:
        df_clean[col] = df_clean.groupby('location')[col].ffill().fillna(0)

### 4.4. Feature Selection
Select only the columns necessary for our analysis to keep the file size manageable.

In [None]:
cols_to_keep = [
    'iso_code', 'continent', 'location', 'date',
    'total_cases', 'new_cases',
    'total_deaths', 'new_deaths',
    'people_vaccinated', 'people_fully_vaccinated',
    'population'
]

final_df = df_clean[cols_to_keep]

# Calculate vaccination rate
final_df['vaccination_rate'] = (final_df['people_vaccinated'] / final_df['population']) * 100
final_df['vaccination_rate'] = final_df['vaccination_rate'].fillna(0)

final_df.head()

## 5. Export Data
Saving the cleaned dataset for the Streamlit application.

In [None]:
output_file = 'cleaned_covid_data.csv'
final_df.to_csv(output_file, index=False)
print(f"Cleaned data saved to {output_file}")