# India Air Quality Data Exploration

This notebook explores multiple datasets to understand air quality trends in India.

## Datasets:
1. **India-Air-Quality-Dataset** - CPCB AQI data for major metros (Delhi, Mumbai, Bangalore, Chennai, Hyderabad)
2. **AQI_bulletins** - UrbanEmissions daily AQI bulletins (2015-2025)
3. **cpcb_air_quality** - PM2.5 data from various stations

## Goal:
- Understand data structure and coverage
- Identify usable date ranges
- Check for PM2.5 availability (needed for cigarette equivalent calculation)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', None)

DATA_DIR = Path('../data')

## 1. India-Air-Quality-Dataset (Metro Cities)

In [None]:
# Load Delhi data as representative sample
delhi_path = DATA_DIR / 'India-Air-Quality-Dataset' / 'Delhi_AQI_Dataset.csv'
delhi_df = pd.read_csv(delhi_path)

print(f"Dataset: Delhi AQI")
print(f"Shape: {delhi_df.shape}")
print(f"\nColumns: {delhi_df.columns.tolist()}")
print(f"\nDate range: {delhi_df['Date'].min()} to {delhi_df['Date'].max()}")
print(f"\nSample:")
delhi_df.head(10)

In [None]:
# Load all metro cities
cities = ['Delhi', 'Mumbai', 'Bangalore', 'Chennai', 'Hyderabad']
metro_data = {}

for city in cities:
    path = DATA_DIR / 'India-Air-Quality-Dataset' / f'{city}_AQI_Dataset.csv'
    df = pd.read_csv(path)
    df['Date'] = pd.to_datetime(df['Date'])
    metro_data[city] = df
    print(f"{city}: {len(df)} rows, {df['Date'].min().date()} to {df['Date'].max().date()}")

print(f"\nBasic stats for Delhi PM2.5 (µg/m³):")
delhi_df['PM2.5'].describe()

## 2. AQI Bulletins Master Dataset (2015-2025)

In [None]:
# Load the master AQI bulletins file
bulletins_path = DATA_DIR / 'AQI_bulletins' / 'data' / 'Processed' / 'AllIndiaBulletins_Master2025.csv'
bulletins_df = pd.read_csv(bulletins_path)

print(f"Dataset: AQI Bulletins Master")
print(f"Shape: {bulletins_df.shape}")
print(f"\nColumns: {bulletins_df.columns.tolist()}")
bulletins_df.head()

In [None]:
# Check coverage by year
if 'Date' in bulletins_df.columns or 'date' in bulletins_df.columns:
    date_col = 'Date' if 'Date' in bulletins_df.columns else 'date'
    bulletins_df[date_col] = pd.to_datetime(bulletins_df[date_col], errors='coerce')
    bulletins_df['Year'] = bulletins_df[date_col].dt.year
    print("Records per year:")
    print(bulletins_df['Year'].value_counts().sort_index())
elif 'year' in bulletins_df.columns:
    print("Records per year:")
    print(bulletins_df['year'].value_counts().sort_index())

In [None]:
# Check unique cities
city_col = [col for col in bulletins_df.columns if 'city' in col.lower()]
if city_col:
    print(f"\nUnique cities: {bulletins_df[city_col[0]].nunique()}")
    print(f"\nSample cities:")
    print(bulletins_df[city_col[0]].value_counts().head(20))

## 3. City-Year AQI Summary

In [None]:
# Load city-year aggregated data
city_year_path = DATA_DIR / 'AQI_bulletins' / 'scripts' / 'city_year_aqi.csv'
city_year_df = pd.read_csv(city_year_path)

print(f"Dataset: City-Year AQI Summary")
print(f"Shape: {city_year_df.shape}")
print(f"\nColumns: {city_year_df.columns.tolist()}")
city_year_df.head(10)

In [None]:
# Descriptive stats
city_year_df.describe()

## 4. Delhi PM2.5 Annual Averages (Historical)

In [None]:
# Load Delhi historical PM2.5 data
delhi_annual_path = DATA_DIR / 'cpcb_air_quality' / 'delhi-pm25-annual-avgs-data.txt'
with open(delhi_annual_path, 'r') as f:
    print(f"Delhi PM2.5 Annual Averages:")
    print(f.read())

## Summary: Data Availability

| Dataset | Coverage | Key Variables | Granularity |
|---------|----------|---------------|-------------|
| India-Air-Quality-Dataset | 5 metros | AQI, PM2.5, PM10, NO2, SO2, CO, O3 | Daily |
| AQI Bulletins | 100+ cities | AQI, Prominent Pollutant | Daily (2015-2025) |
| CPCB scraped data | Multiple cities | PM2.5 | Daily |

### Key Finding: PM2.5 data is available for cigarette equivalent calculation!

In [None]:
# Save this summary for reference
print("Data exploration complete. Key datasets identified for analysis.")