# Air Pollution NO2 Data Analysis

## README

### Overview
This notebook conducts a comprehensive analysis of nitrogen dioxide (NO₂) pollution using Sentinel-5P data, with a focus on Ethiopia (Addis Ababa) and Iraq (Baghdad). It covers the full workflow, including data retrieval, preprocessing, aggregation, and visualisation.

### Objective
The aim is to assess spatial and temporal patterns in NO₂ levels as a proxy for air pollution and economic activity. 

### Workflow
The notebook is structured into three main parts:
1. **Data Download** – Retrieves NO₂ data from Google Earth Engine (Sentinel-5P).
2. **Data Processing** – Fills missing values, clips to boundaries, and aggregates to mesh/grid.
3. **Visualisation** – Produces spatial plots and animated GIFs for temporal dynamics.

### Outputs
- **Filled NO₂ Data**: Exported in TIFF format for spatial analyses.
- **Aggregated NO₂ Values**: Saved in GeoParquet format by mesh for efficiency.
- **Visual Animations**: NO₂ variation over time shown in GIF format.


## 0 Prepare Packages and Path

Cancel the comment to install all the packages and libraries needed.

In [None]:
# ! pip install rasterio matplotlib rasterstats ipynbname imageio tqdm
# ! pip install numpy==1.24.4

### Path Management

Get current / repo / data path in local to make sure the 

In [1]:
from pathlib import Path
import sys

curr_root = Path().resolve()    # current file path
repo_root = curr_root.parent    # current repository path
data_root = repo_root / "data"  # path for saving the data
src_root = repo_root / "src"    # path for other sources
sys.path.append(str(src_root))  # add src to system path to import custom functions

# Import customised scripts
from animation import*
from aggregation import*

# print(repo_root)

### Generate Meshes

Generate meshes, from 2023-01-01 to 2024-12-31, one mesh for each day.

In [51]:
import shutil
from datetime import datetime, timedelta
import fiona

mesh_addis = data_root / "mesh-grid" / "grid_addis_ababa.gpkg"
mesh_baghdad = data_root / "mesh-grid" / "grid_baghdad.gpkg"

lyr_addis_name = fiona.listlayers(mesh_addis)[0]  # control layer number
lyr_baghdad_name = fiona.listlayers(mesh_baghdad)[0]

# start and end date
start_date = datetime.strptime("2023-01-01", "%Y-%m-%d")
end_date = datetime.strptime("2024-12-31", "%Y-%m-%d")

addis_meshes_path = data_root / 'addis-mesh-data'
baghdad_meshes_path = data_root / 'baghdad-mesh-data'

addis_meshes_path.mkdir(exist_ok=True)
baghdad_meshes_path.mkdir(exist_ok=True)

delta = end_date - start_date
days_count = delta.days + 1

# For Addis Ababa
for i in range(days_count):
    current_date = start_date + timedelta(days=i)
    date_str = current_date.strftime("%Y-%m-%d")
    filename = f"addis-ababa-{date_str}.gpkg"
    dest_path = addis_meshes_path / filename

    shutil.copy(mesh_addis, dest_path)

print(f"Complete Generating meshes for Addis Ababa!")

# For Baghdad
for i in range(days_count):
    current_date = start_date + timedelta(days=i)
    date_str = current_date.strftime("%Y-%m-%d")
    filename = f"baghdad-{date_str}.gpkg"
    dest_path = baghdad_meshes_path / filename

    shutil.copy(mesh_baghdad, dest_path)


print(f"Complete Generating meshes for Baghdad!")


Complete Generating meshes for Addis Ababa!
Complete Generating meshes for Baghdad!


In [None]:
mesh_addis = data_root / "mesh-grid" / "grid_addis_ababa.gpkg"
mesh_baghdad = data_root / "mesh-grid" / "grid_baghdad.gpkg"

lyr_addis_name = fiona.listlayers(mesh_addis) # control layer number
lyr_baghdad_name = fiona.listlayers(mesh_baghdad)

# rd_name = fiona.listlayers(data_root / 'addis-mesh-data' / 'addis-ababa-2023-02-25.gpkg') # control layer number
# rd1 = fiona.listlayers(data_root / 'baghdad-mesh-data' / 'baghdad-2023-02-25.gpkg')


In [54]:
lyr_addis_name

['grid_addis_ababa']

In [55]:
lyr_baghdad_name

['grid_badhdad']

In [56]:
rd_name

['grid_addis_ababa']

In [57]:
rd1

['grid_badhdad']

## 1 Download Data

In this chapter, NO2 pollution data from [Google Earth Engine Sentinel 5P](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_NRTI_L3_NO2) is downloaded, for both Ethiopia and Iraq in country level.

From related literature, we chose **tropospheric_NO2_column_number_density** as the proxy for NO2 concentration level.

### 1.1 Custom Functions

Custom function to generate desired time period of NOx data.

In [None]:
import pandas as pd
from typing import List

import ee
ee.Authenticate() # For the first Initialization, individual API is needed to log into Google Earth Engine
ee.Initialize()

# Function: generate desired time period of NO2 data  
def specific_date(start_date: str, end_date: str, time_resolution: str = 'D') -> List[str]:
    """
    Generate a list of dates within specified time period and resolution.

    Parameters:
    - start_date: str
        Start date, format: 'YYYY-MM-DD'.
    - end_date: str
        End date, format: 'YYYY-MM-DD'.
    - time_resolution: str
        Time resolution (e.g., 'D' for daily, 'W' for weekly, 'M' for monthly). Default is 'D'.
    
    Return:
    - dates(list): List of date strings marking the ends of each time segment, format: 'YYYY-MM-DD'.
    
    """
    dates = (
        pd.date_range(start_date, end_date, freq = time_resolution)
        .strftime('%Y-%m-%d')
        .tolist()
    )
    return dates


Request tasks to download in Google Drive.

In [None]:
# Function: download NO2 data
def download_no2_country(country_name: str, dates: list):
    """
    Request NO2 data download from Earth Engine for a specified country and time period

    Parameters:
    - country_name: str
        Name of the target country. Must match the format used by Earth Engine.
    - dates: list
        List containing the desired time range, (e.g., [start_date, end_date]).

    Return:
    - None. Sends a/multiple request(s) to Earth Engine to initiate data download.
        Exported files are saved under a folder named 'NO2_<country_name>' in first-level Google Drive directory.
        Each exported .tiff file is named using its starting date.
    """
    
    countries = ee.FeatureCollection('USDOS/LSIB_SIMPLE/2017')
    country = countries.filter(ee.Filter.eq('country_na', country_name)).geometry()

    n_dates = len(dates)

    for i in range(n_dates-1):

        date_start, date_end = dates[i], dates[i+1]

        no2 = (ee.ImageCollection('COPERNICUS/S5P/NRTI/L3_NO2')
            .select('tropospheric_NO2_column_number_density')
            .filterDate(date_start, date_end)
            .mean())

        task = ee.batch.Export.image.toDrive(
            image=no2,
            description=f'{country_name}_NO2_{date_start}_{date_end}',
            folder=f'NO2_{country_name}',
            fileNamePrefix=f'{country_name}_NO2_{date_start}',
            region=country,
            scale=1000,
            maxPixels=1e13
        )

        try:
            task.start()
            print(f'{country_name}: The export task for {date_start} is ongoing, please check the results in Google Drive.')
        except Exception as e:
            print(f'Fail to sumbit task: {e}')

### 1.2 Call and Download Data

In [None]:
dates = specific_date('2023-01-01', '2025-01-01')
len(dates) # 731

# Download Ethiopia Data
download_no2_country('Ethiopia', dates)

# Download Iraq Data
download_no2_country('Iraq', dates)

## 2 Data Process Pipeline

This chapter processes the NO2 data downloaded in Chapter 1 through the following steps:

- **(1) Filling Missing Value**: Spot the missing values in raster and replenish them using iterative checking, using **mean** of the neighbour raster as the replenish value.

- **(2) Clipping to Region**: Clipping the data to the interested area, and output the filled raster.

- **(3) Aggregation**: Import the generated mesh and aggregate the raster to the mesh level.

Step 2 and 3 are realised by selecting and aggregating the data within the (synthesised) mesh grid. 

Output at the end of the process:

- The processed data will be exported in format of GeoParquet (*.gpq*), which is an open, efficient and modern file format designed for storing geospatial vector data.

*Note: Currently since the official mesh for the two regions (Baghdad and Addis Ababa) are not provided, so we synthesised the mesh to establish the workflow.*

### 2.1 Fill Missing Data

In this chapter, missing data in each raster is filled using the neighbour data, and the filled raster is saved in a new seperate folder: *Ethiopia_NO2_filled* and *Iraq_NO2_filled*

#### 1) Define Custom Functions

Define functions to read and iteratively fill missing data.

In [None]:
import rasterio
import numpy as np
import matplotlib.pyplot as plt
import os

# Function: read tiff files
def read_tiff(filename):
    with rasterio.open(filename) as src:
        band = src.read(1)          # first band
        profile = src.profile       # meta data
        nodata_value = src.nodata   # get nodata (missing)

    return src, band, profile, nodata_value


# iterative missing data interpolate
from scipy.ndimage import generic_filter

# Function: using neighbour average mean as interpolation value
def fill_nan_with_mean(arr):
    center = arr[len(arr) // 2]
    if np.isnan(center):
        mean = np.nanmean(arr)
        return mean if not np.isnan(mean) else np.nan
    return center

# Function: iteratively interpolate missing values in single tiff file
def iterative_fill(data, max_iter=10, window_size=9):
    """
    Parameters:
    -----------
    max_iteration: control the max iteration value
    window_size: the smoothing window of the moving average, set as odd number to avoid raster shifting
    
    """
    filled = data.copy()
    for i in range(max_iter):
        prev_nan = np.isnan(filled).sum()
        filled = generic_filter(filled, function=fill_nan_with_mean, size=window_size, mode='nearest')
        new_nan = np.isnan(filled).sum()
        # print(f"Iteration {i+1}: remaining NaNs = {new_nan}")
        if new_nan == 0 or new_nan == prev_nan:
            break
    return filled

# Function: fill missing values in all the tiff files under same path
def fill_missing_data(country, data_tiff_path):
    """
    Parameters:
    ----------
    country: str
        - Name of the country, initial should be uppercase, such as 'Iraq'.

    data_tiff_path:
        - Path of the tiff files to be processed.

    Output:
    --------
    Return tiff files under a new created folder in data/countryname_NO2_filled
        
    """

    # Get the paths of the to be processed tiff files
    tiffs = [f for f in os.listdir(data_tiff_path) if f.lower().endswith('.tif')]
    abs_tiff_paths = [os.path.join(data_tiff_path, f) for f in tiffs]  # abosolute path
    n_task = len(tiffs)

    # Create a folder to save filled data
    output_dir = data_root / f'{country}-no2-filled'
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        
    # Loop to process each tiff file
    for index, tiff_path in enumerate(abs_tiff_paths):

        # Get the tiff date from the file name
        date = tiffs[index].split('_')[2].split('.')[0]
        
        # Trace progress
        print(f"currently working on: {index+1}/{n_task}, {date}")

        # Detect non-valid data
        file_size_mb = os.path.getsize(tiff_path) / (1024 * 1024)
        if file_size_mb < 1:
            print(f"File size {file_size_mb:.2f}KB < 1MB, skipping {date} file.")
            continue 

        # Read raster data
        src, band, profile, nodata_value = read_tiff(tiff_path)
        if nodata_value is not None:    # replace no_data as np.nan
            band = np.where(band == nodata_value, np.nan, band)

        # Missing Value replacement
        band_filled = iterative_fill(band, max_iter=10, window_size=9)

        # Save filled data
        output_file = output_dir / f'{country}_NO2_{date}_filled.tif'

        with rasterio.open(output_file, 'w', **profile) as dst:
            filled_band = np.where(np.isnan(band_filled), nodata_value, band_filled) # replace np.nan with nodata
            dst.write(filled_band.astype(profile['dtype']), 1)


#### 2) Fill Missing Data in Ethiopia

It took over 8 hours to run the following cell. So after processed the following cell, comment it. 

In [None]:
eth_tiff_path = data_root / 'Ethiopia_NO2'
fill_missing_data('Ethiopia', eth_tiff_path)

#### 3) Fill Missing Data in Iraq

After processed the following cell, comment it.

In [None]:
# iraq_tiff_path = data_root / 'Iraq_NO2'
# fill_missing_data('Iraq', iraq_tiff_path)

#### 4) Demonstrate Purpose

Use *Ethiopia_NO2_2018-07-12.tif* file as an exmaple to show what this missing data process loop do in each iteration.

In [None]:
# Set working directory
import os
from pathlib import Path

# get demo working directory
demo_path = data_root / "demo-data"

# original image
src, band, profile, nodata_value = read_tiff(demo_path / 'Ethiopia_NO2_2018-07-12.tif')

plt.imshow(band, cmap='gray') # 'gray'
plt.title("Original Image")
plt.colorbar()
plt.show()

In [None]:
# filled image
band_filled = iterative_fill(band, max_iter=10, window_size=9)

plt.imshow(band_filled, cmap='gray')
plt.title("Filled Image")
plt.colorbar()
plt.show()

### 2.2 Aggregate Based on Mesh Grid

In [3]:
from aggregation import*

addis_meshes_path = data_root / 'addis-mesh-data'
baghdad_meshes_path = data_root / 'baghdad-mesh-data'

mesh_addis = data_root / "mesh-grid" / "grid_addis_ababa.gpkg"
mesh_baghdad = data_root / "mesh-grid" / "grid_baghdad.gpkg"

lyr_addis_name = fiona.listlayers(mesh_addis)[0] # control layer number
lyr_baghdad_name = fiona.listlayers(mesh_baghdad)[0]

#### 1) Aggregate Ethiopia - Addis Ababa

In [None]:
# Aggregate Ethiopia - Addis Ababa
eth_no2_filled_path = data_root / 'eth-no2-filled'
aggregate_data(
    data_tiff_path=eth_no2_filled_path, 
    mesh_path=addis_meshes_path, 
    layer_name=lyr_addis_name
    )

currently working on: 1/711, 2023-01-01
currently working on: 2/711, 2023-01-02
currently working on: 3/711, 2023-01-03
currently working on: 4/711, 2023-01-04
currently working on: 5/711, 2023-01-05
currently working on: 6/711, 2023-01-06
currently working on: 7/711, 2023-01-07
currently working on: 8/711, 2023-01-08
currently working on: 9/711, 2023-01-09
currently working on: 10/711, 2023-01-10
currently working on: 11/711, 2023-01-11
currently working on: 12/711, 2023-01-12
currently working on: 13/711, 2023-01-13
currently working on: 14/711, 2023-01-14
currently working on: 15/711, 2023-01-15
currently working on: 16/711, 2023-01-16
currently working on: 17/711, 2023-01-17
currently working on: 18/711, 2023-01-18
currently working on: 19/711, 2023-01-19
currently working on: 20/711, 2023-01-20
currently working on: 21/711, 2023-01-21
currently working on: 22/711, 2023-01-22
currently working on: 23/711, 2023-01-23
currently working on: 24/711, 2023-01-24
currently working on: 25/

#### Demonstrate Purpose
Show aggregated result in 2023-01-01.

#### 2) Aggregate Iraq - Baghdad

In [None]:
# Aggregate Iraq - Baghdad
iraq_no2_filled_path = data_root / 'iraq-no2-filled'
aggregate_data(
    data_tiff_path=iraq_no2_filled_path, 
    mesh_path=baghdad_meshes_path, 
    layer_name=lyr_baghdad_name
    )

#### Demonstrate Purpose

Show aggregated value in Addis Ababa.

In [None]:
# read the mesh and file path
demo_mesh = gpd.read_file(data_root / 'mesh-grid' / 'grid_addis_ababa.gpkg')
raster_path = data_root / 'demo-data' / 'Ethiopia_NO2_2018-07-12_filled.tif'

# using mean as representitive value
stats = zonal_stats(demo_mesh, raster_path, stats=["mean"], nodata=np.nan)  # other alternatives: "std", "max", "min", "sum"
demo_mesh["mean"] = [s["mean"] for s in stats]

# visual
demo_mesh.plot(column="mean", edgecolor="grey", legend=True)
plt.title("Addis Ababa 5km$^2$ Hexagon Aggregated NO2")
plt.axis("off")
plt.show()

In [None]:
demo_mesh

Show aggregated value in Baghdad.

In [None]:
# read the mesh and file path
demo_mesh = gpd.read_file(data_root / 'mesh_grid' / 'grid_baghdad.gpkg')
raster_path = data_root / 'demo-data' / 'Iraq_NO2_2018-07-12_filled.tif'

# using mean as representitive value
stats = zonal_stats(demo_mesh, raster_path, stats=["mean"], nodata=np.nan)  # other alternatives: "std", "max", "min", "sum"
demo_mesh["mean"] = [s["mean"] for s in stats]

# visual
demo_mesh.plot(column="mean", edgecolor='gray', legend=True)
plt.title("Addis Ababa 5km$^2$ Hexagon Aggregated NO2")
plt.axis("off")
plt.show()

## 3 Data Visualisation

This chapter is used to generate a dynatmic figure, to show how the feature distribution change with time.

Note:

- In the coloration system, percentile clipping and contrast streching method is uesd to imporve the visual effects of the image.

- In this chapter, the dynamic distribution of NO2 is generated, in format of GIF. 

In [None]:
# Import the packages in src/animation.py
from animation import tiff_2_gif, mesh_2_gif

#### 1) NO2 Distribution in Ethiopia

In [None]:
# no2_eth_tif_dir= data_root / 'Ethiopia_NO2_filled'  
# tiff_2_gif(no2_eth_tif_dir, output_path=data_root, output_name="ethiopia-no2-animation", fps = 8)

# total NO2 distribution animation
total_no2_eth_tif_dir = data_root / 'eth-total-no2-filled'  
tiff_2_gif(total_no2_eth_tif_dir, output_path=data_root, output_name="ethiopia-total-no2-animation", fps = 8)

#### 2) NO2 Distribution in Iraq

In [None]:
no2_iraq_tif_dir= data_root / 'iraq-no2-filled'  
tiff_2_gif(no2_iraq_tif_dir, output_path=data_root, output_name="iraq-no2-animation", fps = 8)

# total NO2 distribution animation
total_no2_eth_tif_dir = data_root / 'iraq-total-no2-filled'  
tiff_2_gif(total_no2_eth_tif_dir, output_path=data_root, output_name="iraq-total-no2-animation", fps = 8)