# Partial Report - Notebook 4 

## 1 Methodology

### 1.1 NO₂ Explanatory Model Architecture Selection

Given (1) the large sample size (near million scale), and (2) the non-linearity relationship among features, **machine learning models are appropriate choices for this task.**

Among them, **tree-based algorithms** are selected based on the combination of literature insights and empirical suitability , as they offer a good trade-off between **accuracy, interpretability, and computational efficiency.**
We restricted the modelling shortlist to two proven ensemble-tree families:

| Algorithm          | Architecture | Selection Rationale | Key Advantages |
|-----------         |--------------|------------------|-----------------------------|
| **Random Forest**  | *Parallel / bagging* &nbsp;→ each tree is trained on a bootstrap sample and votes independently.       | Fast to train, low variance, and effective at capturing non-linear feature interactions with minimal tuning.      | Provides a robust baseline and generates clear SHAP-based explanations for stakeholders. |
| **XGBoost**        | *Sequential / boosting* &nbsp;→ each new tree learns the residual errors of the current ensemble.            | Delivers state-of-the-art accuracy on structured data; built-in regularization helps prevent overfitting.         | Achieves lower RMSE; alignment with RF improves interpretability and model trust. |

*Note – Deep learning architectures were deliberately excluded at this stage due to time and hardware constraints.*  

> This dual-algorithm strategy enables a direct comparison between a **variance-reduction approach** (Random Forest) and a **bias-reduction approach** (XGBoost), allowing us to select the best-performing model for each city.


### 1.2 Tree-based Model Fine-tuning

Tree-based models consist of multiple decision trees and are well-suited for capturing nonlinear relationships and complex feature interactions. 
However, their performance is highly sensitive to structural hyperparameters, such as tree depth, learning rate, and the number of estimators. 
In this project, we focus on fine-tuning these models with the following goals:

- Improving generalization

- Reducing computational cost

- Maintaining interpretability

To achieve these objectives, we employed **Randomized Search** as the hyperparameter optimization strategy. 
Unlike exhaustive methods such as grid search, Randomized Search efficiently explores the hyperparameter space by randomly sampling parameter combinations. 
This approach is especially useful when computational resources are limited, as it can yield good results more quickly.

#### 1.3 Shapley Additive Explanations, SHAP

Due to the black-box nature of many machine learning algorithm, it is often difficult to interpret the importance of the input features. 
**Shapley Additive Explanations, SHAP** provides a principled and theoretically sound way to attribute model output to individual input features, improving model interpretability. 
It is based on the concept of **Shapley values** from cooperative game theory, which fairly distributes the "payout" (i.e., model prediction) among all the "players" (i.e., features) that contribute to it.

One common way to visualize SHAP values is through a violin-like scatter plot (see figure below). 

<p align="center">
  <img src="../data/demo-data/Baghdad - SHAP Feature Impact - Random Forest Model.png" alt="SHAP Feature Impact" width="1000"/>
</p>

<p align="center">
  <em>SHAP Feature Impact Random Forest Results for Baghdad</em>
</p>


The interpretation of the above plot components are summarized as the following table:

| **Plot element**             | **Meaning**        |
| ------------------------- | --------       |
| **Dot**                   | One single datapoint, in this case, one grid-cell on one day.                                                                                      |
| **X-position**            | How much this feature increases (► right) or decreases (◄ left) the predicted output compared to the average output. |
| **Colour**                | The **raw feature value** for that dot (sample): <br> blue = low <br>red/pink = high.                                      |
| **Row order**             | Features are sorted from **most to least influential** by their average absolute SHAP value.                     |
| **Width of the "violin"** | **Spread of impacts**: <br>  wide = feature effect varies a lot across space/time; <br>narrow = stable effect.  

## 2 Results
### 2.1 Best Model Configuration

After optimizing both the Random Forest and XGBoost models, we identified their optimal hyperparameter configurations. The corresponding performance metrics—Root Mean Squared Error (RMSE) and R-squared (R²)—are presented in the table below.

**Model for Addis Ababa**

| Model Type      | If Scale      | RMSE              | R²        | Best Parameters |
|-----------------|----------     |----------------   |-----------|-----------------|
| Random Forest   | Unscaled      | 1.84221e-05       | 0.21495   |'n_estimators': 200, 'max_depth': 15, <br> 'max_features': 0.5, 'min_samples_leaf': 4|
| XGBoost         | Unscaled      | 1.85828e-05       | 0.20120   |'subsample': 1.0, 'min_child_weight': 3, <br> 'max_depth': 12, 'eta': 0.2, <br> 'colsample_bytree': 1.0|
| XGBoost*        | Scale X & y   | 1.84037e-05       | 0.21652   |'subsample': 0.7, 'min_child_weight': 1, <br> 'max_depth': 8, 'eta': 0.01, <br> 'colsample_bytree': 1.0|
| XGBoost         | Only Scale X  | 1.85835e-05       | 0.20114   |'subsample': 1.0, 'min_child_weight': 3, <br> 'max_depth': 12, 'eta': 0.2, <br> 'colsample_bytree': 0.7|

*Note: the Model with * is the final best model for NO₂ Concentration Explanation*

**Model for Baghdad**

Due to the large dataset size in Baghdad (over four million samples) and limited computational resources, we reduced the complexity of the hyperparameter search. The resulting model performance is summarized below.

| Model Type      | If Scale      | RMSE              | R²        | Best Parameters |
|-----------------|----------     |----------------   |-----------|-----------------|
| Random Forest   | Unscaled      | 1.32521e-4       | 0.09575   |'n_estimators': 50, 'max_depth': 10, <br> 'max_features': 0.5, 'min_samples_leaf': 500
| XGBoost         | Unscaled      | 1.31575e-4       | 0.10857   |'subsample': 0.7, 'min_child_weight': 5, <br> 'max_depth': 12, 'eta': 0.01, <br> 'colsample_bytree': 0.7|
| XGBoost*        | Scale X & y   | 1.31435e-4       | 0.11045   |'subsample': 0.7, 'min_child_weight': 5, <br> 'max_depth': 12, 'eta': 0.01, <br> 'colsample_bytree': 0.7|
| XGBoost         | Only Scale X  | 1.31583e-4       | 0.10845   |'subsample': 0.7, 'min_child_weight': 5, <br> 'max_depth': 12, 'eta': 0.01, <br> 'colsample_bytree': 0.7|

*Note: the Model with * is the final best model for NO₂ Concentration Explanation*

### 2.2 NO₂ Concentration Drivers

To gain a deeper understanding of the key drivers influencing NO₂ concentration dynamics in the interested region, we performed NO₂ level explanatory analysis using Random Forest (RF) and XGBoost (XGB) models. 
Model interpretation was conducted via SHAP (SHapley Additive exPlanations) values. 

The target variable is grid-level NO₂ concentration, with values on the order of 10⁻⁵. 
For feature preprocessing, the RF model utilized raw input data, while all features in the XGB model were normalized to the [0, 1] range. 
This decision was based on comparative performance evaluation across different preprocessing strategies, including unscaled inputs, scaling only the features, and scaling both features and the target variable.

#### Addis Ababa

The SHAP value violin plot of the best two models are shown below.

<p align="center">
  <img src="../data/demo-data/Addis Ababa - SHAP Feature Impact - Random Forest Model.png" alt="Addis Ababa - SHAP Feature Impact" width="900"/>
</p>

<p align="center">
  <em>Addis Ababa - SHAP Feature Impact - Random Forest Model</em>
</p>

<p align="center">
  <img src="../data/demo-data/Addis Ababa - SHAP Feature Impact - XGBoost (Scaled).png" alt="Addis Ababa - SHAP Feature Impact" width="900"/>
</p>

<p align="center">
  <em>Addis Ababa - SHAP Feature Impact - XGBoost (Scaled)</em>
</p>


Both models consistently identified the lagged NO₂ concentration in neighboring grids (`no2_neighbor_lag1`) as the most influential predictor, highlighting the significant role of spatial diffusion in pollutant dynamics. 
Additionally, features reflecting human activity levels - such as `cloud_category`, `pop_sum_m`, and `NTL_mean` - ranked highly in both models, underscoring the non-negligible contribution of anthropogenic factors to air pollution levels.

Of particular interest is the strong influence of nighttime light intensity (`NTL_mean`) observed in both models. 
This variable commonly serves as a proxy for night-time economic activity and population aggregation, capturing composite effects of commercial vibrancy, traffic density, and industrial lighting. 
SHAP analysis reveals that **areas with higher NTL values (the reddish regions in the SHAP plots) are positively associated with elevated NO₂ concentrations**, indicating that mobile pollution sources such as night-time traffic and industrial emissions may play a critical role in the spatio-temporal distribution of NO₂.

When it comes to meteorological factors, the average temperature (`temp_mean`) stands out as an important variable in the XGB model, but it shows little influence in the RF model. 
This difference may be due to the effect of feature normalization in XGB, which can make small changes in temperature appear more impactful. 
It may also reflect the more complex and nonlinear role temperature plays in NO₂ behaviour. 
On the one hand, higher temperatures can boost air circulation and speed up chemical reactions that involve NO₂. 
On the other hand, in some situations, high temperatures can increase the formation of ground-level ozone, which may reduce NO₂ levels. 
This kind of two-sided effect might be better captured by the XGB model, which is more sensitive to subtle patterns in the data.

In addition, road-related variables closely linked to transportation activity (e.g., `road_residential_len`, `road_len`, `road_primary_len`) demonstrate medium-to-high importance in both models. 
Road length not only reflects the density of transportation infrastructure but also indirectly indicates the frequency of vehicular movement and emission sources. 
Particularly in the RF model - where no normalization was applied - road-related features with larger value scales exhibit stronger SHAP responses, suggesting a stable contribution to NO₂ levels.

**In summary, both models reveal the multifactorial drivers of NO₂ concentration variability, including spatial lag effects, night-time economic activity, meteorological conditions, and transportation infrastructure.** 
While the exact rankings of feature importance differ slightly between models, the core influential variables remain consistent. 
These findings offer actionable insights for urban air pollution mitigation, suggesting that policy efforts should focus on controlling emissions in high-NTL areas, regulating night-time economic activities, and fostering regional coordination in response to spatial diffusion of pollutants.

#### Baghdad

The SHAP value violin plot of the best two models are shown below.

<p align="center">
  <img src="../data/demo-data/Baghdad - SHAP Feature Impact - Random Forest Model.png" alt="Addis Ababa - SHAP Feature Impact" width="900"/>
</p>

<p align="center">
  <em>Baghdad - SHAP Feature Impact - Random Forest Model</em>
</p>

<p align="center">
  <img src="../data/demo-data/Baghdad - SHAP Feature Impact - XGBoost (Scaled).png" alt="Addis Ababa - SHAP Feature Impact" width="900"/>
</p>

<p align="center">
  <em>Baghdad - SHAP Feature Impact - XGBoost (Scaled)</em>
</p>

Similarly to Addis Ababa, both models consistently highlight spatial lag effects, confirming the critical role of regional pollutant spillover. 
Indicators of human activity, such as nighttime light intensity (NTL_mean), also show strong and consistent influence across models, reflecting the contribution of nocturnal economic and transportation activity to urban NO₂ emissions. 
These shared findings underscore a common underlying structure in both models, where spatial dependence and anthropogenic factors are central to explaining pollutant variation.

In the case of Baghdad, **the XGBoost model further emphasizes the role of dynamic environmental factors**. 
In particular, mean temperature (`temp_mean`) emerges as a highly influential feature, likely due to its role in modulating vertical mixing, atmospheric stability, and the photochemical transformation of NO₂. 
Elevated temperatures can reduce pollutant dispersion and intensify local accumulation of NO₂, especially under stagnant meteorological conditions. 
The model also ranks the Traffic Congestion Index (`TCI`) among the top predictors, capturing the real-time impact of mobility bottlenecks on transport emissions. 
These results suggest that the XGBoost model is especially sensitive to temporally variable features, which aligns with its ability to model complex nonlinear interactions when inputs are normalized.

By contrast, **the Random Forest model shows a tendency to prioritize structural and infrastructural features** - such as `road_residential_len` and total road length `road_len` - over meteorological or dynamic urban variables. 
This is likely influenced by the use of raw feature scales, which may bias the model toward variables with inherently larger numeric scales. 
As a result, Random Forest may underrepresent the relative impact of high-frequency or small-scale fluctuations (e.g., TCI or temperature), and instead overemphasize more stable, cumulative spatial attributes. 
Nevertheless, RF still captures the importance of key variables such as `NTL_mean` and `no2_neighbor_lag1`, suggesting broad alignment in the most essential predictors.

The divergence in feature prioritization between the two models reveals their complementary strengths: 
**XGBoost appears better suited for capturing short-term, high-variability drivers of NO₂ (e.g., meteorology and traffic dynamics), while Random Forest offers a more stable representation of long-term or structural determinants (e.g., built environment and infrastructure).**
These differences are not only methodological - influenced by model architecture and preprocessing pipelines - but also conceptual, highlighting how different modeling approaches may uncover distinct but meaningful layers of insight in urban air pollution dynamics.

**Overall, the results from both models illustrate a multifactorial landscape driving NO₂ variability in Baghdad, shaped by spatial spillover, human mobility, and atmospheric regulation.** 


# Methodology from other notebooks

### Methodology - 1
#### Data Process Pipeline

This notebook processes the air pollution data downloaded in *appendix_preparation.ipynb* through the following steps:

- **(1) Filling Missing Value**: Spot the missing values in raster and replenish them using iterative filling, using **mean** of the neighbour raster as the replenish value.

- **(2) Clipping to Region**: Clipping the data to the interested area, and output the filled raster.

- **(3) Aggregation**: Import the generated mesh and aggregate the raster to the mesh level.

Step 2 and 3 are realised by selecting and aggregating the data within the mesh grid. 


#### Iterative Interpolation  
To handle missing values in satellite-based NO₂ pollution data, we developed an iterative gap-filling method using spatial neighborhood statistics. 
The approach replaces missing pixels (NaN or NoData) with the **mean** of surrounding valid pixels within a square window (typically 9×9). 
This process is applied using a sliding window filter and repeated iteratively until most gaps are filled or no further changes occur.

The method is robust and preserves local spatial patterns. 
It avoids excessive smoothing by only filling values based on nearby observations. 
To improve efficiency, corrupted or empty files are skipped automatically. 
Each filled image is saved as a new GeoTIFF file, maintaining original metadata and georeferencing.

This approach produces continuous, high-quality NO₂ surfaces suitable for visualization, time-series analysis, and spatial modeling.


### Methodology -4

We employs both univariate spatio-temporal analysis and multivariate correlation analysis to comprehensively understand NO₂ pollution dynamics.

#### Univariate Spatio-temporal Analysis

For selected individual variable (e.g., NO₂ concentration), we perform detailed temporal and spatial analyses to characterize its behavior over time and across locations. 
Temporal patterns are examined using time series methods such as **Partial Autocorrelation Function (PACF)** to identify persistence and lag effects. 
Spatial patterns are assessed through spatial autocorrelation metrics like **Local Moran’s I**, revealing clustering and hotspots at the mesh-grid level.

#### Multivariate Correlation and Synergistic Analysis

The main objective of this notebook is to analyze the spatio-temporal patterns of NO₂ pollution and its relationships with urban features through both single-variable and multivariate exploratory analyses to support modeling and decision-making.

Together, these analyses provide a robust framework to capture both isolated and combined influences shaping urban air pollution patterns.


#### Temporal Autocorrelation (PACF)

We used the **Partial Autocorrelation Function (PACF)** to evaluate how past NO₂ values influence current concentrations within each spatial cell. 
The PACF measures the direct correlation between a time series and its lagged values after removing the influence of all intermediate lags.

For example:

- The PACF at lag 1 shows how today's NO₂ is directly related to yesterday's value.

- The PACF at lag 3 shows how today's NO₂ is related to the value 3 days ago, after accounting for the influence of lags 1 and 2.

**PACF Advantages:**

- Interpretability: PACF helps us pinpoint how many previous days have a meaningful direct influence on the current NO₂ level.

- Modeling insight: In time series modeling (e.g., ARIMA), PACF is used to identify the appropriate number of lag terms (p) for autoregressive models.

- Environmental understanding: Identifying strong temporal autocorrelation implies persistence in pollution, which could be due to meteorological stability, continuous emissions, or local topography.

#### Spatial Autocorrelation (Local Moran’s I):
We employed **Local Moran’s I** to detect spatial clusters and outliers in NO₂ concentrations across the study area. 
This method evaluates **whether a grid cell's value is significantly similar to or different from its neighbors**.

- High positive values indicate clusters (e.g., pollution hotspots),
- Negative values reveal spatial outliers.

#### Correlation Matrix Analysis:
To explore interdependencies between different features, we computed correlation matrices at the spatial grid-cell level. 
This analysis helps identify which factors are most strongly associated with NO₂ concentrations, providing insight into potential drivers of urban air pollution.

We calculated pairwise **Pearson correlation coefficients** among selected variables. The correlation matrix quantifies linear relationships, where values range from -1 (perfect negative) to +1 (perfect positive). 

- Strong positive correlations suggest shared spatial patterns or common underlying causes. 
- Weak or negative correlations may point to inverse relationships or independent dynamics.

The resulting matrix is visualized using heatmaps to facilitate interpretation. This allows us to:

- Identify variables with the strongest associations to NO₂

- Detect multicollinearity, which informs feature selection in modeling

- Understand cross-variable interactions that shape pollution distribution

This correlation analysis supports subsequent explanatory modeling and helps prioritize key predictors for intervention or policy consideration.