# PM2.5 concentration spatial distribution estimation in China 

## 1 Introductions

### 1.1 Backgrounds


China, as the largest developing country in the world, has been facing increasingly severe air quality problems accompanying the continuous advancement of industrialization and urbanization. According to the “2017 China Environmental Situation Bulletin” released by the Ministry of Ecology and Environment, among the 338 cities at the prefecture level and above, 239 cities have exceeded the national ambient air quality standards, with a proportion of more than 70%. Air quality problems have seriously affected people's daily travel and physical health, restricted sustainable economic development, and become a hot issue for public and government attention. PM2.5 is a respirable particle with a diameter of no more than 2.5 micrometers in the aerodynamic field, and is one of the main indicators for evaluating air quality. Comprehensive understanding of the spatial distribution of PM2.5 concentration, representing the spatial process and environmental behavior of atmospheric pollution, is of great significance and guidance value for supporting atmospheric pollution monitoring and early warning, comprehensive treatment, protecting human health and social sustainable development. By the end of 2017, the China National Environmental Monitoring Center had established over 400 ground-level air quality monitoring stations, and released hourly air quality monitoring data including PM2.5, providing high-precision and reliable real-time monitoring results. However, due to the uneven spatial distribution and low coverage of ground monitoring stations, it is difficult for existing studies to effectively analyze and deeply mine the spatiotemporal data of their monitoring data. Unlike ground monitoring, remote sensing observation based on satellites can obtain high-coverage atmospheric environment spatial datasets, such as atmospheric aerosol optical depth (AOD). Numerous studies have shown that there is a strong correlation between AOD and PM2.5 concentration. Research on PM2.5 concentration Spatial regression relationships with factors such as AOD retrieved from remote sensing inversion can provide effective solutions for obtaining the spatial distribution of PM2.5 concentration in the entire study area. Methodology Based on GWR geographical weighted regression thinking, Wu SenSen combines OLR with neural network model to propose a Geographically Neural Network Weighted Regression (GNNWR) model. By utilizing the learning ability of neural networks, this model can handle the spatial heterogeneity and complex nonlinear characteristics of regression relationships, which has better fitness accuracy and prediction performance compared to models such as OLR and GWR. The purpose of this case is to establish a PM2.5 concentration spatial estimation model based on GNNWR to achieve accurate fitting of spatial heterogeneity and nonlinear characteristics in PM2.5 regression relationships, and then obtain high-precision and high-reasonability PM2.5 concentration spatial distribution in China.  

![graph](https://pub.mdpi-res.com/remotesensing/remotesensing-13-01979/article_deploy/html/images/remotesensing-13-01979-ag.png?1621475926)  

### 1.2 Data description    

Many studies have shown that integrating meteorological conditions such as temperature, precipitation, wind speed, wind direction, and surface elevation factors can further improve the accuracy of PM2.5 spatial estimation. In this case, in addition to selecting AOD data as an auxiliary factor, temperature (TEMP), precipitation (TP), wind speed (WS), wind direction (WD), and surface elevation (DEM) factors are added as input variables for the model. The research time scale is the average of 2017 year scale:

(1) PM2.5 monitoring site data. The hourly PM2.5 concentration observation values from January 1, 2017 to December 31, 2017 were obtained from the China Environmental Monitoring Station. PM2.5 concentrations were measured using cone-shaped element oscillation trace or beta attenuation methods following national standard GB3095-2012. PM2.5 data were averaged for one year's time scale 

![图片](https://www.mdpi.com/remotesensing/remotesensing-13-01979/article_deploy/html/images/remotesensing-13-01979-g001.png)  

(2) Aerosol data. Aerosol data are obtained from the LAADS website including both Terra and Aqua dark pixel inversion products with a resolution of 3 km (MOD04_3K and MYD04_3k), as well as deep blue algorithm inversion products with a resolution of 10 km (MOD04_L2 and MYD04_L2). In this article, the 3 km resolution AOD products are the main data source for PM2.5 estimation. When there is a missing value in the 3 km resolution data, a resolution-matching product will be used as much as possible using a 10 km resolution product for resampling substitution to ensure the reliability of AOD data.

(3) DEM data. DEM data are obtained from the ETO-PO1 global surface elevation model of NOAA with a resolution of 1 arc minute

(4) Meteorological data including temperature, precipitation, wind speed, and wind direction are obtained from the ERA5 global climate reanalysis modele provided by ECMWF with hourly gridded data at a resolution of 0.5 degrees.   


### 1.3 Model Introduction  

Based on the geographical weighted idea similar to GWR, the GNNWR model believes that the spatial heterogeneity of the regression relationship can be regarded as the varying levels of spatial nonstationarity at different locations that affect the "OLR regression relationship". Therefore, in the spatial estimation experiment of PM2.5 concentration in this case, the model structure of GNNWR is defined as follows:     
 
![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059648746_1.png)  

In this equation, $(u_i, v_i)$ are the spatial coordinates of the i-th sample point, and $β = (β_0, β_1, ..., β_6)$ are the regression coefficients of the OLR model, reflecting the average level of the PM2.5 regression relationship for the entire region. The estimation matrix of OLR coefficients is represented as follows:

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059665465_2.png)  

of which: 

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059673642_3.png)  

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694003342595_1.png)  

### 1.4 References

Du Z H, Wu S S, Wang Z Y, et al. Estimating ground-level PM2.5 concentrations across China using geographically neural network weighted regression[J]. Journal of Geo-information Science,2020,22(1):122-135.

[DOI:10.12082/dqxxkx.2020.190533](https://www.researching.cn/ArticlePdf/m40005/2020/22/1/01000122.pdf)  

### 1.5 Main Content  


1. Model Training 
2. Result Storage, Loading, and Visualization
3. Estimation 

![模型示意图](https://www.mdpi.com/remotesensing/remotesensing-13-01979/article_deploy/html/images/remotesensing-13-01979-g002.png)  


## 2 Preparation

Import Necessary Packages

In [1]:
from gnnwr import models, datasets, utils
import pandas as pd
import numpy as np
import folium
import torch.nn as nn
from sklearn.metrics import r2_score as r2
import matplotlib.pyplot as plt

## 3 Model Training

### 3.1 Import Training Data

In [2]:
data = pd.read_csv('../data/pm25_data.csv')
data.head(5)

Unnamed: 0,station_id,lng,lat,date,PM2_5,row_index,col_index,proj_x,proj_y,dem,...,t2m,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi
0,1001A,116.366,39.8673,20170601,54.733894,2201,6867,1650847.552,1370268.366,46,...,284.561066,100809.2734,0.001006,134.995636,-7e-06,46.315975,0.425366,0.170262,0.870967,2401
1,1002A,116.17,40.2865,20170601,48.080737,2134,6835,1625003.973,1416959.964,420,...,282.907684,97125.08594,0.001044,157.77597,-6e-06,53.605503,0.211734,-0.676848,0.71208,5255
2,1003A,116.434,39.9522,20170601,54.898592,2188,6877,1653776.71,1381524.305,48,...,284.492249,100830.9688,0.001002,129.971298,-7e-06,45.537464,0.266666,0.069172,0.875811,2609
3,1004A,116.434,39.8745,20170601,52.266382,2200,6877,1655828.045,1372270.098,45,...,284.6362,100936.8047,0.00101,138.793961,-7e-06,45.387913,0.299403,0.22795,0.869679,2420
4,1005A,116.473,39.9716,20170601,53.189076,2185,6884,1656224.681,1384491.842,40,...,284.506561,100880.1797,0.001019,130.520599,-7e-06,44.790119,0.169121,0.079546,0.873232,3296


### 3.2 Partition Datasets

In [3]:
train_dataset, val_dataset, test_dataset = datasets.init_dataset(data=data,
                                                                 test_ratio=0.15,
                                                                 valid_ratio=0.15,
                                                                 x_column=[
                                                                     'dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp'],
                                                                 y_column=[
                                                                     'PM2_5'],
                                                                 spatial_column=[
                                                                     'proj_x', 'proj_y'],
                                                                 sample_seed=23,
                                                                 batch_size=64)

### 3.3 Initialize GNNWR Model

In [4]:
gnnwr = models.GNNWR(train_dataset=train_dataset,
                     valid_dataset=val_dataset,
                     test_dataset=test_dataset,
                     dense_layers=[512, 256, 128],
                     start_lr=0.2,
                     optimizer="Adadelta",
                     activate_func=nn.PReLU(init=0.1),
                     model_name="GNNWR_PM25",
                     write_path="../demo_result/gnnwr/runs/",
                     model_save_path="../demo_result/gnnwr/models/",
                     log_path="../demo_result/gnnwr/logs/",
                     )
gnnwr.add_graph()

Add Graph Successfully


### 3.4 Model Training

In [5]:
gnnwr.run(max_epoch=200, early_stop=50)

 60%|██████    | 121/200 [01:14<00:48,  1.62it/s, Train Loss=57.153108, Train R2=0.716219, Train AIC=tensor(6998.9004, device='cuda:0', grad_fn=<AddBackward0>), Valid Loss=54.7, Valid R2=0.711, Best Valid R2=0.745, Learning Rate=0.2]     


Training stop! Model has not been improved for over 50 epochs.


### 3.5 Model Evaluation and Analysis

Output the training result

In [6]:
gnnwr.result()


--------------------Result Information----------------
Test Loss: |                  61.43194
Test R2  : |                   0.71589
Train R2 : |                   0.76827
Valid R2 : |                   0.74480
RMSE: |                        7.83785
MAE:  |                        5.66026
AICc: |                     1476.05457


Save the training result

In [7]:
gnnwr.reg_result('../demo_result/gnnwr/gnnwr_result.csv').sort_values(by='id')

Unnamed: 0,coef_dem,coef_w10,coef_d10,coef_t2m,coef_aod_sat,coef_tp,bias,Pred_PM2_5,id,dataset_belong,denormalized_pred_result
219,-5.189742,0.935647,-0.137950,2.293738,48.971241,5.535703,11.860765,53.828934,0,train,53.828934
750,-5.154523,0.116095,-0.110708,2.498437,53.065498,5.724678,7.454807,43.898289,1,train,43.898289
1159,-5.211779,0.878466,-0.127922,1.940047,50.589832,5.825291,10.285083,53.616673,2,valid,53.616673
95,-5.189971,0.985495,-0.134363,2.037937,49.562389,5.737452,11.302744,53.605171,3,train,53.605171
1381,-5.218955,0.885996,-0.124537,1.782828,51.135056,5.938444,9.762604,53.371277,4,test,53.371277
...,...,...,...,...,...,...,...,...,...,...,...
1352,-4.957616,6.753003,-0.278863,-2.363364,26.483797,6.888298,33.389709,53.420815,1403,test,53.420815
466,3.573628,-27.792097,0.209455,24.910294,72.651123,3.309302,7.845873,7.839865,1404,train,7.839865
991,-4.950811,6.717587,-0.279985,-2.304465,26.285727,6.852742,33.594601,53.537739,1405,train,53.537739
1029,-4.297856,1.223342,-0.209996,15.557784,34.421749,-2.929730,13.141288,15.076088,1406,valid,15.076088


## 4 Saving and Loading

### 4.1 Saving Datasets

In [8]:
train_dataset.save('../demo_result/gnnwr/dataset/train_dataset/', exist_ok=True)
val_dataset.save('../demo_result/gnnwr/dataset/val_dataset/', exist_ok=True)
test_dataset.save('../demo_result/gnnwr/dataset/test_dataset/', exist_ok=True)

### 4.2 Loading Datasets

In [9]:
train_dataset_load = datasets.load_dataset(
    '../demo_result/gnnwr/dataset/train_dataset/')


val_dataset_load = datasets.load_dataset(
    '../demo_result/gnnwr/dataset/val_dataset/')


test_dataset_load = datasets.load_dataset(
    '../demo_result/gnnwr/dataset/test_dataset/')

### 4.3 Loading Model

Initialize the model

In [10]:
gnnwr_load = models.GNNWR(train_dataset=train_dataset_load,
                          valid_dataset=val_dataset_load,
                          test_dataset=test_dataset_load,
                          dense_layers=[512, 256, 128],
                          start_lr=0.2,
                          optimizer="Adadelta",
                          activate_func=nn.PReLU(init=0.1),
                          model_name="GNNWR_PM25",
                          model_save_path="../demo_result/gnnwr/models/",
                          log_path="../demo_result/gnnwr/logs/",
                          write_path="../demo_result/gnnwr/writes/"
                          )

Loading Parameters

In [11]:
gnnwr_load.load_model('../demo_result/gnnwr/models/GNNWR_PM25.pkl')
gnnwr_load.result()


--------------------Result Information----------------
Test Loss: |                  61.43194
Test R2  : |                   0.71589
Train R2 : |                   0.76827
Valid R2 : |                   0.74480
RMSE: |                        7.83785
MAE:  |                        5.66026
AICc: |                     1476.05457


## 5 Estimation

### 5.1 Import Estimation Data

In [12]:
pred_data = pd.read_csv('../data/pm25_predict_data.csv')

### 5.2 Initialize Estimation Dataset

In [13]:
pred_dataset = datasets.init_predict_dataset(data=pred_data, train_dataset=train_dataset, x_column=[
                                             'dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp'],
                                             spatial_column=['proj_x', 'proj_y'])

### 5.3 Estimate

In [14]:
pred_res = gnnwr_load.predict(pred_dataset)
pred_res.head(5)

Unnamed: 0,station_id,lng,lat,date,PM2.5,row_index,col_index,proj_x,proj_y,dem,...,tp,blh,e,r,u10,v10,aod_sat,ndvi,pred_result,denormalized_pred_result
0,1001A,116.366,39.8673,20170930,56.357143,2201.0,6867.0,1650848.0,1370268.0,46,...,5.1e-05,64.583054,-7e-06,52.682091,0.384257,0.784808,0.762762,3443,48.213795,48.213795
1,1002A,116.17,40.2865,20170930,47.148148,2134.0,6835.0,1625004.0,1416960.0,420,...,0.000304,40.62114,-7e-06,62.529091,-0.156175,-0.537717,0.574785,7810,36.449612,36.449612
2,1003A,116.434,39.9522,20170930,53.857143,2188.0,6877.0,1653777.0,1381524.0,48,...,5.8e-05,60.242908,-7e-06,52.12664,0.093867,0.617515,0.796827,3328,49.067551,49.067551
3,1004A,116.434,39.8745,20170930,46.333333,2200.0,6877.0,1655828.0,1372270.0,45,...,4.7e-05,69.535637,-8e-06,51.301529,0.197439,0.893495,0.758839,4535,47.682178,47.682178
4,1005A,116.473,39.9716,20170930,52.203704,2185.0,6884.0,1656225.0,1384492.0,40,...,5.9e-05,62.281456,-7e-06,51.071964,-0.060543,0.634863,0.760148,3901,46.906914,46.906914


## 6 Visualization

### 6.1 Introductions to the Visualization Module

**Initialization (utils.Visualize)**

params:  
- data: An instance of the GNNWR or its derived model (required)  
- lon_lat_columns: The longitude and latitude column names in the dataset. If not provided, the first two columns of `spatial_columns` will be used as longitude and latitude by default.

```python
visualizer = utils.Visualize(data = gnnwr, lon_lat_columns = ['lon','lat'])
```

**Dataset Visualization (display_dataset)**

params:  
- name: The dataset to display. Options are 'all', 'train', 'valid', 'test', representing the full dataset, training set, validation set, and test set respectively. Default is 'all'.  
- y_column: The field to measure. Defaults to the first column in the dataset's `y_columns`.  
- colors: Pass a color array to customize the color palette. Default is a yellow -> red palette.  
- steps: Pass a positive integer to set the number of color gradations. Default is 20.  
- vmin: Set the minimum value for the color palette. Defaults to the minimum value of the measurement data.  
- vmax: Set the maximum value for the color palette. Defaults to the maximum value of the measurement data.

```python
visualizer.display_dataset(name='train', y_column='PM2_5', colors=['blue','green','yellow','red'], steps=50, vmin=0, vmax=100)
```

**Weights Visualization (weights_heatmap)**

params:  
- data_column: Select the weight field (required).  
- colors: Pass a color array to customize the color palette. Default is a yellow -> red palette.  
- steps: Pass a positive integer to set the number of color gradations. Default is 20.  
- vmin: Set the minimum value for the color palette. Defaults to the minimum value of the measurement data.  
- vmax: Set the maximum value for the color palette. Defaults to the maximum value of the measurement data.

```python
visualizer.weights_heatmap(data_column='weight_dem', colors=['blue','green','yellow','red'], steps=50, vmin=0, vmax=100)
```

**Custom Point Data Visualization (dot_map)**

params:  
- data: Pass the DataFrame used for visualization (required).  
- lon_column: Longitude field name (required).  
- lat_column: Latitude field name (required).  
- y_column: Measurement value field name (required).  
- zoom: Initial zoom level for the map. Default is 4.  
- colors: Pass a color array to customize the color palette. Default is a yellow -> red palette.  
- steps: Pass a positive integer to set the number of color gradations. Default is 20.  
- vmin: Set the minimum value for the color palette. Defaults to the minimum value of the measurement data.  
- vmax: Set the maximum value for the color palette. Defaults to the maximum value of the measurement data.

```python
visualizer.dot_map(data=df, lon_column='lon', lat_column='lat', y_column='res', zoom=1, colors=['blue','green','yellow','red'], steps=50, vmin=0, vmax=100)
```

In [15]:
visualizer = utils.Visualize(data=gnnwr, lon_lat_columns=['lng', 'lat'])

### 6.2 Drawing the distribution of datasets  

Universal

In [16]:
visualizer.display_dataset(name='all', y_column='PM2_5', colors=[
                           'blue', 'green', 'yellow', 'red'])

Train

In [17]:
visualizer.display_dataset(name='train', y_column='PM2_5', colors=[
                           'blue', 'green', 'yellow', 'red'])

Validation

In [18]:
visualizer.display_dataset(name='valid', y_column='PM2_5', colors=[
                           'blue', 'green', 'yellow', 'red'])

Test

In [19]:
visualizer.display_dataset(name='test', y_column='PM2_5', colors=[
                           'blue', 'green', 'yellow', 'red'])

Estimation

In [20]:
visualizer.dot_map(data=pred_res, lon_column='lng', lat_column='lat',
                   y_column='pred_result', colors=['blue', 'green', 'yellow', 'red'])

### 6.3 Drawing the heat map for the distribution of weights of each variables 

DEM

In [21]:
visualizer.coefs_heatmap('coef_dem', colors=['blue', 'green', 'yellow', 'red'])

AOD

In [22]:
visualizer.coefs_heatmap('coef_aod_sat', colors=[
                         'blue', 'green', 'yellow', 'red'])

Tp

In [23]:
visualizer.coefs_heatmap('coef_tp', colors=['blue', 'green', 'yellow', 'red'])

Temperature

In [24]:
visualizer.coefs_heatmap('coef_t2m', colors=['blue', 'green', 'yellow', 'red'])

Wind Speed

In [25]:
visualizer.coefs_heatmap('coef_w10', colors=['blue', 'green', 'yellow', 'red'])

Wind Direction

In [26]:
visualizer.coefs_heatmap('coef_d10', colors=['blue', 'green', 'yellow', 'red'])

Bias

In [27]:
visualizer.coefs_heatmap('bias', colors=['blue', 'green', 'yellow', 'red'])

## 7 Results Comparison

True data

In [28]:
visualizer.display_dataset(name='valid', y_column='PM2_5', colors=[
                           'blue', 'green', 'yellow', 'red'])

Estimated Data

In [29]:
valid_pred_dst = datasets.init_predict_dataset(data=val_dataset.dataframe,
                                               train_dataset=train_dataset,
                                               x_column=[
                                                   'dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp'],
                                               spatial_column=['lng', 'lat'])
valid_pred_res = gnnwr.predict(valid_pred_dst)
visualizer.dot_map(data=valid_pred_res, lon_column='lng', lat_column='lat',
                   y_column='pred_result', colors=['blue', 'green', 'yellow', 'red'], vmin=4, vmax=81)