# PM2.5 concentration spatial distribution estimation in China 
## Abstract  

China, as the largest developing country in the world, has been facing increasingly severe air quality problems accompanying the continuous advancement of industrialization and urbanization. According to the “2017 China Environmental Situation Bulletin” released by the Ministry of Ecology and Environment, among the 338 cities at the prefecture level and above, 239 cities have exceeded the national ambient air quality standards, with a proportion of more than 70%. Air quality problems have seriously affected people's daily travel and physical health, restricted sustainable economic development, and become a hot issue for public and government attention. PM2.5 is a respirable particle with a diameter of no more than 2.5 micrometers in the aerodynamic field, and is one of the main indicators for evaluating air quality. Comprehensive understanding of the spatial distribution of PM2.5 concentration, representing the spatial process and environmental behavior of atmospheric pollution, is of great significance and guidance value for supporting atmospheric pollution monitoring and early warning, comprehensive treatment, protecting human health and social sustainable development. By the end of 2017, the China National Environmental Monitoring Center had established over 400 ground-level air quality monitoring stations, and released hourly air quality monitoring data including PM2.5, providing high-precision and reliable real-time monitoring results. However, due to the uneven spatial distribution and low coverage of ground monitoring stations, it is difficult for existing studies to effectively analyze and deeply mine the spatiotemporal data of their monitoring data. Unlike ground monitoring, remote sensing observation based on satellites can obtain high-coverage atmospheric environment spatial datasets, such as atmospheric aerosol optical depth (AOD). Numerous studies have shown that there is a strong correlation between AOD and PM2.5 concentration. Research on PM2.5 concentration Spatial regression relationships with factors such as AOD retrieved from remote sensing inversion can provide effective solutions for obtaining the spatial distribution of PM2.5 concentration in the entire study area. Methodology Based on GWR geographical weighted regression thinking, Wu SenSen combines OLR with neural network model to propose a Geographically Neural Network Weighted Regression (GNNWR) model. By utilizing the learning ability of neural networks, this model can handle the spatial heterogeneity and complex nonlinear characteristics of regression relationships, which has better fitness accuracy and prediction performance compared to models such as OLR and GWR. The purpose of this case is to establish a PM2.5 concentration spatial estimation model based on GNNWR to achieve accurate fitting of spatial heterogeneity and nonlinear characteristics in PM2.5 regression relationships, and then obtain high-precision and high-reasonability PM2.5 concentration spatial distribution in China.  

## Data description  

Many studies have shown that integrating meteorological conditions such as temperature, precipitation, wind speed, wind direction, and surface elevation factors can further improve the accuracy of PM2.5 spatial estimation. In this case, in addition to selecting AOD data as an auxiliary factor, temperature (TEMP), precipitation (TP), wind speed (WS), wind direction (WD), and surface elevation (DEM) factors are added as input variables for the model. The research time scale is the average of 2017 year scale:
(1) PM2.5 monitoring site data. The hourly PM2.5 concentration observation values from January 1, 2017 to December 31, 2017 were obtained from the China Environmental Monitoring Station. PM2.5 concentrations were measured using cone-shaped element oscillation trace or beta attenuation methods following national standard GB3095-2012. PM2.5 data were averaged for one year's time scale
(2) Aerosol data. Aerosol data are obtained from the LAADS website including both Terra and Aqua dark pixel inversion products with a resolution of 3 km (MOD04_3K and MYD04_3k), as well as deep blue algorithm inversion products with a resolution of 10 km (MOD04_L2 and MYD04_L2). In this article, the 3 km resolution AOD products are the main data source for PM2.5 estimation. When there is a missing value in the 3 km resolution data, a resolution-matching product will be used as much as possible using a 10 km resolution product for resampling substitution to ensure the reliability of AOD data.
(3) DEM data. DEM data are obtained from the ETO-PO1 global surface elevation model of NOAA with a resolution of 1 arc minute
(4) Meteorological data including temperature, precipitation, wind speed, and wind direction are obtained from the ERA5 global climate reanalysis modele provided by ECMWF with hourly gridded data at a resolution of 0.5 degrees.   

## Model Introduction  

Based on the geographical weighted idea similar to GWR, the GNNWR model believes that the spatial heterogeneity of the regression relationship can be regarded as the varying levels of spatial nonstationarity at different locations that affect the "OLR regression relationship". Therefore, in the spatial estimation experiment of PM2.5 concentration in this case, the model structure of GNNWR is defined as follows:  
 
![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059648746_1.png)  

In this equation, (u_i, v_i) are the spatial coordinates of the i-th sample point, and β = (β_0, β_1, ..., β_6) are the regression coefficients of the OLR model, reflecting the average level of the PM2.5 regression relationship for the entire region. The estimation matrix of OLR coefficients is represented as follows:

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059665465_2.png)  

of which: 

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059673642_3.png)  


<br/><br/>  


![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694003342595_1.png)  
The spatial estimation model for PM2.5 concentration based on GNNWR

## Main Content  
1. Model Training 
2. Result Storage, Loading, and Visualization
3. Estimation 

# Part A：Preparation

## Import Necessary Packages

In [1]:
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
from gnnwr import models,datasets,utils
import pandas as pd
import numpy as np
import folium
from folium.plugins import HeatMap, MarkerCluster
from demo_utils import marker_map,Heatmap
import torch.nn as nn
from sklearn.metrics import r2_score as r2
import matplotlib.pyplot as plt

# Part B：Model Training

## Step 1：Import Training Data

In [2]:
data = pd.read_csv(u'../data/pm25_data.csv')
data.head(5)

Unnamed: 0,station_id,lng,lat,date,PM2_5,row_index,col_index,proj_x,proj_y,dem,...,t2m,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi
0,1001A,116.366,39.8673,20170601,54.733894,2201,6867,1650847.552,1370268.366,46,...,284.561066,100809.2734,0.001006,134.995636,-7e-06,46.315975,0.425366,0.170262,0.870967,2401
1,1002A,116.17,40.2865,20170601,48.080737,2134,6835,1625003.973,1416959.964,420,...,282.907684,97125.08594,0.001044,157.77597,-6e-06,53.605503,0.211734,-0.676848,0.71208,5255
2,1003A,116.434,39.9522,20170601,54.898592,2188,6877,1653776.71,1381524.305,48,...,284.492249,100830.9688,0.001002,129.971298,-7e-06,45.537464,0.266666,0.069172,0.875811,2609
3,1004A,116.434,39.8745,20170601,52.266382,2200,6877,1655828.045,1372270.098,45,...,284.6362,100936.8047,0.00101,138.793961,-7e-06,45.387913,0.299403,0.22795,0.869679,2420
4,1005A,116.473,39.9716,20170601,53.189076,2185,6884,1656224.681,1384491.842,40,...,284.506561,100880.1797,0.001019,130.520599,-7e-06,44.790119,0.169121,0.079546,0.873232,3296


### View the distribution of training data

## Step 2：Partition Datasets

In [3]:
import numpy as np
import pandas as pd
from gnnwr.datasets import init_dataset
data = pd.read_csv('../data/pm25_data.csv')
x_column=['dem', 'w10','d10','t2m','aod_sat','tp']
y_column=['PM2_5']
spatial_column=['proj_x','proj_y']
train_set, val_set, test_set = init_dataset(data=data,
                                            test_ratio=0.2,
                                            valid_ratio=0.2,
                                            x_column=x_column,
                                            y_column=y_column,
                                            spatial_column=spatial_column,
                                            sample_seed=42)

x_min:[-5.0000000e+00  4.1591436e-02  3.9565850e-02  2.6959613e+02
  5.6254357e-02  3.8816700e-05];  x_max:[4.52000000e+03 3.20341086e+00 3.59605225e+02 2.97242950e+02
 1.06999075e+00 4.07377200e-03]
y_min:[3.85633803];  y_max:[133.8005618]


## Step 3：Initialize GNNWR Model

In [4]:
gnnwr = models.GNNWR(train_dataset = train_set,
                     valid_dataset = val_set, 
                     test_dataset = test_set,
                     dense_layers = [1024, 256, 128],
                     activate_func = nn.PReLU(init=0.2),
                     start_lr = 0.1,
                     optimizer = "Adadelta",
                     model_name = "GNNWR_PM25",
                     model_save_path = "./demo_result/gnnwr_models",
                     log_path = "./demo_result/gnnwr_logs",
                     write_path = "./demo_result/gnnwr_runs"
                     )

In [5]:
gnnwr.add_graph()

Add Graph Successfully


In [6]:
gnnwr.run(max_epoch = 4000,early_stop=1000,print_frequency = 500)

 13%|█▎        | 501/4000 [01:09<06:48,  8.57it/s]


Epoch:  500
learning rate:  0.09733963460294015
Train Loss:  50.73307395113692
Train R2: 0.73732
Train RMSE: 7.12272
Train AIC: 6101.77410
Train AICc: 6091.70068
Valid Loss:  45.88246536254883
Valid R2: 0.79676 

Best R2: 0.82613 



 25%|██▌       | 1001/4000 [02:10<05:37,  8.90it/s]


Epoch:  1000
learning rate:  0.03263617175376001
Train Loss:  38.383435544639525
Train R2: 0.80126
Train RMSE: 6.19544
Train AIC: 5850.43834
Train AICc: 5826.89355
Valid Loss:  37.54179763793945
Valid R2: 0.83370 

Best R2: 0.85574 



 38%|███▊      | 1501/4000 [03:10<05:04,  8.20it/s]


Epoch:  1500
learning rate:  0.09879906455148443
Train Loss:  41.26976883239407
Train R2: 0.78632
Train RMSE: 6.42415
Train AIC: 5915.76447
Train AICc: 5889.42578
Valid Loss:  37.087890625
Valid R2: 0.83571 

Best R2: 0.85800 



 50%|█████     | 2001/4000 [04:10<03:53,  8.57it/s]


Epoch:  2000
learning rate:  0.08591893798619905
Train Loss:  31.962803975061888
Train R2: 0.83451
Train RMSE: 5.65357
Train AIC: 5685.50729
Train AICc: 5652.50391
Valid Loss:  30.41059112548828
Valid R2: 0.86529 

Best R2: 0.87319 



 63%|██████▎   | 2501/4000 [05:10<03:02,  8.20it/s]


Epoch:  2500
learning rate:  0.06286572710711905
Train Loss:  30.742439585441225
Train R2: 0.84082
Train RMSE: 5.54459
Train AIC: 5650.43265
Train AICc: 5612.86621
Valid Loss:  32.82408142089844
Valid R2: 0.85460 

Best R2: 0.87730 



 75%|███████▌  | 3001/4000 [06:10<01:56,  8.61it/s]


Epoch:  3000
learning rate:  0.03722450026559757
Train Loss:  30.531753777134565
Train R2: 0.84192
Train RMSE: 5.52555
Train AIC: 5644.23654
Train AICc: 5606.05908
Valid Loss:  32.232173919677734
Valid R2: 0.85722 

Best R2: 0.87827 



 88%|████████▊ | 3501/4000 [07:09<00:57,  8.65it/s]


Epoch:  3500
learning rate:  0.01743184615314671
Train Loss:  27.582468657329528
Train R2: 0.85719
Train RMSE: 5.25190
Train AIC: 5552.70672
Train AICc: 5514.57812
Valid Loss:  32.61893844604492
Valid R2: 0.85551 

Best R2: 0.87865 



100%|██████████| 4000/4000 [08:07<00:00,  8.20it/s]



Epoch:  4000
learning rate:  0.010000030461738542
Train Loss:  26.019797914697644
Train R2: 0.86528
Train RMSE: 5.10096
Train AIC: 5500.15812
Train AICc: 5463.43359
Valid Loss:  29.981163024902344
Valid R2: 0.86719 

Best R2: 0.87865 

Best_r2: 0.8786540030865484


In [7]:
gnnwr.load_model('./demo_result/gnnwr_models/GNNWR_PM25.pkl')

## Step 4：Model Training

In [8]:
# gnnwr.run(max_epoch = 200,print_frequency = 100)

## Step 5：模型评价与结果分析

### 输出模型训练结果

In [9]:
gnnwr.result()

--------------------Model Information-----------------
Model Name:           | GNNWR_PM25
independent variable: | ['dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp']
dependent variable:   | ['PM2_5']

OLS coefficients: 
x0: 7.20676
x1: -7.51571
x2: 0.00989
x3: 21.75958
x4: 30.29816
x5: -26.90607
Intercept: 22.79182

--------------------Result Information----------------
Test Loss: |                  40.03035
Test R2  : |                   0.81037
Train R2 : |                   0.85771
Valid R2 : |                   0.87865
RMSE: |                        6.32695
AIC:  |                     1847.75923
AICc: |                     1820.60583
F1:   |                        0.05237
F2:   |                        1.65990
f3_param_0: |                  8.97603
f3_param_1: |                  1.52405
f3_param_2: |                  3.20749
f3_param_3: |                 45.54015
f3_param_4: |                321.37238
f3_param_5: |                 98.94161
f3_param_6: |                 18.05067


### Save the training result

In [10]:
gnnwr.reg_result('./demo_result/GNNWR_PM25_Result.csv')

Unnamed: 0,coef_dem,coef_w10,coef_d10,coef_t2m,coef_aod_sat,coef_tp,bias,Pred_PM2_5,id
0,-42.245953,32.271889,-0.114344,25.495701,4.644829,-2.301092,26.216631,48.535564,708
1,-33.179379,-11.717217,-0.105434,29.955183,45.663036,-5.036797,6.118102,52.999641,473
2,-32.749878,34.850971,-0.103790,22.118444,11.623776,-17.240992,24.765287,47.247192,688
3,-42.951084,-8.170590,-0.087614,23.301256,-18.591471,-21.067556,48.883209,31.642212,718
4,8.625327,14.034409,-0.060852,-23.038494,91.108368,-41.954235,5.234749,23.922737,907
...,...,...,...,...,...,...,...,...,...
1403,-32.997318,-11.724401,-0.105112,30.158096,44.292564,-5.277888,6.934898,53.038467,474
1404,-12.890089,-0.051342,-0.061370,-4.298569,43.124130,19.377064,7.950440,40.401588,795
1405,-56.515850,1.228940,-0.123811,52.481209,18.853447,-42.570831,20.560272,53.832516,1059
1406,-20.160530,33.000599,-0.103459,0.267865,48.386425,12.603077,5.744659,55.579384,492


# Part C：Visualization

In [11]:
visualizer = utils.Visualize(data=gnnwr,lon_lat_columns=['lng','lat'])

### Drawing the distribution of datasets

### Universal

In [12]:
visualizer.display_dataset(name='all',y_column='PM2_5')

### Train

In [13]:
visualizer.display_dataset(name='train',y_column='PM2_5')

### Validation

In [14]:
visualizer.display_dataset(name='valid',y_column='PM2_5')

### Test

In [15]:
visualizer.display_dataset(name='test',y_column='PM2_5')

### Drawing the heat map for the distribution of weights of each variables 

#### DEM

In [16]:
visualizer.coefs_heatmap('coef_dem')

#### AOD

In [17]:
visualizer.coefs_heatmap('coef_aod_sat')

### Precipitation

In [18]:
visualizer.coefs_heatmap('coef_tp')

#### Temperature

In [19]:
visualizer.coefs_heatmap('coef_t2m')

#### Wind Speed

In [20]:
visualizer.coefs_heatmap('coef_w10')

#### Wind Direction

In [21]:
visualizer.coefs_heatmap('coef_d10')

#### Bias

In [22]:
visualizer.coefs_heatmap('bias')

# Part D：Saving and Loading

## Step 1：Saving Datasets

In [24]:
# make sure dir is not exist
train_set.save('./demo_result/gnnwr_datasets/train_dataset')
val_set.save('./demo_result/gnnwr_datasets/val_dataset')
test_set.save('./demo_result/gnnwr_datasets/test_dataset')

## Step 2：Loading Datasets

In [25]:
train_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/train_dataset/')
val_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/val_dataset/')
test_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/test_dataset/')

## Step 3：Loading Model

### Initialize GNNWR Model

In [26]:
gnnwr_load = models.GNNWR(train_dataset = train_dataset_load,
                     valid_dataset = val_dataset_load, 
                     test_dataset = test_dataset_load,
                     dense_layers = [512, 256, 128],
                     start_lr = 0.2,
                     optimizer = "Adadelta",
                     activate_func = nn.PReLU(init=0.1),
                     model_name = "GNNWR_PM25",
                     model_save_path = "./demo_result/gnnwr_models",
                     log_path = "./demo_result/gnnwr_logs",
                     write_path = "./demo_result/gnnwr_runs"
                     )

### Loading Parameters

In [27]:
gnnwr_load.load_model('./demo_result/gnnwr_models/GNNWR_PM25.pkl')
gnnwr_load.result()

--------------------Model Information-----------------
Model Name:           | GNNWR_PM25
independent variable: | ['dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp']
dependent variable:   | ['PM2_5']

OLS coefficients: 
x0: 7.20675
x1: -7.51571
x2: 0.00989
x3: 21.75957
x4: 30.29816
x5: -26.90607
Intercept: 22.79182

--------------------Result Information----------------
Test Loss: |                  40.03030
Test R2  : |                   0.81037
RMSE: |                        6.32695
AIC:  |                     1847.75885
AICc: |                     1820.60559
F1:   |                        0.05237
F2:   |                        1.66006
f3_param_0: |                  8.97735
f3_param_1: |                  1.52390
f3_param_2: |                  3.20786
f3_param_3: |                 45.54019
f3_param_4: |                321.38257
f3_param_5: |                 98.94106
f3_param_6: |                 18.04847


# Part E：Estimation

## Step 1：Import Estimation Data

In [28]:
pred_data = pd.read_csv(u'../data/pm25_predict_data.csv')

## Step 2：Initialize Estimation Dataset

In [29]:
pred_dataset = datasets.init_predict_dataset(data = pred_data,train_dataset = train_set,x_column=['dem', 'w10','d10','t2m','aod_sat','tp'],spatial_column=['lng','lat'])

## Step 3：Estimate

In [30]:
res = gnnwr.predict(pred_dataset)
res.head(5)

Unnamed: 0,station_id,lng,lat,date,PM2.5,row_index,col_index,proj_x,proj_y,dem,...,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi,pred_result
0,1001A,116.366,39.8673,20170930,56.357143,2201.0,6867.0,1650848.0,1370268.0,46,...,100287.671875,5.1e-05,64.583054,-7e-06,52.682091,0.384257,0.784808,0.762762,3443,81.87262
1,1002A,116.17,40.2865,20170930,47.148148,2134.0,6835.0,1625004.0,1416960.0,420,...,96752.507812,0.000304,40.62114,-7e-06,62.529091,-0.156175,-0.537717,0.574785,7810,81.095139
2,1003A,116.434,39.9522,20170930,53.857143,2188.0,6877.0,1653777.0,1381524.0,48,...,100307.703125,5.8e-05,60.242908,-7e-06,52.12664,0.093867,0.617515,0.796827,3328,62.781219
3,1004A,116.434,39.8745,20170930,46.333333,2200.0,6877.0,1655828.0,1372270.0,45,...,100410.367188,4.7e-05,69.535637,-8e-06,51.301529,0.197439,0.893495,0.758839,4535,84.883621
4,1005A,116.473,39.9716,20170930,52.203704,2185.0,6884.0,1656225.0,1384492.0,40,...,100355.054688,5.9e-05,62.281456,-7e-06,51.071964,-0.060543,0.634863,0.760148,3901,74.731354


## Step 4：Display

In [31]:
visualizer.dot_map(data=res,lon_column='lng',lat_column='lat',y_column='pred_result',colors=['blue','green','red'])