# PM2.5 concentration spatial distribution estimation in China 
## Abstract  

China, as the largest developing country in the world, has been facing increasingly severe air quality problems accompanying the continuous advancement of industrialization and urbanization. According to the “2017 China Environmental Situation Bulletin” released by the Ministry of Ecology and Environment, among the 338 cities at the prefecture level and above, 239 cities have exceeded the national ambient air quality standards, with a proportion of more than 70%. Air quality problems have seriously affected people's daily travel and physical health, restricted sustainable economic development, and become a hot issue for public and government attention. PM2.5 is a respirable particle with a diameter of no more than 2.5 micrometers in the aerodynamic field, and is one of the main indicators for evaluating air quality. Comprehensive understanding of the spatial distribution of PM2.5 concentration, representing the spatial process and environmental behavior of atmospheric pollution, is of great significance and guidance value for supporting atmospheric pollution monitoring and early warning, comprehensive treatment, protecting human health and social sustainable development. By the end of 2017, the China National Environmental Monitoring Center had established over 400 ground-level air quality monitoring stations, and released hourly air quality monitoring data including PM2.5, providing high-precision and reliable real-time monitoring results. However, due to the uneven spatial distribution and low coverage of ground monitoring stations, it is difficult for existing studies to effectively analyze and deeply mine the spatiotemporal data of their monitoring data. Unlike ground monitoring, remote sensing observation based on satellites can obtain high-coverage atmospheric environment spatial datasets, such as atmospheric aerosol optical depth (AOD). Numerous studies have shown that there is a strong correlation between AOD and PM2.5 concentration. Research on PM2.5 concentration Spatial regression relationships with factors such as AOD retrieved from remote sensing inversion can provide effective solutions for obtaining the spatial distribution of PM2.5 concentration in the entire study area. Methodology Based on GWR geographical weighted regression thinking, Wu SenSen combines OLR with neural network model to propose a Geographically Neural Network Weighted Regression (GNNWR) model. By utilizing the learning ability of neural networks, this model can handle the spatial heterogeneity and complex nonlinear characteristics of regression relationships, which has better fitness accuracy and prediction performance compared to models such as OLR and GWR. The purpose of this case is to establish a PM2.5 concentration spatial estimation model based on GNNWR to achieve accurate fitting of spatial heterogeneity and nonlinear characteristics in PM2.5 regression relationships, and then obtain high-precision and high-reasonability PM2.5 concentration spatial distribution in China.  

## Data description  

Many studies have shown that integrating meteorological conditions such as temperature, precipitation, wind speed, wind direction, and surface elevation factors can further improve the accuracy of PM2.5 spatial estimation. In this case, in addition to selecting AOD data as an auxiliary factor, temperature (TEMP), precipitation (TP), wind speed (WS), wind direction (WD), and surface elevation (DEM) factors are added as input variables for the model. The research time scale is the average of 2017 year scale:
(1) PM2.5 monitoring site data. The hourly PM2.5 concentration observation values from January 1, 2017 to December 31, 2017 were obtained from the China Environmental Monitoring Station. PM2.5 concentrations were measured using cone-shaped element oscillation trace or beta attenuation methods following national standard GB3095-2012. PM2.5 data were averaged for one year's time scale
(2) Aerosol data. Aerosol data are obtained from the LAADS website including both Terra and Aqua dark pixel inversion products with a resolution of 3 km (MOD04_3K and MYD04_3k), as well as deep blue algorithm inversion products with a resolution of 10 km (MOD04_L2 and MYD04_L2). In this article, the 3 km resolution AOD products are the main data source for PM2.5 estimation. When there is a missing value in the 3 km resolution data, a resolution-matching product will be used as much as possible using a 10 km resolution product for resampling substitution to ensure the reliability of AOD data.
(3) DEM data. DEM data are obtained from the ETO-PO1 global surface elevation model of NOAA with a resolution of 1 arc minute
(4) Meteorological data including temperature, precipitation, wind speed, and wind direction are obtained from the ERA5 global climate reanalysis modele provided by ECMWF with hourly gridded data at a resolution of 0.5 degrees.   

## Model Introduction  

Based on the geographical weighted idea similar to GWR, the GNNWR model believes that the spatial heterogeneity of the regression relationship can be regarded as the varying levels of spatial nonstationarity at different locations that affect the "OLR regression relationship". Therefore, in the spatial estimation experiment of PM2.5 concentration in this case, the model structure of GNNWR is defined as follows:  
 
![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059648746_1.png)  

In this equation, (u_i, v_i) are the spatial coordinates of the i-th sample point, and β = (β_0, β_1, ..., β_6) are the regression coefficients of the OLR model, reflecting the average level of the PM2.5 regression relationship for the entire region. The estimation matrix of OLR coefficients is represented as follows:

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059665465_2.png)  

of which: 

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059673642_3.png)  


<br/><br/>  


![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694003342595_1.png)  
The spatial estimation model for PM2.5 concentration based on GNNWR

## Main Content  
1. Model Training 
2. Result Storage, Loading, and Visualization
3. Estimation 

# Part A：Preparation

## Import Necessary Packages

In [1]:
from gnnwr import models,datasets
import pandas as pd
import numpy as np
import folium
import torch.nn as nn
from sklearn.metrics import r2_score as r2
import matplotlib.pyplot as plt

# Part B：Model Training

## Step 1：Import Training Data

In [2]:
data = pd.read_csv(u'../data/pm25_data.csv')
print(data.head(5))

   监测点编码 监测点名称  城市       经度       纬度      date      PM2_5  row_index  \
0  1001A  万寿西宫  北京  116.366  39.8673  20170601  54.733894       2201   
1  1002A    定陵  北京  116.170  40.2865  20170601  48.080737       2134   
2  1003A    东四  北京  116.434  39.9522  20170601  54.898592       2188   
3  1004A    天坛  北京  116.434  39.8745  20170601  52.266382       2200   
4  1005A   农展馆  北京  116.473  39.9716  20170601  53.189076       2185   

   col_index       proj_x  ...         t2m            sp        tp  \
0       6867  1650847.552  ...  284.561066  100809.27340  0.001006   
1       6835  1625003.973  ...  282.907684   97125.08594  0.001044   
2       6877  1653776.710  ...  284.492249  100830.96880  0.001002   
3       6877  1655828.045  ...  284.636200  100936.80470  0.001010   
4       6884  1656224.681  ...  284.506561  100880.17970  0.001019   

          blh         e          r       u10       v10   aod_sat  ndvi  
0  134.995636 -0.000007  46.315975  0.425366  0.170262  0.870967  2401  


### View the distribution of training data

In [13]:
lon_center,lat_center = data['经度'].mean(),data['纬度'].mean()
map = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
data.apply(lambda x:folium.Marker(location=[x['纬度'],x['经度']],popup=x['监测点名称']+'\n PM2.5: '+str(x['PM2_5'])).add_to(map),axis=1)
map

## Step 2：Partition Datasets

In [3]:
train_dataset, val_dataset, test_dataset = datasets.init_dataset(data=data,
                                                        test_ratio=0.15,
                                                        valid_ratio=0.15,
                                                        x_column=['dem', 'w10','d10','t2m','aod_sat','tp'],
                                                        y_column=['PM2_5'],
                                                        spatial_column=['经度','纬度'],
                                                        sample_seed=23,
                                                        batch_size=64)

x_min:[-5.0000000e+00  4.1591436e-02  3.9565850e-02  2.6959613e+02
  5.6254357e-02  3.8816700e-05];  x_max:[4.52000000e+03 3.20341086e+00 3.59605225e+02 2.97242950e+02
 1.06999075e+00 4.07377200e-03]
y_min:[3.85633803];  y_max:[133.8005618]


### Checking the partitioning of the datasets

In [15]:
map_1 = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
train_dataset.dataframe.apply(lambda x:folium.Marker(location=[x['纬度'],x['经度']],icon=folium.Icon(color='red'),popup=x['监测点名称']+'\n PM2.5: '+str(x['PM2_5'])).add_to(map_1),axis=1)
val_dataset.dataframe.apply(lambda x:folium.Marker(location=[x['纬度'],x['经度']],icon=folium.Icon(color='green'),popup=x['监测点名称']+'\n PM2.5: '+str(x['PM2_5'])).add_to(map_1),axis=1)
test_dataset.dataframe.apply(lambda x:folium.Marker(location=[x['纬度'],x['经度']],icon=folium.Icon(color='blue'),popup=x['监测点名称']+'\n PM2.5: '+str(x['PM2_5'])).add_to(map_1),axis=1)

map_1

## Step 3：Initialize GNNWR Model

In [5]:
gnnwr = models.GNNWR(train_dataset = train_dataset,
                     valid_dataset = val_dataset, 
                     test_dataset = test_dataset,
                     dense_layers = [512, 256, 128],
                     start_lr = 0.2,
                     optimizer = "Adadelta",
                     activate_func = nn.PReLU(init=0.1),
                     model_name = "GNNWR_PM25",
                     model_save_path = "./demo_result/gnnwr_models",
                     log_path = "./demo_result/gnnwr_logs",
                     write_path = "./demo_result/gnnwr_runs"
                     )

## Step 4：Model Training

In [20]:
gnnwr.run(max_epoch = 20000,early_stop = 5000,print_frequency = 1000)

  5%|▌         | 1002/20000 [01:13<22:40, 13.96it/s]


Epoch:  1000
learning rate:  0.05778747370238225
Train Loss:  31.147884695227518
Train R2: 0.84534
Train RMSE: 5.58103
Train AIC: 8120.81376
Train AICc: 8019.90674
Valid Loss:  64.07957458496094
Valid R2: 0.66154 

Best R2: 0.76761 



 10%|█         | 2002/20000 [02:25<21:34, 13.90it/s]


Epoch:  2000
learning rate:  0.17027331352642022
Train Loss:  28.358025259207366
Train R2: 0.85919
Train RMSE: 5.32523
Train AIC: 8146.22004
Train AICc: 8041.55957
Valid Loss:  74.40351104736328
Valid R2: 0.60700 

Best R2: 0.80477 



 15%|█▌        | 3002/20000 [03:36<20:03, 14.12it/s]


Epoch:  3000
learning rate:  0.06747394500515043
Train Loss:  22.30566325229881
Train R2: 0.88925
Train RMSE: 4.72289
Train AIC: 8146.20345
Train AICc: 8042.20996
Valid Loss:  55.48660659790039
Valid R2: 0.70692 

Best R2: 0.80477 



 20%|██        | 4002/20000 [04:47<18:48, 14.17it/s]


Epoch:  4000
learning rate:  0.0100000643081147
Train Loss:  22.034473642369875
Train R2: 0.89059
Train RMSE: 4.69409
Train AIC: 8161.67873
Train AICc: 8057.35156
Valid Loss:  50.88096618652344
Valid R2: 0.73125 

Best R2: 0.80477 



 25%|██▌       | 5002/20000 [05:59<17:38, 14.17it/s]


Epoch:  5000
learning rate:  0.19295770555694952
Train Loss:  22.542947341558857
Train R2: 0.88807
Train RMSE: 4.74794
Train AIC: 8186.84538
Train AICc: 8126.89160
Valid Loss:  49.98702621459961
Valid R2: 0.73597 

Best R2: 0.82692 



 30%|███       | 6002/20000 [07:10<16:28, 14.16it/s]


Epoch:  6000
learning rate:  0.17284911977727405
Train Loss:  21.214358492938466
Train R2: 0.89467
Train RMSE: 4.60591
Train AIC: 8180.24136
Train AICc: 8079.28711
Valid Loss:  48.3480110168457
Valid R2: 0.74463 

Best R2: 0.82896 



 35%|███▌      | 7002/20000 [08:21<15:19, 14.14it/s]


Epoch:  7000
learning rate:  0.14266140738628752
Train Loss:  18.282467396678943
Train R2: 0.90922
Train RMSE: 4.27580
Train AIC: 8168.61928
Train AICc: 8065.77100
Valid Loss:  48.88898849487305
Valid R2: 0.74177 

Best R2: 0.82980 



 40%|████      | 8002/20000 [09:32<14:18, 13.98it/s]


Epoch:  8000
learning rate:  0.10687901529049952
Train Loss:  17.857205364673657
Train R2: 0.91133
Train RMSE: 4.22578
Train AIC: 8200.79551
Train AICc: 8073.67920
Valid Loss:  44.31189727783203
Valid R2: 0.76595 

Best R2: 0.82980 



 45%|████▌     | 9002/20000 [10:44<13:00, 14.09it/s]


Epoch:  9000
learning rate:  0.07081749159978878
Train Loss:  16.27283736597472
Train R2: 0.91920
Train RMSE: 4.03396
Train AIC: 8195.23885
Train AICc: 8069.27393
Valid Loss:  40.816829681396484
Valid R2: 0.78441 

Best R2: 0.82980 



 50%|█████     | 10002/20000 [11:55<12:07, 13.74it/s]


Epoch:  10000
learning rate:  0.03983384999787906
Train Loss:  15.583019815480581
Train R2: 0.92263
Train RMSE: 3.94753
Train AIC: 8198.67855
Train AICc: 8065.83643
Valid Loss:  44.36006546020508
Valid R2: 0.76569 

Best R2: 0.82980 



 55%|█████▌    | 11002/20000 [13:06<10:37, 14.11it/s]


Epoch:  11000
learning rate:  0.018530774316340596
Train Loss:  15.395371305790153
Train R2: 0.92356
Train RMSE: 3.92369
Train AIC: 8214.84805
Train AICc: 8087.76025
Valid Loss:  37.47509002685547
Valid R2: 0.80206 

Best R2: 0.82980 



 58%|█████▊    | 11630/20000 [13:51<09:58, 13.99it/s]

Training stop! Model has not been improved for over 5000 epochs.
Best_r2: 0.8297950475900042





## Step 5：模型评价与结果分析

### 输出模型训练结果

In [7]:
gnnwr.result()


Test Loss:  47.317161560058594  Test R2:  0.7811654004960318
--------------------Result Table--------------------

Model Name:           | GNNWR_PM25
Model Structure:      |
 DataParallel(
  (module): DataParallel(
    (module): SWNN(
      (activate_func): PReLU(num_parameters=1)
      (fc): Sequential(
        (swnn_full0): Linear(in_features=1017, out_features=512, bias=True)
        (swnn_batc0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (swnn_acti0): PReLU(num_parameters=1)
        (swnn_drop0): Dropout(p=0.2, inplace=False)
        (swnn_full1): Linear(in_features=512, out_features=256, bias=True)
        (swnn_batc1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (swnn_acti1): PReLU(num_parameters=1)
        (swnn_drop1): Dropout(p=0.2, inplace=False)
        (swnn_full2): Linear(in_features=256, out_features=128, bias=True)
        (swnn_batc2): BatchNorm1d(128, eps=1e-05, momentum=0.1, af

### Save the training result

In [8]:
gnnwr.reg_result('/home/mw/project/GNNWR_PM25_Result.csv')

Unnamed: 0,weight_dem,weight_w10,weight_d10,weight_t2m,weight_aod_sat,weight_tp,bias,PM2_5,id
0,3.784909,57.760181,-1.433208,-19.105408,-2.353813,-3.010631,49.843536,40.463448,1117.0
1,-13.033020,14.112597,-2.673133,-30.197380,91.433136,15.328211,-8.095282,32.490837,575.0
2,-41.840466,-8.272955,-2.228050,4.549059,55.323395,-3.228616,9.572063,35.180138,294.0
3,-68.594635,6.753700,-3.319684,48.954319,-1.862861,-11.950991,22.822838,49.503555,1074.0
4,-24.279854,15.080664,0.021832,15.526450,65.970024,-22.426077,-5.697450,50.360462,26.0
...,...,...,...,...,...,...,...,...,...
1403,-16.598175,13.226131,-4.792192,-40.767525,97.900589,10.031476,-2.454238,25.699091,605.0
1404,-24.854321,0.351533,-1.632044,-8.446840,100.554344,1.967967,-18.279560,35.503094,262.0
1405,-85.850227,3.005556,-3.068398,-3.918420,6.176850,-18.631191,77.973106,58.579792,1333.0
1406,-19.934078,41.431381,-4.646597,-8.662212,-10.243143,39.647686,26.006897,35.435493,1105.0


### Drawing the heat map for the distribution of weights of each variables 

In [10]:
result_data = gnnwr.getWeights()


#### DEM

In [14]:
from folium.plugins import HeatMap
hmap_dem = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
weight_dem = [[row['纬度'],row['经度'],row['weight_dem']]for index,row in result_data.iterrows()]
hmap_dem.add_child(HeatMap(weight_dem))
hmap_dem

#### AOD

In [15]:
hmap_aod = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
weight_aod = [[row['纬度'],row['经度'],row['weight_aod_sat']]for index,row in result_data.iterrows()]
hmap_aod.add_child(HeatMap(weight_aod))
hmap_aod

### Precipitation

In [16]:
hmap_tp = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
weight_tp = [[row['纬度'],row['经度'],row['weight_tp']]for index,row in result_data.iterrows()]
hmap_tp.add_child(HeatMap(weight_tp))
hmap_tp

#### Temperature

In [17]:
hmap_t2m = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
weight_t2m = [[row['纬度'],row['经度'],row['weight_t2m']]for index,row in result_data.iterrows()]
hmap_t2m.add_child(HeatMap(weight_t2m))
hmap_t2m

#### Wind Speed

In [18]:
hmap_w10 = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
weight_w10 = [[row['纬度'],row['经度'],row['weight_w10']]for index,row in result_data.iterrows()]
hmap_w10.add_child(HeatMap(weight_w10))
hmap_w10

#### Wind Direction

In [19]:
hmap_d10 = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
weight_d10 = [[row['纬度'],row['经度'],row['weight_d10']]for index,row in result_data.iterrows()]
hmap_d10.add_child(HeatMap(weight_d10))
hmap_d10

#### Bias

In [20]:
hmap_bias = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
weight_bias = [[row['纬度'],row['经度'],row['bias']]for index,row in result_data.iterrows()]
hmap_bias.add_child(HeatMap(weight_bias))
hmap_bias

# Part C：Saving and Loading

## Step 1：Saving Datasets

In [21]:
# make sure dir is not exist
train_dataset.save('./demo_result/gnnwr_datasets/train_dataset')
val_dataset.save('./demo_result/gnnwr_datasets/val_dataset')
test_dataset.save('./demo_result/gnnwr_datasets/test_dataset')

## Step 2：Loading Datasets

In [22]:
train_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/train_dataset/')
val_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/val_dataset/')
test_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/test_dataset/')

## Step 3：Loading Model

### Initialize GNNWR Model

In [23]:
gnnwr_load = models.GNNWR(train_dataset = train_dataset_load,
                     valid_dataset = val_dataset_load, 
                     test_dataset = test_dataset_load,
                     dense_layers = [512, 256, 128],
                     start_lr = 0.2,
                     optimizer = "Adadelta",
                     activate_func = nn.PReLU(init=0.1),
                     model_name = "GNNWR_PM25",
                     model_save_path = "./demo_result/gnnwr_models",
                     log_path = "./demo_result/gnnwr_logs",
                     write_path = "./demo_result/gnnwr_runs"
                     )

### Loading Parameters

In [24]:
gnnwr_load.load_model('./demo_result/gnnwr_models/GNNWR_PM25.pkl')
gnnwr_load.result()

Test Loss:  47.31715774536133  Test R2:  0.7811654041026812
--------------------Result Table--------------------

Model Name:           | GNNWR_PM25
Model Structure:      |
 DataParallel(
  (module): DataParallel(
    (module): SWNN(
      (activate_func): PReLU(num_parameters=1)
      (fc): Sequential(
        (swnn_full0): Linear(in_features=1017, out_features=512, bias=True)
        (swnn_batc0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (swnn_acti0): PReLU(num_parameters=1)
        (swnn_drop0): Dropout(p=0.2, inplace=False)
        (swnn_full1): Linear(in_features=512, out_features=256, bias=True)
        (swnn_batc1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (swnn_acti1): PReLU(num_parameters=1)
        (swnn_drop1): Dropout(p=0.2, inplace=False)
        (swnn_full2): Linear(in_features=256, out_features=128, bias=True)
        (swnn_batc2): BatchNorm1d(128, eps=1e-05, momentum=0.1, aff

# Part D：Estimation

## Step 1：Import Estimation Data

In [25]:
pred_data = pd.read_csv(u'../data/pm25_predict_data.csv')

## Step 2：Initialize Estimation Dataset

In [26]:
pred_dataset = datasets.init_predict_dataset(data = pred_data,train_dataset = train_dataset,x_column=['dem', 'w10','d10','t2m','aod_sat','tp'],spatial_column=['经度','纬度'])

## Step 3：Estimate

In [27]:
res = gnnwr.predict(pred_dataset)
res.head(5)

Unnamed: 0,监测点编码,监测点名称,城市,经度,纬度,date,PM2.5,row_index,col_index,proj_x,...,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi,pred_result
0,1001A,万寿西宫,北京,116.366,39.8673,20170930,56.357143,2201.0,6867.0,1650848.0,...,100287.671875,5.1e-05,64.583054,-7e-06,52.682091,0.384257,0.784808,0.762762,3443,61.780437
1,1002A,定陵,北京,116.17,40.2865,20170930,47.148148,2134.0,6835.0,1625004.0,...,96752.507812,0.000304,40.62114,-7e-06,62.529091,-0.156175,-0.537717,0.574785,7810,49.10614
2,1003A,东四,北京,116.434,39.9522,20170930,53.857143,2188.0,6877.0,1653777.0,...,100307.703125,5.8e-05,60.242908,-7e-06,52.12664,0.093867,0.617515,0.796827,3328,62.518009
3,1004A,天坛,北京,116.434,39.8745,20170930,46.333333,2200.0,6877.0,1655828.0,...,100410.367188,4.7e-05,69.535637,-8e-06,51.301529,0.197439,0.893495,0.758839,4535,61.710503
4,1005A,农展馆,北京,116.473,39.9716,20170930,52.203704,2185.0,6884.0,1656225.0,...,100355.054688,5.9e-05,62.281456,-7e-06,51.071964,-0.060543,0.634863,0.760148,3901,61.778942


In [31]:
hmap_pred = folium.Map(location=[lat_center,lon_center],zoom_start=4,tiles = "Stamen Terrain")
weight_pred = [[row['纬度'],row['经度'],row['pred_result']]for index,row in res.iterrows()]
hmap_pred.add_child(HeatMap(weight_pred,gradient={.3:"blue",.6:"green",.9:"yellow",1:"red"}))
hmap_pred