# PM2.5 concentration spatial distribution estimation in China 
## Abstract  

China, as the largest developing country in the world, has been facing increasingly severe air quality problems accompanying the continuous advancement of industrialization and urbanization. According to the “2017 China Environmental Situation Bulletin” released by the Ministry of Ecology and Environment, among the 338 cities at the prefecture level and above, 239 cities have exceeded the national ambient air quality standards, with a proportion of more than 70%. Air quality problems have seriously affected people's daily travel and physical health, restricted sustainable economic development, and become a hot issue for public and government attention. PM2.5 is a respirable particle with a diameter of no more than 2.5 micrometers in the aerodynamic field, and is one of the main indicators for evaluating air quality. Comprehensive understanding of the spatial distribution of PM2.5 concentration, representing the spatial process and environmental behavior of atmospheric pollution, is of great significance and guidance value for supporting atmospheric pollution monitoring and early warning, comprehensive treatment, protecting human health and social sustainable development. By the end of 2017, the China National Environmental Monitoring Center had established over 400 ground-level air quality monitoring stations, and released hourly air quality monitoring data including PM2.5, providing high-precision and reliable real-time monitoring results. However, due to the uneven spatial distribution and low coverage of ground monitoring stations, it is difficult for existing studies to effectively analyze and deeply mine the spatiotemporal data of their monitoring data. Unlike ground monitoring, remote sensing observation based on satellites can obtain high-coverage atmospheric environment spatial datasets, such as atmospheric aerosol optical depth (AOD). Numerous studies have shown that there is a strong correlation between AOD and PM2.5 concentration. Research on PM2.5 concentration Spatial regression relationships with factors such as AOD retrieved from remote sensing inversion can provide effective solutions for obtaining the spatial distribution of PM2.5 concentration in the entire study area. Methodology Based on GWR geographical weighted regression thinking, Wu SenSen combines OLR with neural network model to propose a Geographically Neural Network Weighted Regression (GNNWR) model. By utilizing the learning ability of neural networks, this model can handle the spatial heterogeneity and complex nonlinear characteristics of regression relationships, which has better fitness accuracy and prediction performance compared to models such as OLR and GWR. The purpose of this case is to establish a PM2.5 concentration spatial estimation model based on GNNWR to achieve accurate fitting of spatial heterogeneity and nonlinear characteristics in PM2.5 regression relationships, and then obtain high-precision and high-reasonability PM2.5 concentration spatial distribution in China.  

## Data description  

Many studies have shown that integrating meteorological conditions such as temperature, precipitation, wind speed, wind direction, and surface elevation factors can further improve the accuracy of PM2.5 spatial estimation. In this case, in addition to selecting AOD data as an auxiliary factor, temperature (TEMP), precipitation (TP), wind speed (WS), wind direction (WD), and surface elevation (DEM) factors are added as input variables for the model. The research time scale is the average of 2017 year scale:
(1) PM2.5 monitoring site data. The hourly PM2.5 concentration observation values from January 1, 2017 to December 31, 2017 were obtained from the China Environmental Monitoring Station. PM2.5 concentrations were measured using cone-shaped element oscillation trace or beta attenuation methods following national standard GB3095-2012. PM2.5 data were averaged for one year's time scale
(2) Aerosol data. Aerosol data are obtained from the LAADS website including both Terra and Aqua dark pixel inversion products with a resolution of 3 km (MOD04_3K and MYD04_3k), as well as deep blue algorithm inversion products with a resolution of 10 km (MOD04_L2 and MYD04_L2). In this article, the 3 km resolution AOD products are the main data source for PM2.5 estimation. When there is a missing value in the 3 km resolution data, a resolution-matching product will be used as much as possible using a 10 km resolution product for resampling substitution to ensure the reliability of AOD data.
(3) DEM data. DEM data are obtained from the ETO-PO1 global surface elevation model of NOAA with a resolution of 1 arc minute
(4) Meteorological data including temperature, precipitation, wind speed, and wind direction are obtained from the ERA5 global climate reanalysis modele provided by ECMWF with hourly gridded data at a resolution of 0.5 degrees.   

## Model Introduction  

Based on the geographical weighted idea similar to GWR, the GNNWR model believes that the spatial heterogeneity of the regression relationship can be regarded as the varying levels of spatial nonstationarity at different locations that affect the "OLR regression relationship". Therefore, in the spatial estimation experiment of PM2.5 concentration in this case, the model structure of GNNWR is defined as follows:  
 
![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059648746_1.png)  

In this equation, (u_i, v_i) are the spatial coordinates of the i-th sample point, and β = (β_0, β_1, ..., β_6) are the regression coefficients of the OLR model, reflecting the average level of the PM2.5 regression relationship for the entire region. The estimation matrix of OLR coefficients is represented as follows:

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059665465_2.png)  

of which: 

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059673642_3.png)  


<br/><br/>  


![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694003342595_1.png)  
The spatial estimation model for PM2.5 concentration based on GNNWR

## Main Content  
1. Model Training 
2. Result Storage, Loading, and Visualization
3. Estimation 

# Part A：Preparation

## Import Necessary Packages

In [6]:
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
from src.gnnwr import models,datasets,utils
import pandas as pd
import torch.nn as nn

# Part B：Model Training

## Step 1：Import Training Data

In [7]:
data = pd.read_csv(u'../data/pm25_data.csv')
data.head(5)

Unnamed: 0,station_id,lng,lat,date,PM2_5,row_index,col_index,proj_x,proj_y,dem,...,t2m,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi
0,1001A,116.366,39.8673,20170601,54.733894,2201,6867,1650847.552,1370268.366,46,...,284.561066,100809.2734,0.001006,134.995636,-7e-06,46.315975,0.425366,0.170262,0.870967,2401
1,1002A,116.17,40.2865,20170601,48.080737,2134,6835,1625003.973,1416959.964,420,...,282.907684,97125.08594,0.001044,157.77597,-6e-06,53.605503,0.211734,-0.676848,0.71208,5255
2,1003A,116.434,39.9522,20170601,54.898592,2188,6877,1653776.71,1381524.305,48,...,284.492249,100830.9688,0.001002,129.971298,-7e-06,45.537464,0.266666,0.069172,0.875811,2609
3,1004A,116.434,39.8745,20170601,52.266382,2200,6877,1655828.045,1372270.098,45,...,284.6362,100936.8047,0.00101,138.793961,-7e-06,45.387913,0.299403,0.22795,0.869679,2420
4,1005A,116.473,39.9716,20170601,53.189076,2185,6884,1656224.681,1384491.842,40,...,284.506561,100880.1797,0.001019,130.520599,-7e-06,44.790119,0.169121,0.079546,0.873232,3296


In [8]:
data = data.sample(frac=1,random_state=42)
train_data = data[:int(0.7*len(data))]
val_data = data[int(0.7*len(data)):int(0.8*len(data))]
test_data = data[int(0.8*len(data)):]

### View the distribution of training data

## Step 2：Partition Datasets

In [None]:
x_columns=['dem', 'w10','d10','t2m','aod_sat','tp']
y_column=['PM2_5']
spatial_columns=['proj_x','proj_y']
# train_set, val_set, test_set = datasets.init_dataset(data=data,
#                                             test_ratio=0.2,
#                                             valid_ratio=0.2,
#                                             x_column=x_column,
#                                             y_column=y_column,
#                                             spatial_column=spatial_column,
#                                             use_model="gnnwr",
#                                             sample_seed=42)
train_set, val_set, test_set = datasets.init_dataset_split(
                                            train_data=train_data,
                                            val_data=val_data,
                                            test_data=test_data,
                                            x_columns=x_columns,
                                            y_column=y_column,
                                            spatial_columns=spatial_columns,
                                            batch_size = 1024,
                                            use_model="gnnwr")

## Step 3：Initialize GNNWR Model

In [10]:
optimizer_params = {
    "scheduler":"MultiStepLR",
    "scheduler_milestones":[500, 1000, 1500, 2000],
    "scheduler_gamma":0.75,
}
gnnwr = models.GNNWR(train_dataset = train_set,
                     valid_dataset = val_set, 
                     test_dataset = test_set,
                     dense_layers = [1024, 512, 256],
                     activate_func = nn.PReLU(init=0.4),
                     start_lr = 0.1,
                     optimizer = "Adadelta",
                     model_name = "GNNWR_PM25",
                     model_save_path = "../demo_result/gnnwr_models",
                     log_path = "../demo_result/gnnwr_logs",
                     write_path = "../demo_result/gnnwr_runs", # 这里需要修改
                     optimizer_params = optimizer_params
                     )

In [11]:
gnnwr.add_graph()

Add Graph Successfully


## Step 4：Model Training

In [12]:
gnnwr.run(max_epoch = 4000,early_stop=1000,print_frequency = 500)

Consider using tensor.detach() first. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\autograd\generated\python_variable_methods.cpp:837.)
  return float(1 - torch.sum(self.__residual ** 2) / torch.sum((self.__y_data - torch.mean(self.__y_data)) ** 2))
100%|██████████| 4000/4000 [08:17<00:00,  8.04it/s, Train Loss=24.3178, Train R2=0.8849, Train RMSE=4.8204, Train AIC=5894.8164, Valid Loss=25.8734, Valid R2=0.8359, Valid RMSE=5.4935, Best Valid R2=0.8453, Learning Rate=0.0316]     


In [13]:
gnnwr.load_model('../demo_result/gnnwr_models/GNNWR_PM25.pkl')

## Step 5：模型评价与结果分析

### 输出模型训练结果

In [14]:
gnnwr.result()

--------------------Model Information-----------------
Model Name:           | GNNWR_PM25
independent variable: | ['dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp']
dependent variable:   | ['PM2_5']

OLS coefficients: 
x0: 5.00462
x1: -10.35234
x2: -0.98542
x3: 23.47110
x4: 29.09492
x5: -27.67621
Intercept: 24.20848

--------------------Result Information----------------
Test Loss: |                  23.94857
Test R2  : |                   0.88655
Train R2 : |                   0.91600
Valid R2 : |                   0.84527
RMSE: |                        4.89373
AIC:  |                     1698.32324
AICc: |                     1702.86877
F1:   |                        0.03772
F2:   |                        2.46101
f3_param_0: |                 15.00357
f3_param_1: |                  2.54759
f3_param_2: |                  5.36124
f3_param_3: |                 76.11821
f3_param_4: |                537.18396
f3_param_5: |                165.37825
f3_param_6: |                 34.26014


### Save the training result

In [15]:
gnnwr.reg_result('../demo_result/GNNWR_PM25_Result.csv')

Result saved as e:\CODE\gnnwr\demo_result\GNNWR_PM25_Result.csv


Unnamed: 0,coef_dem,coef_w10,coef_d10,coef_t2m,coef_aod_sat,coef_tp,bias,Pred_PM2_5,id,dataset_belong,denormalized_pred_result
0,-49.313916,-22.290796,-15.578921,36.983446,-12.335181,-18.373330,46.407154,38.334366,0.0,train,38.334366
1,-8.194478,5.993950,1.468095,12.559563,50.717017,-12.320839,2.974349,17.835045,1.0,train,17.835045
2,-93.491765,-12.015135,-27.624726,52.494813,-27.746685,-29.813959,69.009952,40.526413,2.0,train,40.526413
3,-65.672850,8.099174,-7.004443,64.534123,34.086476,-51.721898,5.943659,27.826311,3.0,train,27.826311
4,-24.080847,0.334543,-2.982499,-3.888858,136.517129,25.274273,-51.125342,21.929756,4.0,train,21.929756
...,...,...,...,...,...,...,...,...,...,...,...
1403,-49.036025,36.874089,-10.851839,-20.096102,5.372617,-5.225683,58.580327,42.818748,1403.0,test,42.818748
1404,-56.665277,-12.391780,-5.938049,-0.082357,36.562125,0.363843,28.869351,37.978157,1404.0,test,37.978157
1405,-79.741967,10.254953,7.688930,26.065527,-31.575601,-18.840222,53.086404,40.074440,1405.0,test,40.074440
1406,-45.037643,3.993817,2.193668,57.281200,9.316445,-36.755505,23.070454,25.302567,1406.0,test,25.302567


# Part C：Visualization

In [16]:
visualizer = utils.Visualize(data=gnnwr,lon_lat_columns=['lng','lat'])

### Drawing the distribution of datasets

### Universal

In [17]:
visualizer.display_dataset(name='all',y_column='PM2_5')

### Train

In [18]:
visualizer.display_dataset(name='train',y_column='PM2_5')

### Validation

In [19]:
visualizer.display_dataset(name='valid',y_column='PM2_5')

### Test

In [20]:
visualizer.display_dataset(name='test',y_column='PM2_5')

### Drawing the heat map for the distribution of weights of each variables 

#### DEM

In [21]:
visualizer._result_data['coef_dem']

id
0      -49.313916
1       -8.194478
2      -93.491765
3      -65.672850
4      -24.080847
          ...    
1403   -49.036025
1404   -56.665277
1405   -79.741967
1406   -45.037643
1407   -53.791648
Name: coef_dem, Length: 1408, dtype: float64

In [22]:
visualizer.coefs_heatmap('coef_dem')

#### AOD

In [23]:
visualizer.coefs_heatmap('coef_aod_sat')

### Precipitation

In [24]:
visualizer.coefs_heatmap('coef_tp')

#### Temperature

In [25]:
visualizer.coefs_heatmap('coef_t2m')

#### Wind Speed

In [26]:
visualizer.coefs_heatmap('coef_w10')

#### Wind Direction

In [27]:
visualizer.coefs_heatmap('coef_d10')

#### Bias

In [28]:
visualizer.coefs_heatmap('bias')

# Part D：Saving and Loading

## Step 1：Saving Datasets

In [29]:
# make sure dir is not exist
train_set.save('../demo_result/gnnwr_datasets/train_dataset', exist_ok=True)
val_set.save('../demo_result/gnnwr_datasets/val_dataset', exist_ok=True)
test_set.save('../demo_result/gnnwr_datasets/test_dataset', exist_ok=True)

## Step 2：Loading Datasets

In [30]:
train_dataset_load = datasets.BaseDataset.load('../demo_result/gnnwr_datasets/train_dataset/')
val_dataset_load = datasets.BaseDataset.load('../demo_result/gnnwr_datasets/val_dataset/')
test_dataset_load = datasets.BaseDataset.load('../demo_result/gnnwr_datasets/test_dataset/')

## Step 3：Loading Model

### Initialize GNNWR Model

In [31]:
gnnwr_load = models.GNNWR(train_dataset = train_dataset_load,
                     valid_dataset = val_dataset_load, 
                     test_dataset = test_dataset_load,
                     dense_layers = [512, 256, 128],
                     start_lr = 0.2,
                     optimizer = "Adadelta",
                     activate_func = nn.PReLU(init=0.1),
                     model_name = "GNNWR_PM25",
                     model_save_path = "../demo_result/gnnwr_models",
                     log_path = "../demo_result/gnnwr_logs",
                     write_path = "../demo_result/gnnwr_runs"
                     )

### Loading Parameters

In [32]:
gnnwr_load.load_model('../demo_result/gnnwr_models/GNNWR_PM25.pkl')
gnnwr_load.result()

--------------------Model Information-----------------
Model Name:           | GNNWR_PM25
independent variable: | ['dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp']
dependent variable:   | ['PM2_5']

OLS coefficients: 
x0: 5.00462
x1: -10.35234
x2: -0.98542
x3: 23.47110
x4: 29.09492
x5: -27.67621
Intercept: 24.20848

--------------------Result Information----------------
Test Loss: |                  23.94857
Test R2  : |                   0.88655
Train R2 : |                   0.91600
Valid R2 : |                   0.84527
RMSE: |                        4.89373
AIC:  |                     1698.32324
AICc: |                     1702.86877
F1:   |                        0.03772
F2:   |                        2.46101
f3_param_0: |                 15.00357
f3_param_1: |                  2.54759
f3_param_2: |                  5.36124
f3_param_3: |                 76.11821
f3_param_4: |                537.18396
f3_param_5: |                165.37825
f3_param_6: |                 34.26014


# Part E：Estimation

## Step 1：Import Estimation Data

In [33]:
pred_data = pd.read_csv('../data/pm25_predict_data.csv')
pred_data

Unnamed: 0,station_id,lng,lat,date,PM2.5,row_index,col_index,proj_x,proj_y,dem,...,t2m,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi
0,1001A,116.3660,39.8673,20170930,56.357143,2201.0,6867.0,1.650848e+06,1.370268e+06,46,...,294.224304,100287.671875,0.000051,64.583054,-0.000007,52.682091,0.384257,0.784808,0.762762,3443
1,1002A,116.1700,40.2865,20170930,47.148148,2134.0,6835.0,1.625004e+06,1.416960e+06,420,...,292.293274,96752.507812,0.000304,40.621140,-0.000007,62.529091,-0.156175,-0.537717,0.574785,7810
2,1003A,116.4340,39.9522,20170930,53.857143,2188.0,6877.0,1.653777e+06,1.381524e+06,48,...,294.010468,100307.703125,0.000058,60.242908,-0.000007,52.126640,0.093867,0.617515,0.796827,3328
3,1004A,116.4340,39.8745,20170930,46.333333,2200.0,6877.0,1.655828e+06,1.372270e+06,45,...,294.296631,100410.367188,0.000047,69.535637,-0.000008,51.301529,0.197439,0.893495,0.758839,4535
4,1005A,116.4730,39.9716,20170930,52.203704,2185.0,6884.0,1.656225e+06,1.384492e+06,40,...,293.959381,100355.054688,0.000059,62.281456,-0.000007,51.071964,-0.060543,0.634863,0.760148,3901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1405,3059A,117.7181,36.2092,20170930,37.803571,2787.0,7083.0,1.855411e+06,9.608211e+05,203,...,293.371246,98542.398438,0.000471,162.070038,-0.000009,72.421783,-0.726250,0.820030,0.607255,3994
1406,3061A,119.7512,49.1577,20170930,7.196429,715.0,7408.0,1.623754e+06,2.533062e+06,613,...,282.403046,93387.531250,0.001373,347.414734,-0.000011,63.961544,1.295761,0.804451,0.138062,2808
1407,3064A,117.6983,36.2211,20170930,27.846154,2785.0,7080.0,1.853493e+06,9.618501e+05,212,...,293.411987,98633.031250,0.000472,160.446884,-0.000009,72.422592,-0.701013,0.841404,0.633351,3521
1408,3065A,101.9533,29.9972,20170930,12.333333,3781.0,4561.0,6.252221e+05,2.179838e+04,3047,...,279.839996,62988.273438,0.002760,187.390488,-0.000008,99.212151,-0.841196,0.436649,0.209526,6370


## Step 2：Initialize Estimation Dataset

In [None]:
pred_dataset = datasets.init_predict_dataset(data = pred_data, reference_data=train_data, x_columns=['dem', 'w10','d10','t2m','aod_sat','tp'],spatial_columns=['lng','lat'])



## Step 3：Estimate

In [44]:
res, coeff = gnnwr.predict(pred_dataset)
res



Unnamed: 0,station_id,lng,lat,date,PM2.5,row_index,col_index,proj_x,proj_y,dem,...,blh,e,r,u10,v10,aod_sat,ndvi,__idx__,pred_result,denormalized_pred_result
0,1001A,116.3660,39.8673,20170930,56.357143,2201.0,6867.0,1.650848e+06,1.370268e+06,46,...,64.583054,-0.000007,52.682091,0.384257,0.784808,0.762762,3443,0,-6.822932e+05,-6.822932e+05
1,1002A,116.1700,40.2865,20170930,47.148148,2134.0,6835.0,1.625004e+06,1.416960e+06,420,...,40.621140,-0.000007,62.529091,-0.156175,-0.537717,0.574785,7810,1,-2.345016e+06,-2.345016e+06
2,1003A,116.4340,39.9522,20170930,53.857143,2188.0,6877.0,1.653777e+06,1.381524e+06,48,...,60.242908,-0.000007,52.126640,0.093867,0.617515,0.796827,3328,2,-6.717021e+05,-6.717021e+05
3,1004A,116.4340,39.8745,20170930,46.333333,2200.0,6877.0,1.655828e+06,1.372270e+06,45,...,69.535637,-0.000008,51.301529,0.197439,0.893495,0.758839,4535,3,-6.638534e+05,-6.638534e+05
4,1005A,116.4730,39.9716,20170930,52.203704,2185.0,6884.0,1.656225e+06,1.384492e+06,40,...,62.281456,-0.000007,51.071964,-0.060543,0.634863,0.760148,3901,4,-8.131592e+05,-8.131592e+05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1405,3059A,117.7181,36.2092,20170930,37.803571,2787.0,7083.0,1.855411e+06,9.608211e+05,203,...,162.070038,-0.000009,72.421783,-0.726250,0.820030,0.607255,3994,1405,-1.304270e+06,-1.304270e+06
1406,3061A,119.7512,49.1577,20170930,7.196429,715.0,7408.0,1.623754e+06,2.533062e+06,613,...,347.414734,-0.000011,63.961544,1.295761,0.804451,0.138062,2808,1406,-2.769652e+06,-2.769652e+06
1407,3064A,117.6983,36.2211,20170930,27.846154,2785.0,7080.0,1.853493e+06,9.618501e+05,212,...,160.446884,-0.000009,72.422592,-0.701013,0.841404,0.633351,3521,1407,-1.342005e+06,-1.342005e+06
1408,3065A,101.9533,29.9972,20170930,12.333333,3781.0,4561.0,6.252221e+05,2.179838e+04,3047,...,187.390488,-0.000008,99.212151,-0.841196,0.436649,0.209526,6370,1408,-7.246872e+06,-7.246872e+06


## Step 4：Display

In [45]:
visualizer.dot_map(data=res,lon_column='lng',lat_column='lat',y_column='pred_result',colors=['blue','green','red'])