# PM2.5 concentration spatial distribution estimation in China 
## Abstract  

China, as the largest developing country in the world, has been facing increasingly severe air quality problems accompanying the continuous advancement of industrialization and urbanization. According to the “2017 China Environmental Situation Bulletin” released by the Ministry of Ecology and Environment, among the 338 cities at the prefecture level and above, 239 cities have exceeded the national ambient air quality standards, with a proportion of more than 70%. Air quality problems have seriously affected people's daily travel and physical health, restricted sustainable economic development, and become a hot issue for public and government attention. PM2.5 is a respirable particle with a diameter of no more than 2.5 micrometers in the aerodynamic field, and is one of the main indicators for evaluating air quality. Comprehensive understanding of the spatial distribution of PM2.5 concentration, representing the spatial process and environmental behavior of atmospheric pollution, is of great significance and guidance value for supporting atmospheric pollution monitoring and early warning, comprehensive treatment, protecting human health and social sustainable development. By the end of 2017, the China National Environmental Monitoring Center had established over 400 ground-level air quality monitoring stations, and released hourly air quality monitoring data including PM2.5, providing high-precision and reliable real-time monitoring results. However, due to the uneven spatial distribution and low coverage of ground monitoring stations, it is difficult for existing studies to effectively analyze and deeply mine the spatiotemporal data of their monitoring data. Unlike ground monitoring, remote sensing observation based on satellites can obtain high-coverage atmospheric environment spatial datasets, such as atmospheric aerosol optical depth (AOD). Numerous studies have shown that there is a strong correlation between AOD and PM2.5 concentration. Research on PM2.5 concentration Spatial regression relationships with factors such as AOD retrieved from remote sensing inversion can provide effective solutions for obtaining the spatial distribution of PM2.5 concentration in the entire study area. Methodology Based on GWR geographical weighted regression thinking, Wu SenSen combines OLR with neural network model to propose a Geographically Neural Network Weighted Regression (GNNWR) model. By utilizing the learning ability of neural networks, this model can handle the spatial heterogeneity and complex nonlinear characteristics of regression relationships, which has better fitness accuracy and prediction performance compared to models such as OLR and GWR. The purpose of this case is to establish a PM2.5 concentration spatial estimation model based on GNNWR to achieve accurate fitting of spatial heterogeneity and nonlinear characteristics in PM2.5 regression relationships, and then obtain high-precision and high-reasonability PM2.5 concentration spatial distribution in China.  

## Data description  

Many studies have shown that integrating meteorological conditions such as temperature, precipitation, wind speed, wind direction, and surface elevation factors can further improve the accuracy of PM2.5 spatial estimation. In this case, in addition to selecting AOD data as an auxiliary factor, temperature (TEMP), precipitation (TP), wind speed (WS), wind direction (WD), and surface elevation (DEM) factors are added as input variables for the model. The research time scale is the average of 2017 year scale:
(1) PM2.5 monitoring site data. The hourly PM2.5 concentration observation values from January 1, 2017 to December 31, 2017 were obtained from the China Environmental Monitoring Station. PM2.5 concentrations were measured using cone-shaped element oscillation trace or beta attenuation methods following national standard GB3095-2012. PM2.5 data were averaged for one year's time scale
(2) Aerosol data. Aerosol data are obtained from the LAADS website including both Terra and Aqua dark pixel inversion products with a resolution of 3 km (MOD04_3K and MYD04_3k), as well as deep blue algorithm inversion products with a resolution of 10 km (MOD04_L2 and MYD04_L2). In this article, the 3 km resolution AOD products are the main data source for PM2.5 estimation. When there is a missing value in the 3 km resolution data, a resolution-matching product will be used as much as possible using a 10 km resolution product for resampling substitution to ensure the reliability of AOD data.
(3) DEM data. DEM data are obtained from the ETO-PO1 global surface elevation model of NOAA with a resolution of 1 arc minute
(4) Meteorological data including temperature, precipitation, wind speed, and wind direction are obtained from the ERA5 global climate reanalysis modele provided by ECMWF with hourly gridded data at a resolution of 0.5 degrees.   

## Model Introduction  

Based on the geographical weighted idea similar to GWR, the GNNWR model believes that the spatial heterogeneity of the regression relationship can be regarded as the varying levels of spatial nonstationarity at different locations that affect the "OLR regression relationship". Therefore, in the spatial estimation experiment of PM2.5 concentration in this case, the model structure of GNNWR is defined as follows:  
 
![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059648746_1.png)  

In this equation, (u_i, v_i) are the spatial coordinates of the i-th sample point, and β = (β_0, β_1, ..., β_6) are the regression coefficients of the OLR model, reflecting the average level of the PM2.5 regression relationship for the entire region. The estimation matrix of OLR coefficients is represented as follows:

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059665465_2.png)  

of which: 

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059673642_3.png)  


<br/><br/>  


![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694003342595_1.png)  
The spatial estimation model for PM2.5 concentration based on GNNWR

## Main Content  
1. Model Training 
2. Result Storage, Loading, and Visualization
3. Estimation 

# Part A：Preparation

## Import Necessary Packages

In [1]:
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
from src.gnnwr import models,datasets,utils
import pandas as pd
import torch.nn as nn

# Part B：Model Training

## Step 1：Import Training Data

In [2]:
data = pd.read_csv(u'../data/pm25_data.csv')
data.head(5)

Unnamed: 0,station_id,lng,lat,date,PM2_5,row_index,col_index,proj_x,proj_y,dem,...,t2m,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi
0,1001A,116.366,39.8673,20170601,54.733894,2201,6867,1650847.552,1370268.366,46,...,284.561066,100809.2734,0.001006,134.995636,-7e-06,46.315975,0.425366,0.170262,0.870967,2401
1,1002A,116.17,40.2865,20170601,48.080737,2134,6835,1625003.973,1416959.964,420,...,282.907684,97125.08594,0.001044,157.77597,-6e-06,53.605503,0.211734,-0.676848,0.71208,5255
2,1003A,116.434,39.9522,20170601,54.898592,2188,6877,1653776.71,1381524.305,48,...,284.492249,100830.9688,0.001002,129.971298,-7e-06,45.537464,0.266666,0.069172,0.875811,2609
3,1004A,116.434,39.8745,20170601,52.266382,2200,6877,1655828.045,1372270.098,45,...,284.6362,100936.8047,0.00101,138.793961,-7e-06,45.387913,0.299403,0.22795,0.869679,2420
4,1005A,116.473,39.9716,20170601,53.189076,2185,6884,1656224.681,1384491.842,40,...,284.506561,100880.1797,0.001019,130.520599,-7e-06,44.790119,0.169121,0.079546,0.873232,3296


In [3]:
train_data = data[:int(0.8*len(data))]
print(len(train_data))
val_data = data[int(0.8*len(data)):int(0.9*len(data))]
print(len(val_data))
test_data = data[int(0.9*len(data)):]
print(len(test_data))

1126
141
141


### View the distribution of training data

## Step 2：Partition Datasets

In [6]:
x_column=['dem', 'w10','d10','t2m','aod_sat','tp']
y_column=['PM2_5']
spatial_column=['proj_x','proj_y']
# train_set, val_set, test_set = datasets.init_dataset_usedata(data=data,
#                                             test_ratio=0.2,
#                                             valid_ratio=0.2,
#                                             x_column=x_column,
#                                             y_column=y_column,
#                                             spatial_column=spatial_column,
#                                             use_model="gnnwr",
#                                             sample_seed=42)
train_set, val_set, test_set = datasets.init_dataset_usedata(
                                            train_data=train_data,
                                            val_data=val_data,
                                            test_data=test_data,
                                            x_column=x_column,
                                            y_column=y_column,
                                            spatial_column=spatial_column,
                                            use_model="gnnwr")

1126 141 141
1126 141 141


## Step 3：Initialize GNNWR Model

In [7]:
optimizer_params = {
    "scheduler":"MultiStepLR",
    "scheduler_milestones":[500, 1000, 1500, 2000],
    "scheduler_gamma":0.75,
}
gnnwr = models.GNNWR(train_dataset = train_set,
                     valid_dataset = val_set, 
                     test_dataset = test_set,
                     dense_layers = [1024, 512, 256],
                     activate_func = nn.PReLU(init=0.4),
                     start_lr = 0.1,
                     optimizer = "Adadelta",
                     model_name = "GNNWR_PM25",
                     model_save_path = "../demo_result/gnnwr_models",
                     log_path = "../demo_result/gnnwr_logs",
                     write_path = "../demo_result/gnnwr_runs", # 这里需要修改
                     optimizer_params = optimizer_params
                     )

In [8]:
gnnwr.add_graph()

Add Graph Successfully


## Step 4：Model Training

In [9]:
gnnwr.run(max_epoch = 4000,early_stop=1000,print_frequency = 500)

  0%|          | 0/4000 [00:00<?, ?it/s]

 26%|██▌       | 1036/4000 [18:54<54:06,  1.10s/it, Train Loss=20.382665, Train R2=0.882499, Train AIC=tensor(6590.9253, grad_fn=<AddBackward0>), Valid Loss=145, Valid R2=-0.069, Best Valid R2=0.402, Learning Rate=0.0563]    


Training stop! Model has not been improved for over 1000 epochs.


In [7]:
gnnwr.load_model('../demo_result/gnnwr_models/GNNWR_PM25.pkl')

In [8]:
# gnnwr.run(max_epoch = 200,print_frequency = 100)

## Step 5：模型评价与结果分析

### 输出模型训练结果

In [9]:
gnnwr.result()

--------------------Model Information-----------------
Model Name:           | GNNWR_PM25
independent variable: | ['dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp']
dependent variable:   | ['PM2_5']

OLS coefficients: 
x0: 7.20675
x1: -7.51571
x2: 0.00989
x3: 20.22680
x4: 29.89851
x5: -26.85803
Intercept: 24.67621

--------------------Result Information----------------
Test Loss: |                  48.51078
Test R2  : |                   0.77020
Train R2 : |                   0.83541
Valid R2 : |                   0.87624
RMSE: |                        6.96497
AIC:  |                     1901.94502
AICc: |                     1861.75977
F1:   |                        0.03645
F2:   |                        1.14678
f3_param_0: |                  7.40749
f3_param_1: |                  1.25752
f3_param_2: |                  2.64681
f3_param_3: |                 37.57907
f3_param_4: |                265.19223
f3_param_5: |                 81.64454
f3_param_6: |                 24.89878


### Save the training result

In [10]:
gnnwr.reg_result('../demo_result/GNNWR_PM25_Result.csv')

Unnamed: 0,coef_dem,coef_w10,coef_d10,coef_t2m,coef_aod_sat,coef_tp,bias,Pred_PM2_5,id,dataset_belong,denormalized_pred_result
0,-45.557182,-0.775715,-0.149798,27.540590,31.191809,-12.291560,13.318878,45.728504,133,train,45.728504
1,-34.701672,-1.025062,-0.140959,13.398714,64.744911,0.236278,-0.443000,50.383171,762,train,50.383171
2,-33.250767,-9.450571,-0.085379,9.757741,28.645323,1.396420,12.682632,25.448702,242,train,25.448702
3,-52.963104,28.796860,-0.076136,23.482132,-15.975051,-15.203238,41.262192,38.852043,850,train,38.852043
4,-34.232338,-10.483487,-0.094262,7.047032,38.991272,-1.042216,12.211046,28.968010,1388,train,28.968010
...,...,...,...,...,...,...,...,...,...,...,...
1403,-23.156355,-10.124598,-0.127857,8.816700,35.762028,8.281764,16.007805,22.852810,1197,test,22.852810
1404,-19.226147,2.163573,-0.098970,-7.150615,45.480740,18.506145,8.557326,39.327358,130,test,39.327358
1405,-85.870758,24.653465,-0.161381,24.354698,-16.807131,2.630647,29.552315,41.787109,288,test,41.787109
1406,-11.194925,20.109470,-0.071278,-8.602604,63.145027,-7.025927,-1.339907,28.228340,883,test,28.228340


# Part C：Visualization

In [11]:
visualizer = utils.Visualize(data=gnnwr,lon_lat_columns=['lng','lat'])

### Drawing the distribution of datasets

### Universal

In [12]:
visualizer.display_dataset(name='all',y_column='PM2_5')

### Train

In [13]:
visualizer.display_dataset(name='train',y_column='PM2_5')

### Validation

In [14]:
visualizer.display_dataset(name='valid',y_column='PM2_5')

### Test

In [15]:
visualizer.display_dataset(name='test',y_column='PM2_5')

### Drawing the heat map for the distribution of weights of each variables 

#### DEM

In [16]:
visualizer.coefs_heatmap('coef_dem')

#### AOD

In [17]:
visualizer.coefs_heatmap('coef_aod_sat')

### Precipitation

In [18]:
visualizer.coefs_heatmap('coef_tp')

#### Temperature

In [19]:
visualizer.coefs_heatmap('coef_t2m')

#### Wind Speed

In [20]:
visualizer.coefs_heatmap('coef_w10')

#### Wind Direction

In [21]:
visualizer.coefs_heatmap('coef_d10')

#### Bias

In [22]:
visualizer.coefs_heatmap('bias')

# Part D：Saving and Loading

## Step 1：Saving Datasets

In [23]:
# make sure dir is not exist
train_set.save('../demo_result/gnnwr_datasets/train_dataset', exist_ok=True)
val_set.save('../demo_result/gnnwr_datasets/val_dataset', exist_ok=True)
test_set.save('../demo_result/gnnwr_datasets/test_dataset', exist_ok=True)

## Step 2：Loading Datasets

In [24]:
train_dataset_load = datasets.load_dataset('../demo_result/gnnwr_datasets/train_dataset/')
val_dataset_load = datasets.load_dataset('../demo_result/gnnwr_datasets/val_dataset/')
test_dataset_load = datasets.load_dataset('../demo_result/gnnwr_datasets/test_dataset/')

## Step 3：Loading Model

### Initialize GNNWR Model

In [25]:
gnnwr_load = models.GNNWR(train_dataset = train_dataset_load,
                     valid_dataset = val_dataset_load, 
                     test_dataset = test_dataset_load,
                     dense_layers = [512, 256, 128],
                     start_lr = 0.2,
                     optimizer = "Adadelta",
                     activate_func = nn.PReLU(init=0.1),
                     model_name = "GNNWR_PM25",
                     model_save_path = "../demo_result/gnnwr_models",
                     log_path = "../demo_result/gnnwr_logs",
                     write_path = "../demo_result/gnnwr_runs"
                     )

### Loading Parameters

In [26]:
gnnwr_load.load_model('../demo_result/gnnwr_models/GNNWR_PM25.pkl')
gnnwr_load.result()

--------------------Model Information-----------------
Model Name:           | GNNWR_PM25
independent variable: | ['dem', 'w10', 'd10', 't2m', 'aod_sat', 'tp']
dependent variable:   | ['PM2_5']

OLS coefficients: 
x0: 7.20675
x1: -7.51571
x2: 0.00989
x3: 20.22680
x4: 29.89851
x5: -26.85803
Intercept: 24.67621

--------------------Result Information----------------
Test Loss: |                  48.51079
Test R2  : |                   0.77020
Train R2 : |                   0.83541
Valid R2 : |                   0.87624
RMSE: |                        6.96497
AIC:  |                     1901.94505
AICc: |                     1861.75977
F1:   |                        0.03645
F2:   |                        1.14681
f3_param_0: |                  7.40758
f3_param_1: |                  1.25753
f3_param_2: |                  2.64686
f3_param_3: |                 37.57855
f3_param_4: |                265.19437
f3_param_5: |                 81.64403
f3_param_6: |                 24.89906


# Part E：Estimation

## Step 1：Import Estimation Data

In [27]:
pred_data = pd.read_csv('../data/pm25_predict_data.csv')

## Step 2：Initialize Estimation Dataset

In [28]:
pred_dataset = datasets.init_predict_dataset(data = pred_data,train_dataset = train_set,x_column=['dem', 'w10','d10','t2m','aod_sat','tp'],spatial_column=['lng','lat'])

## Step 3：Estimate

In [29]:
res = gnnwr.predict(pred_dataset)
res.head(5)

Unnamed: 0,station_id,lng,lat,date,PM2.5,row_index,col_index,proj_x,proj_y,dem,...,tp,blh,e,r,u10,v10,aod_sat,ndvi,pred_result,denormalized_pred_result
0,1001A,116.366,39.8673,20170930,56.357143,2201.0,6867.0,1650848.0,1370268.0,46,...,5.1e-05,64.583054,-7e-06,52.682091,0.384257,0.784808,0.762762,3443,100.403412,100.403412
1,1002A,116.17,40.2865,20170930,47.148148,2134.0,6835.0,1625004.0,1416960.0,420,...,0.000304,40.62114,-7e-06,62.529091,-0.156175,-0.537717,0.574785,7810,91.639618,91.639618
2,1003A,116.434,39.9522,20170930,53.857143,2188.0,6877.0,1653777.0,1381524.0,48,...,5.8e-05,60.242908,-7e-06,52.12664,0.093867,0.617515,0.796827,3328,86.25441,86.25441
3,1004A,116.434,39.8745,20170930,46.333333,2200.0,6877.0,1655828.0,1372270.0,45,...,4.7e-05,69.535637,-8e-06,51.301529,0.197439,0.893495,0.758839,4535,102.564621,102.564621
4,1005A,116.473,39.9716,20170930,52.203704,2185.0,6884.0,1656225.0,1384492.0,40,...,5.9e-05,62.281456,-7e-06,51.071964,-0.060543,0.634863,0.760148,3901,96.772766,96.772766


## Step 4：Display

In [30]:
visualizer.dot_map(data=res,lon_column='lng',lat_column='lat',y_column='pred_result',colors=['blue','green','red'])