# PM2.5 concentration spatial distribution estimation in China 
## Abstract  

China, as the largest developing country in the world, has been facing increasingly severe air quality problems accompanying the continuous advancement of industrialization and urbanization. According to the “2017 China Environmental Situation Bulletin” released by the Ministry of Ecology and Environment, among the 338 cities at the prefecture level and above, 239 cities have exceeded the national ambient air quality standards, with a proportion of more than 70%. Air quality problems have seriously affected people's daily travel and physical health, restricted sustainable economic development, and become a hot issue for public and government attention. PM2.5 is a respirable particle with a diameter of no more than 2.5 micrometers in the aerodynamic field, and is one of the main indicators for evaluating air quality. Comprehensive understanding of the spatial distribution of PM2.5 concentration, representing the spatial process and environmental behavior of atmospheric pollution, is of great significance and guidance value for supporting atmospheric pollution monitoring and early warning, comprehensive treatment, protecting human health and social sustainable development. By the end of 2017, the China National Environmental Monitoring Center had established over 400 ground-level air quality monitoring stations, and released hourly air quality monitoring data including PM2.5, providing high-precision and reliable real-time monitoring results. However, due to the uneven spatial distribution and low coverage of ground monitoring stations, it is difficult for existing studies to effectively analyze and deeply mine the spatiotemporal data of their monitoring data. Unlike ground monitoring, remote sensing observation based on satellites can obtain high-coverage atmospheric environment spatial datasets, such as atmospheric aerosol optical depth (AOD). Numerous studies have shown that there is a strong correlation between AOD and PM2.5 concentration. Research on PM2.5 concentration Spatial regression relationships with factors such as AOD retrieved from remote sensing inversion can provide effective solutions for obtaining the spatial distribution of PM2.5 concentration in the entire study area. Methodology Based on GWR geographical weighted regression thinking, Wu SenSen combines OLR with neural network model to propose a Geographically Neural Network Weighted Regression (GNNWR) model. By utilizing the learning ability of neural networks, this model can handle the spatial heterogeneity and complex nonlinear characteristics of regression relationships, which has better fitness accuracy and prediction performance compared to models such as OLR and GWR. The purpose of this case is to establish a PM2.5 concentration spatial estimation model based on GNNWR to achieve accurate fitting of spatial heterogeneity and nonlinear characteristics in PM2.5 regression relationships, and then obtain high-precision and high-reasonability PM2.5 concentration spatial distribution in China.  

## Data description  

Many studies have shown that integrating meteorological conditions such as temperature, precipitation, wind speed, wind direction, and surface elevation factors can further improve the accuracy of PM2.5 spatial estimation. In this case, in addition to selecting AOD data as an auxiliary factor, temperature (TEMP), precipitation (TP), wind speed (WS), wind direction (WD), and surface elevation (DEM) factors are added as input variables for the model. The research time scale is the average of 2017 year scale:
(1) PM2.5 monitoring site data. The hourly PM2.5 concentration observation values from January 1, 2017 to December 31, 2017 were obtained from the China Environmental Monitoring Station. PM2.5 concentrations were measured using cone-shaped element oscillation trace or beta attenuation methods following national standard GB3095-2012. PM2.5 data were averaged for one year's time scale
(2) Aerosol data. Aerosol data are obtained from the LAADS website including both Terra and Aqua dark pixel inversion products with a resolution of 3 km (MOD04_3K and MYD04_3k), as well as deep blue algorithm inversion products with a resolution of 10 km (MOD04_L2 and MYD04_L2). In this article, the 3 km resolution AOD products are the main data source for PM2.5 estimation. When there is a missing value in the 3 km resolution data, a resolution-matching product will be used as much as possible using a 10 km resolution product for resampling substitution to ensure the reliability of AOD data.
(3) DEM data. DEM data are obtained from the ETO-PO1 global surface elevation model of NOAA with a resolution of 1 arc minute
(4) Meteorological data including temperature, precipitation, wind speed, and wind direction are obtained from the ERA5 global climate reanalysis modele provided by ECMWF with hourly gridded data at a resolution of 0.5 degrees.   

## Model Introduction  

Based on the geographical weighted idea similar to GWR, the GNNWR model believes that the spatial heterogeneity of the regression relationship can be regarded as the varying levels of spatial nonstationarity at different locations that affect the "OLR regression relationship". Therefore, in the spatial estimation experiment of PM2.5 concentration in this case, the model structure of GNNWR is defined as follows:  
 
![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059648746_1.png)  

In this equation, (u_i, v_i) are the spatial coordinates of the i-th sample point, and β = (β_0, β_1, ..., β_6) are the regression coefficients of the OLR model, reflecting the average level of the PM2.5 regression relationship for the entire region. The estimation matrix of OLR coefficients is represented as follows:

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059665465_2.png)  

of which: 

![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694059673642_3.png)  


<br/><br/>  


![Image Name](https://mydde.deep-time.org/s3/static-files/upload/upload/1694003342595_1.png)  
The spatial estimation model for PM2.5 concentration based on GNNWR

## Main Content  
1. Model Training 
2. Result Storage, Loading, and Visualization
3. Estimation 

# Part A：Preparation

## Import Necessary Packages

In [1]:
from gnnwr import models,datasets,utils
import pandas as pd
import numpy as np
import folium
from folium.plugins import HeatMap, MarkerCluster
from demo_utils import marker_map,Heatmap
import torch.nn as nn
from sklearn.metrics import r2_score as r2
import matplotlib.pyplot as plt

# Part B：Model Training

## Step 1：Import Training Data

In [2]:
data = pd.read_csv(u'../data/pm25_data.csv')
data.head(5)

Unnamed: 0,监测点编码,监测点名称,城市,经度,纬度,date,PM2_5,row_index,col_index,proj_x,...,t2m,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi
0,1001A,万寿西宫,北京,116.366,39.8673,20170601,54.733894,2201,6867,1650847.552,...,284.561066,100809.2734,0.001006,134.995636,-7e-06,46.315975,0.425366,0.170262,0.870967,2401
1,1002A,定陵,北京,116.17,40.2865,20170601,48.080737,2134,6835,1625003.973,...,282.907684,97125.08594,0.001044,157.77597,-6e-06,53.605503,0.211734,-0.676848,0.71208,5255
2,1003A,东四,北京,116.434,39.9522,20170601,54.898592,2188,6877,1653776.71,...,284.492249,100830.9688,0.001002,129.971298,-7e-06,45.537464,0.266666,0.069172,0.875811,2609
3,1004A,天坛,北京,116.434,39.8745,20170601,52.266382,2200,6877,1655828.045,...,284.6362,100936.8047,0.00101,138.793961,-7e-06,45.387913,0.299403,0.22795,0.869679,2420
4,1005A,农展馆,北京,116.473,39.9716,20170601,53.189076,2185,6884,1656224.681,...,284.506561,100880.1797,0.001019,130.520599,-7e-06,44.790119,0.169121,0.079546,0.873232,3296


### View the distribution of training data

## Step 2：Partition Datasets

In [3]:
train_dataset, val_dataset, test_dataset = datasets.init_dataset(data=data,
                                                        test_ratio=0.15,
                                                        valid_ratio=0.15,
                                                        x_column=['dem', 'w10','d10','t2m','aod_sat','tp'],
                                                        y_column=['PM2_5'],
                                                        spatial_column=['经度','纬度'],
                                                        sample_seed=23,
                                                        batch_size=64)

x_min:[-5.0000000e+00  4.1591436e-02  3.9565850e-02  2.6959613e+02
  5.6254357e-02  3.8816700e-05];  x_max:[4.52000000e+03 3.20341086e+00 3.59605225e+02 2.97242950e+02
 1.06999075e+00 4.07377200e-03]
y_min:[3.85633803];  y_max:[133.8005618]


## Step 3：Initialize GNNWR Model

In [4]:
gnnwr = models.GNNWR(train_dataset = train_dataset,
                     valid_dataset = val_dataset, 
                     test_dataset = test_dataset,
                     dense_layers = [512, 256, 128],
                     start_lr = 0.2,
                     optimizer = "Adadelta",
                     activate_func = nn.PReLU(init=0.1),
                     model_name = "GNNWR_PM25",
                     model_save_path = "./demo_result/gnnwr_models",
                     log_path = "./demo_result/gnnwr_logs",
                     write_path = "./demo_result/gnnwr_runs"
                     )

## Step 4：Model Training

In [5]:
gnnwr.run(max_epoch = 200,print_frequency = 100)

 50%|█████     | 100/200 [00:39<00:38,  2.61it/s]


Epoch:  100
learning rate:  0.010046876765255498
Train Loss:  69.7564330312003
Train R2: 0.65364
Train RMSE: 8.35203
Train AIC: 8114.44862
Train AICc: 8124.33545
Valid Loss:  55.193511962890625
Valid R2: 0.70847 

Best R2: 0.71955 



100%|██████████| 200/200 [01:17<00:00,  2.58it/s]


Epoch:  200
learning rate:  0.1533589344962853
Train Loss:  52.21087652860958
Train R2: 0.74076
Train RMSE: 7.22571
Train AIC: 8117.78826
Train AICc: 8047.70508
Valid Loss:  66.37403869628906
Valid R2: 0.64942 

Best R2: 0.75463 






Best_r2: 0.7546305752231601


## Step 5：模型评价与结果分析

### 输出模型训练结果

In [6]:
gnnwr.result()

Test Loss:  58.860477447509766  Test R2:  0.7277793475083416
--------------------Result Table--------------------

Model Name:           | GNNWR_PM25
Model Structure:      |
 SWNN(
  (activate_func): PReLU(num_parameters=1)
  (fc): Sequential(
    (swnn_full0): Linear(in_features=1017, out_features=512, bias=True)
    (swnn_batc0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (swnn_acti0): PReLU(num_parameters=1)
    (swnn_drop0): Dropout(p=0.2, inplace=False)
    (swnn_full1): Linear(in_features=512, out_features=256, bias=True)
    (swnn_batc1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (swnn_acti1): PReLU(num_parameters=1)
    (swnn_drop1): Dropout(p=0.2, inplace=False)
    (swnn_full2): Linear(in_features=256, out_features=128, bias=True)
    (swnn_batc2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (swnn_acti2): PReLU(num_parameters=1)
    (swnn_drop2): Dropo

### Save the training result

In [7]:
gnnwr.reg_result('./demo_result/GNNWR_PM25_Result.csv')

Unnamed: 0,weight_dem,weight_w10,weight_d10,weight_t2m,weight_aod_sat,weight_tp,bias,Pred_PM2_5,id
0,-13.644902,10.718448,-0.464332,30.875172,19.719229,-5.688736,19.945663,37.840584,1294
1,-34.087116,-6.749558,-0.739227,68.362503,8.330478,-38.193954,30.013891,59.529022,652
2,-11.040514,7.844609,-0.496154,28.518747,21.163504,-2.790599,12.264205,41.560345,1185
3,-10.852250,4.566343,-0.234378,6.573699,53.894478,4.825507,-3.395134,31.345348,206
4,-14.732536,7.610086,-0.525586,42.784279,-0.110088,-6.700081,12.342926,42.178513,689
...,...,...,...,...,...,...,...,...,...
1403,-30.813250,-6.261174,-0.634639,63.929161,34.519855,-27.180733,15.000223,43.382423,849
1404,-11.073834,9.934203,-0.347188,5.773351,75.298485,6.035664,-17.502312,40.865799,149
1405,-10.515162,5.089436,-0.231722,3.804466,56.608685,8.547994,-4.869211,30.178919,216
1406,-6.759485,2.657905,-0.410325,14.731380,30.914356,9.443642,3.575957,29.868340,1204


# Part C：Visualization

In [8]:
visualizer = utils.Visualize(data=gnnwr,lon_lat_columns=['经度','纬度'])

### Drawing the distribution of datasets

### Universal

In [9]:
visualizer.display_dataset(name='all',y_column='PM2_5')

### Train

In [10]:
visualizer.display_dataset(name='train',y_column='PM2_5')

### Validation

In [11]:
visualizer.display_dataset(name='valid',y_column='PM2_5')

### Test

In [12]:
visualizer.display_dataset(name='test',y_column='PM2_5')

### Drawing the heat map for the distribution of weights of each variables 

#### DEM

In [13]:
visualizer.weights_heatmap('weight_dem')


#### AOD

In [14]:
visualizer.weights_heatmap('weight_aod_sat')

### Precipitation

In [15]:
visualizer.weights_heatmap('weight_tp')

#### Temperature

In [16]:
visualizer.weights_heatmap('weight_t2m')

#### Wind Speed

In [17]:
visualizer.weights_heatmap('weight_w10')

#### Wind Direction

In [18]:
visualizer.weights_heatmap('weight_d10')

#### Bias

In [19]:
visualizer.weights_heatmap('bias')

# Part D：Saving and Loading

## Step 1：Saving Datasets

In [21]:
# make sure dir is not exist
train_dataset.save('./demo_result/gnnwr_datasets/train_dataset')
val_dataset.save('./demo_result/gnnwr_datasets/val_dataset')
test_dataset.save('./demo_result/gnnwr_datasets/test_dataset')

## Step 2：Loading Datasets

In [22]:
train_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/train_dataset/')
val_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/val_dataset/')
test_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/test_dataset/')

## Step 3：Loading Model

### Initialize GNNWR Model

In [23]:
gnnwr_load = models.GNNWR(train_dataset = train_dataset_load,
                     valid_dataset = val_dataset_load, 
                     test_dataset = test_dataset_load,
                     dense_layers = [512, 256, 128],
                     start_lr = 0.2,
                     optimizer = "Adadelta",
                     activate_func = nn.PReLU(init=0.1),
                     model_name = "GNNWR_PM25",
                     model_save_path = "./demo_result/gnnwr_models",
                     log_path = "./demo_result/gnnwr_logs",
                     write_path = "./demo_result/gnnwr_runs"
                     )

### Loading Parameters

In [24]:
gnnwr_load.load_model('./demo_result/gnnwr_models/GNNWR_PM25.pkl')
gnnwr_load.result()

Test Loss:  58.860511779785156  Test R2:  0.7277791691585838
--------------------Result Table--------------------

Model Name:           | GNNWR_PM25
Model Structure:      |
 SWNN(
  (activate_func): PReLU(num_parameters=1)
  (fc): Sequential(
    (swnn_full0): Linear(in_features=1017, out_features=512, bias=True)
    (swnn_batc0): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (swnn_acti0): PReLU(num_parameters=1)
    (swnn_drop0): Dropout(p=0.2, inplace=False)
    (swnn_full1): Linear(in_features=512, out_features=256, bias=True)
    (swnn_batc1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (swnn_acti1): PReLU(num_parameters=1)
    (swnn_drop1): Dropout(p=0.2, inplace=False)
    (swnn_full2): Linear(in_features=256, out_features=128, bias=True)
    (swnn_batc2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (swnn_acti2): PReLU(num_parameters=1)
    (swnn_drop2): Dropo

# Part E：Estimation

## Step 1：Import Estimation Data

In [25]:
pred_data = pd.read_csv(u'../data/pm25_predict_data.csv')

## Step 2：Initialize Estimation Dataset

In [26]:
pred_dataset = datasets.init_predict_dataset(data = pred_data,train_dataset = train_dataset,x_column=['dem', 'w10','d10','t2m','aod_sat','tp'],spatial_column=['经度','纬度'])

## Step 3：Estimate

In [27]:
res = gnnwr.predict(pred_dataset)
res.head(5)

Unnamed: 0,监测点编码,监测点名称,城市,经度,纬度,date,PM2.5,row_index,col_index,proj_x,...,sp,tp,blh,e,r,u10,v10,aod_sat,ndvi,pred_result
0,1001A,万寿西宫,北京,116.366,39.8673,20170930,56.357143,2201.0,6867.0,1650848.0,...,100287.671875,5.1e-05,64.583054,-7e-06,52.682091,0.384257,0.784808,0.762762,3443,57.820961
1,1002A,定陵,北京,116.17,40.2865,20170930,47.148148,2134.0,6835.0,1625004.0,...,96752.507812,0.000304,40.62114,-7e-06,62.529091,-0.156175,-0.537717,0.574785,7810,46.632584
2,1003A,东四,北京,116.434,39.9522,20170930,53.857143,2188.0,6877.0,1653777.0,...,100307.703125,5.8e-05,60.242908,-7e-06,52.12664,0.093867,0.617515,0.796827,3328,57.993763
3,1004A,天坛,北京,116.434,39.8745,20170930,46.333333,2200.0,6877.0,1655828.0,...,100410.367188,4.7e-05,69.535637,-8e-06,51.301529,0.197439,0.893495,0.758839,4535,57.3871
4,1005A,农展馆,北京,116.473,39.9716,20170930,52.203704,2185.0,6884.0,1656225.0,...,100355.054688,5.9e-05,62.281456,-7e-06,51.071964,-0.060543,0.634863,0.760148,3901,56.253838
