## C-SGEN: Molecule Property Prediction Based on Spatial Graph Embedding

ABSTRACT: Accurate prediction of molecular properties is
important for new compound design, which is a crucial step in
drug discovery. In this paper, molecular graph data is utilized
for property prediction based on graph convolution neural
networks. In addition, a convolution spatial graph embedding
layer (C-SGEL) is introduced to retain the spatial connection
information on molecules. And, multiple C-SGELs are stacked
to construct a convolution spatial graph embedding network
(C-SGEN) for end-to-end representation learning. In order to
enhance the robustness of the network, molecular fingerprints
are also combined with C-SGEN to build a composite model for predicting molecular properties. Our comparative experiments
have shown that our method is accurate and achieves the best results on some open benchmark datasets.

Link to paper: https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.9b00410?rand=oin4mnup

Credit: https://github.com/wxfsd/C-SGEN

In [1]:
# Clone the repository and cd into directory
!git clone https://github.com/wxfsd/C-SGEN.git
%cd C-SGEN

Cloning into 'C-SGEN'...
remote: Enumerating objects: 32, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 32 (delta 10), reused 32 (delta 10), pack-reused 0[K
Unpacking objects: 100% (32/32), done.
/Users/saams4u/chemlabs/playground/C-SGEN


In [None]:
# Install dependencies / requirements
!pip install theano==1.0.3 numpy==1.16.4 scipy==1.3.0
!pip install sklearn==0.0 deepchem torch==1.4.0 torchvision==0.5.0 torchtext==0.5.0

!pip install torch-geometric \
  torch-sparse==latest+cu101 \
  torch-scatter==latest+cu101 \
  torch-cluster==latest+cu101 \
  torch-spline-conv==latest+cu101 \
  -f https://pytorch-geometric.com/whl/torch-1.4.0.html

# Install RDKit 
!pip install rdkit-pypi==2021.3.1.5

### Datasets
`Feature.npy`, `Normed_adj.npy`, `fingerprint_stand.npy` and `Interactions.npy` are molecular features, adjacency matrices, molecular fingerprints and corresponding target values in the data, respectively.Input for C-SGEN Model

full_feature, edge and `Interactions.npy` are molecular features, adjacency matrices and corresponding target values of `pytorch_geometric` specific data format in the data, respectively.


#### Model Hyper-Parameters
```
  --epochs                      INT     Number of epochs.                              Default is 33.
  --batch-size                  INT     Number fo molecules per batch.                 Default is 8.
  --C-SGEL-layers               INT     Number of C-SGELs.                             Default is 2.
  --ch_num                      INT     Number of neurons in Graph embedding layer.    Default is 16.
  --k                           INT     Number of filters in conv1d.                   Default is 4.
  --lr_decay                    FLOAT   Weight decay / 10 epochs.                      Defatuls is 0.5.
  --learning-rate               FLOAT   Adam learning rate.                            Default is 5e-4.
```

### Examples

The following commands learn a model and save the predictions. Training C-SGEN model on the default dataset,the data is ready to be saved in a folder. You can execute the above model directly.

In [11]:
!python C-SGEN_trian.py

FreeSolv----batch8--k4--lr-0.0005--iteration-33--ch_num-16--decay_interval-10
batch: 8
k: 4
ch_num: 16
decay_interval: 10
lr: 0.0005
lr_decay: 0.5
iteration: 33
Current python interpreter path：
/usr/bin/python3
Epoch Time(sec) Loss_train Loss_dev Loss_test RMSE_train RMSE_dev RMSE_test
epoch:1-train loss: 0.365,valid loss: 0.238,test loss: 0.165, valid rmse: 1.874, test rmse: 1.564, time: 0.505
epoch:2-train loss: 0.132,valid loss: 0.140,test loss: 0.093, valid rmse: 1.438, test rmse: 1.173, time: 0.508
epoch:3-train loss: 0.081,valid loss: 0.138,test loss: 0.059, valid rmse: 1.429, test rmse: 0.935, time: 0.503
epoch:4-train loss: 0.072,valid loss: 0.126,test loss: 0.091, valid rmse: 1.364, test rmse: 1.162, time: 0.499
epoch:5-train loss: 0.059,valid loss: 0.108,test loss: 0.062, valid rmse: 1.261, test rmse: 0.957, time: 0.477
epoch:6-train loss: 0.041,valid loss: 0.227,test loss: 0.149, valid rmse: 1.833, test rmse: 1.483, time: 0.482
epoch:7-train loss: 0.033,valid loss: 0.108,tes

Training a PyG model directly.

In [18]:
!python pyg_train.py

decay_interval: 10
lr: 0.01
ARMA-epoch:1,---train loss: 3.035,valid loss: 0.748,test loss: 0.665, valid rmse: 3.325, test rmse: 3.112, time: 0.653
ARMA-epoch:2,---train loss: 0.674,valid loss: 0.581,test loss: 0.387, valid rmse: 2.931, test rmse: 2.516, time: 0.441
ARMA-epoch:3,---train loss: 0.424,valid loss: 0.389,test loss: 0.312, valid rmse: 2.398, test rmse: 1.936, time: 0.448
ARMA-epoch:4,---train loss: 0.363,valid loss: 0.399,test loss: 0.216, valid rmse: 2.428, test rmse: 1.881, time: 0.439
ARMA-epoch:5,---train loss: 0.267,valid loss: 0.239,test loss: 0.109, valid rmse: 1.878, test rmse: 1.302, time: 0.441
ARMA-epoch:6,---train loss: 0.198,valid loss: 0.290,test loss: 0.216, valid rmse: 2.069, test rmse: 1.608, time: 0.454
ARMA-epoch:7,---train loss: 0.228,valid loss: 0.260,test loss: 0.127, valid rmse: 1.960, test rmse: 1.368, time: 0.444
ARMA-epoch:8,---train loss: 0.242,valid loss: 0.427,test loss: 0.319, valid rmse: 2.512, test rmse: 2.164, time: 0.453
ARMA-epoch:9,---trai

Load data from DeepChem.

In [19]:
!python load_FreeSolv.py

2021-06-02 22:36:28.141690: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
'split' is deprecated.  Use 'splitter' instead.


This work was inspired by Large-scale learnable graph convolutional networks [(LGCN)](https://github.com/divelab/lgcn).