## Chemical-protein Interaction Extraction via Gaussian Probability Distribution and External Biomedical Knowledge

<b>Motivation:</b> The biomedical literature contains a wealth of chemical-protein interactions (CPIs).
Automatically extracting CPIs described in biomedical literature is essential for drug discovery, precision
medicine, as well as basic biomedical research. Most existing methods focus only on the sentence
sequence to identify these CPIs. However, the local structure of sentences and external biomedical
knowledge also contain valuable information. Effective use of such information may improve the
performance of CPI extraction.

<b>Results:</b> In this paper, we propose a novel neural network-based approach to improve CPI extraction.
Specifically, the approach first employs BERT to generate high-quality contextual representations of the
title sequence, instance sequence, and knowledge sequence. Then, the Gaussian probability distribution
is introduced to capture the local structure of the instance. Meanwhile, the attention mechanism is applied
to fuse the title information and biomedical knowledge, respectively. Finally, the related representations
are concatenated and fed into the softmax function to extract CPIs. We evaluate our proposed model on
the CHEMPROT corpus. Our proposed model is superior in performance as compared with other stateof-the-art models. The experimental results show that the Gaussian probability distribution and external
knowledge are complementary to each other. Integrating them can effectively improve the CPI extraction
performance. Furthermore, the Gaussian probability distribution can effectively improve the extraction
performance of sentences with overlapping relations in biomedical relation extraction tasks.

Link to paper: https://arxiv.org/pdf/1911.09487v2.pdf

Credit: https://github.com/CongSun-dlut/CPI_extraction

Google Colab: https://colab.research.google.com/drive/1FrGx-P9ENCKalNiMb8q-YSWW7K8lkGR4?usp=sharing

In [4]:
# Clone the repository and cd into directory
!git clone https://github.com/CongSun-dlut/CPI_extraction.git
%cd CPI_extraction/SourceCode

/content/CPI_extraction/SourceCode


In [None]:
# Install dependencies / requirements
!pip install pytorch-pretrained-bert==0.6.1 torch==1.1.0

### Code
In this repository, we provide the code of the proposed model. \
`pytorch_model.bin` in the Resources and Records can be obtained from [pytorch_models](https://drive.google.com/drive/folders/15o_h-_YQUgccvc9202hTGrSyZfrzOPFX?usp=sharing).
```
CPI Extraction
  -Resources
    -NCBI_BERT_pubmed_uncased_L-12_H-768_A-12
      -vocab.txt
      -pytorch_model.bin
      -bert_config.json
  -Records
    -record_76.56%
      -pytorch_model.bin
      -bert_config.json
      -eval_results.txt
      -test_results.txt
  -SourceCode
    -BioRE.py
    -BioRE_BG.py
    -modeling.py
    -modeling_BG.py
    -file_utils.py
  -ProcessedData
    -CHEMPROT
      -train.tsv
      -dev.tsv
      -test.tsv
      -test_overlapping.tsv
      -test_normal.tsv
    -DDIExtraction2013
      -train.tsv
      -dev.tsv
      -test.tsv
      -test_overlapping.tsv
      -test_normal.tsv
```

### Run models
Since the model contains multiple layer, it generally need some time to train. If the users have no time to train model, the saved model in the Records can be loaded to test.
Some examples of execution instructions are listed below.

#### Run the proposed model

In [7]:
!python BioRE.py \
  --task_name cpi \
  --do_train \
  --do_eval \
  --do_predict \
  --do_lower_case \
  --data_dir CPI_extraction/ProcessedData/CHEMPROT \
  --bert_model CPI_extraction/Resources/NCBI_BERT_pubmed_uncased_L-12_H-768_A-12 \
  --max_seq_length 128 \
  --train_batch_size 16 \
  --eval_batch_size 8 \
  --predict_batch_size 8 \
  --learning_rate 2e-5 \
  --num_train_epochs 2.0 \
  --seed 47 \
  --output_dir CPI_extraction/results

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
06/05/2021 00:18:56 - INFO - __main__ -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
06/05/2021 00:18:56 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /content/CPI_extraction/Resources/NCBI_BERT_pubmed_uncased_L-12_H-768_A-12/vocab.txt
06/05/2021 00:18:56 - INFO - modeling -   loading archive file /content/CPI_extraction/Resources/NCBI_BERT_pubmed_uncased_L-12_H-768_A-12
06/05/2021 00:18:56 - INFO - modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

06/05/2021 00:19:01 - INFO - __main__ -   Writing example 0 of 19460
06/05/2021 00:19:01 - INFO - __main__

#### Load the record of the proposed model

In [11]:
!python BioRE.py \
  --task_name cpi \
  --do_eval \
  --do_predict \
  --do_lower_case \
  --data_dir CPI_extraction/ProcessedData/CHEMPROT \
  --bert_model CPI_extraction/Resources/NCBI_BERT_pubmed_uncased_L-12_H-768_A-12 \
  --saved_model CPI_extraction/results \
  --max_seq_length 128 \
  --train_batch_size 16 \
  --eval_batch_size 8 \
  --predict_batch_size 8 \
  --learning_rate 2e-5 \
  --num_train_epochs 2.0 \
  --output_dir CPI_extraction/results

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
06/05/2021 00:59:30 - INFO - __main__ -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
06/05/2021 00:59:30 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /content/CPI_extraction/Resources/NCBI_BERT_pubmed_uncased_L-12_H-768_A-12/vocab.txt
06/05/2021 00:59:30 - INFO - modeling -   loading archive file /content/CPI_extraction/Resources/NCBI_BERT_pubmed_uncased_L-12_H-768_A-12
06/05/2021 00:59:30 - INFO - modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

06/05/2021 00:59:38 - INFO - __main__ -   Writing example 0 of 11820
06/05/2021 00:59:38 - INFO - __main__

#### Run the `BERT+Gaussian` model on the CHEMPROT dataset

In [None]:
!python BioRE_BG.py \
  --task_name cpi \
  --do_train \
  --do_eval \
  --do_predict \
  --do_lower_case \
  --data_dir CPI_extraction/ProcessedData/CHEMPROT \
  --bert_model CPI_extraction/Resources/NCBI_BERT_pubmed_uncased_L-12_H-768_A-12 \
  --max_seq_length 128 \
  --train_batch_size 16 \
  --eval_batch_size 8 \
  --predict_batch_size 8 \
  --learning_rate 2e-5 \
  --num_train_epochs 2.0 \
  --seed 47 \
  --output_dir CPI_extraction/output

#### Run the `BERT+Gaussian` model on the DDIExtraction dataset

In [None]:
!python BioRE_BG.py \
  --task_name ddi \
  --do_train \
  --do_eval \
  --do_predict \
  --do_lower_case \
  --data_dir CPI_extraction/ProcessedData/DDIExtraction2013 \
  --bert_model CPI_extraction/Resources/NCBI_BERT_pubmed_uncased_L-12_H-768_A-12 \
  --max_seq_length 128 \
  --train_batch_size 16 \
  --eval_batch_size 8 \
  --predict_batch_size 8 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --seed 17 \
  --output_dir CPI_extraction/output