<a href="https://colab.research.google.com/github/ribeaud/NLP_Workshop/blob/master/BioBERT_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Diego Saldana, November 2019**

# PyPharma NLP 2019 Tutorial: Text Classification with BioBERT

The objective of this exercise is to take scientific python users through the process of fitting a text classification model to detect mentions of adverse events in case reports published in the biomedical literature using the BioBERT pre-trained model. By the end of this exercise, you should be familiar with 

* The ADE Corpus
* The pypharma_nlp package tools to download and explore the ADE corpus
* The command line interface to fit text classifiers with the BioBERT pre-trained model
* The pypharma_nlp package tools to load the BioBERT pre-trained checkpoints and predict on new text

# Running this Notebook

In order to run the code in this notebook, you will first have to save a copy of it in your Google Drive. Please do this by clicking on the "File" menu and then "Save a copy in Drive" as shown in the figure below.

![](https://drive.google.com/uc?export=view&id=1hjFEhg7ML5AOzu-cBBThJ2QFtZvTQRHk)

# Recommended Reading

* The [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) by Devlin *et al.* (2018)
* The [BioBERT paper](https://arxiv.org/abs/1901.08746) by Lee *et al.* (2019)

# Glossary and Acronyms

- **ADE:** Adverse Drug Event
- **Neg:** Negative (not an ADE)
- **token:** In NLP algorithms, sentences are divided into logical blocks called tokens. Tokens may correspond to words, word-pieces, or characters. In the case of BioBERT, they are word-pieces.
- **checkpoint:** A file containing the saved state of a machine learning model including weights and other trainable parameters. The file can be used to restore the state of the trained model without re-training it from scratch. It is analogous to the set of coefficients and intercept in a regression model.

# The **pypharma_nlp Package**

The pypharma_nlp package is a set of tools that we have developed to make the use of the datasets and models demonstrated in these notebooks easier. Among other things, it allows you to

* Download the datasets that will be used in these tutorials (and more).
* Download abstracts from PubMed.
* Download the pre-trained checkpoints originally fit by the BioBERT authors.
* Download our pre-trained checkpoints, which we have fit after fine tuning BioBERT models in a similar way as was done in the original BioBERT paper.
* Wrapper classes to easily perform text classification, named entity recognition, relation extraction, or question answering on new data after having fit a model using BioBERT.

In [1]:
# Install pypharma-nlp

%tensorflow_version 1.x
%cd /content/
!git clone https://github.com/openpharma/pypharma_nlp.git

%cd /content/pypharma_nlp

!pip install -e .
%cd ..

import nltk
nltk.download("punkt")

/content
Cloning into 'pypharma_nlp'...
remote: Enumerating objects: 617, done.[K
remote: Counting objects: 100% (617/617), done.[K
remote: Compressing objects: 100% (275/275), done.[K
remote: Total 617 (delta 345), reused 591 (delta 319), pack-reused 0[K
Receiving objects: 100% (617/617), 22.32 MiB | 19.42 MiB/s, done.
Resolving deltas: 100% (345/345), done.
/content/pypharma_nlp
Obtaining file:///content/pypharma_nlp
Collecting biopython==1.74
[?25l  Downloading https://files.pythonhosted.org/packages/ed/77/de3ba8f3d3015455f5df859c082729198ee6732deaeb4b87b9cfbfbaafe3/biopython-1.74-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)
[K     |████████████████████████████████| 2.2MB 9.1MB/s 
Collecting ipykernel==5.1.2
[?25l  Downloading https://files.pythonhosted.org/packages/d4/16/43f51f65a8a08addf04f909a0938b06ba1ee1708b398a9282474531bd893/ipykernel-5.1.2-py3-none-any.whl (116kB)
[K     |████████████████████████████████| 122kB 67.2MB/s 
Collecting pandas==0.25.0
[?25l  Downloading http

/content


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
## Use this if you need to refresh the repository
## You may have to restart the runtime
#
#!cd /content/pypharma_nlp && \
#git pull && \
#pip install -e .

# Restarting the Runtime [IMPORTANT]

After installing pypharma_nlp, you now have to restart the runtime in order for pypharma_nlp to be loaded properly. After restarting, you can start working on the notebook from this point onward. A "Restart Runtime" button should have appeared in the previously executed cell, where we installed pypharma_nlp. If not, please restart by clicking on the "Runtime" menu and then on "Restart Runtime", as shown in the figure below.

![](https://drive.google.com/uc?export=view&id=1Req14nraAWQ1g82n7Xw3uHX7ZPEu8RWc)

# BioBERT: Open Source pre-trained Biomedical NLP model

We'll be using BioBERT (Lee *et al.* 2019), an open source pre-trained biomedical model for Natural Language Understanding tasks. It was based on the original BERT pre-trained model (Devlin *et al.* 2018). The model was pre-trained on PubMed abstract and PubMed Central articles and by performing two tasks:

* The Masked Language Model (Masked LM), which consists on masking randomly selected tokens from sentences, and training the model to predict the missing tokens.

* Next Sentence Prediction, which consists on training the model on a mixed dataset consisting of pairs of sentences where in certain cases the sentences were next to each other in the original text, and in other cases they are randomly paired. The model is trained to predict whether the second sentence followed the first one in the original text.

In order to use BioBERT, we will clone the original repository and download their checkpoints, which were made available by the original authors of the paper.

In [0]:
# Clone BioBERT

%cd /content/
!git clone https://github.com/diego-s/biobert.git
%cd /content/biobert

/content
Cloning into 'biobert'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 212 (delta 6), reused 9 (delta 2), pack-reused 199[K
Receiving objects: 100% (212/212), 123.51 KiB | 342.00 KiB/s, done.
Resolving deltas: 100% (121/121), done.
/content/biobert


# Text Classification

In text classification, the goal is to train an algorithm that, given some text, can accurately classify it according to a pre-defined categorization. For example, one may wish to classify journal abstracts as being relevant to specific biomedical topics, such as cancer, cardiovascular disease, autoimmune diseases, etc. Or one may wish to detect sentences that mention adverse events in case reports published in the biomedical literature. This is the example that will be explored in this exercise.

Each case report is composed of a set of sentences, and the task will be to train a model that automatically categorizes it, and assigns a probability of it having an Adverse Drug Event (ADE) mention or not.

![](https://drive.google.com/uc?export=view&id=1XZSjDvr2aFyP3aD1zJfmSVf_HLPUaCNE)


# Dataset: The ADE Corpus

The ADE corpus was introduced by Gurulingappa et al. (2012) in order to provide a benchmark dataset for the development of algorithms for the detection of ADEs in case reports published in the biomedical literature. The original source of the data was 2972 MEDLINE case reports. The data was labelled by three trained annotators and their annotation results were consolidated into a final dataset including 6728 ADE relations (in 4272 sentences), as well as 16688 non ADE relevant sentences.

We will use our pypharma NLP package to download the annotations that were made available by the authors as well as to download the abstracts from PubMed. We will also take a look at the sentences contained in the dataset and their labels. Later on, we will use this data to train a sentence classification model that can predict whether a sentence contains an ADE or not.

In [0]:
# We need to specify an email address which may be used by Entrez to contact 
# us in case of issues

print("Your email (this is needed for Entrez):")
import os
os.environ["ENTREZ_EMAIL"] = input()

Your email (this is needed for Entrez):
diegovs87@yahoo.fr


In [0]:
# Download the source data

%cd /content/biobert
from pypharma_nlp.data.ade_corpus import download_source_data

download_source_data("data/ade_corpus")

/content/biobert


In [0]:
# Let's create a generator to read the examples

%cd /content/biobert
from pypharma_nlp.data.ade_corpus import get_classification_examples

batches = get_classification_examples("data/ade_corpus")

/content/biobert


In [0]:
# Now let's look at some examples, click 'run' again to see the next abstract

%cd /content/biobert
from IPython.display import display
from IPython.display import HTML
import pandas as pd

pmids, sentences, labels = next(batches)
table = pd.DataFrame.from_dict({
    "PMID" : pmids, 
    "Sentence" : sentences, 
    "Labels" : labels, 
})
display(HTML(table.to_html(index=False)))

/content/biobert


PMID,Sentence,Labels
8579054,Acute myeloid leukemia evolving from essential thrombocythemia in two patients treated with hydroxyurea.,AE
8579054,"Essential thrombocythemia (ET) is an uncommon myeloproliferative disorder, which is thought to develop from a multipotent stem cell.",Neg
8579054,"Like other myeloproliferative diseases, ET is associated with an increased risk of development of acute leukemia (AL).",Neg
8579054,"However, the large majority of cases of leukemic transformation in ET are thought to be related to prior therapy, usually radioactive phosphorous or alkylating chemotherapy, and the development of AL in ET is extremely rare in the untreated patient.",Neg
8579054,"In this report, two cases of ET which evolved into AL without prior exposure to radiation or alkylating agents, and which were treated with long-term hydroxyurea therapy, are described.",AE
8579054,"The first case had cytogenetic changes in the bone marrow suggestive of therapy-associated leukemia, and the second developed myelodysplastic syndrome on therapy which was likely chemotherapy-induced and led to acute leukemia.",Neg
8579054,Prolonged used of hydroxyurea in patients with ET may lead to therapy-associated acute leukemia.,AE


# BioBERT for Text Classification

Reminder: BioBERT model is first pre-trained on a large dataset by (1) learning to predict what are the masked tokens (the "Masked Language Model"); and (2) learning to predict if a sentence is next to another sentence ("Next Sentence Prediction"). There is a vector of outputs for each one of the tokens in the input, which as called "Hidden States". A dummy token, noted as "CLS" is always added at the beginning of the inputs. 

Task 2 is predicted by using the hidden state at the CLS token. As a convention the hidden state at the CLS token is called the "pooled output", and the full set of hidden states corresponding to all the tokens in the input is known as the "sequence output".

![Architecture of our ADE detector](https://drive.google.com/uc?export=view&id=1BVCQ_uauSEnR3pwIjrj1_JlNMlrHEyhu)

**Simplified Architecture of the ADE detector**


When doing text classification, BioBERT's pooled output (the hidden state corresponding to the CLS token) is passed through a dropout, fully connected + bias layers, and finally a softmax function to obtain class probabilities. The two possible classes are: (1) AE, indicating the mention of an adverse event; and (2) Neg, corresponding to no adverse event.

# Training the Model

**Important**

```
This section will not be used during the tutorial, since it will take a long 
time to run. But we encourage you to go back to this notebook and run this 
section by yourself to familiarize yourself with the BioBERT CLI. 
```


In this section, we will first download the pre-trained checkpoint from BioBERT, which was trained to perform the two tasks (the Masked Language Model and Next Sentence Prediction) as previously described, on PubMed abstracts and PubMed Central articles.

We will then use the BioBERT classification code to train an ADE detector model. This process can take a long time (1-2 hours). However, we also provide pre-trained checkpoints that can be used to quickly load a model like the one that you would obtain after training the model as shown in here.

In [0]:
# Download the BioBERT checkpoint

%cd /content/biobert
from pypharma_nlp.biobert.checkpoints.base import download_checkpoint

download_checkpoint("checkpoints/biobert/", checkpoint="biobert_v1.1_pubmed")

/content/biobert
Downloading file from Google Drive


In [0]:
# Let's train a classification model

% cd /content/biobert 
!python run_classifier.py \
    --task_name="ade" \
    --do_train="true" \
    --do_eval="true" \
    --do_predict="true" \
    --data_dir="data/ade_corpus" \
    --vocab_file="checkpoints/biobert/biobert_v1.1_pubmed/vocab.txt" \
    --bert_config_file="checkpoints/biobert/biobert_v1.1_pubmed/bert_config.json" \
    --init_checkpoint="checkpoints/biobert/biobert_v1.1_pubmed/model.ckpt-1000000" \
    --max_seq_length="128" \
    --train_batch_size="32" \
    --learning_rate="2e-5" \
    --num_train_epochs="3.0" \
    --do_lower_case="false" \
    --output_dir="checkpoints/fine_tuning/"

/content/biobert



W1117 14:03:40.369320 139939077236608 module_wrapper.py:139] From run_classifier.py:964: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1117 14:03:40.369553 139939077236608 module_wrapper.py:139] From run_classifier.py:964: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1117 14:03:40.369997 139939077236608 module_wrapper.py:139] From /content/biobert/modeling.py:92: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1117 14:03:40.370758 139939077236608 module_wrapper.py:139] From run_classifier.py:990: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/t

In [0]:
# Show results evaluation on the test set

%cd /content/biobert
!cat checkpoints/fine_tuning/eval_results.txt

/content/biobert
eval_accuracy = 0.96382785
eval_loss = 0.13599786
eval_precision = 0.87904966
eval_recall = 0.9465116
global_step = 1771
loss = 0.13599786


# Results

After fitting the model, you should obtain performances on the test set close to the following:

* Accuracy: 96%
* Precision: 89%
* F1-Score: 0.90
* Recall: 92%
* Specificity: 97%
* AUC: 0.99

# Load a pre-trained Checkpoint

We have trained models using the same code as shown here and stored the results in our PyPharma NLP 2019 Google Drive. You can use the pypharma_nlp package to download the checkpoints. Using these checkpoints, you don't need to train the models once again to use these models.

Run the following cell to download a pre-trained checkpoint that is equivalent to what you would obtain when running the previous training task in full.

In [0]:
# Recover a pre-fit checkpoint

%cd /content/biobert/
from pypharma_nlp.checkpoints import download_checkpoint

!rm -rf checkpoints/fine_tuning 
download_checkpoint("checkpoints", "biobert_v1.1_pubmed_classification_ade")
!mv checkpoints/biobert_v1.1_pubmed_classification_ade/ checkpoints/fine_tuning/

/content/biobert
Downloading file from Google Drive


# Prediction

You can use the following code to predict the presence of adverse event mentions in case reports published in the biomedical literature. We have included one example (PMID: 31574875) to get you started. Remember that the original dataset used to train the model only was trained on case reports dating up to 2010, whereas the example given here is from 2017, however the model should be able to detect adverse event mentions in its sentences without issues. You can add more PMIDs in the code below if you wish to try other case reports.

One simple way to find adverse event related case reports in PubMed is to use the following query: 

```
"adverse effects"[sh] AND (hasabstract[text] AND Case Reports[ptyp]) AND "drug therapy"[sh] AND English[lang] AND (Case Reports[ptyp])
```

You can click [here](https://www.ncbi.nlm.nih.gov/pubmed/?term=%22adverse+effects%22%5Bsh%5D+AND+(hasabstract%5Btext%5D+AND+Case+Reports%5Bptyp%5D)+AND+%22drug+therapy%22%5Bsh%5D+AND+English%5Blang%5D+AND+(Case+Reports%5Bptyp%5D)) to perform this search.


In [0]:
# Let us now create a new set of examples

%cd /content/biobert/
from pypharma_nlp.pubmed import get_publications
from pypharma_nlp.pubmed import get_publication_sentences

records = get_publications(pmids=["31574875"])
documents = get_publication_sentences(records, include_title=True)
sentences = next(documents)

/content/biobert


In [0]:
# Let's now predict on new sentences and look at the results

%cd /content/biobert/
from pypharma_nlp.biobert.wrappers import BioBertWrapper
from IPython.display import display
from IPython.display import HTML
import pandas as pd

model = BioBertWrapper()                                                  
model.build(
    "classification", 
    "ade", 
    "/content/biobert/checkpoints/biobert/biobert_v1.1_pubmed", 
    "/content/biobert/checkpoints/fine_tuning", 
)
labels, probabilities = model.classify(sentences)
prediction_data = pd.DataFrame.from_dict(
{
    "Sentence" : sentences, 
    "Predicted Label" : labels, 
})
display(HTML(prediction_data.to_html(index=False)))

/content/biobert


Sentence,Predicted Label
A case report of glecaprevir/pibrentasvir-induced severe hyperbilirubinemia in a patient with compensated liver cirrhosis.,AE
"RATIONALE: Glecaprevir/pibrentasvir, a pan-genotypic and ribavirin-free direct acting antiviral agent regimen, has shown significant efficacy and very few serious complications.",Neg
"However, as the drug metabolizes in the liver, it is not recommended in patients with decompensated liver cirrhosis.",Neg
"Herein, we report the case of a patient with compensated liver cirrhosis who developed severe jaundice after glecaprevir/pibrentasvir medication.",AE
PATIENT CONCERNS: A 77-year-old man diagnosed with chronic hepatitis C-related compensated liver cirrhosis visited hospital due to severe jaundice after 12 weeks of glecaprevir/pibrentasvir medication.,AE
"DIAGNOSES: On the laboratory work-up, the total/direct bilirubin level was markedly elevated to 21.56/11.68 from 1.81 mg/dL; the alanine aminotransferase and aspartate aminotransferase levels were within the normal range.",Neg
"We checked the plasma drug concentration level of glecaprevir, and 18,500 ng/mL was detected, which was more than 15 times higher than the drug concentration level verified in normal healthy adults.",Neg
"INTERVENTIONS: Glecaprevir/pibrentasvir was abruptly stopped and after 6 days, the drug concentration level decreased to 35 ng/mL and the serum total/direct bilirubin decreased to 7.49/4.06 mg/dL.",Neg
"OUTCOMES: Three months after drug cessation, the serum total bilirubin level normalized to 1.21 mg/dL and HCV RNA was not detected.",Neg
LESSONS: We report what is likely the first known case of severe jaundice after medication with glecaprevir/pibrentasvir in a patient with compensated liver cirrhosis.,AE


# Conclusion

We have taken you through the process of downloading the ADE corpus, exploring it, fitting a text classifier to predict the presence of adverse event mentions at the sentence level and, finally, predicting on a new unseen abstract. By now, you should be familiar with

* The ADE Corpus
* The pypharma_nlp package tools to download and explore the ADE corpus
* The command line interface to fit text classifiers with the BioBERT pre-trained model
* The pypharma_nlp package tools to load the BioBERT pre-trained checkpoints and predict on new text