<a href="https://colab.research.google.com/github/ribeaud/NLP_Workshop/blob/master/BioBERT_ner_re.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Diego Saldana, November 2019**

# PyPharma NLP 2019 Tutorial: Named Entity Recognition (NER) and Relation Extraction (RE) with BioBERT

The objective of this notebook is to take scientific python users through the process of fitting (1) a named entity recognition (NER) model to extract disease mentions in biomedical abstracts; and (2) a relation extraction (RE) model to detect gene-disease associations in sentences; both using the BioBERT pre-trained model. By the end of this exercise, you should be familiar with 

Part I
* The BioCreative V Chemical-Disease Relation (BC5CDR) corpus.
* The pypharma_nlp package tools to download and explore the BC5CDR corpus
* The command line interface to fit named entity recognition models with BioBERT
* The pypharma_nlp package tools to load the BioBERT pre-trained checkpoints and extract named entities from new text

Part II
* The Gene-Disease Association Database (GAD) corpus.
* The pypharma_nlp package tools to download and explore the GAD corpus
* The command line interface to fit relation extraction models with BioBERT
* The pypharma_nlp package tools to load the BioBERT pre-trained checkpoints and extract relations from new text

# Running this Notebook

In order to run the code in this notebook, you will first have to save a copy of it in your Google Drive. Please do this by clicking on the "File" menu and then "Save a copy in Drive" as shown in the figure below.

![](https://drive.google.com/uc?export=view&id=1hjFEhg7ML5AOzu-cBBThJ2QFtZvTQRHk)

# Recommended Reading

* The [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) by Devlin *et al.* (2018)
* The [BioBERT paper](https://arxiv.org/abs/1901.08746) by Lee *et al.* (2019)

# Glossary and Acronyms

- **Named Entity:** A mention of an entity belonging to a particular category (e.g. a protein, a gene, a disease, a drug, etc) in a sentence or text passage.
- **Relation:** An association between two entities belonging to a defined category (e.g. gene-disease associations).
- **token:** In NLP algorithms, sentences are divided into logical blocks called tokens. Tokens may correspond to words, word-pieces, or characters. In the case of BioBERT, they are word-pieces.
- **checkpoint:** A file containing the saved state of a machine learning model including weights and other trainable parameters. The file can be used to restore the state of the trained model without re-training it from scratch. It is analogous to the set of coefficients and intercept in a regression model.

# The **pypharma_nlp Package**

The pypharma_nlp package is a set of tools that we have developed to make the use of the datasets and models demonstrated in these notebooks easier. Among other things, it allows you to

* Download the datasets that will be used in these tutorials (and more).
* Download abstracts from PubMed.
* Download the pre-trained checkpoints originally fit by the BioBERT authors.
* Download our pre-trained checkpoints, which we have fit after fine tuning BioBERT models in a similar way as was done in the original BioBERT paper.
* Wrapper classes to easily perform text classification, names entity recognition, relation extraction, or question answering on new data after having fit a model using BioBERT.

In [0]:
# Install pypharma-nlp

%tensorflow_version 1.x
%cd /content/
!git clone https://github.com/openpharma/pypharma_nlp.git

%cd /content/pypharma_nlp

!pip install -e .
%cd ..

import nltk
nltk.download("punkt")

/content
Cloning into 'pypharma_nlp'...
remote: Enumerating objects: 613, done.[K
remote: Counting objects: 100% (613/613), done.[K
remote: Compressing objects: 100% (272/272), done.[K
remote: Total 613 (delta 342), reused 589 (delta 318), pack-reused 0[K
Receiving objects: 100% (613/613), 21.88 MiB | 8.26 MiB/s, done.
Resolving deltas: 100% (342/342), done.
/content/pypharma_nlp
Obtaining file:///content/pypharma_nlp
Collecting biopython==1.74
[?25l  Downloading https://files.pythonhosted.org/packages/ed/77/de3ba8f3d3015455f5df859c082729198ee6732deaeb4b87b9cfbfbaafe3/biopython-1.74-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)
[K     |████████████████████████████████| 2.2MB 2.8MB/s 
Collecting ipykernel==5.1.2
[?25l  Downloading https://files.pythonhosted.org/packages/d4/16/43f51f65a8a08addf04f909a0938b06ba1ee1708b398a9282474531bd893/ipykernel-5.1.2-py3-none-any.whl (116kB)
[K     |████████████████████████████████| 122kB 69.5MB/s 
Collecting pandas==0.25.0
[?25l  Downloading https

/content


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
## Use this if you need to refresh the repository
## You may have to restart the runtime
#
#!cd /content/pypharma_nlp && \
#git pull && \
#pip install -e .

# Restarting the Runtime [IMPORTANT]

After installing pypharma_nlp, you now have to restart the runtime in order for pypharma_nlp to be loaded properly. After restarting, you can start working on the notebook from this point onward. A "Restart Runtime" button should have appeared in the previously executed cell, where we installed pypharma_nlp. If not, please restart by clicking on the "Runtime" menu and then on "Restart Runtime", as shown in the figure below.

![](https://drive.google.com/uc?export=view&id=1Req14nraAWQ1g82n7Xw3uHX7ZPEu8RWc)

# BioBERT: Open Source pre-trained Biomedical NLP model

We'll be using BioBERT (Lee *et al.* 2019), an open source pre-trained biomedical model for Natural Language Understanding tasks. It was based on the original BERT pre-trained model (Devlin *et al.* 2018). The model was pre-trained on PubMed abstract and PubMed Central articles and by performing two tasks:

* The Masked Language Model (Masked LM), which consists on masking randomly selected tokens from sentences, and training the model to predict the missing tokens.

* Next Sentence Prediction, which consists on training the model on a mixed dataset consisting of pairs of sentences where in certain cases the sentences were next to each other in the original text, and in other cases they are randomly paired. The model is trained to predict whether the second sentence followed the first one in the original text.

In order to use BioBERT, we will clone the original repository and download their checkpoints, which were made available by the original authors of the paper.

In [0]:
# Clone BioBERT

%cd /content/
!git clone https://github.com/diego-s/biobert.git

/content
fatal: destination path 'biobert' already exists and is not an empty directory.


# PART I: Named Entity Recognition

In Named Entity Recognition, the goal is to train an algorithm that, given a text document, can extract mentions of specific entities. Entities can be objects belonging to a specific class, such as Diseases, Drugs, Proteins, Genes, etc. The algorithm should be able to not only detect the presence of the entity, but also detect the start and end (the boundaries) of the entity mention within the text.

In order to do this, sentences are broken down into pieces, or "tokens". In the case of BERT, tokens are word-pieces. We need to assign a label to each token in order to encode the presence or absence of an entity mention. A common way to encode this is using BIO notation. In BIO notation, the start of an entity mention is encoded as B, the continuation of a previously started entity is encoded as I, and the absence of an entity mention is encoded as O. You can see an example of this below.

![](https://drive.google.com/uc?export=view&id=1kRTziWhLAOG6phoTHpWVFaEiTZrmlEc9)

## Dataset: The BC5CDR Corpus

The BC5CDR corpus was introduced by Li et al. (2016) as part of the BioCreative V CDR task. It consists of 1500 PubMed abstracts, and annotations corresponding to 4409 chemical entities, 5818 disease entities, and 3116 interactions between chemicals and diseases. We make available the NER version used to fit the models in BioBERT, by Lee et al. (2019). The datasets are labelled using BIO notation.

In [0]:
# Download the source data

%cd /content/biobert
from pypharma_nlp.biobert.data.ner_corpora import download_source_data

download_source_data("data/ner_corpora")

/content/biobert
Downloading file from Google Drive


In [0]:
# Let's create a generator to read the training examples

%cd /content/biobert
import pandas as pd
from IPython.display import display
from IPython.display import HTML
from pypharma_nlp.biobert.data.ner_corpora import get_ner_examples

sentences = get_ner_examples("data/ner_corpora", "BC5CDR-disease", "train")

/content/biobert


In [0]:
# Now let's look at some examples, click 'run' again to see the next sentence

sentence_ids, tokens, labels = next(sentences)
table = pd.DataFrame.from_dict({
    "Sentence ID" : sentence_ids, 
    "Sentence" : tokens, 
    "Labels" : labels, 
})
display(HTML(table.to_html(index=False)))

Sentence ID,Sentence,Labels
1,Selegiline,O
1,-,O
1,induced,O
1,postural,B
1,hypotension,I
1,in,O
1,Parkinson,B
1,',I
1,s,I
1,disease,I


## BioBERT for Named Entity Recognition

Reminder: BioBERT model is first pre-trained on a large dataset by (1) learning to predict what are the masked tokens (the "Masked Language Model"); and (2) learning to predict if a sentence is next to another sentence ("Next Sentence Prediction"). There is a vector of outputs for each one of the tokens in the input, which as called "Hidden States". A dummy token, noted as "CLS" is always added at the beginning of the inputs. 

Task 2 is predicted by using the hidden state at the CLS token. As a convention the hidden state at the CLS token is called the "pooled output", and the full set of hidden states corresponding to all the tokens in the input is known as the "sequence output".

![Architecture of the Disease NER model](https://drive.google.com/uc?export=view&id=1ov3UBFaW-jGxlcE-oCCWeFn-pXg-uB1B)

**Simplified Architecture of the Disease NER model**

Named Entity Recognition is performed in a similar way to classification, except that the sequence outputs (as opposed to the pooled outputs) are used to assign class probabilities at the token level (note that tokens are word-pieces in BERT and BioBERT). That is, each token has a probability of having B, I or O as a label. It's then possible to extract the tokens having either B or I as the most likely labels and consider them extracted entities.

## Training the Model

**Important**

```This section will not be used during the tutorial, since it will take a long 
time to run. But we encourage you to go back to this notebook and run this 
section by yourself to familiarize yourself with the BioBERT CLI.
```

In this section, we will first download the pre-trained checkpoint from BioBERT, which was trained to perform the two tasks (the Masked Language Model and Next Sentence Prediction) as previously described, on PubMed abstracts and PubMed Central articles.

We will then use the BioBERT named entity recognition script to train a disease entity extraction model. This process can take a long time (1-2 hours). However, we also provide pre-trained checkpoints that can be used to quickly load a model like the one that you would obtain after training the model as shown in here.

In [0]:
# Download the BioBERT checkpoint

%cd /content/biobert
from pypharma_nlp.biobert.checkpoints.base import download_checkpoint

download_checkpoint("checkpoints/biobert/", checkpoint="biobert_v1.1_pubmed")

/content/biobert
Downloading file from Google Drive


In [0]:
# Let's see what it's like to train an NER model

%cd /content/biobert/
!mkdir -p checkpoints/fine_tuning/ && \
python run_ner.py \
    --do_train="true" \
    --do_eval="true" \
    --vocab_file="checkpoints/biobert/biobert_v1.1_pubmed/vocab.txt" \
    --bert_config_file="checkpoints/biobert/biobert_v1.1_pubmed/bert_config.json" \
    --init_checkpoint="checkpoints/biobert/biobert_v1.1_pubmed/model.ckpt-1000000" \
    --num_train_epochs="10.0" \
    --data_dir="data/ner_corpora/BC5CDR-disease" \
    --output_dir="checkpoints/fine_tuning"

## Results

After fitting the model, you should obtain results similar to the following

- F-score: 90.3%
- Precision: 88.4%
- Recall: 92.3%

which will be displayed in the output of the previous cell.

## Load a pre-trained Checkpoint

We have trained models using the same code as shown here and stored the results in our PyPharma NLP 2019 Google Drive. You can use the pypharma_nlp package to download the checkpoints. Using these checkpoints, you don't need to train the models once again to use these models.

Run the following cell to download a pre-trained checkpoint that is equivalent to what you would obtain when running the previous training task in full.

In [0]:
# Recover a pre-fit checkpoint

%cd /content/biobert/
from pypharma_nlp.checkpoints import download_checkpoint

download_checkpoint("checkpoints", "biobert_v1.1_pubmed_ner_bc5cdr_disease")
!mv checkpoints/biobert_v1.1_pubmed_ner_bc5cdr_disease/ checkpoints/fine_tuning/

/content/biobert
Downloading file from Google Drive


## Prediction

You can use the following code to extract disease entities in the test set, which has not been used to train the model. You can also visualize the gold standard labels and compare them with the predicted labels. As an additional exercise, you could also try to modify this code to extract entities from an abstract or text of your choice and see how the model performs on such unseen text.

In [0]:
# Let's load the model we just trained using pypharma_nlp's wrapper
# class

%cd /content/biobert
from pypharma_nlp.biobert.wrappers import BioBertWrapper

model = BioBertWrapper()
model.build(
    "ner", 
    "ner", 
    "checkpoints/biobert/biobert_v1.1_pubmed", 
    "checkpoints/fine_tuning", 
)

/content/biobert



Enter an e-mail for Entrez:
diegovs87@yahoo.fr


In [0]:
# Let's create a generator to read the test examples

%cd /content/biobert
import pandas as pd
from IPython.display import display
from IPython.display import HTML
from pypharma_nlp.biobert.data.ner_corpora import get_ner_examples

sentences = get_ner_examples("data/ner_corpora", "BC5CDR-disease", "test")

/content/biobert


In [0]:
# Now let's look at some of the predictions of BioBERT on these examples
# click 'run' again to see the next sentence

sentence_ids, tokens, true_labels = next(sentences)
predictions = model.extract_entities([tokens])
predicted_labels = [l for l in predictions[0]["prediction"] if l != "X"]
predicted_labels = predicted_labels[1:] + ["O"]

table = pd.DataFrame.from_dict({
    "Sentence ID" : sentence_ids, 
    "Token" : tokens, 
    "True Label" : true_labels, 
    "Predicted Label" : predicted_labels, 
})
display(HTML(table.to_html(index=False)))

Sentence ID,Token,True Label,Predicted Label
1,Torsade,B,B
1,de,I,I
1,pointes,I,I
1,ventricular,B,B
1,tachycardia,I,I
1,during,O,O
1,low,O,O
1,dose,O,O
1,intermittent,O,O
1,dobutamine,O,O


## Conclusion

We have taken you through the process of downloading the BC5CDR corpus, exploring it, fitting a named entity recognizer to extract mentions of entities at the sentence level and, finally, predicting on test set abstracts. By now, you should be familiar with

* The BioCreative V Chemical-Disease Relation (BC5CDR) corpus.
* The pypharma_nlp package tools to download and explore the BC5CDR corpus
* The command line interface to fit named entity recognition models with BioBERT
* The pypharma_nlp package tools to load the BioBERT pre-trained  checkpoints and extract named entities from new text

# Part II: Relation Extraction

In Relation Extraction (RE), the goal is to detect relationships between entities. The entities may have been extracted prior to the relation extraction, for example by running a previously fit named entity recognition model to extract the entities. Examples of potential relations include

* Protein-Protain Interactions (PPI)
* Adverse Events (Drug-Disease/Symptom Associations)
* Gene Disease Associations

In this exercise, we will look at the process of fitting a model to carry out the latter example using the Gene-Disease Association Database (GAD).

![Architecture of our ADE detector](https://drive.google.com/uc?export=view&id=1mZ37M5h2tpmkA0TIm1E36xn3IsoLrNsJ)

## Dataset: The GAD Relation Extraction Dataset

The Gene-Disease Association Database (GAD) Relation Extraction Dataset is a dataset consisting of 5329 sentences containing mentions of genes and diseases and labelled by curators as either being positive for the relationship between the gene-disease pair or not. In these sentences, the gene and disease entities have been replaced by the tokens @GENE and @DISEASE. Note that the semantic association itself may be positive or negative, but it is still labelled as positive for the presence of the relation.

We will use our pypharma NLP package to download the GAD dataset as made available by the authors of BioBERT. We will also take a look at the relation annotations contained in the dataset. Later on, we will use this data to train a relation extraction model that can extract Gene-Disese Associations from a given sentence.

In [0]:
# Download the source data

%cd /content/biobert
from pypharma_nlp.biobert.data.re_corpora import download_source_data

download_source_data("data/re_corpora")

/content/biobert
Downloading file from Google Drive


In [0]:
# Let's now read the training examples

%cd /content/biobert
import pandas as pd
from IPython.display import display
from IPython.display import HTML
from pypharma_nlp.biobert.data.re_corpora import get_re_examples

sentence_ids, sentences, labels = get_re_examples("data/re_corpora", "GAD", "1", "train")

/content/biobert


In [0]:
# Now let's look at the examples

table = pd.DataFrame.from_dict({
    "Sentence ID" : sentence_ids, 
    "Sentence" : sentences, 
    "Label" : labels, 
})
display(HTML(table.to_html(index=False)))

Sentence ID,Sentence,Label
1,this study proposes that A/A genotype at position -607 in @GENE$ gene can be used as a new genetic maker in Thai population for predicting @DISEASE$ development.,1
2,Common polymorphisms in the genes @GENE$ and LOC387715 are independently related to @DISEASE$ progression after adjustment for other known AMD risk factors.,1
3,Results do not support any overall association of the Ala-9Val @GENE$ polymorphism to the development of @DISEASE$.,1
4,@GENE$ methylation occurs frequently in human colonic @DISEASE$ and cancers and is closely associated with K-ras mutations.,0
5,"In conclusion, @GENE$ 8092C > A polymorphism may modify the associations between cumulative cigarette smoking and @DISEASE$ risk.",1
6,"Allele A in @GENE$ gene +252 site can significantly increase the relative risk of @DISEASE$ in women in Guangdong, among which TNF-beta AA genotype might be one of the genetic susceptible factors for endometriosis.",1
7,Our data indicate that the -160 single nucleotide polymorphism in @GENE$ is a low-penetrant @DISEASE$ susceptibility gene that might explain a proportion of familial and notably hereditary prostate cancer.,1
8,These results suggest that the @GENE$/-159 polymorphism is an important marker for the @DISEASE$ of IgAN and may modulate the level of the inflammatory responses.,0
9,there is no evidence for an association of @GENE$ alleles with @DISEASE$ in our study groups.,1
10,The association between the @GENE$ G allele and early RA is largely explained by individuals with @DISEASE$ who have coexisting autoimmune endocrinopathies.,1


## BioBERT for Relation Extraction

Reminder: BioBERT model is first pre-trained on a large dataset by (1) learning to predict what are the masked tokens (the "Masked Language Model"); and (2) learning to predict if a sentence is next to another sentence ("Next Sentence Prediction"). There is a vector of outputs for each one of the tokens in the input, which as called "Hidden States". A dummy token, noted as "CLS" is always added at the beginning of the inputs. 

Task 2 is predicted by using the hidden state at the CLS token. As a convention the hidden state at the CLS token is called the "pooled output", and the full set of hidden states corresponding to all the tokens in the input is known as the "sequence output".

![Arthitecture of the Gene Disease Association Relation Extractor](https://drive.google.com/uc?export=view&id=1BFLzmS65cIcj41gQHOP7UDYHI6vaNJMF)

**Arthitecture of the Gene Disease Association Relation Extractor**

We can use BioBERT to treat the problem as a sentence classification problem. The model should be able to understand the association between the GENE and DISEASE entities, and assign an appropriate probability of presence of a relation between the two entities. As with the ADE detection text classifier, BioBERT's pooled output (the hidden state corresponding to the CLS token) is passed through a dropout, fully connected + bias layers, and finally a softmax function to obtain class probabilities. The two possible classes are: (1) Gene Disease Association; and (2) Neg, not a Gene Disease Association.

## Training the Model

**Important**

```This section will not be used during the tutorial, since it will take a long 
time to run. But we encourage you to go back to this notebook and run this 
section by yourself to familiarize yourself with the BioBERT CLI.
```

In this section, we will first download the pre-trained checkpoint from BioBERT, which was trained to perform the two tasks (the Masked Language Model and Next Sentence Prediction) as previously described, on PubMed abstracts and PubMed Central articles.

We will then use the BioBERT relation extraction script to train a biomedical relation extraction model. This process can take a long time (1-2 hours). However, we also provide pre-trained checkpoints that can be used to quickly load a model like the one that you would obtain after training the model as shown in here.

In [0]:
# Fit relation extraction model

%cd /content/biobert
!python run_re.py \
    --task_name="GAD" \
    --do_train="true" \
    --do_eval="true" \
    --do_predict="true" \
    --vocab_file="checkpoints/biobert/biobert_v1.1_pubmed/vocab.txt" \
    --bert_config_file="checkpoints/biobert/biobert_v1.1_pubmed/bert_config.json" \
    --init_checkpoint="checkpoints/biobert/biobert_v1.1_pubmed/model.ckpt-1000000" \
    --max_seq_length="128" \
    --train_batch_size="32" \
    --learning_rate="2e-5" \
    --num_train_epochs="3.0" \
    --do_lower_case="false" \
    --data_dir="data/re_corpora/GAD/1" \
    --output_dir="checkpoints/fine_tuning_2/" 

/content/biobert



W1029 09:53:40.905786 140363246684032 module_wrapper.py:139] From run_re.py:907: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1029 09:53:40.906021 140363246684032 module_wrapper.py:139] From run_re.py:907: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1029 09:53:40.906419 140363246684032 module_wrapper.py:139] From /content/biobert/modeling.py:92: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1029 09:53:40.907037 140363246684032 module_wrapper.py:139] From run_re.py:937: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O re

In [0]:
# Evaluate performance of the model

!python biocodes/re_eval.py \
    --output_path="checkpoints/fine_tuning/test_results.tsv" \
    --answer_path="data/re_corpora/GAD/1/test.tsv"

f1 score    : 83.52%
recall      : 91.10%
precision   : 77.11%
specificity : 69.96%


## Results

After fitting the model, you should obtain performances on the test set close to the following:

- Recall: 92.88%
- Specificity: 67.19%
- F1 score: 83.52%
- Precision: 75.87%

## Load a pre-trained Checkpoint

We have trained models using the same code as shown here and stored the results in our PyPharma NLP 2019 Google Drive. You can use the pypharma_nlp package to download the checkpoints. Using these checkpoints, you don't need to train the models once again to use these models.

Run the following cell to download a pre-trained checkpoint that is equivalent to what you would obtain when running the previous training task in full.

In [0]:
# Recover a pre-fit checkpoint

%cd /content/biobert/
from pypharma_nlp.checkpoints import download_checkpoint

download_checkpoint("checkpoints", "biobert_v1.1_pubmed_re_gad")
!mv checkpoints/biobert_v1.1_pubmed_re_gad checkpoints/fine_tuning_2

/content/biobert
Downloading file from Google Drive


## Prediction

You can use the following code to extract Gene Disease Association relations in the test set, which has not been used to train the model. You can also visualize the gold standard labels and compare them with the predicted labels. As an additional exercise, you could also try to modify this code to extract relations from an abstract or text of your choice and see how the model performs on such unseen text.

In [0]:
# Let's load the model we just trained using pypharma_nlp's wrapper
# class

%cd /content/biobert
from pypharma_nlp.biobert.wrappers import BioBertWrapper

model = BioBertWrapper()
model.build(
    "relation_extraction", 
    "gad", 
    "checkpoints/biobert/biobert_v1.1_pubmed", 
    "checkpoints/fine_tuning_2/", 
    do_lower_case=False, 
)

/content/biobert


In [0]:
# Let's create a generator to read the test examples and predict

%cd /content/biobert
import pandas as pd
from IPython.display import display
from IPython.display import HTML
from pypharma_nlp.biobert.data.re_corpora import get_re_examples

sentence_ids, sentences, true_labels = get_re_examples("data/re_corpora", 
    "GAD", "1", "test")
predicted_labels, probabilities = model.extract_relations(sentences)
prediction_data = pd.DataFrame.from_dict({
    "Sentence ID" : sentence_ids, 
    "Sentence" : sentences, 
    "True Label" : true_labels, 
    "Predicted Label" : predicted_labels, 
})
display(HTML(prediction_data.to_html(index=False)))



/content/biobert


Sentence ID,Sentence,True Label,Predicted Label
1,These results suggest that the C1772T polymorphism in @GENE$ is not involved in progression or metastasis of @DISEASE$.,1,1
2,"In our setting, @DISEASE$ among alcoholic individuals seems to be independent of the presence of mutations C282Y, H63D and S65C in the @GENE$ gene.",1,1
3,MPO genotype GG is associated with @DISEASE$ in patients with hereditary @GENE$.,1,0
4,These three studies do not provide consistent evidence supporting the hypothesis that @GENE$ mutations are associated with an increased risk of @DISEASE$ and with the development of arteriosclerosis.,1,1
5,Our prospective findings suggest that individuals carrying the @GENE$ C282Y mutation may be at increased risk of @DISEASE$.,1,1
6,"We conclude that homozygosity for the G1514-->A mutation is exclusively responsible for the adult form of @DISEASE$ in this family, and that the A619-->G substitution is not a deleterious mutation but rather a common @GENE$ polymorphism.",1,1
7,The data suggest that the @GENE$ gene or a linked locus significantly modulates the risk for @DISEASE$.,1,1
8,The novel gene @GENE$ may be related with the infiltration and proliferation of @DISEASE$.,1,1
9,Our findings suggest that the genetic variants of the @GENE$ but not the TIM-3 gene contribute to @DISEASE$ susceptibility in this African-American population.,1,1
10,"In conclusion, the M416V polymorphism of @GENE$ gene is not associated with insulin resistance in @DISEASE$.",1,1


## Conclusion

We have taken you through the process of downloading the GAD corpus, exploring it, fitting a relation extractor to predict the presence of gene disease associations at the sentence level and, finally, predicting on new unseen sentences. By now, you should be familiar with

* The Gene-Disease Association Database (GAD) corpus.
* The pypharma_nlp package tools to download and explore the GAD corpus
* The command line interface to fit relation extraction models with BioBERT
* The pypharma_nlp package tools to load the BioBERT pre-trained checkpoints and extract relations from new text