# A Notebook to demonstrate BioBERT using NER - edr

BIOBERT is text-mining model that is pre-trained on the biomedical datasets. In the pre-training, weights of the regular BERT model was taken and then pre-trained on the medical datasets like (PubMed abstracts and PMC). This domain-specific pre-trained model can be fine-tunned for many tasks like NER(Named Entity Recognition), RE(Relation Extraction) and QA(Question-Answering system). As per the analysis, it is proven that fine-tuning BIOBERT model outperformed the fine-tuned BERT model for the biomedical domain-specific NLP tasks.
In this notebook, we will use pre-trained deep learning model to process some text using NER. We will then use the output of that model to classify the text. The text is a list of sentences from NCBI dataset.

## Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries bio-medical sentiments). We can think of it as looking like this:

<img src="https://www.pragnakalp.com/wp-content/uploads/2020/05/BIOBERT-NER-Demo.jpg" />

By having a pre-trained model that encompasses both general and biomedical domain corpora, developers and practitioners could now encapsulate biomedical terms that would have been incredibly difficult for a general language model to comprehend.

<img src="https://miro.medium.com/max/631/1*w-68faGJlR23sbvUEADrPw.png" />

## Dataset
The dataset we will use in this example is [BioBERT-Base v1.1 (+ PubMed 1M)](https://drive.google.com/file/d/1R84voFKHfWV9xjzeLzWBbmY1uOMYpnyD/view), which contains sentences with symptoms, diagnosis...

## Installing the transformers library
Let's start by installing the huggingface transformers library so we can load our deep learning NLP model.

In [1]:
!pip install transformers



In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

# Importing the dataset
Use pandas to read the dataset and load it into a dataframe.

In [3]:
df = pd.read_csv('train.tsv', delimiter='\t', header=None)

Use 5,000 sentences from the dataset for demonstration.

Now we try to list out first 10 elements:

In [18]:
df.head(10)

Unnamed: 0,0,1
0,Identification,O
1,of,O
2,APC2,O
3,",",O
4,a,O
5,homologue,O
6,of,O
7,the,O
8,adenomatous,B
9,polyposis,I


In [4]:
batch_1 = df[:2000]

The datasets used to evaluate NER are structured in the BIO (Beginning, Inside, Outside) schema, the most common tagging format for sentence tokens within this task. That way we can note the positional prefix and entity type that is being predicted from the training data. B and I represents to 13 entities with 27 tags (As per BIO schema), O for non-biological tags.

<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    text
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      the
    </td>
    <td class="mdc-bg-purple-50">
      O
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      adenomatous
    </td>
    <td class="mdc-bg-purple-50">
      B
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      polyposis
    </td>
    <td class="mdc-bg-purple-50">
      I
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      coli
    </td>
    <td class="mdc-bg-purple-50">
      I
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      tumour
    </td>
    <td class="mdc-bg-purple-50">
      I
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      ,
    </td>
    <td class="mdc-bg-purple-50">
      O
    </td>
  </tr>
</table>
<img src="BIO.png" />

We can ask pandas how many sentences are labeled as "B", "I" and "O".


In [5]:
batch_1[1].value_counts()

O    1788
I     123
B      89
Name: 1, dtype: int64

## Loading the Pre-trained BioBERT model
Now load a pre-trained BioBERT model. 

In [6]:
# Load from BERT with transformers:
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Right now, the variable `model` holds a pretrained BioBERT model 

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [7]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BioBERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths). In this example we pad to the longest text.

In [8]:
# Maximum length of text in dataset
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below by using numpy library:

In [9]:
np.array(padded).shape

(2000, 8)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [10]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 8)

It returns to a 2000x10 matrix used to execute the next step.

## Deep Learning
Now that we have our model and inputs ready, let's run our model!
<img src="DeepLearning.png" />
The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [11]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [12]:
features = last_hidden_states[0][:,0,:].numpy()

The labels indicating which text is biomedical and non-biomedical now go into the labels variable

In [13]:
labels = batch_1[1]

## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 5,000 words from the SST2 training set).

In [14]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [15]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression()

<img src="https://jalammar.github.io/images/distilBERT/bert-training-logistic-regression.png" />

## Evaluating Model #2
Check the accuracy against the testing dataset:

In [16]:
lr_clf.score(test_features, test_labels)

0.94

What can we compare it against? Let's first look at a dummy classifier:

In [17]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.894 (+/- 0.00)


## Proper SST2 scores
To improve its score on this task – a process called **fine-tuning** which updates BioBERT’s weights to make it achieve a better performance in this text classification task (NER).