### Named Entity Recognition with BioBERT over biomedical journal corpus

##### Homework 3, Fall, 2021
##### Prof. James H. Martin
###### author: Sushma Akoju


Notebook to train/fine-tune a BioBERT model to perform named entity recognition (NER). 

Required features:
  - Sentence id
  - Word
  - POS
  - Tag

For this task, POS tag for this dataset was not available, POS tag generation was done using NLTK library. [Using a NLTK Tagger](https://www.nltk.org/book/ch05.html)

Steps:
* Getting Data
* Training and validating the Model
* Model Inference over Test data

#### Inuition, background: 
A special case to consider, for example, the least common words in dataset provided for this homework, 'K713','hypercholesterolemic','lutein','P69','conference','Talk','Tele','cruciform','TE105'. They are not only least common, they need special domain specific knowledge for subword tokenization, which is very different as well as difficult from that of other common English word tokens. Thus BioBERT makes for a special case and seems more reasonable to explore. This also complies with the fact that domain specific expertise adds additional information required to understand the Named Entity tags and vice versa. The reverse case is : to represent knowledge and “reason” as understood from a given context in a medical journal text corpus, Named entity recognition also serves as a pre-requisite for knowledge mining. To rephrase the reverse case, for domain specific knowledge mining, we need Named Entity recognition as a prerequisite. 

#### Task Description

> Named entity recognition (NER) is the task of tagging entities in text with their corresponding type. Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities. O is used for non-entity tokens.

#### Install Dependencies and Restart Runtime

In [None]:
!pip install -q transformers
!pip install -q simpletransformers

[K     |████████████████████████████████| 3.1 MB 7.4 MB/s 
[K     |████████████████████████████████| 596 kB 42.3 MB/s 
[K     |████████████████████████████████| 3.3 MB 42.1 MB/s 
[K     |████████████████████████████████| 59 kB 5.4 MB/s 
[K     |████████████████████████████████| 895 kB 51.9 MB/s 
[K     |████████████████████████████████| 247 kB 8.0 MB/s 
[K     |████████████████████████████████| 9.1 MB 47.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 38.7 MB/s 
[K     |████████████████████████████████| 1.7 MB 51.4 MB/s 
[K     |████████████████████████████████| 290 kB 62.0 MB/s 
[K     |████████████████████████████████| 43 kB 1.9 MB/s 
[K     |████████████████████████████████| 140 kB 57.3 MB/s 
[K     |████████████████████████████████| 180 kB 61.2 MB/s 
[K     |████████████████████████████████| 97 kB 6.6 MB/s 
[K     |████████████████████████████████| 63 kB 1.6 MB/s 
[K     |████████████████████████████████| 132 kB 55.4 MB/s 
[K     |█████████████████████████

### Getting Data

#### Loading the data from Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

In [None]:
d = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/bio_ner.csv")
n =  d.to_numpy().tolist()
new_list = []
counter = 0
for l in n:
  #print(l)
  if l[2] == 1:
    counter += 1
  l[1] = counter
  new_list.append(l)

##### The data was converted to a csv file with Sentence ID, Line, Word, POS tag and the IOB Tag

In [None]:
d = pd.DataFrame.from_records(new_list, columns = ['ind', 'Sentence #','Line','Word', 'POS', 'Tag'])
d.head(30)

Unnamed: 0,ind,Sentence #,Line,Word,POS,Tag
0,1,1,1,Comparison,NNP,O
1,2,1,2,with,IN,O
2,3,1,3,alkaline,JJ,B
3,4,1,4,phosphatases,NNS,I
4,5,1,5,and,CC,O
5,6,1,6,5,CD,B
6,7,1,7,-,:,I
7,8,1,8,nucleotidase,NN,I
8,9,1,9,.,.,O
9,10,2,1,Pharmacologic,NNP,O


In [None]:
d = d.drop('ind', axis=1)
df = d[['Sentence #','Word', 'POS', 'Tag']]
start = df[df['Sentence #']==6897].index.values.astype(int)[0]
second_start = df[df['Sentence #']==10347].index.values.astype(int)[0]
end = df[df['Sentence #']==13794].index.values.astype(int)[0]
train = df.iloc[0:start]
train_dev = df.iloc[start:second_start]
test = df.iloc[second_start:end]

In [None]:
df.head(13)

Unnamed: 0,Sentence #,Word,POS,Tag
0,1,Comparison,NNP,O
1,1,with,IN,O
2,1,alkaline,JJ,B
3,1,phosphatases,NNS,I
4,1,and,CC,O
5,1,5,CD,B
6,1,-,:,I
7,1,nucleotidase,NN,I
8,1,.,.,O
9,2,Pharmacologic,NNP,O


In [None]:
train.to_csv('/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/train.tsv', sep="\t", header=False, index = False)
test.to_csv('/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/test.tsv', sep="\t", header=False, index = False)
train_dev.to_csv('/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/train_dev.tsv', sep="\t", header=False, index = False)

In [None]:
import pandas as pd
def read_conll(filename):
    df = pd.read_csv(filename,
                    sep = '\t', header = None, keep_default_na = False,
                    names = ['sentence_id','words', 'pos', 'labels'],
                    quoting = 3, skip_blank_lines = False)
    df = df[~df['words'].astype(str).str.startswith('-DOCSTART- ')] 
    return df[df.words != '']

#### For this task, Data is split inot Train, Test and Dev datasets.
- Train # : 6896 sentences
- Test # : 3448 sentences.
- Dev # : 3450 sentences.

In [None]:
train_df = read_conll('/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw//train.tsv')
test_df = read_conll('/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/test.tsv')
dev_df = read_conll('/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/train_dev.tsv')
test_df.head(20)

Unnamed: 0,sentence_id,words,pos,labels
0,10347,In,IN,O
1,10347,some,DT,O
2,10347,cases,NNS,O
3,10347,",",",",O
4,10347,the,DT,O
5,10347,aberrant,JJ,O
6,10347,methylation,NN,O
7,10347,of,IN,O
8,10347,CpGs,NNP,O
9,10347,within,IN,O


We now print out the statistics (number of sentences) of the train, dev and test sets.

In [None]:
data = [[train_df['sentence_id'].nunique(), test_df['sentence_id'].nunique(), dev_df['sentence_id'].nunique()]]

# Prints out the dataset sizes of train and test sets per label.
pd.DataFrame(data, columns=["Train", "Test", "Dev"])

Unnamed: 0,Train,Test,Dev
0,6896,3447,3450


# Training and Testing the Model

#### Set up the Training Arguments

We set up the training arguments. Here we train to 10 epochs to get accuracy close to the SOTA. The train, test and dev sets are relatively small so we don't have to wait too long. We set a sliding window as NER sequences can be quite long and because we have limited GPU memory we can't increase the `max_seq_length` too long.

In [None]:
train_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'sliding_window': True,
    'max_seq_length': 64,
    'num_train_epochs': 10,
    'train_batch_size': 32,
    'fp16': True,
    'output_dir': '/outputs/',
    'best_model_dir': '/outputs/best_model/',
    'evaluate_during_training': True,
}

The following line of code saves (to the variable `custom_labels`) a set of all the NER tags/labels in the dataset.

In [None]:
custom_labels = ['I', 'B', 'O'] #list(train_df['labels'].unique())
print(custom_labels)

['I', 'B', 'O']


#### Train the Model

###### The pre-trained BioBERT model (by [DMIS Lab, Korea University](https://huggingface.co/dmis-lab)) from the [Hugging Face Transformers](https://github.com/huggingface/transformers) library as the base and use the [Simple Transformers library](https://simpletransformers.ai/docs/classification-models/) on top of it to train the NER (sequence tagging) model with just a few lines of code.

In [None]:
from simpletransformers.ner import NERModel
from transformers import AutoTokenizer
import pandas as pd
import logging

logging.basicConfig(level=logging.DEBUG)
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.WARNING)

# We use the bio BERT pre-trained model.
model = NERModel('bert', 'dmis-lab/biobert-v1.1', labels=custom_labels, args=train_args)

# Train the model
# https://simpletransformers.ai/docs/tips-and-tricks/#using-early-stopping
model.train_model(train_df, eval_data=dev_df)

# Evaluate the model in terms of accuracy score
result, model_outputs, preds_list = model.eval_model(test_df)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /dmis-lab/biobert-v1.1/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /dmis-lab/biobert-v1.1/resolve/main/pytorch_model.bin HTTP/1.1" 302 0
Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-v1.1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/dmis-lab/biobert-v1.1 HTTP/1.1" 200 646
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEB

  0%|          | 0/4 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/216 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/432 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model: Training of bert model complete. Saved to /outputs/.
INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/431 [00:00<?, ?it/s]

INFO:simpletransformers.ner.ner_model:{'eval_loss': 0.14850225066391373, 'precision': 0.830317147464612, 'recall': 0.8575129533678757, 'f1_score': 0.8436959490213929}


The F1-score for the model is **84.3%** ('f1_score': 0.8436959490213929).

# Using the Model (Running Inference)

Running the model to do some predictions/inference is as simple as calling `model.predict(samples)`. First we get a sentence from the test set and conduct the prediction of each of the sentence.

In [None]:
ids = test_df.sentence_id.unique().tolist()
len(ids)

3447

In [None]:
preds = []
counter = 1
for id in ids:
  sample = test_df[test_df.sentence_id == id].words.str.cat(sep=' ')
  samples.append(sample)

predictions, _ = model.predict(samples)
print(predictions, samples)
for idx, sample in enumerate(samples):
  counter = 1
  print('{}: '.format(idx))
  for word in predictions[idx]:
    w = list(word.keys())[0]
    tag = list(word.values())[0]
    preds.append([counter, w,tag ])
    print('{}'.format(word), type(word), word)
    counter += 1

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1725 [00:00<?, ?it/s]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{'activity': 'O'} <class 'dict'> {'activity': 'O'}
{'.': 'O'} <class 'dict'> {'.': 'O'}
13628: 
{'Effects': 'O'} <class 'dict'> {'Effects': 'O'}
{'of': 'O'} <class 'dict'> {'of': 'O'}
{'spatial': 'O'} <class 'dict'> {'spatial': 'O'}
{'and': 'O'} <class 'dict'> {'and': 'O'}
{'temporal': 'O'} <class 'dict'> {'temporal': 'O'}
{'smoothing': 'O'} <class 'dict'> {'smoothing': 'O'}
{'on': 'O'} <class 'dict'> {'on': 'O'}
{'stimulated': 'O'} <class 'dict'> {'stimulated': 'O'}
{'brillouin': 'O'} <class 'dict'> {'brillouin': 'O'}
{'scattering': 'O'} <class 'dict'> {'scattering': 'O'}
{'in': 'O'} <class 'dict'> {'in': 'O'}
{'the': 'O'} <class 'dict'> {'the': 'O'}
{'independent': 'O'} <class 'dict'> {'independent': 'O'}
{'-': 'O'} <class 'dict'> {'-': 'O'}
{'hot': 'O'} <class 'dict'> {'hot': 'O'}
{'-': 'O'} <class 'dict'> {'-': 'O'}
{'spot': 'O'} <class 'dict'> {'spot': 'O'}
{'model': 'O'} <class 'dict'> {'model': 'O'}
{'limit': 'O'} 

In [None]:
df_preds = pd.DataFrame.from_records(preds)
df_preds.head(30)

Unnamed: 0,0,1,2
0,1,The,O
1,2,genomic,O
2,3,fragments,O
3,4,were,O
4,5,fused,O
5,6,upstream,O
6,7,of,O
7,8,the,O
8,9,luciferase,B
9,10,reporter,I


In [None]:
df_preds.to_csv('/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/preds_biobert.csv', sep="\t", header=False, index = False)

In [None]:
samples = [sample]
predictions, _ = model.predict(samples)
print(predictions, samples)
for idx, sample in enumerate(samples):
  print('{}: '.format(idx))
  for word in predictions[idx]:
    print('{}'.format(word))

INFO:simpletransformers.ner.ner_model: Converting to features started.


  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[[{'Deletion': 'O'}, {'of': 'O'}, {'the': 'O'}, {'last': 'O'}, {'two': 'O'}, {'Ser': 'O'}, {'residues': 'O'}, {',': 'O'}, {'including': 'O'}, {'one': 'O'}, {'PKC': 'B'}, {'consensus': 'I'}, {'site': 'I'}, {'in': 'O'}, {'the': 'O'}, {'receptor': 'O'}, {'tail': 'O'}, {',': 'O'}, {'prevented': 'O'}, {'only': 'O'}, {'phorbol': 'O'}, {'12': 'O'}, {'-': 'O'}, {'myristate': 'O'}, {'13': 'O'}, {'-': 'O'}, {'acetate': 'O'}, {'-': 'O'}, {'induced': 'O'}, {'desensitization': 'O'}, {'by': 'O'}, {'30': 'O'}, {'%.': 'O'}]] ['Deletion of the last two Ser residues , including one PKC consensus site in the receptor tail , prevented only phorbol 12 - myristate 13 - acetate - induced desensitization by 30 %.']
0: 
{'Deletion': 'O'}
{'of': 'O'}
{'the': 'O'}
{'last': 'O'}
{'two': 'O'}
{'Ser': 'O'}
{'residues': 'O'}
{',': 'O'}
{'including': 'O'}
{'one': 'O'}
{'PKC': 'B'}
{'consensus': 'I'}
{'site': 'I'}
{'in': 'O'}
{'the': 'O'}
{'receptor': 'O'}
{'tail': 'O'}
{',': 'O'}
{'prevented': 'O'}
{'only': 'O'}
{'ph

You can move the model checkpount files which are saved in the `/outputs/` directory to your Google Drive.

In [None]:
import shutil
shutil.move('/outputs/', "/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/outputs/")

'/content/drive/MyDrive/Colab Notebooks/entity-extraction/hw3/data_ner_hw/outputs/'

#### The results and analysis
For comparative purpose, the results and analysis are conducted along with various other approaches implemented for this homework. Additionally, although results from validation do seem impressive, the performance is far apart from expected and/or reasonable f1-scores. The results are covered in detail in final report.