# Bio Named Entity Recognition
- Named Entity Recognition is the task of determining a set of entities in the sentence.
- In other words, each word is classified to one of the predefined entities.
- BERT models can be used to solve the NER task by adding a softmax layer after the last embedding layer.
Since the embeddings generated from BERT already hold information about the realtion between the word and the other words of the sentence and thus are suitable for the NER task since the entities can hugely depend on the context.

## TLDR
The model isn't good enough to be used till now. Is BERT actually good enough?

## Download the JNLPBA dataset
## TODO: Check the release dates of each dataset and investigate the entities in each dataset
- https://www.aclweb.org/anthology/W04-1213.pdf


In [None]:
! mkdir JNLPBA

In [None]:
! wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/JNLPBA/train.txt -O JNLPBA/train.raw

In [None]:
! wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/JNLPBA/test.txt -O JNLPBA/test.raw

In [None]:
! wget https://raw.githubusercontent.com/allenai/scibert/master/data/ner/JNLPBA/dev.txt -O JNLPBA/dev.raw

In [None]:
! ls JNLPBA

In [None]:
! cat JNLPBA/*.raw | cut -f 4 | sort | grep -v "^$" | uniq > JNLPBA/labels.txt

In [None]:
! cat JNLPBA/labels.txt

In [None]:
! cat JNLPBA/train.raw | cut -f 1,4 | tr '\t' ' ' > JNLPBA/train.txt.tmp
! cat JNLPBA/test.raw | cut  -f 1,4 | tr '\t' ' ' > JNLPBA/test.txt.tmp
! cat JNLPBA/dev.raw | cut -f 1,4 | tr '\t' ' ' > JNLPBA/dev.txt.tmp

In [None]:
! head JNLPBA/train.txt.tmp

In [None]:
! head JNLPBA/test.txt.tmp

In [None]:
! head JNLPBA/dev.txt.tmp

# Fine-tune the model
https://huggingface.co/mrm8488/scibert_scivocab-finetuned-CORD19

In [None]:
# ! git clone https://github.com/huggingface/transformers
! git clone https://github.com/AMR-KELEG/transformers.git

In [None]:
! pip install transformers

In [None]:
! cd transformers/examples/ner/

In [None]:
! wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"

In [None]:
%env MAX_LENGTH=128
%env BERT_MODEL=mrm8488/scibert_scivocab-finetuned-CORD19

In [None]:
! python preprocess.py /kaggle/working/JNLPBA/train.txt.tmp $BERT_MODEL $MAX_LENGTH > /kaggle/working/JNLPBA/train.txt
! python preprocess.py /kaggle/working/JNLPBA/test.txt.tmp $BERT_MODEL $MAX_LENGTH > /kaggle/working/JNLPBA/test.txt
! python preprocess.py /kaggle/working/JNLPBA/dev.txt.tmp $BERT_MODEL $MAX_LENGTH > /kaggle/working/JNLPBA/dev.txt

In [None]:
! head /kaggle/working/JNLPBA/train.txt

In [None]:
%env OUTPUT_DIR=roberta
%env BATCH_SIZE=32
%env NUM_EPOCHS=2
%env SAVE_STEPS=750
%env SEED=1

In [None]:
! pip install -r /kaggle/working/transformers/examples/requirements.txt

In [None]:
! python /kaggle/working/transformers/examples/ner/run_ner.py --data_dir /kaggle/working/JNLPBA/ \
--model_type roberta \
--labels /kaggle/working/JNLPBA/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length  $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir \
--evaluate_during_training \
--logging_steps 4000

# Test the model for single sample

In [None]:
from transformers import AutoTokenizer, AutoModel, AutoConfig, AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained('roberta')
tokenizer = AutoTokenizer.from_pretrained('mrm8488/scibert_scivocab-finetuned-CORD19')


In [None]:
import torch

with open('JNLPBA/labels.txt', 'r') as f:
    labels = [l.strip() for l in f.readlines()]

label_id_to_label = {i:l for i, l in enumerate(labels)}

def tokenize(sample):
    return tokenizer.encode(' '.join(['CLS', sample, 'SEP']))

def get_prediction(sample):
    tokens = tokenize(sample)
    attention_mask = [1] * len(tokens)
    predictions = model.forward(input_ids=torch.LongTensor([tokens]),
                                attention_mask=torch.LongTensor([attention_mask]))[0].argmax(axis=2).tolist()[0]
    return [label_id_to_label[i] for i in predictions]

In [None]:
sample = 'IL-2 gene expression and NF-kappa B activation through CD28 requires reactive oxygen production by 5-lipoxygenase.'
predictions = get_prediction(sample)
for token, pred in zip(tokenizer.tokenize(sample), predictions):
    print(token, pred)

In [None]:
print('''IL-2 B-DNA
gene I-DNA
expression O
and O
NF-kappa B-protein
B I-protein
activation O
through O
CD28 B-protein
requires O
reactive O
oxygen O
production O
by O
5-lipoxygenase B-protein
. O''')

## Conclusion
- The sample shows that the model isn't generating good enough results.
- Additionally, the way BERT is tokenizing the sample introduces a tricky problem of alligning the predictions to the tokens.