# **Introduction**

This notebook is to fine-tune ELECTRA and BERT on NER and RE tasks.
If you plan to use GPU instead of TPU, it's better to have Google Colab Pro since it gives you access to V100 GPU. 

To activate TPU go to Runtime -> Change Runtime and choose TPU with high memory. Also, it is better to create Google Bucket at the IOWA zone to avoid additional cost. 


If you plan to use GPU, you need to install TensorFlow-GPU. You just need to upload pretrained weights locally in this notebook, create a new folder, "finetunning," and copy all datasets there.

We added slidebars to each form so you can try different hyperparameters on Dev dataset. use "do_eval":True,"do_test":false for that because we can't try different hyperparameter on test dataset.

For the NER task, ELECTRA implementation uses Entity Level Evaluation. You can confirm that by looking at "finetune/tagging/tagging_metrics.py at line 99
class EntityLevelF1Scorer(F1Scorer):
  "Computes F1 score for entity-level tasks such as NER."


# **Environmental Setup**

In [None]:
!git clone https://github.com/salrowili/BioM-Transformers

Cloning into 'BioM-Transformers'...
remote: Enumerating objects: 51, done.[K
remote: Counting objects: 100% (51/51), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 51 (delta 6), reused 44 (delta 4), pack-reused 0[K
Unpacking objects: 100% (51/51), done.


Establish the connection with TPU. All credits to ALBERT team for creating this code which can be found here https://github.com/google-research/albert/blob/master/albert_glue_fine_tuning_tutorial.ipynb

In [None]:
%tensorflow_version 1.x
import os
import pprint
import json
import tensorflow as tf

assert "COLAB_TPU_ADDR" in os.environ, "ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!"
TPU_ADDRESS = "grpc://" + os.environ["COLAB_TPU_ADDR"] 
os.environ['TPU_ADDRESS'] ="grpc://" + os.environ["COLAB_TPU_ADDR"] 
TPU_TOPOLOGY = "2x2"
print("TPU address is", TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    # Now credentials are set for all future sessions on this TPU.
os.environ['TPU_Name'] = TPU_ADDRESS

# **Pre-Processing NER and ChemProt Dataset**

We need to download the pre-processing script from https://microsoft.github.io/BLURB/submit.html . This is the direct link https://microsoft.github.io/BLURB/sample_code/data_generation.tar.gz . Also we need to download Punkt Sentence Tokenizer .

In [None]:
!wget https://microsoft.github.io/BLURB/sample_code/data_generation.tar.gz
!tar -xvf data_generation.tar.gz
import nltk
nltk.download('punkt')

--2021-06-16 17:32:41--  https://microsoft.github.io/BLURB/sample_code/data_generation.tar.gz
Resolving microsoft.github.io (microsoft.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to microsoft.github.io (microsoft.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54142 (53K) [application/gzip]
Saving to: ‘data_generation.tar.gz’


2021-06-16 17:32:41 (5.34 MB/s) - ‘data_generation.tar.gz’ saved [54142/54142]

data_generation/
data_generation/download_BLURB_data.sh
data_generation/indexing/
data_generation/indexing/BioASQ/
data_generation/indexing/BioASQ/dev_id.tsv
data_generation/indexing/BioASQ/test_id.tsv
data_generation/indexing/BioASQ/train_id.tsv
data_generation/indexing/DDI/
data_generation/indexing/DDI/dev_files.tsv
data_generation/indexing/DDI/test_files.tsv
data_generation/indexing/DDI/train_files.tsv
data_generation/indexing/ebmnlp/
data_generation/indexing/ebmnlp/dev_pmid.tsv
data_generatio

True

In [None]:
%%shell
pip3 install -r /content/data_generation/requirements.txt
cd data_generation
sh download_BLURB_data.sh
wget https://biocreative.bioinformatics.udel.edu/media/store/files/2017/ChemProt_Corpus.zip
mv ChemProt_Corpus.zip raw_data
sh preprocess_BLURB_data.sh

Collecting nltk==3.5
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |████████████████████████████████| 1.4MB 6.7MB/s 
[?25hCollecting lxml==4.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/de/3c/fa420469c0d4f62ae39f19ee6505f90d00ae469f6264f4f54e61ed9d9a2c/lxml-4.5.2-cp37-cp37m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 18.6MB/s 
[?25hCollecting defusedxml==0.6.0
  Downloading https://files.pythonhosted.org/packages/06/74/9b387472866358ebc08732de3da6dc48e44b0aacd2ddaa5cb85ab7e986a2/defusedxml-0.6.0-py2.py3-none-any.whl
Collecting python-docx==0.8.10
[?25l  Downloading https://files.pythonhosted.org/packages/e4/83/c66a1934ed5ed8ab1dbb9931f1779079f8bca0f6bbc5793c06c4b5e7d671/python-docx-0.8.10.tar.gz (5.5MB)
[K     |████████████████████████████████| 5.5MB 53.8MB/s 
[?25hCollecting unrar==0.4
  Downloading https://files.pytho



NER label should be in format of B-xx or I-xx so we need to fix this issue.

In [None]:
!sed -i "s/\tB/\tB-disease/g" /content/data_generation/data/NCBI-disease/*.tsv
!sed -i "s/\tI/\tI-disease/g" /content/data_generation/data/NCBI-disease/*.tsv
!sed -i "s/\tB/\tB-chem/g" /content/data_generation/data/BC5CDR-chem/*.tsv
!sed -i "s/\tI/\tI-chem/g" /content/data_generation/data/BC5CDR-chem/*.tsv
!sed -i "s/\tB/\tB-disease/g" /content/data_generation/data/BC5CDR-disease/*.tsv
!sed -i "s/\tI/\tI-disease/g" /content/data_generation/data/BC5CDR-disease/*.tsv

Here we need to login to our google account that has the google bucket.

In [None]:
!gcloud auth login

copy all pre-processed dataset to our bucket . Please note that ELECTRA code will look for finetuning_data folder to locate the dataset for each task.

In [None]:
!gsutil -m cp -r /content/data_generation/data/* gs://biom-transformers/ELECTRA/finetuning_data/

Copy the tensorflow checkpoint with the vocab.txt file. You will find the checkpoint files at our github page. For example for BioM-ELECTRA-Large model this is the link https://drive.google.com/file/d/1-60kzBf7X8Y5XiZPdNIQHql82zpOYEnE/view

In [None]:
!gsutil cp -r /content/BioM-ELECTRA-Large-TF gs://biom-transformers/ELECTRA/models/

# **Fine-Tunning**

If you want to search for the best hyperparameters:

*   Check both train and dev and uncheck the test option.
*   Find the best hyperparameters using the dev dataset.
*   Use those hyperparameters to evaluate the model on the test dataset by checking both train and test checkboxes.

We use default hyperparameters in our paper to help you replicate the results.




In [None]:
train = True #@param {type:"boolean"}
dev = False #@param {type:"boolean"}
test = True #@param {type:"boolean"}
model= "electra" #@param ["electra", "bert"]
scale= "large" #@param ["base", "large"]
task= "NCBI-disease" #@param ["BC5CDR-disease", "BC5CDR-chem", "NCBI-disease","chemprot","bioasq7b-squadv2"]
model_path = 'gs://biom-transformers/ELECTRA/models/BioM-ELECTRA-Large-TF/model.ckpt-434200' #@param {type:"string"}
vocab_path = 'gs://biom-transformers/ELECTRA/models/BioM-ELECTRA-Large-TF/vocab.txt' #@param {type:"string"}
max_seq_length= "256" #@param ["128", "256", "384","512"]
trials= "5" #@param ["1", "2", "3","4","5","10"]
batch_size = 32 #@param {type:"slider", min:8, max:128, step:8}
learning_rate = 0.00007 #@param {type:"slider", min:0.00001, max:0.0009, step:0.00001}
epochs_num = 5 #@param {type:"slider", min:1, max:50, step:1}
os.environ['max_seq_length'] = str(max_seq_length)
os.environ['batch_size'] = str(batch_size)
os.environ['learning_rate'] = str(learning_rate)
os.environ['epochs_num'] = str(epochs_num)
os.environ['task'] = str(task)
os.environ['model_path'] = str(model_path)
os.environ['vocab_path'] = str(vocab_path)
os.environ['scale'] = str(scale)
os.environ['trials'] = str(trials)
os.environ['train'] = str.lower(str(train))
os.environ['dev'] = str.lower(str(dev))
os.environ['test'] = str.lower(str(test))
os.environ['model_name'] = str(model+"-"+scale+"-"+task+"-"+max_seq_length)
!python3 BioM-Transformers/electra/run_finetuning.py --data-dir gs://biom-transformers/ELECTRA/ \
--model-name $model_name --hparams '{"num_trials":'"$trials"',"model_size":"'$scale'", "task_names": ["'$task'"],"do_train":'"$train"',"do_eval":'"$dev"',"do_test":'"$test"' ,"init_checkpoint":"'$model_path'","vocab_file":"'$vocab_path'","vocab_size":28895,"max_seq_length":'"$max_seq_length"',"learning_rate":'"$learning_rate"',"train_batch_size":'"$batch_size"',"num_train_epochs":'"$epochs_num"',"use_tpu":true,"tpu_name":"'$TPU_ADDRESS'","num_tpu_cores":8}'

Config: model=electra-large-NCBI-disease-256, trial 1/5
answerable_classifier True
answerable_uses_start_logits True
answerable_weight 0.5
beam_size 20
data_dir gs://biom-transformers/ELECTRA/
debug False
do_eval False
do_lower_case True
do_test True
do_train True
doc_stride 128
double_unordered True
embedding_size None
eval_batch_size 32
evaluate_bioasq False
gcp_project None
init_checkpoint gs://biom-transformers/ELECTRA/models/BioM-ELECTRA-Large-TF/model.ckpt-434200
iterations_per_loop 1000
joint_prediction True
keep_all_models True
layerwise_lr_decay 0.9
learning_rate 7e-05
log_examples False
max_answer_length 30
max_query_length 64
max_seq_length 256
model_dir gs://biom-transformers/ELECTRA/models/electra-large-NCBI-disease-256/finetuning_models/NCBI-disease_model
model_hparam_overrides {}
model_name electra-large-NCBI-disease-256
model_size large
n_best_size 20
n_writes_test 5
num_tpu_cores 8
num_train_epochs 5
num_trials 5
predict_batch_size 32
preprocessed_data_dir gs://biom-tr

In [None]:
train = True #@param {type:"boolean"}
dev = False #@param {type:"boolean"}
test = True #@param {type:"boolean"}
model= "electra" #@param ["electra", "bert"]
scale= "large" #@param ["base", "large"]
task= "chemprot" #@param ["BC5CDR-disease", "BC5CDR-chem", "NCBI-disease","chemprot","bioasq7b-squadv2"]
model_path = 'gs://biom-transformers/ELECTRA/models/BioM-ELECTRA-Large-TF/model.ckpt-434200' #@param {type:"string"}
vocab_path = 'gs://biom-transformers/ELECTRA/models/BioM-ELECTRA-Large-TF/vocab.txt' #@param {type:"string"}
max_seq_length= "256" #@param ["128", "256", "384","512"]
trials= "5" #@param ["1", "2", "3","4","5","10"]
batch_size = 32 #@param {type:"slider", min:8, max:128, step:8}
learning_rate = 0.00007 #@param {type:"slider", min:0.00001, max:0.0009, step:0.00001}
epochs_num = 4 #@param {type:"slider", min:1, max:50, step:1}
os.environ['max_seq_length'] = str(max_seq_length)
os.environ['batch_size'] = str(batch_size)
os.environ['learning_rate'] = str(learning_rate)
os.environ['epochs_num'] = str(epochs_num)
os.environ['task'] = str(task)
os.environ['model_path'] = str(model_path)
os.environ['vocab_path'] = str(vocab_path)
os.environ['scale'] = str(scale)
os.environ['trials'] = str(trials)
os.environ['train'] = str.lower(str(train))
os.environ['dev'] = str.lower(str(dev))
os.environ['test'] = str.lower(str(test))
os.environ['model_name'] = str(model+"-"+scale+"-"+task+"-"+max_seq_length)
!python3 BioM-Transformers/electra/run_finetuning.py --data-dir gs://biom-transformers/ELECTRA/ \
--model-name $model_name --hparams '{"num_trials":'"$trials"',"model_size":"'$scale'", "task_names": ["'$task'"],"do_train":'"$train"',"do_eval":'"$dev"',"do_test":'"$test"' ,"init_checkpoint":"'$model_path'","vocab_file":"'$vocab_path'","vocab_size":28895,"max_seq_length":'"$max_seq_length"',"learning_rate":'"$learning_rate"',"train_batch_size":'"$batch_size"',"num_train_epochs":'"$epochs_num"',"use_tpu":true,"tpu_name":"'$TPU_ADDRESS'","num_tpu_cores":8}'

Config: model=electra-large-chemprot-256, trial 1/5
answerable_classifier True
answerable_uses_start_logits True
answerable_weight 0.5
beam_size 20
data_dir gs://biom-transformers/ELECTRA/
debug False
do_eval False
do_lower_case True
do_test True
do_train True
doc_stride 128
double_unordered True
embedding_size None
eval_batch_size 32
evaluate_bioasq False
gcp_project None
init_checkpoint gs://biom-transformers/ELECTRA/models/BioM-ELECTRA-Large-TF/model.ckpt-434200
iterations_per_loop 1000
joint_prediction True
keep_all_models True
layerwise_lr_decay 0.9
learning_rate 7e-05
log_examples False
max_answer_length 30
max_query_length 64
max_seq_length 256
model_dir gs://biom-transformers/ELECTRA/models/electra-large-chemprot-256/finetuning_models/chemprot_model
model_hparam_overrides {}
model_name electra-large-chemprot-256
model_size large
n_best_size 20
n_writes_test 5
num_tpu_cores 8
num_train_epochs 4
num_trials 5
predict_batch_size 32
preprocessed_data_dir gs://biom-transformers/ELECT

In [None]:
train = True #@param {type:"boolean"}
dev = False #@param {type:"boolean"}
test = True #@param {type:"boolean"}
model= "bert" #@param ["electra", "bert"]
scale= "large" #@param ["base", "large"]
task= "chemprot" #@param ["BC5CDR-disease", "BC5CDR-chem", "NCBI-disease","chemprot","bioasq7b-squadv2"]
model_path = 'gs://biom-transformers/ELECTRA/models/BioM-BERT-Large-TF/model.ckpt-690000' #@param {type:"string"}
vocab_path = 'gs://biom-transformers/ELECTRA/models/BioM-BERT-Large-TF/vocab.txt' #@param {type:"string"}
max_seq_length= "256" #@param ["128", "256", "384","512"]
trials= "5" #@param ["1", "2", "3","4","5","10"]
batch_size = 32 #@param {type:"slider", min:8, max:128, step:8}
learning_rate = 0.00007 #@param {type:"slider", min:0.00001, max:0.0009, step:0.00001}
epochs_num = 4 #@param {type:"slider", min:1, max:50, step:1}
os.environ['max_seq_length'] = str(max_seq_length)
os.environ['batch_size'] = str(batch_size)
os.environ['learning_rate'] = str(learning_rate)
os.environ['epochs_num'] = str(epochs_num)
os.environ['task'] = str(task)
os.environ['model_path'] = str(model_path)
os.environ['vocab_path'] = str(vocab_path)
os.environ['scale'] = str(scale)
os.environ['trials'] = str(trials)
os.environ['train'] = str.lower(str(train))
os.environ['dev'] = str.lower(str(dev))
os.environ['test'] = str.lower(str(test))
os.environ['model_name'] = str(model+"-"+scale+"-"+task+"-"+max_seq_length)
!python3 BioM-Transformers/electra/run_finetuning.py --data-dir gs://biom-transformers/ELECTRA/ \
--model-name $model_name --hparams '{"num_trials":'"$trials"',"model_size":"'$scale'", "task_names": ["'$task'"],"do_train":'"$train"',"do_eval":'"$dev"',"do_test":'"$test"' ,"init_checkpoint":"'$model_path'","vocab_file":"'$vocab_path'","vocab_size":30000,"max_seq_length":'"$max_seq_length"',"learning_rate":'"$learning_rate"',"train_batch_size":'"$batch_size"',"num_train_epochs":'"$epochs_num"',"use_tpu":true,"tpu_name":"'$TPU_ADDRESS'","num_tpu_cores":8}'

Config: model=bert-large-chemprot-256, trial 1/5
answerable_classifier True
answerable_uses_start_logits True
answerable_weight 0.5
beam_size 20
data_dir gs://biom-transformers/ELECTRA/
debug False
do_eval False
do_lower_case True
do_test True
do_train True
doc_stride 128
double_unordered True
embedding_size None
eval_batch_size 32
evaluate_bioasq False
gcp_project None
init_checkpoint gs://biom-transformers/ELECTRA/models/BioM-BERT-Large-TF/model.ckpt-690000
iterations_per_loop 1000
joint_prediction True
keep_all_models True
layerwise_lr_decay 0.9
learning_rate 7e-05
log_examples False
max_answer_length 30
max_query_length 64
max_seq_length 256
model_dir gs://biom-transformers/ELECTRA/models/bert-large-chemprot-256/finetuning_models/chemprot_model
model_hparam_overrides {}
model_name bert-large-chemprot-256
model_size large
n_best_size 20
n_writes_test 5
num_tpu_cores 8
num_train_epochs 4
num_trials 5
predict_batch_size 32
preprocessed_data_dir gs://biom-transformers/ELECTRA/models/be