This Tutorial shows how we can fine-tune BERT-like models (e.g., ELECTRA, ALBERT, BioBERT, BioM-Transformers) with PyTorch code on the TPU unit. 


TensorFlow Processing Unit or TPU is a specially crafted device by Google Company which is designed to boost Matrix Multiplications perfomance which most AI applications needs for fine-tuning and pre-training. . See this page to get better idea about TPUv3-8 perfomance compared to GPU units :

 https://wccftech.com/nvidia-ampere-a100-fastest-ai-gpu-up-to-4-times-faster-than-volta-v100/


TPU unit is designed to work with native TensorFlow code. However, with a new library called Torch XLA https://github.com/pytorch/xla, we could make it work with Torch code, as shown in this example. 

First we need to install the latest torch xla library

In [None]:
!pip install cloud-tpu-client==0.10 torch==1.11.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.11-cp37-cp37m-linux_x86_64.whl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch-xla==1.11
  Downloading https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.11-cp37-cp37m-linux_x86_64.whl (152.9 MB)
[K     |████████████████████████████████| 152.9 MB 39 kB/s 
[?25hCollecting cloud-tpu-client==0.10
  Downloading cloud_tpu_client-0.10-py3-none-any.whl (7.4 kB)
Collecting google-api-python-client==1.8.0
  Downloading google_api_python_client-1.8.0-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 3.0 MB/s 
Installing collected packages: google-api-python-client, torch-xla, cloud-tpu-client
  Attempting uninstall: google-api-python-client
    Found existing installation: google-api-python-client 1.12.11
    Uninstalling google-api-python-client-1.12.11:
      Successfully uninstalled google-api-python-client-1.12.11
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are

Then, we need to install Transformers library which is a collection of NLP applications (e.g, Question Answering, Text Classifcation , Seq2Seq models ..).

In [None]:
!pip3 install git+https://github.com/huggingface/transformers
!git clone https://github.com/huggingface/transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-b8e3rrhu
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-b8e3rrhu
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 5.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.man

In [None]:
!pip3 install -r /content/transformers/examples/pytorch/question-answering/requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.10.0-py3-none-any.whl (117 kB)
[K     |████████████████████████████████| 117 kB 5.4 MB/s 
[?25hCollecting datasets>=1.8.0
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 35.2 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 66.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 75.3 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 75

In [None]:
!python /content/transformers/examples/pytorch/question-answering/run_qa.py --help

usage: run_qa.py [-h] --model_name_or_path MODEL_NAME_OR_PATH
                 [--config_name CONFIG_NAME] [--tokenizer_name TOKENIZER_NAME]
                 [--cache_dir CACHE_DIR] [--model_revision MODEL_REVISION]
                 [--use_auth_token [USE_AUTH_TOKEN]]
                 [--dataset_name DATASET_NAME]
                 [--dataset_config_name DATASET_CONFIG_NAME]
                 [--train_file TRAIN_FILE] [--validation_file VALIDATION_FILE]
                 [--test_file TEST_FILE] [--overwrite_cache [OVERWRITE_CACHE]]
                 [--preprocessing_num_workers PREPROCESSING_NUM_WORKERS]
                 [--max_seq_length MAX_SEQ_LENGTH]
                 [--pad_to_max_length [PAD_TO_MAX_LENGTH]]
                 [--no_pad_to_max_length]
                 [--max_train_samples MAX_TRAIN_SAMPLES]
                 [--max_eval_samples MAX_EVAL_SAMPLES]
                 [--max_predict_samples MAX_PREDICT_SAMPLES]
                 [--version_2_with_negative [VERSION_2_WITH_NEGATIV

We need to use the xla_spawn script to spawn our code to 8 cores since the TPU unit has eight cores. If you do not use the xla_spwan script, the run_qa.py script will only use one core of the TPU unit. In the beginning, you will notice a slow speed because the torch XLA code will calculate XLA complications (this sometimes is much slower when you have dynamic shapes).

If you got this message : 

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.


This means your batch size is larger than what TPU memory could handle and you need to reduce the 

--per_device_train_batch_size


--per_device_train_batch_size value is for one core only, so when we use xla_spawn, we should multiply this number by x8. In our example, we fine-tune our model with batch size of 32 ( --per_device_train_batch_size 4 )

If RAM inside this google colab goes full, reduce --preprocessing_num_workers 4 to less than 4

In [None]:
!python /content/transformers/examples/pytorch/xla_spawn.py --num_cores=8 /content/transformers/examples/pytorch/question-answering/run_qa.py --model_name_or_path sultan/BioM-ELECTRA-Large-Discriminator \
--dataset_name squad_v2 \
--do_train \
--do_eval \
--dataloader_num_workers 4 \
--preprocessing_num_workers 4 \
--version_2_with_negative \
--num_train_epochs 2 \
--learning_rate 5e-5 \
--max_seq_length 384 \
--doc_stride 128 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--logging_steps 500 \
--save_steps 1000 \
--overwrite_output_dir \
--output_dir out

06/17/2022 16:44:23 - INFO - run_qa - Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=

In this example we use the flag: --dataset_name squad_v2 , because squad is already uploaded to huggingface dataset library here https://huggingface.co/datasets/squad_v2

However, if your dataset (e.g, BioASQ) is not part of huggingface dataset, use these flags:


--train_file 

--validation_file 

--test_file 


where those flags point to your dataset files that you uploaded to Google Colab . All dataset should be in format of SQuAD.