# __How to Fine-Tune BERT Transformer with spaCy 3__

[Article](https://towardsdatascience.com/how-to-fine-tune-bert-transformer-with-spacy-3-6a90bfe57647)
- [Github](https://github.com/UBIAI/Fine_tune_BERT_with_spacy3)
- [follow up on NER + RE](https://towardsdatascience.com/how-to-train-a-joint-entities-and-relation-extraction-classifier-using-bert-transformer-with-spacy-49eb08d91b5c)

Named entity recognition (NER)
- Used to identify entities inside a text and store the data for advanced querying and filtering
- [Code for finetuning BERT for NER](https://github.com/UBIAI/Fine_tune_BERT_with_spacy3)
- not enough since we don’t know how the entities are related to each other
  - So will have the second part later to do joint NER and relation extraction
  - a whole new way of information retrieval through knowledge graphs
  - can navigate across different nodes to discover hidden relationships

## ___Setup___

### Note

The original implementation is based on older version of cuda, torch, transformer etc. I tried to use more recent versions and it works fine.

### Cuda

CUDA Toolkit 11.7
- [download configuration](https://developer.nvidia.com/cuda-11-7-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_network)
- Linux, x86_64, WSL-Ubuntu, 2.0, runfile(local)

```bash
wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_linux.run
sudo sh cuda_11.7.1_515.65.01_linux.run
```

Modify `.bashrc`
 - PATH includes /usr/local/cuda/bin
 - LD_LIBRARY_PATH includes /usr/local/cuda/lib64

### Environment

```bash
conda create -n torch_spacy python
conda activate torch_spacy
```

#### spacy and transformer model

Follow [this](https://spacy.io/usage#quickstart)

In [1]:
%pip install -U pip setuptools wheel
%pip install -U 'spacy[cuda-autodetect,transformers,lookups]'
!python -m spacy download en_core_web_trf

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-trf==3.6.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.6.1/en_core_web_trf-3.6.1-py3-none-any.whl (460.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m460.3/460.3 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


#### Pytorch

Follow [this](https://pytorch.org/get-started/locally/)
- For cuda 11.7, ubuntu

In [2]:
%pip install torch torchvision torchaudio

Note: you may need to restart the kernel to use updated packages.


#### cupy

CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. 
- [Install](https://docs.cupy.dev/en/stable/install.html#install-cupy-with-cuda)
- If I don't specify the environmental variable, it won't work.

In [3]:
!python -m pip install -U setuptools pip
!export CUDA_PATH="/usr/local/cuda"
%pip install cupy-cuda11x

Note: you may need to restart the kernel to use updated packages.


### Import packages

In [9]:
import os, spacy
from pathlib import Path
from shutil import copy
from torch import cuda

### Set up working directories

In [5]:
work_dir = Path.home() / "proj_local/joint_ner_re"
work_dir.mkdir(parents=True, exist_ok=True)

# For NER
data_dir = work_dir / "data"

os.chdir(work_dir)
os.getcwd()

'/home/shius/proj_local/joint_ner_re'

## ___Data___

### Data labeling and download

Used [UBIAI](https://ubiai.tools/) for annotation
- ___BUT THIS TOOL COSTS___
- [Tutorial](https://chatbotslife.com/introducing-ubiai-easy-to-use-text-annotation-for-nlp-applications-74a2401fa725)
- Using the regular expression feature in UBIAI, I have pre-annotated all the experience mentions that follows the pattern “\d.*\+.*” such as “5 + years of experience in C++”. 
- Uploaded a csv dictionary containing all the software languages and assigned the entity skills.
- The pre-annotation saves a lot of time and will help you minimize manual annotation.

Data files are in [this repo](https://github.com/UBIAI/Fine_tune_BERT_with_spacy3)
- In [IOB format](https://www.geeksforgeeks.org/nlp-iob-tags/)
  - denote the (I)nside, (O)utside, and (B)eginning of a chunk

E.g., the tsv for the repo
```
will    O
have    O
:       O
5       B-EXPERIENCE
+       I-EXPERIENCE
years   I-EXPERIENCE
of      O
industry        B-SKILLS
experience      O
developing      B-SKILLS
and     I-SKILLS
implementing    I-SKILLS
tools   I-SKILLS
and     O
applications    O
```

In [5]:
!git clone https://github.com/UBIAI/Fine_tune_BERT_with_spacy3.git

Cloning into 'Fine_tune_BERT_with_spacy3'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 19 (delta 3), reused 2 (delta 0), pack-reused 0[K
Unpacking objects: 100% (19/19), 48.40 KiB | 707.00 KiB/s, done.


In [7]:
os.rename("Fine_tune_BERT_with_spacy3", "data")

### Format data

Provide training and dev data in spaCy 3 JSON format
- which will be then converted to a .spacy binary file.
- [spaCy data format](https://spacy.io/api/data-formats#json-input)
  - Based on the doc, JSON training format is dprecated and replaced by the [binary format](https://spacy.io/api/data-formats#binary-training).


In [8]:
# Convert the IOB file exported from the UBIAI annotation tool to spacy JSON
!python -m spacy convert data/train.tsv ./data -t json -n 1 -c iob
!python -m spacy convert data/test.tsv ./data -t json -n 1 -c iob

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
[38;5;2m✔ Generated output file (1 documents): data/train.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
[38;5;2m✔ Generated output file (1 documents): data/test.json[0m


In [9]:
# Convert JSON to spacy binary file
!python -m spacy convert data/train.json ./data -t spacy
!python -m spacy convert data/test.json ./data -t spacy

[38;5;2m✔ Generated output file (77 documents): data/train.spacy[0m
[38;5;2m✔ Generated output file (11 documents): data/test.spacy[0m


## ___Model training___

### Generate spacy config

[From this page](https://spacy.io/usage/training#quickstart)
- Select `ner`, `GPU (transformer)`, `efficiency`
- Save as `base_config.cfg` and modify:
  - train = "./data/train.spacy"
  - dev = "./data/test.spacy"
  - batch_size=512 (default: 128)
  - epochs=600 (add in the training section, default: unlimited, 0))

In [None]:
# Fill default parameters and rename the file as config.cfg
!python -m spacy init fill-config base_config.cfg config.cfg

### Debug configuration file

Complain about:
- Low number of examples to train a new pipeline (77)

In [10]:
!python -m spacy debug data config.cfg

[1m
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[38;5;2m✔ Pipeline can be 

### Train the model

Best model
- Saved under folder model-best
- Model scores: in `meta.json``inside the model-best folder

Tranining pipeline [output](https://support.prodi.gy/t/understanding-the-different-terminology-in-the-command-line-output-of-a-training-pipeline/5719)
- E: number of completed epochs
- #: Number of iterations, or documents, that were passed through in training
  - This number recounts documents even after re-use from an additional epoch. 
  - For example, if you have 114 training documents, your first epoch will be completed after iteration 114, your second epoch will be completed after iteration 228, etc.
- LOSS TRANS: loss value for the transformer component
- LOSS NER: loss value for the ner component
- ENTS_F: f-score
- ENTS_P: precision
- ENTS_R: recall
- SCORE: evaluation score from 0.0 to 1.0 in two decimal place (rounded). It is based on the training.score_weights that you have defined in config.conf.

In [7]:
cuda.is_available(), cuda.device_count(), cuda.current_device()

(True, 1, 0)

In [8]:
!python -m spacy train ./config.cfg --gpu-id 0 --output ./output 

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task t

## ___Test the model___

In [12]:
nlp = spacy.load('./output/model-best')

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
text = [
'''Qualifications
   - A thorough understanding of C# and .NET Core
   - Knowledge of good database design and usage
   - An understanding of NoSQL principles
   - Excellent problem solving and critical thinking skills
   - Curious about new technologies
   - Experience building cloud hosted, scalable web services
   - Azure experience is a plusRequirements
   - Bachelor's degree in Computer Science or related field(Equivalent experience can substitute for earned educational qualifications)
   - Minimum 4 years experience with C# and .NET
   - Minimum 4 years overall experience in developing commercial software
''']

In [None]:
for doc in nlp.pipe(text, disable=["tagger", "parser"]):
  for ent in doc.ents:
    print(f"text: {ent.text}\n  label: {ent.label_}")

text:C	 label:SKILLS
text:#	 label:SKILLS
text:.NET Core	 label:SKILLS
text:database design	 label:SKILLS
text:usage	 label:SKILLS
text:NoSQL	 label:SKILLS
text:problem solving	 label:SKILLS
text:critical thinking	 label:SKILLS
text:building cloud hosted, scalable web services	 label:SKILLS
text:Azure	 label:SKILLS
text:Bachelor	 label:DIPLOMA
text:'s	 label:DIPLOMA
text:Computer Science	 label:DIPLOMA_MAJOR
text:4 years	 label:EXPERIENCE
text:C	 label:SKILLS
text:#	 label:SKILLS
text:.NET	 label:SKILLS
text:4 years	 label:EXPERIENCE
text:developing commercial software	 label:SKILLS
