<a href="https://colab.research.google.com/github/salmenhsairi/EndOfStudiesProjectNotebooks/blob/main/BERTUBIAArticleTBS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Mounting Drive FS

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Pulling `UBIAI`'s necessary assets for a training demo

In [None]:
! rm -r Fine_tune_BERT_with_spacy3
! git clone https://github.com/UBIAI/Fine_tune_BERT_with_spacy3.git

## Installing dependencies 
* make sure you restart runtime to apply some settings changes

In [None]:
! pip install -U pip setuptools wheel
! pip install 'spacy[transformers]'
# ! python -m spacy download en_core_web_lg

## Processing the Data for the Model

### Convert tsv files to JSON format  

In [None]:
!python -m spacy convert Fine_tune_BERT_with_spacy3/train.tsv ./ -t json -n 1 -c iob
!python -m spacy convert Fine_tune_BERT_with_spacy3/test.tsv ./ -t json -n 1 -c iob

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
[38;5;2m✔ Generated output file (1 documents): train.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;3m⚠ Document delimiters found, automatic document segmentation with `-n`
disabled.[0m
[38;5;2m✔ Generated output file (1 documents): test.json[0m


### Convert JSON files To Spacy Binary format

In [None]:
# create on if doesn't exist
! mkdir drive/MyDrive/NER_data

In [None]:
# convert them to spacy binary file
!python -m spacy convert train.json "/content/drive/MyDrive/NER_data/" -t spacy
!python -m spacy convert test.json "/content/drive/MyDrive/NER_data/" -t spacy

[38;5;2m✔ Generated output file (77 documents):
/content/drive/MyDrive/NER_data/train.spacy[0m
[38;5;2m✔ Generated output file (11 documents):
/content/drive/MyDrive/NER_data/test.spacy[0m


## Fill the remaining config defaults

In [None]:
# fill config file for the ner model from the base config
! python -m spacy init fill-config Fine_tune_BERT_with_spacy3/base_config.cfg spacy_config_origin.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
spacy_config_origin.cfg
You can now add your data and train your pipeline:
python -m spacy train spacy_config_origin.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
# Optional : debugging new config file
! python -m spacy debug data spacy_config_origin.cfg

In [None]:
# check whether  the gpu is accessible for spacy
import spacy
spacy.require_gpu()

True

In [None]:
!python -m spacy train \
/content/spacy_config_origin.cfg \
--gpu-id 0 \
--training.max_epochs 20 \
--components.transformer.max_batch_items=2048 \
--training.patience=500 \
--training.eval_frequency=50 \
--training.batcher.size=1000 \
--training.logger.progress_bar='true' \
--output='./'

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-07-07 16:47:29,874] [INFO] Set up nlp object from config
[2022-07-07 16:47:29,884] [INFO] Pipeline: ['transformer', 'ner']
[2022-07-07 16:47:29,889] [INFO] Created vocabulary
[2022-07-07 16:47:29,890] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSeq

## Evaluate the pipeline after training

In [None]:
from spacy.cli.evaluate import evaluate
result = evaluate(
    'model-best',
    '/content/drive/MyDrive/NER_data/test.spacy',
    output='/content/metrics.json',
    use_gpu=-1,
)

In [None]:
result

{'ents_f': 0.6315789473684211,
 'ents_p': 0.6591549295774648,
 'ents_per_type': {'DIPLOMA': {'f': 0.8387096774193549,
   'p': 0.8125,
   'r': 0.8666666666666667},
  'DIPLOMA_MAJOR': {'f': 0.7887323943661971,
   'p': 0.8,
   'r': 0.7777777777777778},
  'EXPERIENCE': {'f': 0.8333333333333334, 'p': 0.9375, 'r': 0.75},
  'SKILLS': {'f': 0.5903814262023217,
   'p': 0.6180555555555556,
   'r': 0.5650793650793651}},
 'ents_r': 0.6062176165803109,
 'speed': 6048.513434116675,
 'token_acc': 1.0,
 'token_f': 1.0,
 'token_p': 1.0,
 'token_r': 1.0}

## get model inference result with unseen data

In [None]:
import spacy
nlp = spacy.load("./model-best")
text = [
'''Qualifications
- A thorough understanding of C# and .NET Core
- Knowledge of good database design and usage
- An understanding of NoSQL principles
- Excellent problem solving and critical thinking skills
- Curious about new technologies
- Experience building cloud hosted, scalable web services
- Azure experience is a plus
Requirements
- Bachelor's degree in Computer Science or related field
(Equivalent experience can substitute for earned educational qualifications)
- Minimum 4 years experience with C# and .NET
- Minimum 4 years overall experience in developing commercial software
'''
]
for doc in nlp.pipe(text, disable=["tagger", "parser"]):
    print([(ent.text, ent.label_) for ent in doc.ents])

[('C', 'SKILLS'), ('#', 'SKILLS'), ('.NET', 'SKILLS'), ('database design', 'SKILLS'), ('usage', 'SKILLS'), ('NoSQL principles', 'SKILLS'), ('problem solving', 'SKILLS'), ('critical thinking', 'SKILLS'), ('building cloud hosted', 'SKILLS'), ('web services', 'SKILLS'), ('Azure experience', 'SKILLS'), ('Bachelor', 'DIPLOMA'), ("'s", 'DIPLOMA'), ('Computer Science', 'DIPLOMA_MAJOR'), ('4 years', 'EXPERIENCE'), ('C', 'SKILLS'), ('#', 'SKILLS'), ('.NET', 'SKILLS'), ('4 years', 'EXPERIENCE'), ('developing commercial software', 'SKILLS')]
