# __Train a Joint Entities and Relation Extraction Classifier__

- [article](https://towardsdatascience.com/how-to-train-a-joint-entities-and-relation-extraction-classifier-using-bert-transformer-with-spacy-49eb08d91b5c)

## ___Setup___

The environment and module setup info is in `tutorial_ner_bert_spacy.ipynb` notebook.
- `torch_spacy` conda environment

### Import packages

In [1]:
import os, spacy
from pathlib import Path
from shutil import copy
from torch import cuda

### Set up working directories

In [2]:
work_dir = Path.home() / "proj_local/joint_ner_re"
work_dir.mkdir(parents=True, exist_ok=True)

# For NER
data_re_dir = work_dir / "data_re"

os.chdir(work_dir)
os.getcwd()

'/home/shius/proj_local/joint_ner_re'

## ___Intro___

Relation extraction model is a classifier
- Predicts a relation r for a given pair of entity {e1, e2}.
- In case of transformers, this classifier is added on top of the output hidden states.

Pretrained model
- roberta-base

Goal here is to extract the relationship between:
- {Experience, Skills} as Experience_in,
- {Diploma, Diploma_major} as Degree_in. 

## ___Data___

#### Clone repo with RE data

In [3]:
!git clone https://github.com/walidamamou/relation_extraction_transformer

Cloning into 'relation_extraction_transformer'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 34 (delta 11), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (34/34), 439.36 KiB | 6.10 MiB/s, done.


In [4]:
os.rename("relation_extraction_transformer", "data_re")

#### Clone spacy’s relation extraction repo

Move relevant datasets
- Create a folder with the name “data” inside rel_component
- Copy the training, dev and test binary files into it:

In [5]:
!python -m spacy project clone tutorials/rel_component

[38;5;2m✔ Cloned 'tutorials/rel_component' from 'explosion/projects' (branch
'v3')[0m
/home/shius/proj_local/joint_ner_re/rel_component
[38;5;2m✔ Your project is now ready![0m
To fetch the assets, run:
python -m spacy project assets /home/shius/proj_local/joint_ner_re/rel_component


In [10]:
# destination data dir
dir_rel_comp = work_dir / "rel_component"
dir_rel_comp.mkdir(parents=True, exist_ok=True)

# file names
file_dev   = "relations_dev.spacy"
file_train = "relations_training.spacy"
file_test  = "relations_test.spacy"

copy(data_re_dir / file_dev,   dir_rel_comp / "data" / file_dev)
copy(data_re_dir / file_train, dir_rel_comp / "data" / file_train)
copy(data_re_dir / file_test,  dir_rel_comp / "data" / file_test)

PosixPath('/home/shius/proj_local/joint_ner_re/rel_component/data/relations_test.spacy')

### Update config file

`project.yml` in dir_rel_comp, change the following:
- train_file: "data/relations_train.spacy"
- dev_file: "data/relations_dev.spacy"
- test_file: "data/relations_test.spacy"

`rel_trf.cfg` in dir_rel_comp / configs, change:
- max_length = 20

## ___spacy RE pipeline___

See [Github readme](https://github.com/explosion/projects/tree/v3/tutorials/rel_component)

The following three steps can also be done by:
- `spacy project run all_gpu`

### data step

Parse annotations

In [11]:
os.chdir(dir_rel_comp)
os.getcwd()

'/home/shius/proj_local/joint_ner_re/rel_component'

In [13]:
# The original article did not specify this
!spacy project run data

[1m
Running command: /home/shius/miniconda3/envs/torch_spacy/bin/python ./scripts/parse_data.py assets/annotations.jsonl data/relations_train.spacy data/relations_dev.spacy data/relations_test.spacy
[38;5;4mℹ 102 training sentences from 43 articles, 209/2346 pos instances.[0m
[38;5;4mℹ 27 dev sentences from 5 articles, 56/710 pos instances.[0m
[38;5;4mℹ 20 test sentences from 6 articles, 30/340 pos instances.[0m


### train_gpu

Train the REL model with a Transformer on a GPU and evaluate on the dev corpus.

In [14]:
!spacy project run train_gpu

[1m
Running command: /home/shius/miniconda3/envs/torch_spacy/bin/python -m spacy train configs/rel_trf.cfg --output training --paths.train data/relations_train.spacy --paths.dev data/relations_dev.spacy -c ./scripts/custom_functions.py --gpu-id 0
[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassificatio

### evaluate

Apply the best model to new, unseen text, and measure accuracy at different thresholds.

In [15]:
!spacy project run evaluate

[1m
Running command: /home/shius/miniconda3/envs/torch_spacy/bin/python ./scripts/evaluate.py training/model-best data/relations_test.spacy False
[38;5;4mℹ Could not determine any instances in doc - returning doc as is.[0m

Random baseline:
threshold 0.00 	 {'rel_micro_p': '15.22', 'rel_micro_r': '100.00', 'rel_micro_f': '26.42'}
threshold 0.05 	 {'rel_micro_p': '15.64', 'rel_micro_r': '100.00', 'rel_micro_f': '27.05'}
threshold 0.10 	 {'rel_micro_p': '15.38', 'rel_micro_r': '92.86', 'rel_micro_f': '26.40'}
threshold 0.20 	 {'rel_micro_p': '14.10', 'rel_micro_r': '78.57', 'rel_micro_f': '23.91'}
threshold 0.30 	 {'rel_micro_p': '14.71', 'rel_micro_r': '71.43', 'rel_micro_f': '24.39'}
threshold 0.40 	 {'rel_micro_p': '14.75', 'rel_micro_r': '64.29', 'rel_micro_f': '24.00'}
threshold 0.50 	 {'rel_micro_p': '18.28', 'rel_micro_r': '60.71', 'rel_micro_f': '28.10'}
threshold 0.60 	 {'rel_micro_p': '19.44', 'rel_micro_r': '50.00', 'rel_micro_f': '28.00'}
threshold 0.70 	 {'rel_micro_p': '