# 10 - Pretraining CamemBERT

Pretrain the Huggingface `Jean-Baptiste/cammbert-ner` model on a large set of trade directory entries extracted with OCR.
The resulting model is saved on the disk in the folder `10-camembert_pretrained_model`.

In [1]:
""" RUN THIS BLOCK ONLY ON GOOGLE COLAB """

# `GDRIVE_PAPER_FOLDER` is the relative path in your GDrive to the folder
# contaning the code of the paper
# ADAPT TO YOUR SITUATION !
%env GDRIVE_PAPER_FOLDER=TEST

# Mount Google Drive to your Colab environment. May require to log in to Google.
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Copy the Python modules in `PATH_TO_SOURCES/src/ner/util` to GColab
# to enable import.
!cp -r /content/drive/MyDrive/$GDRIVE_PAPER_FOLDER/src/ner/util .

# Install dependencies
!pip install -q datasets transformers[sentencepiece]
# Force update SpaCy to v3 and NLTK
!pip install -qU spacy
!pip install -qU nltk

env: GDRIVE_PAPER_FOLDER=TEST
Mounted at /content/drive
[K     |████████████████████████████████| 1.5 MB 14.9 MB/s 
[K     |████████████████████████████████| 749 kB 68.2 MB/s 
[?25h

In [2]:
""" Loads the configuration """

# Set to 1/true/ to set the logging level of nerlogger to DEBUG 
# and save the the spacy datasets as TXT along with the .spacy file
#  for easier debug of the training set generation.
%env DEBUG=1

# If True, activates a set of assertions in the notebooks to ensure
# that the scripts runs with the parameters used in the paper.
%env AS_IN_THE_PAPER = True

import util.config as config

config.show()

23/05/2022 03:51:01 ; INFO ; BASEDIR: /content/drive/MyDrive/TEST
23/05/2022 03:51:01 ; INFO ; Input datasets will be loaded from DATASETDIR /content/drive/MyDrive/TEST/dataset
23/05/2022 03:51:01 ; INFO ; Training data and models will be saved to NERDIR /content/drive/MyDrive/TEST/src/ner
23/05/2022 03:51:01 ; INFO ; Debug mode is ON
23/05/2022 03:51:01 ; INFO ; Random seed: 42
23/05/2022 03:51:01 ; INFO ; Enable reproducibility checks: True


env: DEBUG=1
env: AS_IN_THE_PAPER=True


In [3]:
""" Import all modules at once """

# General imports
import nltk
import gzip
import tempfile 
import pathlib

# NER imports
from util.as_in_the_paper import assert_expected


## 11 - Preparation of the pre-training data

Starting from a set of raw entries in `dataset/unsupervised_pretraining/10-normalized/all.txt.gz`, this block extracts a subset of entries that contain a minimum number of words as fixed by the threshold `MIN_WORDS_PER_ENTRY`.

The resulting set is outputed to the text file `pretraining_data.txt` in the temporary folder of your file system.

In [4]:
""" Prepare the pre-training dataset """

pretraining_set_path = config.DATASETDIR / "unsupervised_pretraining/10-normalized"

# Filter the normalized pre-training 
# Keep only entries containing at least MIN_WORDS_PER_ENTRY words
MIN_WORDS_PER_ENTRY = 7 # Keep simples entries like "Morel abbé ,Tournon, 14."

assert_expected(actual=MIN_WORDS_PER_ENTRY, expected=7)

nltk.download("punkt")
number_of_entries = 0
valid_entries = []

with gzip.open(pretraining_set_path / "all.txt.gz", "rt") as all_txt:
    for entry in all_txt:
        entry = entry.strip() # Sanitize
        words = nltk.word_tokenize(entry, language="fr", preserve_line=True)
        if len(words) >= MIN_WORDS_PER_ENTRY:
            valid_entries.append(entry)
        number_of_entries += 1
        


# Get the name of the temp folder and save the file to that folder.
temp_dir = pathlib.Path(tempfile.gettempdir())
with open(temp_dir / "pretraining_data.txt", "w") as fp:
    fp.write("\n".join(valid_entries))
    config.logger.debug(f"Stored the pretraining data in {temp_dir}")


config.logger.info("Valid entries:%d, %f.2 percent of the total" % (len(valid_entries),len(valid_entries) / number_of_entries))
assert_expected(actual=845014, expected=len(valid_entries))
assert_expected(actual=1045674, expected=number_of_entries)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
23/05/2022 03:52:51 ; DEBUG ; Stored the pretraining data in /tmp
23/05/2022 03:52:51 ; INFO ; Valid entries:845014, 0.808105.2 percent of the total


## 12 - Pretraining process

Pretrain the base model "Jean-Baptiste/camembert-ner" on the 845k raw entries in pretraining_data.txt
Th

In [10]:
""" Actually runs the pretraining.
On GColab you'd certainly want to activate GPU acceleration before training !
"""

# Get the name of the temp folder and save the file to that folder.
temp_dir = tempfile.gettempdir()
!python util/pretrain_camembert.py --model_name_or_path "Jean-Baptiste/camembert-ner" --train_file "{temp_dir}/pretraining_data.txt" --do_train --do_eval --line_by_line --output_dir "${config.BASEDIR}/10-camembert_pretrained_model" --save_total_limit 1 --load_best_model_at_end True --evaluation_strategy "steps" --save_strategy "steps"

05/23/2022 15:56:36 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=500,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=False,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_met