In this task, we will achieve the completion of French proverbs. Our work will consist of fine-tuning an encoder transformer in order to complete  proverbs by applying the appropriate word. We will analyze the results obtained.

### Import the necessary modules

And prepare reading files functions. Here some necessary manipulations have been added for the test file in order to read dataframe format

In [1]:
import transformers

import ast
import json
import spacy

import pandas as pd

In [2]:
def load_proverbs(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        raw_lines = f.readlines()
    return [x.strip() for x in raw_lines]

def load_tests(filename):
    fp = open(filename, encoding='utf-8')
    test_data = json.load(fp)
    return test_data

In [3]:
proverbs_fn = "data_proverbes/proverbes.txt"
test1_fn = "data_proverbes/test_proverbes.json"

corpus = load_proverbs(proverbs_fn)
test = load_proverbs(test1_fn)

In [4]:
a = ''.join(str(x) for x in test)

t = a.split('},{')

t[0] = t[0][2:]
t[-1] = t[-1][:-2]

for i in range(len(t)):
    t[i] =  ast.literal_eval("{" + t[i] + '}')
    
test = pd.DataFrame.from_dict(t)

In [5]:
test.head()

Unnamed: 0,test,choices,solution
0,a beau mentir qui *** de loin,"[vient, part, revient, programme]",vient
1,a beau *** qui vient de loin,"[mentir, savoir, temps, dire]",mentir
2,"année de gelée, *** de blé","[année, absence, saison, mois]",année
3,"après la pluie, le *** temps","[beau, bon, meilleur, mauvais]",beau
4,"aux échecs, les *** sont les plus près des rois","[fous, joueurs, dames, femmes]",fous


# Without Fine-Tuning

We chose the bert-base-french-europeana-cased model which is the French model of BERT and available on HuggingFace. We get a score of 55%.

In [6]:
from transformers import pipeline
from transformers import AutoTokenizer

model_checkpoint = "dbmdz/bert-base-french-europeana-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

generator = pipeline(task="fill-mask", model=model_checkpoint, tokenizer=tokenizer)

Some weights of the model checkpoint at dbmdz/bert-base-french-europeana-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
score = 0
for sentence, word in zip(list(test['test']), list(test['solution'])) :
    sentence = sentence.replace("***","[MASK]")
    results = generator(sentence)
    target = [result['token_str'] for result in results]
    if word in target :
        score = score + 1
        
score / len(list(test['test']))

0.5535714285714286

The results are promising, the French language is well captured but the proverbs use a rather particular sustained French which can give us these results (We will see this in detail in the analysis).

# Let's Fine-Tune

We will use the masked language model approach, we will use Bert's tokenizer and load the specialized model for MLM (**BertForMaskedLM**) from the previous French model (using the **from_pretrained** function).

In [8]:
import pandas as pd
from tqdm import tqdm
import time
from transformers import BertTokenizer
from transformers import LineByLineTextDataset
from transformers import BertForMaskedLM
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [9]:
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-french-europeana-cased")
model = BertForMaskedLM.from_pretrained("dbmdz/bert-base-french-europeana-cased")

Some weights of the model checkpoint at dbmdz/bert-base-french-europeana-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We will create the dataset from the proverbs data. We use **LineByLineTextDataset** because our text file contains separated lines which is more suitable for our work (instead of concatenating all the data). In addition, it allowed to obtain better results.

In [10]:
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="data_proverbes/proverbes.txt",
    block_size=128,
)



The data collator will be responsible for preparing the training and test data, it is also responsible for creating the mask for the MLM (because we choose a **DataCollatorForLanguageModeling**) with a probability of masking a word of 15%.

In [11]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

Pre-training parameters (10 epochs only to avoid overfitting, the dataset is not very big either).

In [12]:
training_args = TrainingArguments(
    output_dir="./proverb_bert",
    # overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

Creating the Pytorch trainer.

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

Pre training ...

In [14]:
trainer.train()

***** Running training *****
  Num examples = 3108
  Num Epochs = 10
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 490
  Number of trainable parameters = 110650880


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=490, training_loss=1.7918042241310588, metrics={'train_runtime': 285.2451, 'train_samples_per_second': 108.959, 'train_steps_per_second': 1.718, 'total_flos': 710554212237312.0, 'train_loss': 1.7918042241310588, 'epoch': 10.0})

Pushing the model to HuggingFace

In [15]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


In [16]:
model.push_to_hub("rasta/proverbes-french-IFT-7022")

Configuration saved in /tmp/tmp209glfrb/config.json
Model weights saved in /tmp/tmp209glfrb/pytorch_model.bin
Uploading the following files to rasta/proverbes-french-IFT-7022: pytorch_model.bin,config.json


CommitInfo(commit_url='https://huggingface.co/rasta/proverbes-french-IFT-7022/commit/b5e5bc2889a24b0474adb60f8a6e2d427a8e628a', commit_message='Upload BertForMaskedLM', commit_description='', oid='b5e5bc2889a24b0474adb60f8a6e2d427a8e628a', pr_url=None, pr_revision=None, pr_num=None)

# Résults after pre-training

In [14]:
from transformers import pipeline
from transformers import AutoTokenizer

model_checkpoint = "dbmdz/bert-base-french-europeana-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model= "rasta/proverbes-french-IFT-7022"

generator_pretrained = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--dbmdz--bert-base-french-europeana-cased/snapshots/b895c3cf291f7bf4c15639078a6bee0b3e272c5b/config.json
Model config BertConfig {
  "_name_or_path": "dbmdz/bert-base-french-europeana-cased",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--dbmdz--bert-base-french-europeana-cased/snapshots/b895c3cf291f7bf4c15639078a6bee0b3e272c5b/vocab.tx

In [15]:
score = 0
for sentence, word in zip(list(test['test']), list(test['solution'])) :
    sentence = sentence.replace("***","[MASK]")
    results = generator_pretrained(sentence)
    target = [result['token_str'] for result in results]
    if word in target :
        score = score + 1
        
score / len(list(test['test']))

0.8571428571428571

Result after Fine tuning: 85%. We observe an improvement in the results, which shows that the model has learned better.

# Analysis

The Fine-Tuned model obtains better results (55% vs 85%) because it has better learned the language of proverbs especially to predict end-of-sentence words. Without Fine-Tune, one obtains results which are coherent in terms of language, syntactically correct but which do not correspond to the context of the proverb. The following examples can demonstrate this phenomenon.

In [16]:
sentence = 'quand la poire est mûre, elle [MASK]'
word = "tombe"

results = generator(sentence)
target = [result['token_str'] for result in results]

print("********** SANS FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])
    

results = generator_pretrained(sentence)
target = [result['token_str'] for result in results]

print("\n********** AVEC FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])


********** SANS FINE TUNE *********************

quand la poire est mûre, elle [MASK]
mot correct:  tombe
mots prédit :  .

********** AVEC FINE TUNE *********************

quand la poire est mûre, elle [MASK]
mot correct:  tombe
mots prédit :  tombe


In [17]:
sentence = 'à maison laide arbre [MASK]'
word = "mort"

results = generator(sentence)
target = [result['token_str'] for result in results]

print("********** SANS FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])
    

results = generator_pretrained(sentence)
target = [result['token_str'] for result in results]

print("\n********** AVEC FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])


********** SANS FINE TUNE *********************

à maison laide arbre [MASK]
mot correct:  mort
mots prédit :  .

********** AVEC FINE TUNE *********************

à maison laide arbre [MASK]
mot correct:  mort
mots prédit :  mort


In [18]:
sentence = 'mieux vaut [MASK] que jamais'
word = "tard"

results = generator(sentence)
target = [result['token_str'] for result in results]

print("********** SANS FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])
    

results = generator_pretrained(sentence)
target = [result['token_str'] for result in results]

print("\n********** AVEC FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])

********** SANS FINE TUNE *********************

mieux vaut [MASK] que jamais
mot correct:  tard
mots prédit :  plus

********** AVEC FINE TUNE *********************

mieux vaut [MASK] que jamais
mot correct:  tard
mots prédit :  tard


In [19]:
sentence = 'année de gelée, [MASK] de blé'
word = "année"

results = generator(sentence)
target = [result['token_str'] for result in results]

print("********** SANS FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])
    

results = generator_pretrained(sentence)
target = [result['token_str'] for result in results]

print("\n********** AVEC FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])

********** SANS FINE TUNE *********************

année de gelée, [MASK] de blé
mot correct:  année
mots prédit :  beaucoup

********** AVEC FINE TUNE *********************

année de gelée, [MASK] de blé
mot correct:  année
mots prédit :  année


Finally, some correct examples for both, which are generally to recognize the verb or especially auxiliaries

In [20]:
sentence = 'a beau mentir qui [MASK] de loin'
word = "vient"

results = generator(sentence)
target = [result['token_str'] for result in results]

print("********** SANS FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])
    

results = generator_pretrained(sentence)
target = [result['token_str'] for result in results]

print("\n********** AVEC FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])

********** SANS FINE TUNE *********************

a beau mentir qui [MASK] de loin
mot correct:  vient
mots prédit :  vient

********** AVEC FINE TUNE *********************

a beau mentir qui [MASK] de loin
mot correct:  vient
mots prédit :  vient


In [21]:
sentence = 'ce que [MASK] veut, dieu le veut'
word = "femme"

results = generator(sentence)
target = [result['token_str'] for result in results]

print("********** SANS FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])
    

results = generator_pretrained(sentence)
target = [result['token_str'] for result in results]

print("\n********** AVEC FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])

********** SANS FINE TUNE *********************

ce que [MASK] veut, dieu le veut
mot correct:  femme
mots prédit :  femme

********** AVEC FINE TUNE *********************

ce que [MASK] veut, dieu le veut
mot correct:  femme
mots prédit :  femme


In [22]:
sentence = "d'un sac [MASK] ne peut tirer deux moutures"
word = "on"

results = generator(sentence)
target = [result['token_str'] for result in results]

print("********** SANS FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])
    

results = generator_pretrained(sentence)
target = [result['token_str'] for result in results]

print("\n********** AVEC FINE TUNE *********************\n")
print(sentence)
print("mot correct: ",word)
if word in target:
    print("mots prédit : ", word)
else :
    print("mots prédit : ", target[0])

********** SANS FINE TUNE *********************

d'un sac [MASK] ne peut tirer deux moutures
mot correct:  on
mots prédit :  on

********** AVEC FINE TUNE *********************

d'un sac [MASK] ne peut tirer deux moutures
mot correct:  on
mots prédit :  on
