# Does CamemBERT/FlauBERT/Bert understand negation?

Short answer: no.

This notebook replicates a section of the paper of `Ettinger2020` on negation using a French corpus. The idea is to add negation on propositions and test if Bert `without fine-tuning` is able to switch the answer from one to another.

Actually this principle resembles a lot the Winograd Schema Challenge.

Here is an example in French:

```
id;masked;tgt;item;cond;right_answer;options
0;La truite est un <mask>;poisson;0;TA;poisson;poisson outil
1;La truite n'est pas un <mask>;poisson;0;FN;outil;poisson outil
```
The `<mask>` token should be `poisson` in the first example and `outil` in the second.

However, in a corpus of 18 pairs of sentences Bert was unable to switch response on either pair.

There's no reason for CamemBERT/FlauBERT to behave differently on this task.

This notebook shows that French counterparts of Bert (CamemBERT and FlauBERT) are indeed incapable of performing better on a French corpus.

Please note that I write some wrapper functions to faciliate use of French Berts published via a packge named `frenchnlp`. Be sure to install it before running this notebook on your computer.

`!pip install frenchnlp`

In [1]:
# Some utility functions

from frenchnlp import *

def change_verb(sent,new_verb):
    return sent.replace("n'est","ne "+ new_verb).replace("est",new_verb)

def change_header(sent,new_term):
    if "La" in sent:
        return sent.replace("La","Le terme")
    else:
        return sent.replace("Le","Le terme")

## Corpus and variations

I translated the items of Ettinger2020 into French. Besides, I add some variations to test the effect of minor modifications on the original utterance.

For a sentence like `La truite est un <mask>`, you have:

* La truite représente un `<mask>`.
* La truite représente un `<mask>` en français.
* Le terme truite désigne un `<mask>`.
* Le terme truite désigne un `<mask>` en français.

I append `en français` to each sentence because it is possible that Bert works better when the left and right contexts are provided since it was not trained like GPT3 on a traditional language modeling task.

As you can see from the dataframe, `<mask>` was also replaced with `<special1>` to comply with FlauBERT's annotation.

In [2]:
df = xo_load_data("negation_french.csv")
df["masked_cam"] = df["masked"].apply(lambda x:x+".")
df["masked_flau"] = df["masked_cam"].apply(lambda x:x.replace("<mask>","<special1>")) 
df["ch_verb"] = df["masked"].apply(lambda x: change_verb(x,"représente")+".")
df["ch_verb_flau"] = df["masked"].apply(lambda x:change_verb(x,"représente").replace("<mask>","<special1>")+".")
df["ch_verb_add_right"] = df["ch_verb"].apply(lambda x:x.replace(".","")+" en français.")
df["ch_verb_add_right_flau"] = df["ch_verb_flau"].apply(lambda x:x.replace(".","")+" en français.")
df["ch_whole"] = df["masked_cam"].apply(lambda x:change_verb(change_header(x,"Le terme"),"désigne"))
df["ch_whole_flau"] = df["masked_flau"].apply(lambda x:change_verb(change_header(x,"Le terme"),"désigne"))
df["ch_whole_add_right"] = df["ch_whole"].apply(lambda x:x.replace(".","")+" en français.")
df["ch_whole_add_right_flau"] = df["ch_whole_flau"].apply(lambda x:x.replace(".","")+" en français.")
df.head(2)

Unnamed: 0,id,masked,tgt,item,cond,right_answer,options,response1,prob1,response2,...,masked_cam,masked_flau,ch_verb,ch_verb_flau,ch_verb_add_right,ch_verb_add_right_flau,ch_whole,ch_whole_flau,ch_whole_add_right,ch_whole_add_right_flau
0,0,La truite est un <mask>,poisson,0,TA,poisson,poisson outil,0,0,0,...,La truite est un <mask>.,La truite est un <special1>.,La truite représente un <mask>.,La truite représente un <special1>.,La truite représente un <mask> en français.,La truite représente un <special1> en français.,Le terme truite désigne un <mask>.,Le terme truite désigne un <special1>.,Le terme truite désigne un <mask> en français.,Le terme truite désigne un <special1> en franç...
1,1,La truite n'est pas un <mask>,poisson,0,FN,outil,poisson outil,0,0,0,...,La truite n'est pas un <mask>.,La truite n'est pas un <special1>.,La truite ne représente pas un <mask>.,La truite ne représente pas un <special1>.,La truite ne représente pas un <mask> en franç...,La truite ne représente pas un <special1> en f...,Le terme truite ne désigne pas un <mask>.,Le terme truite ne désigne pas un <special1>.,Le terme truite ne désigne pas un <mask> en fr...,Le terme truite ne désigne pas un <special1> e...


In [3]:
# Camembert
pipeline = xo_fillin("camembert-base",1000)
cam_results = xo_produce_answers(pipeline,"masked_cam",df)

# Flaubert
pipeline_flau = xo_fillin("flaubert/flaubert_base_cased",1000)
flau_results = xo_produce_answers(pipeline_flau,"masked_flau",df)

Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Results on the standard version with no variation

For examples like `La truite est un <mask>`, note that the accuracy is 50% with equal number of right and wrong responses.

Interestingly, both models behaved the same way.

This is somewhat expected because both models were trained on similar corpus using similar method. Some differences exist, however. CamemBERT was trained using `whole word masking` and FlauBERT `token masking`. For details, please read the original papers.

Suppose that no switch ever happens, the accuracy should be exactly 50%. However upon closer investigation, a successful switch takes place for CamemBERT.

```
La fourmi est un <mask>. insecte
La fourmi n'est pas un <mask>. légume
```

The 50% accuracy is due to another switch with both wrong answers.

```
Le petit pois est un <mask>.	 bâtiment	
Le petit pois n'est pas un <mask>.	 légume
```

In [4]:
# 50% in both cases
print(xo_compute_score("right_answer",cam_results))
print(xo_compute_score("right_answer",flau_results))

   correct  no_response  bad_response  total_responses  exactitude  qualite  \
0       18            0            18               36        50.0     50.0   

   reussite  
0      50.0  
   correct  no_response  bad_response  total_responses  exactitude  qualite  \
0       18            0            18               36        50.0     50.0   

   reussite  
0      50.0  


In [207]:
cam_results[["masked_cam","options","response1","prob1","response2","prob2"]].to_csv("res_cam1.csv",index=False)
flau_results[["masked_flau","options","response1","prob1","response2","prob2"]].to_csv("res_flau1.csv",index=False)

## Results on the version with être replaced by représente

The accuracy decreases mainly because of the asymmetry between responses for affirmative items and those for negative items. Put in simpler terms, in some cases CamemBERT/FlauBERT gave only one answer to a pair of sentences and this one answer was wrong.

In [208]:
pipeline = xo_fillin("camembert-base",1000)
cam_results_ch_verb = xo_produce_answers(pipeline,"ch_verb",df)
pipeline_flau = xo_fillin("flaubert/flaubert_base_cased",1000)
flau_results_ch_verb = xo_produce_answers(pipeline_flau,"ch_verb_flau",df)
print(xo_compute_score("right_answer",cam_results_ch_verb))
print(xo_compute_score("right_answer",flau_results_ch_verb))

Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
   correct  no_response  bad_response  total_responses  exactitude  qualite  \
0       15            3            18               36       41.67    45.45   

   reussite  
0     45.83  
   correct  no_response  bad_response  total_responses  exactitude  qualite  \
0       15            3            18               36       41.67    45.45   

   reussite  
0     45.83  


In [209]:
cam_results_ch_verb[["ch_verb","options","response1","prob1","response2","prob2"]].to_csv("res_cam2.csv",index=False)
flau_results_ch_verb[["ch_verb_flau","options","response1","prob1","response2","prob2"]].to_csv("res_flau2.csv",index=False)

## Results on the version with être replaced by désigner

The accuracy decreases further for the same reason mentioned earlier.

In [210]:
pipeline = xo_fillin("camembert-base",1000)
cam_results_ch_verb_add_right = xo_produce_answers(pipeline,"ch_verb_add_right",df)
pipeline_flau = xo_fillin("flaubert/flaubert_base_cased",1000)
flau_results_ch_verb_add_right_flau = xo_produce_answers(pipeline_flau,"ch_verb_add_right_flau",df)
# 50% in both cases
print(xo_compute_score("right_answer",cam_results_ch_verb_add_right))
print(xo_compute_score("right_answer",flau_results_ch_verb_add_right_flau))

Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
   correct  no_response  bad_response  total_responses  exactitude  qualite  \
0       11           13            12               36       30.56    47.83   

   reussite  
0     48.61  
   correct  no_response  bad_response  total_responses  exactitude  qualite  \
0       11           13            12               36       30.56    47.83   

   reussite  
0     48.61  


In [211]:
cam_results_ch_verb_add_right[["ch_verb_add_right","options","response1","prob1","response2","prob2"]].to_csv("res_cam3.csv",index=False)
flau_results_ch_verb_add_right_flau[["ch_verb_add_right_flau","options","response1","prob1","response2","prob2"]].to_csv("res_flau3.csv",index=False)

## Results on the version with être replaced by désigner and a right context

The accuracy decreases further more for the same reason mentioned earlier. Also note the great number of non responses. This showes that how the utterance is phrased has an effect on the responses. 

Note that in this case, the `qualite` measure considering only the answered sentences is more informative.

In [212]:
pipeline = xo_fillin("camembert-base",1000)
cam_results_ch_whole_add_right = xo_produce_answers(pipeline,"ch_whole_add_right",df)
pipeline_flau = xo_fillin("flaubert/flaubert_base_cased",1000)
flau_results_ch_whole_add_right_flau = xo_produce_answers(pipeline_flau,"ch_whole_add_right_flau",df)
# 50% in both cases
print(xo_compute_score("right_answer",cam_results_ch_whole_add_right))
print(xo_compute_score("right_answer",flau_results_ch_whole_add_right_flau))

Some weights of FlaubertWithLMHeadModel were not initialized from the model checkpoint at flaubert/flaubert_base_cased and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
   correct  no_response  bad_response  total_responses  exactitude  qualite  \
0        9           20             7               36        25.0    56.25   

   reussite  
0     52.78  
   correct  no_response  bad_response  total_responses  exactitude  qualite  \
0        9           20             7               36        25.0    56.25   

   reussite  
0     52.78  


In [213]:
cam_results_ch_whole_add_right[["ch_whole_add_right","options","response1","prob1","response2","prob2"]].to_csv("res_cam4.csv",index=False)
flau_results_ch_whole_add_right_flau[["ch_whole_add_right_flau","options","response1","prob1","response2","prob2"]].to_csv("res_flau4.csv",index=False)

## Conclusion

This simple experiment shows the insensitivity of Bert-like language models to negation. The same observations have proven to be valid on a similar French corpus.

What improvements can be made?

The addition of self-supervised tasks requiring more sophisticated linguistic information than the simple linear order (such as the addition of syntactic information by (Xu et al., 2020)) could be helpful.

## Reference

Devlin, J., Chang M.-W., Lee K. and Toutanova K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In NAACL-HLT 2019.

Ettinger, A. (2019). What bert is not : Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48.

Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., ... & Schwab, D. (2019). Flaubert: Unsupervised language model pre-training for french. arXiv preprint arXiv:1912.05372.

Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de la Clergerie, É. V., ... & Sagot, B. (2019). Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894.

Xu Z., Guo D., Tang D., Su Q., Shou L., Gong M., Zhong W., Quan X., Duan N., and Jiang D. (2020) “Syntax-Enhanced Pre-trained Model”. arXiv:2012.14116
