# HeBERT vs. mBERT
> "Testing the new model for Hebrew on cPOS-tagging"

- toc:false
- branch: master
- badges: true
- comments: true
- author: Stav Klein
- categories: [fastpages, jupyter]
- image: images/oscar-bert.png

In [None]:
#hide
import pandas as pd

### Data Overview

Avichay Chriqui and Dr. Inbal Yahav Shenberger recently released a new model for Hebrew based on BERT's architecture and trained on the OSCAR corpus. The model is built with the huggingface library and uses the library's `AutoTokenizer` (which is WordPieces since the model is BERT-based) and the model itself is `AutoModelForMaskedLM`. 
Key differences between HeBERT and mBERT are summarized in the table below:


In [None]:
#hide_input
models = {'HeBERT':['~ 30K', 'Hebrew Wiki, Hebrew OSCAR, User-generated content'], 
        'mBERT':['~ 2K', 'Hebrew Wiki']} 
  
# Creates pandas DataFrame. 
models_df = pd.DataFrame(models, index =['# of word-pieces', 
                                'Training Data'])

models_df

Unnamed: 0,HeBERT,mBERT
# of word-pieces,~ 30K,~ 2K
Training Data,"Hebrew Wiki, Hebrew OSCAR, User-generated content",Hebrew Wiki


In [None]:
#hide_input
data = {'Size':['650MB', '9.8GB' , '150MB'], 
        '# of sentences':['3.8M', '20.8M', '350K'],} 
  
# Creates pandas DataFrame. 
data_df = pd.DataFrame(data, index =['Hebrew Wiki', 
                                'Hebrew OSCAR',
                                'UGC'])

data_df

Unnamed: 0,Size,# of sentences
Hebrew Wiki,650MB,3.8M
Hebrew OSCAR,9.8GB,20.8M
UGC,150MB,350K


It's clear from both tables that the HeBERT is order of magnitude larger than the Hebrew part of the multilingual BERT. I reproduced two experimental settings from this [paper from 2020 SIGMORPHON](https://www.aclweb.org/anthology/2020.sigmorphon-1.24/) (and see also the corresponding [blog post](https://stavkl.github.io/linguistics-for-nlp/fastpages/jupyter/2020/09/21/getting-the-life.html) for more details) in order to examine the contribution of huge amounts of data to the complex-POS tagging task. Both experiments ran in exactly the same settings, with the only differences being the choice of model and tokenizer.<br>
Below are examples for the tokenization differences between models:<br>
**mBERT tokenization**<br>
![](https://github.com/stavkl/linguistics-for-nlp/raw/master/images/tokenized-examples/mbert-tokenized.PNG)<br>
**HeBERT tokenization**<br>
![](https://github.com/stavkl/linguistics-for-nlp/raw/master/images/tokenized-examples/hebert-tokenized.PNG)<br>

### Experiment 1: Word-level Multitag

In [None]:
#hide_input
exp1 = {'HeBERT':['94.63', '96.13'], 
        'mBERT':['92.45', '94.09'],} 
  
# Creates pandas DataFrame. 
exp1_df = pd.DataFrame(exp1, index =['Exact Match', 
                                'F1'])

exp1_df

Unnamed: 0,HeBERT,mBERT
Exact Match,94.63,92.45
F1,96.13,94.09


This setting shows an improvement in both exact-match accuracy and existence-f1 measures. It indicates that for settings where the access to the inner structure is not crucial it might be better to use a larger model. Nevertheless, the results received for HeBERT here are still on par with models like [YAP](https://https://www.aclweb.org/anthology/Q19-1003.pdf) (and needless to say, YAP is actually tested on a harder problem, so it's not really a like-for-like comparison).

### Experiment 2: (Multi)-tag per Wordpiece
Recall that in this setting each wordpiece can receive a possibly different tag or multi-tag. The results show that for a significantly larger model the procedure of assigning each wordpiece a different tag eventually deteriorates performance, probably because this procedure is based on heuristics about Hebrew morphology and also because it was designed with many wordpieces in mind while in HeBERT many words don't break into wordpieces.

In [None]:
#hide_input
exp2 = {'HeBERT':['84.86', '84.62'], 
        'mBERT':['86.66', '88.71'],} 
  
# Creates pandas DataFrame. 
exp2_df = pd.DataFrame(exp2, index =['Exact Match', 
                                'F1'])

exp2_df

Unnamed: 0,HeBERT,mBERT
Exact Match,84.86,86.66
F1,84.62,88.71


### My linguistic take on things: Using huge models for Hebrew is like solving a 100-piece puzzle using 30,000 pieces

The results signify that for tasks that don't require specific knowledge about internal structure it might be better to use a significantly larger model like HeBERT. However, this improvement is not big and it seems possible to achive that much improvement by taking other measures on the original mBERT (and avoid the huge training cost altogether). Here are some ideas:

1.   Change the tokenizer - I think that's the single most meaningful thing that can be done for processing Hebrew (and other semitic languages). It was already shown that wordpieces are useless for modelling complex morphology.
2.   Also related to that - I hypothesize that Hebrew models can use a much smaller vocabulary. Hebrew is really more like a 100-piece puzzle in the sense that many morphemes have a well-defined purpose and they connect to the other morphemes in very specific ways. Having 30K pieces actually makes the puzzle much harder to solve. Improving the model's "knowledge" of the affixation processes can reduce training costs and might lead to better morphological disambiguation and so to other improvements down the pipeline.
3. Imposing some structure on the vocabulary, instead of just going over a huge list looking for the largest wordpiece we can fit. This would require some rule-based intervention with the process but might be worth checking.
4. Changing the task - LM predicts the next word in a sentence, which works great for English since English sentences exhibit a strict word order, and so by learning what the next word is we also encode syntactic roles. This is not the case for Hebrew as word-order can vary a lot. However, the internal structure of a Hebrew word does NOT vary a lot, so we can try transferring "next-word prediction" to the character level to capture more morphological knowledge (though it would not encode the non-concatenative template, that remains a problem).



That's all for now,<br>
until next time<br>
Stav