# "Getting the life out of Living"
> "A walkthrough creating, fine-tuning and evaluating the data for the SIGMORPHON 2020 paper"

- toc:false
- branch: master
- badges: true
- comments: true
- author: Stav Klein
- categories: [fastpages, jupyter]
- image: images/draw-bert.png

### Importing the libraries
Note that the 'bclm' library is an internal library developed in the ONLP lab, and is currently unavailable for public use. All the results are obtailable and reproducible with the standard parsing of CONLL formatted files (all available in the ONLP Github page).

In [None]:
import os
import csv
import pandas as pd
import numpy as np
from tqdm import tqdm, trange
import bclm

import torch
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from transformers import BertTokenizer, BertConfig
from transformers import BertForTokenClassification, AdamW

Using TensorFlow backend.


### Manually setting seeds
Results are calculated based on an average of 5 independent runs of this code. The seeds are set to minimize the variation between runs. Due to internal randomization in pytorch results can't be identical from run to run even when using the same seeds (see [pytorch documentation](https://pytorch.org/docs/stable/notes/randomness.html))

In [None]:
torch.manual_seed(3)
np.random.seed(3)
torch.cuda.manual_seed_all(3)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

### Taking a first and important look at the data
Data is split to train, dev and test sets. We fine-tune the model on the train set and evaluate it on the dev set.

In [None]:
train = bclm.read_dataframe('spmrl', subset='train')
train_df = bclm.get_token_df(train, ['upostag'])
train_df['token_str'] = train_df['token_str'].str.replace('”','"')

dev = bclm.read_dataframe('spmrl', subset='dev')
dev_df = bclm.get_token_df(dev, ['upostag'])
dev_df['token_str'] = dev_df['token_str'].str.replace('”','"')

### Here are the first lines of the dev set
Notice that some words have a multi-tag while other words carry a simple tag - a challenge for Hebrew POS-tagging.

In [None]:
dev_df.head(20)

Unnamed: 0,sent_id,token_id,token_str,upostag,set
0,1,1,עשרות,CDT,dev
1,1,2,אנשים,NN,dev
2,1,3,מגיעים,BN,dev
3,1,4,מתאילנד,PREPOSITION^NNP,dev
4,1,5,לישראל,PREPOSITION^NNP,dev
5,1,6,כשהם,TEMP^PRP,dev
6,1,7,נרשמים,BN,dev
7,1,8,כמתנדבים,PREPOSITION^NN,dev
8,1,9,",",yyCM,dev
9,1,10,אך,CC,dev


### Uniform column names
We evaluated on different datasets so a uniform column name was needed

In [None]:
train_df.rename(columns = {"token_str": "form"}, inplace = True)
dev_df.rename(columns = {"token_str": "form"}, inplace = True)

### Get lists of sentences and their corresponding labels
Includes an example of a sentence from the train set at the end. Also, not shown here, we removed the four longest sentences from the dev set, as they caused systematic issued later on in the code.

In [None]:
class sentenceGetter(object):
    def __init__(self, data, max_sent=None):
        self.index = 0
        self.max_sent = max_sent
        self.tokens = data['form']
        self.labels = data['upostag']
        #for evaluating by word-accuracy
        self.correspondingToken = data['token_id']
        self.orig_sent_id = data['sent_id']
    
    def sentences(self):
        sent = []
        counter = 0
        
        for token,label, corres_tok, sent_id in zip(self.tokens, self.labels, self.correspondingToken, self.orig_sent_id):
            sent.append((token, label, corres_tok, sent_id))
            if token.strip() == ".":
                yield sent
                sent = []
                counter += 1
            if self.max_sent is not None and counter >= self.max_sent:
                return

train_getter = sentenceGetter(train_df)
dev_getter = sentenceGetter(dev_df)
test_getter = sentenceGetter(test_df)

train_sentences = [[token for token, label, corres_tok, sent_id in sent] for sent in train_getter.sentences()]
train_labels = [[label for token, label, corres_tok, sent_id in sent] for sent in train_getter.sentences()]

dev_sentences = [[token for token, label, corres_tok, sent_id in sent] for sent in dev_getter.sentences()]
dev_labels = [[label for token, label, corres_tok, sent_id in sent] for sent in dev_getter.sentences()]
dev_corresTokens = [[corres_tok for token, label, corres_tok, sent_id in sent] for sent in dev_getter.sentences()]
dev_sent_ids = [[sent_id for token, label, corres_tok, sent_id in sent] for sent in dev_getter.sentences()]

print(train_sentences[10])
print(train_labels[10])

print(len(dev_sentences))
print(len(test_sentences))

['הם', 'התבקשו', 'לדווח', 'למשטרה', 'על', 'תנועותיהם', '.']
['PRP', 'VB', 'VB', 'PREPOSITION^DEF^NN', 'IN', 'NN', 'yyDOT']
490
712


### Put everything on CUDA

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.set_device(0)

print("Device: " + str(device))
print("Number of gpus: " + str(n_gpu))
print("Name of gpu: " + torch.cuda.get_device_name(0))

Device: cuda
Number of gpus: 4
Name of gpu: GeForce RTX 2080 Ti


In [None]:
MAX_LEN = 150
bs = 32

### Tokenize the training set (with BERT)
See the example of how the tokenization looks like in the end

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)
def tokenize(sentences, orig_labels):
    tokenized_texts = []
    labels = []
    for sent, sent_labels in zip(sentences, orig_labels):
        bert_tokens = []
        bert_labels = []
        for orig_token, orig_label in zip(sent, sent_labels):
            b_tokens = tokenizer.tokenize(orig_token)
            bert_tokens.extend(b_tokens)
            for b_token in b_tokens:
                bert_labels.append(orig_label)
        tokenized_texts.append(bert_tokens)
        labels.append(bert_labels)
        assert len(bert_tokens) == len(bert_labels)
    return tokenized_texts, labels

train_tokenized_texts, train_tokenized_labels = tokenize(train_sentences, train_labels)
print(train_tokenized_texts[10])
print(train_tokenized_labels[10])

['הם', 'ה', '##ת', '##בק', '##שו', 'ל', '##דו', '##וח', 'ל', '##משטרה', 'על', 'ת', '##נוע', '##ות', '##יהם', '.']
['PRP', 'VB', 'VB', 'VB', 'VB', 'VB', 'VB', 'VB', 'PREPOSITION^DEF^NN', 'PREPOSITION^DEF^NN', 'IN', 'NN', 'NN', 'NN', 'NN', 'yyDOT']


### Get a list of all possible POS-tag labels
Note that this list contains simple and multi POS tags. There are 50 Simple POS-tags (in isolation), and 315 POS-tags (combining simple and multi POS-tags). Making even more complex tags (for example, joining the 'feats' tag as well) results in an exponentially larger and more sparse label space.


In [None]:
data = train_df
tag_vals = list(set(data["upostag"].values))
tags = ['PAD'] + tag_vals
tag2idx = {tag:idx for idx, tag in enumerate(tags)}
idx2tag = {idx:tag for idx, tag in enumerate(tags)}

print(tag2idx)
# print(idx2tag)
print(len(tags))

{'PAD': 0, 'REL^PREPOSITION^CDT': 1, 'DEF^DTT': 2, 'CONJ^VB^AT^PRP': 3, 'CONJ^yyQUOT^NNT': 4, 'PREPOSITION^ADVERB^CD': 5, 'TEMP^PRP': 6, 'CONJ^TEMP^RB': 7, 'AT': 8, 'CONJ^REL^COP': 9, 'PREPOSITION^ADVERB^NCD': 10, 'PREPOSITION^ADVERB^CDT': 11, 'CONJ^DTT': 12, 'TEMP^PREPOSITION^DEF^NN': 13, 'TEMP^NNP': 14, 'IN^IN^NNT': 15, 'DEF^P': 16, 'ADVERB^DTT': 17, 'VB^AT^S_ANP': 18, 'CONJ^DEF^NNP': 19, 'CD': 20, 'PREPOSITION^PREPOSITION^DEF^PRP': 21, 'PREPOSITION^yyQUOT^DEF^NN': 22, 'REL^DTT': 23, 'CONJ^IN': 24, 'POS': 25, 'PREPOSITION^JJ': 26, 'TEMP^NNT': 27, 'REL^yyQUOT^NNP': 28, 'CONJ^NNP': 29, 'IN^NN': 30, 'PREPOSITION^CDT': 31, 'PREPOSITION^NNP': 32, 'CONJ^CC': 33, 'yyLRB': 34, 'DEF^NN': 35, 'BNT': 36, 'REL^DEF^BN': 37, 'REL^yyQUOT^JJ': 38, 'PREPOSITION^DEF^yyQUOT^NNP': 39, 'REL': 40, 'NN': 41, 'IN^NNT': 42, 'TEMP^RB': 43, 'DEF^RB': 44, 'REL^JJ': 45, 'IN': 46, 'JJ': 47, 'PREPOSITION^DEF^CD': 48, 'CONJ^REL^PREPOSITION^NN': 49, 'CONJ^yyQUOT^NN': 50, 'CONJ^PREPOSITION^CDT': 51, 'CONJ^MD': 52, 'I

### All the technical stuff
To make sentences and labels the same length and in tensor form

In [None]:
def pad_sentences_and_labels(tokenized_texts, labels):
    input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                              maxlen = MAX_LEN, dtype = "float32", truncating = "post", padding = "post", value = tag2idx['PAD'])
    tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in labels], 
                         maxlen = MAX_LEN, value = tag2idx['PAD'], padding = "post",
                        dtype = "float32", truncating = "post")
    attention_masks = [[float(i>0) for i in ii] for ii in input_ids]
    return input_ids, tags, attention_masks

input_ids, tags, attention_masks = pad_sentences_and_labels(train_tokenized_texts, train_tokenized_labels)

In [None]:
tr_inputs = torch.tensor(input_ids, dtype=torch.long)
tr_tags = torch.tensor(tags, dtype=torch.long)
tr_masks = torch.tensor(attention_masks, dtype=torch.long)

train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler = train_sampler, batch_size=bs)

### Performing the Fine-tuning
As you can see we didn't perform any hyperparameter search. This is intended. Of course, HP-optimization can make the model better (you are welcome to try it yourself), but we didn't want to make the best model, we wanted to investigate how well contextualized models can represent complex morphology, so all of our models use the same fine-tuning procedure described below, and the only thing that changes is the label that each word-piece receives (described in detail in the theoratical post on this paper).

In [None]:
from transformers import get_linear_schedule_with_warmup

model = BertForTokenClassification.from_pretrained('bert-base-multilingual-cased',
                                                   num_labels=len(tag2idx),
                                                   output_attentions = False,
                                                   output_hidden_states = False)
model.cuda()
FULL_FINETUNING = True
if FULL_FINETUNING:
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0}
    ]
else:
    param_optimizer = list(model.classifier.named_parameters())
    optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]

optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5, eps=1e-8)

from seqeval.metrics import f1_score

def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
#     print (pred_flat, labels_flat)
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

epochs = 15
max_grad_norm = 1.0

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps)

## Store the average loss after each epoch so we can plot them.
loss_values, validation_loss_values = [], []
for _ in trange(epochs, desc="Epoch"):
    # TRAIN loop
    model.train()
    total_loss = 0

    for step, batch in enumerate(train_dataloader):
        # add batch to gpu
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        model.zero_grad()
        # forward pass
        outputs = model(b_input_ids, token_type_ids=None,
                     attention_mask=b_input_mask, labels=b_labels)
        # get the loss
        loss = outputs[0]
        # Perform a backward pass to calculate the gradients.
        loss.backward()
        # track train loss
        total_loss += loss.item() 
        # Clip the norm of the gradient
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=max_grad_norm)
        # update parameters
        optimizer.step()
        # Update the learning rate.
        scheduler.step()
        
    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)
    print("Average train loss: {}".format(avg_train_loss))
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)

Epoch:   7%|▋         | 1/15 [00:58<13:45, 58.94s/it]

Average train loss: 2.2720749048810256


Epoch:  13%|█▎        | 2/15 [02:02<13:05, 60.43s/it]

Average train loss: 0.821627008679666


Epoch:  20%|██        | 3/15 [03:12<12:36, 63.06s/it]

Average train loss: 0.5413947211284387


Epoch:  27%|██▋       | 4/15 [04:27<12:16, 66.92s/it]

Average train loss: 0.39871315207136304


Epoch:  33%|███▎      | 5/15 [05:46<11:42, 70.26s/it]

Average train loss: 0.31264687878520864


Epoch:  40%|████      | 6/15 [07:04<10:54, 72.67s/it]

Average train loss: 0.25154486896568223


Epoch:  47%|████▋     | 7/15 [08:23<09:56, 74.55s/it]

Average train loss: 0.20845407497529922


Epoch:  53%|█████▎    | 8/15 [09:42<08:50, 75.82s/it]

Average train loss: 0.17870022454544118


Epoch:  60%|██████    | 9/15 [11:02<07:43, 77.18s/it]

Average train loss: 0.15089197540165564


Epoch:  67%|██████▋   | 10/15 [12:25<06:34, 78.94s/it]

Average train loss: 0.13223850584932065


Epoch:  73%|███████▎  | 11/15 [13:46<05:17, 79.48s/it]

Average train loss: 0.11638171909573047


Epoch:  80%|████████  | 12/15 [15:05<03:58, 79.42s/it]

Average train loss: 0.10619803432277158


Epoch:  87%|████████▋ | 13/15 [16:26<02:39, 79.99s/it]

Average train loss: 0.09557023781694864


Epoch:  93%|█████████▎| 14/15 [17:48<01:20, 80.63s/it]

Average train loss: 0.09014363510926303


Epoch: 100%|██████████| 15/15 [19:06<00:00, 76.45s/it]

Average train loss: 0.08390635945589135





### More Hebrew Fun
Remember that the model predicts a POS tag (simple or multi, one of the 315 from above) for each word-piece. But we don't want to evaluate word-pieces, we want to evaluate on whole words, because words can have multi-tags and also because we want to make a fair comparison with our baseline model that uses the actual morphemes (see example in the paper) - therefore all of our models are evaluated on word-level (as opposed to morpheme level, for example).

This setting of the aggregation of the wordpieces (multi-)tags to a whole word assumes we combine the (multi-)tag of each wordpiece to a single multi-tag. It can easily be adjusted for the case where we take the first (multi-)tag to be the tag for the whole word.

In [None]:
# Function receives a sentence with its labels, and the tokenized sentence and labels
def aggr_toks_labels_tags(orig_words, orig_labels, tok_wordps, tok_labels, predicted_tags):
    
    joint_tokens = []
    joint_labels = []
    joint_predicted = []
    
    for word in orig_words:
        aggregated_tokenized = ""
        aggregated_label = ""
        aggregated_predicted = ""
        aggregated_test = ""
        
        while aggregated_tokenized != word:
            tmpTok = tok_wordps.pop(0)
            if tmpTok.startswith("##"):
                tmpTok = tmpTok[2:]
                
            tmpLab = tok_labels.pop(0)
            aggregated_label += '^'
            aggregated_label += tmpLab

            tmpPred = predicted_tags.pop(0)

            aggregated_predicted += '^'
            aggregated_predicted += tmpPred
                
            aggregated_tokenized += tmpTok

            
        joint_tokens.append(aggregated_tokenized)
        joint_labels.append(aggregated_label)
        joint_predicted.append(aggregated_predicted)

        
    assert len(joint_tokens) == len(orig_words)
    assert len(joint_tokens) == len(joint_predicted)
    return joint_tokens, joint_labels, joint_predicted

### Building the evaluation models and evaluating on the dev set

In [None]:
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

def delete_pads_from_preds(predicted_tags, test_tags):
    clean_predicted = []
    clean_test = []
    
    for ix in range(0, len(test_tags)):
        if test_tags[ix] != 'PAD':
            clean_predicted.append(predicted_tags[ix])
            clean_test.append(test_tags[ix])
            
    return clean_predicted, clean_test
    
def calculate_accuracy(df):
    numOfCorrectPredictions = 0
    for index in df.index:
        orig_pos = df.at[index, 'orig_label']
        pred_pos = df.at[index, 'predicted_tag']
        if orig_pos == pred_pos:
            numOfCorrectPredictions += 1
    return numOfCorrectPredictions/len(df)
                
def test_model(sentence, labels, tok_sent, tok_labels, corres_tokens, sent_id):
    input_ids, tags, attention_masks = pad_sentences_and_labels([tok_sent], [tok_labels])

    val_inputs = torch.tensor(input_ids, dtype=torch.long)
    val_tags = torch.tensor(tags, dtype=torch.long)
    val_masks = torch.tensor(attention_masks, dtype=torch.long)

    test_data = TensorDataset(val_inputs, val_masks, val_tags)
    test_sampler = SequentialSampler(test_data)
    test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=bs)

    model.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    predictions, true_labels = [], []
    counter = 0
    for batch in test_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad():
            outputs = model(b_input_ids, token_type_ids=None,
                                attention_mask=b_input_mask, labels=b_labels)
        logits = outputs[1].detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        predictions.append([list(p) for p in np.argmax(logits, axis=2)])
        
        true_labels.append(label_ids)
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)

        eval_loss += outputs[0].mean().item()
        eval_accuracy += flat_accuracy(logits, label_ids)

        nb_eval_examples += b_input_ids.size(0)
        nb_eval_steps += 1
    eval_loss = eval_loss / nb_eval_steps
    
    pred_tags = [idx2tag[p_ii] for p in predictions for p_i in p for p_ii in p_i]
    joint_tokenized, joint_labels, preds = aggr_toks_labels_tags(sentence, labels, tok_sent, tok_labels, 
                                                                        pred_tags)
    
    tmp = {'word': sentence, 'orig_label': labels, 'predicted_tag': preds, 
           'corresToken': corres_tokens, 'sent_id': sent_id}
    tmp_df = pd.DataFrame(data=tmp)
    return tmp_df

In [None]:
full_dev_df = pd.DataFrame()
dev_tokenized_texts, dev_tokenized_labels = tokenize(dev_sentences, dev_labels)
for sent, label, tok_sent, tok_label, corresTokens, sent_id in zip(dev_sentences, dev_labels, dev_tokenized_texts, 
                                                                   dev_tokenized_labels, dev_corresTokens, 
                                                                   dev_sent_ids):
    eval_df = test_model(sent, label, tok_sent, tok_label, corresTokens, sent_id)
    full_dev_df = full_dev_df.append(eval_df, ignore_index=True, sort=False)

### How the dev DataFrame looks like after evaluation
Under the 'predicted_tag' column we can see the aggragated tag. Some processing is still required for the predicted tag to be comparable to the 'orig_label' tag - see the next cells.



In [None]:
full_dev_df.head()

Unnamed: 0,word,orig_label,predicted_tag,corresToken,sent_id
0,עשרות,CDT,^CD,1,1
1,אנשים,NN,^NN,2,1
2,מגיעים,BN,^BN^BN,3,1
3,מתאילנד,PREPOSITION^NNP,^PREPOSITION^NNP^PREPOSITION^NNP^PREPOSITION^NNP,4,1
4,לישראל,PREPOSITION^NNP,^PREPOSITION^NNP,5,1


### Splitting the predicted tag
As mentioned in the theoratical post and in the paper, we split the predicted tag to that it includes only unique tags by the order they were predicted. We make a list of tags from both the original tag and the predicted (uniquely filtered) tag, and those lists will be compared.

In [None]:
from more_itertools import unique_everseen

def unique_vals_to_list(df):
    for index in df.index:
        joint_pred = df.at[index, 'predicted_tag']
        joint_orig = df.at[index, 'orig_label']
        
        predicted_tag_list = joint_pred.split('^')
        predicted_tag_list_no_empty = list(filter(None, predicted_tag_list))
        original_tag_list = joint_orig.split('^')
        original_tag_list_no_empty = list(filter(None, original_tag_list))

        
        df.at[index, 'predicted_tag'] = list(unique_everseen(predicted_tag_list_no_empty))
        df.at[index, 'orig_label'] = list(unique_everseen(original_tag_list_no_empty))
        
        
unique_vals_to_list(full_dev_df)

### Comparing predicted vs. original label
Two metrics of evaluation are available - exact match and existence.
Exact match - pretty straight forward, an all-or-nothing approach, if the original label is IN^DEF^NN and the model predicted exactly that - great, if it predicted IN^NN - it doesn't count as success. This is a tough test, sure. What if the model actually predicted the tags but not in the right order (suppose DEF^IN^NN) or maybe just got 2 tags out of 3? doesn't that count for something?
Sure thing. So the second metric is the existence, where we calculate the standard precision and recall for the tags that appeared in both the predicted and original tags and report the F1 score based on those calculations.

In [None]:
def exact_match_accuracy(df):
    exact_matches = 0
    for index in df.index:
        if df.at[index, 'orig_label'] == df.at[index, 'predicted_tag']:
            exact_matches += 1
            
    return exact_matches

print("DEV - Exact Match Accuracy = {0:.2f}%".format(exact_match_accuracy(dev_combined)/len(dev_combined) * 100))

In [None]:
def existence_accuracy(df):
    # correct tag = appeared in predicted and in original
    total_orig_num_of_labels = 0
    total_predicted_num_of_labels = 0
    total_num_of_correct_tags = 0
    
    for index in df.index:
        orig_list = df.at[index, 'orig_label']
        predicted_list = df.at[index, 'predicted_tag']
        total_orig_num_of_labels += len(orig_list)
        total_predicted_num_of_labels += len(predicted_list)
        total_num_of_correct_tags += len(set(orig_list).intersection(set(predicted_list)))
        
    precision = total_num_of_correct_tags / total_predicted_num_of_labels * 100
    recall = total_num_of_correct_tags / total_orig_num_of_labels * 100
    f1 = 2*precision*recall/(precision+recall)
    
    print("Precision: {0:.2f}%".format(precision))
    print("Recall: {0:.2f}%".format(recall))
    print("F1: {0:.2f}%".format(f1))
    
print("DEV:")
existence_accuracy(dev_combined)

### That's it!
Some more things that can be done (and maybe will be done by me in the future...)


1.   Fine-tune on ordered sentences, that is, see if the learning improves if we apply some order to the sentences (for example arrange the sentences from shortest to longest). The idea behind this is that shorter sentences probably have a simpler syntactic structure, and fine-tuning on them first will hhelp the model learn basic and fundamental properties of Hebrew structure.
2.   Change the tokenizer - I think this is the biggest issue in processing Hebrew in general. It was demonstrated in the paper that BERT's tokenizer is not suitable for languages with complex morphology - we should probably use a different tokenizer for such languages (but that would also require pre-training BERT all together).

