# Magistritöö : BERT mudeli kohandamine eesti keelele

## Taust

Sisendtekst -> **tokenizer** -> **embedding** -> transformer

Tavalises BERTis hakitakse sõna tokenizeris osadeks (täissõna või n-gram) ning sellest tehakse 3-komponendiline embedding (token embedding, segment embedding ja sequence embedding).

* Token embedding - tokeni vektor 
* Segment embedding - 1/0 vektor paarissisendite eristamiseks
* Sequence embedding - vektor, mis tähistab positsiooni sisendtekstis

Embeddingud liidetakse ja saadakse sisendembedding transformerile.  
Vaata siit:
https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a


## Ülesanne

Anda sisendiga kaasa rohkem infot, kasutades estnltk vahendeid (morfoloogia osad). Selleks on vaja:

1) Muuta tokenizerit, et saada tokeni asemel lemma ja vorm  
2) Muuta embeddinguid, tokem embeddingu asemel leida lemma embedding ja vormi embedding

## 1) Tokenizer

In [1]:
# Tavaline BERT tokenizer

from transformers import BertTokenizer, PreTrainedTokenizer
tokenizer = BertTokenizer.from_pretrained("tartuNLP/EstBERT")

tokenizer("Kuidas kirjutada magistritööd?")

{'input_ids': [2, 572, 3611, 11838, 1709, 49892, 229, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [2]:
#import sys
#import os
#current = os.path.dirname(os.path.realpath("test.ipnyb"))
#parent = os.path.dirname(current)
#sys.path.append(parent)

In [2]:
# Uus tokeniseerija

from src.transformers.models.bert.tokenization_bert import BertTokenizer
tokenizer = BertTokenizer(vocab_file = "vocab.txt", vocab_file_form = "vocab_form.txt")
input_tekst = "Maril on paha tuju, aga see on fine."
input_tekst = "Tallinna linn algatab Paldiski maantee ääres Hotell Tallinna kõrval asuva suure vundamendiaugu ja tühermaa detailplaneeringu koostamise, ehitustööde alustamist takistavad aga ala segased omandisuhted."

print(tokenizer.tokenize(input_tekst))
print("")
print(tokenizer(input_tekst))

[('tallinna', ''), ('linn', 'sg n'), ('algatama', 'b'), ('paldiski', ''), ('maantee', 'sg g'), ('ääres', ''), ('hotell', 'sg n'), ('tallinna', ''), ('kõrval', ''), ('asuv', 'sg g'), ('suur', 'sg g'), ('vundamendiauk', 'sg g'), ('ja', ''), ('tühermaa', 'sg g'), ('detailplaneering', 'sg g'), ('koostamine', 'sg g'), (',', ''), ('ehitustöö', 'pl g'), ('alustamine', 'sg p'), ('takistama', 'vad'), ('aga', ''), ('ala', 'sg g'), ('segane', 'pl n'), ('omandisuhe', 'pl n'), ('.', '')]

{'input_ids': [(2, 2), (32476, 1), (2733, 34), (1, 53), (1, 1), (3937, 30), (4462, 1), (12535, 34), (32476, 1), (1773, 1), (8285, 30), (502, 30), (1, 30), (37, 1), (1, 30), (1, 30), (14457, 30), (11, 1), (25063, 45), (35501, 35), (1, 108), (179, 1), (440, 30), (18569, 49), (1, 49), (15, 1), (3, 3)], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [[1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1

In [4]:
tokenizer

PreTrainedTokenizer(name_or_path='', vocab_size=50004, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [5]:
import torch
tokenizer_output = tokenizer(input_tekst)
words_tensor = torch.tensor([tokenizer_output["input_ids"]])
segments_tensor = torch.tensor([tokenizer_output["token_type_ids"]])
print(words_tensor)
print(segments_tensor)

tensor([[[    2,     2],
         [32476,     1],
         [ 2733,    34],
         [    1,    53],
         [    1,     1],
         [ 3937,    30],
         [ 4462,     1],
         [12535,    34],
         [32476,     1],
         [ 1773,     1],
         [ 8285,    30],
         [  502,    30],
         [    1,    30],
         [   37,     1],
         [    1,    30],
         [    1,    30],
         [14457,    30],
         [   11,     1],
         [25063,    45],
         [35501,    35],
         [    1,   108],
         [  179,     1],
         [  440,    30],
         [18569,    49],
         [    1,    49],
         [   15,     1],
         [    3,     3]]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0]])


## 2) Embeddings

In [6]:
from src.transformers.models.bert.modeling_bert import BertEmbeddings, BertModel
from src.transformers.models.bert.configuration_bert import BertConfig
import torch


config = BertConfig()
embedding = BertEmbeddings(config)
model = BertModel(config).from_pretrained("tartuNLP/EstBERT")

# model = BertModel(config) 
# Annab index out of range errorit, kuna tokeniseerija kasutab EstBERT vocab.txt faili
# Kui sõnal pole vormi, siis hetkel on vasteks tühi sõne -> ID 49881
# Tavalisel BERTil sõnastiku suurus ~30 000

None


Some weights of the model checkpoint at tartuNLP/EstBERT were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'bert.embeddings.word_embeddings.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at tartuNLP/EstBERT and are newly initialized: ['bert.embeddings.lemm

None


In [7]:
print(model(words_tensor, segments_tensor))

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.7084, -0.1104, -0.5107,  ..., -0.5266,  0.0329, -0.3554],
         [ 1.1890, -0.2122, -0.4336,  ..., -0.6041,  0.1565,  1.0217],
         [ 0.6978, -0.3808, -0.9276,  ..., -0.6450,  0.0800,  0.5463],
         ...,
         [ 0.1490, -0.2475, -0.2370,  ..., -0.8311,  0.3278,  0.0696],
         [ 0.7233, -0.1713, -0.2095,  ..., -0.7554, -0.0536,  0.2238],
         [ 0.5969, -0.1337, -0.3339,  ..., -1.1777,  0.1442,  0.5726]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.7575, -0.1106, -0.6539, -0.0478,  0.2228,  0.1760,  0.5432,  0.3282,
         -0.3785,  0.6589,  0.5764,  0.3764,  0.1170,  0.5791,  0.0266,  0.6563,
          0.5073, -0.5673,  0.1771,  0.4460,  0.3728, -0.2305,  0.2115,  0.3274,
          0.3929, -0.4446, -0.3809,  0.2946,  0.8287, -0.5622,  0.4769,  0.2044,
         -0.2921, -0.2330,  0.3328,  0.1427,  0.4485, -0.3210, -0.3539,  0.1338,
         -0.1099,  0.4051, -0.72

In [8]:
# concat? esialgue läheme sellega...
# overfittida väiksel korpusel

# pos ja type mis ta nendega teeb? siis saab samal loogikal kanalid tekitadal
# vocabid eraldi...

# pärast saame timmida, nt. liitsõnad...

# korpus võtta kuskilt cl ut ee korpused
# estnltk tutorials corpus_processing

# test eraldi kausta
# ülesanne ennustada estnltk vormi
# treenimisse sisse kirjutada vormi (ja lemma) ennustamine...

In [9]:
%%time

# Korpus

from estnltk.corpus_processing.parse_enc import parse_enc_file_iterator

input_file = "estonian_nc17.vert"
n = 100 # Mitu teksti korpusesse lugeda
korpus = []
l = 0
for text_obj in parse_enc_file_iterator(input_file):
    korpus.append(text_obj.text)
    if l > n:
        break
    l += 1

CPU times: total: 2.73 s
Wall time: 1.58 s


In [10]:
from estnltk import Text
tekst = Text(" ".join(korpus)).tag_layer()

In [11]:
laused = []
for span in tekst.sentences:
    laused.append(tekst.text[span.start:span.end])

In [12]:
train =  laused[:int(0.8*n)]
test = laused[int(0.8*n):]

In [13]:
def mlm(tensor):
    rand = torch.rand(tensor[:, :, 0].shape)
    mask_arr = (rand < 0.15) * (tensor[:, :, 0] > 5)
    for i in range(tensor[:, :, 0].shape[0]):
        selection = torch.flatten(mask_arr[i].nonzero()).tolist()
        tensor[i, selection] = 4
    return tensor

In [42]:
#tokenizer = BertTokenizer.from_pretrained("tartuNLP/EstBERT")
tokenizer = bt(vocab_file = "vocab.txt", vocab_file_form = "vocab_form.txt")
tokeniseeritud_lause = tokenizer(train[0:2], max_length = 64, padding = "max_length",
                                     truncation = True, return_tensors = "pt")


In [60]:
tokeniseeritud_lause["input_ids"][0][:,0]

tensor([    2, 32476,  2733,     1,     1,  3937,  4462, 12535, 32476,  1773,
         8285,   502,     1,    37,     1,     1, 14457,    11, 25063, 35501,
            1,   179,   440, 18569,     1,    15,     3,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0])

In [56]:
torch.nonzero(tokeniseeritud_lause["input_ids"][0] == 2)

tensor([[0, 0],
        [0, 1]])

In [None]:
[(0,0)] + [(0,0)]

In [None]:
input_ids = []
mask = []
labels = []

for lause in train:
    tokeniseeritud_lause = tokenizer(lause, max_length = 64, padding = "max_length",
                                     truncation = True, return_tensors = "pt")
    labels.append([id[1] for id in tokeniseeritud_lause.input_ids])
    mask.append(tokeniseeritud_lause.attention_mask)
    input_ids.append(mlm(torch.tensor([tokeniseeritud_lause.input_ids]).detach().clone()))

In [None]:
input_ids = torch.cat(input_ids)
mask = torch.cat(mask)
labels = torch.cat(labels)