# Magistritöö : BERT mudeli kohandamine eesti keelele

## Taust

Sisendtekst -> **tokenizer** -> **embedding** -> transformer

Tavalises BERTis hakitakse sõna tokenizeris osadeks (täissõna või n-gram) ning sellest tehakse 3-komponendiline embedding (token embedding, segment embedding ja sequence embedding).

* Token embedding - tokeni vektor 
* Segment embedding - 1/0 vektor paarissisendite eristamiseks
* Sequence embedding - vektor, mis tähistab positsiooni sisendtekstis

Embeddingud liidetakse ja saadakse sisendembedding transformerile.  
Vaata siit:
https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a


## Ülesanne

Anda sisendiga kaasa rohkem infot, kasutades estnltk vahendeid (morfoloogia osad). Selleks on vaja:

1) Muuta tokenizerit, et saada tokeni asemel lemma ja vorm  
2) Muuta embeddinguid, tokem embeddingu asemel leida lemma embedding ja vormi embedding

## 1) Tokenizer

In [1]:
# Tavaline BERT tokenizer

from transformers import BertTokenizer, PreTrainedTokenizer
tokenizer = BertTokenizer.from_pretrained("tartuNLP/EstBERT")

tokenizer("Kuidas kirjutada magistritööd?")

{'input_ids': [2, 572, 3611, 11838, 1709, 49892, 229, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [2]:
# Uus tokeniseerija

from src.transformers.models.bert.tokenization_bert import BertTokenizer
tokenizer = BertTokenizer("vocab.txt")
input_tekst = "Maril on paha tuju, aga see on fine."

print(tokenizer.tokenize(input_tekst))
print("")
print(tokenizer(input_tekst))

[('mari', 'sg ad'), ('olema', 'b'), ('paha', 'sg n'), ('tuju', 'sg n'), (',', ''), ('aga', ''), ('see', 'sg n'), ('olema', 'b'), ('fine', 'sg n'), ('.', '')]

{'input_ids': [(2, 2), (8316, 1), (788, 411), (2857, 1), (7015, 1), (11, 49881), (179, 49881), (126, 1), (788, 411), (1, 1), (15, 49881), (3, 3)], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [[1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1]]}


In [21]:
tokenizer_output = tokenizer(input_tekst)
words_tensor = torch.tensor([tokenizer_output["input_ids"]])
segments_tensor = torch.tensor([tokenizer_output["token_type_ids"]])
print(words_tensor)
print(segments_tensor)

tensor([[[    2,     2],
         [ 8316,     1],
         [  788,   411],
         [ 2857,     1],
         [ 7015,     1],
         [   11, 49881],
         [  179, 49881],
         [  126,     1],
         [  788,   411],
         [    1,     1],
         [   15, 49881],
         [    3,     3]]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


## 2) Embeddings

In [37]:
from src.transformers.models.bert.modeling_bert import BertEmbeddings, BertModel
from src.transformers.models.bert.configuration_bert import BertConfig
import torch


config = BertConfig()
embedding = BertEmbeddings(config)
model = BertModel(config).from_pretrained("tartuNLP/EstBERT")

# model = BertModel(config) 
# Annab index out of range errorit, kuna tokeniseerija kasutab EstBERT vocab.txt faili
# Kui sõnal pole vormi, siis hetkel on vasteks tühi sõne -> ID 49881
# Tavalisel BERTil sõnastiku suurus ~30 000

Some weights of the model checkpoint at tartuNLP/EstBERT were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'bert.embeddings.word_embeddings.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at tartuNLP/EstBERT and are newly initialized: ['bert.embeddings.form

In [38]:
print(model(words_tensor, segments_tensor))

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.0670, -0.4643, -0.3942,  ..., -0.2698, -0.1209, -0.1286],
         [-0.3832, -0.1994,  1.8395,  ..., -0.7417,  0.2582,  1.0315],
         [-0.1101, -0.4827,  1.5438,  ..., -0.8496, -0.1393,  3.0713],
         ...,
         [ 0.4404, -0.1631,  1.8296,  ..., -0.5746,  0.1799,  2.4584],
         [-0.4306, -0.3358,  1.5688,  ..., -0.4726,  0.5961,  0.3756],
         [-0.2092, -0.1607,  0.0142,  ..., -0.8392,  0.1013, -0.0064]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.2727, -0.5788,  0.7369, -0.5138,  0.6122, -0.0200, -0.6871,  0.3304,
          0.0658,  0.0090,  0.1095,  0.0501,  0.0930,  0.4894,  0.2567, -0.1889,
         -0.7491,  0.0874, -0.1658,  0.3224,  0.7752, -0.1106, -0.4507, -0.0576,
         -0.3789, -0.4304, -0.2049, -0.1836, -0.4341,  0.1879, -0.5059, -0.6251,
         -0.1487,  0.4441, -0.4719,  0.7903, -0.1909, -0.7007, -0.4707, -0.5245,
          0.3003, -0.8979, -0.36