# RoBERTa 

This notebook shows basic usage of RoBERTa model trained on Czech texts.

In [1]:
import os
from transformers import RobertaModel, RobertaForMaskedLM
from tokenizers.implementations import ByteLevelBPETokenizer
import torch
from czech_roberta_tokenizer import CzechRobertaTokenizer

In [2]:
model_path = "/home/naplava/data/roberta_pretrain/hf_model"
tokenizer_path = model_path + '/tokenizer'

Initialize Tokenizer and use it to encode sample text.

In [3]:
tokenizer = CzechRobertaTokenizer(tokenizer_path)

In [4]:
encoded = tokenizer.encode('Jakub jede do Říma,', pad_to_max_length=False)
print(encoded)

{'input_ids': [0, 17157, 5678, 16, 24101, 91, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}


Load RobertaModel with LM head

In [5]:
model = RobertaForMaskedLM.from_pretrained(model_path)
model.eval()

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(51961, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

## Fill mask 

What is the most probable word given context

In [6]:
text_to_encode = 'Praha je hlavní město České republiky'
encoded_text = tokenizer.encode(text_to_encode, pad_to_max_length=False)['input_ids']
encoded_text = torch.Tensor(encoded_text).unsqueeze(0).to(torch.int64)
print(encoded_text)

tensor([[   0, 1016,   12,  599,  509,  314,  635,    4,    2]])


Each word in the text_to_encode got mapped to separate token in encoded representation.

Let's mask Praha and see what are most probable infilments given the model.

In [7]:
word_index_to_mask = 1
encoded_text[0][word_index_to_mask] = tokenizer.mask_index
res = model(encoded_text)[0]
print(res.shape)

torch.Size([1, 9, 51961])


In [8]:
logits = res[0]
argsorted_logits = torch.argsort(logits, dim=1, descending = True).numpy()

print('The most probable variants are:')
for i in range(20):
    print('\t' + tokenizer.decode([argsorted_logits[word_index_to_mask][i]]))

The most probable variants are:
	Co
	Jaké
	Kde
	Kdo
	Kolik
	Praha
	Kdy
	Které
	co
	Toto
	Jaký
	Jaká
	Jak
	Město
	Stát
	Kraj
	Česko
	ČR
	Kam
	Brno


## Other tasks.

When used for other tasks - e.g. sequence classification or sequence labelling, load model into appropriate class (the most general is RobertaModel upon which all various heads can be built).

In [9]:
from transformers import RobertaForSequenceClassification, RobertaForTokenClassification, RobertaModel