## Семинар 8: "Современные модели для NLP"

ФИО: Токаева Александра Александровна

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [124]:
import numpy as np
import torch
#!pip install --upgrade transformers
from transformers import *

In [77]:
MODEL = (MobileBertForMaskedLM, MobileBertTokenizer, 'google/mobilebert-uncased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [113]:
input_ids = tokenizer.encode("Here is some text to encode .", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 1012, 102]


In [79]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [80]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [81]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [82]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [83]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [93]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]
len(GPT_TEXTS)

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше.

# Первый Способ: просто 50 слов подряд

In [125]:
#Первый способ:Сначала генерируем так: 50 штук, и если сгенерировали точку, 
#то равновероятно дописываем символ [SEP] или [CLS], 
# и уже этот текст подаем на вход на следующем шаге

input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
point_id = tokenizer.encode(".", add_special_tokens=False)[0]
sep_id=102
cls_id=101
for i in range(50):
    input_ids.append(tokenizer.mask_token_id)

    input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
    with torch.no_grad():
        res = model(input_batch)[0]
    
    prob = torch.nn.functional.softmax(res, dim=-1)
    new_id = prob.max(-1)[1][0][-1]
    input_ids[-1]=new_id
    if (new_id==point_id):
        a=np.random.rand()
        if (a<0.5):
            input_ids.append(cls_id)
        else:
            input_ids.append(sep_id)
print(tokenizer.decode(input_ids))


[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. [SEP] they were also very intelligent and intelligent animals. [SEP] they were very intelligent animals. [CLS] unicorns were very intelligent animals. [CLS] unicorns were very intelligent animals. [SEP] they were very intelligent animals. [CLS] unicorns were very intelligent animals. [CLS] unicorns were very intelligent animals. [SEP] they


In [None]:
#Первый способ плохо работает

# Второй способ: не 50 слов сразу, а 10 раз надо предсказать 5 масок подряд, но в случайном порядке (порядок выбирается с помощью random permutation)

In [138]:
N=5
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)

for m in range(10):
    for i in range(N):
        input_ids.append(tokenizer.mask_token_id)

    for j in np.random.permutation(N):
        input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
        with torch.no_grad():
            res = model(input_batch)[0]

        prob = torch.nn.functional.softmax(res, dim=-1)
        new_id = prob.max(-1)[1][0][j+len(input_ids)-N]
        input_ids[j+len(input_ids)-N]=new_id

print(tokenizer.decode(input_ids))

[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. [SEP] however, in fact, the researchers could not understand the natural language of the unicorns and the language of the herd of the unicorns. neither was able to understand the other's language and the language of the herd's language. and they


In [140]:
N=5
input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)

for m in range(10):
    for i in range(N):
        input_ids.append(tokenizer.mask_token_id)

    for j in np.random.permutation(N):
        input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
        with torch.no_grad():
            res = model(input_batch)[0]

        prob = torch.nn.functional.softmax(res, dim=-1)
        new_id = prob.max(-1)[1][0][j+len(input_ids)-N]
        input_ids[j+len(input_ids)-N]=new_id

print(tokenizer.decode(input_ids))

[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. [SEP] the remaining part of the train carriage is buried by a railroad car. the carriage is located in the city limits of cincinnati, ohio. the train carriage is located in the city limits of cincinnati, ohio. the train carriage is buried by a railroad car


In [None]:
#второй способ работает уже лучше!

# Третий способ: оставляем random permutation и добавим top p, top k, сэмплирование вместо максимума

In [162]:
N=5
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)

for m in range(10):
    for i in range(N):
        input_ids.append(tokenizer.mask_token_id)

    for j in np.random.permutation(N):
        input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
        with torch.no_grad():
            res = model(input_batch)[0]

        prob = torch.nn.functional.softmax(res, dim=-1) #1_len(input_ids)_32000
        #print(prob.shape)
        all_probs=prob[0,j+len(input_ids)-N,:]
        order=np.array(np.argsort(all_probs).data.numpy())[::-1]
        #print(order)
        candidates=[]
        distribution=[]
        prob_sum=0
        k=20
        p=0.7
        while (len(candidates)<k and prob_sum<p):
            candidates.append(order[len(candidates)]) #id-numbers of candidat token
            #print(order[len(candidates)])
            prob_sum+=prob[0,j+len(input_ids)-N,order[len(candidates)]]
            distribution.append(prob[0,j+len(input_ids)-N,order[len(candidates)]])
        
        sampler=torch.distributions.Categorical(torch.Tensor(distribution))
        new_id=candidates[sampler.sample().data.numpy().item()]
        
        input_ids[j+len(input_ids)-N]=new_id

print(tokenizer.decode(input_ids))

[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. [SEP] they lived in the region following their father'c expedition, as he became known in argentina and peru. while visiting their families of the various millionaires ’ gold mines and their family ’ descendants ’ wealth, the youngsters were told, they believed that


In [161]:
N=5
input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)

for m in range(10):
    for i in range(N):
        input_ids.append(tokenizer.mask_token_id)

    for j in np.random.permutation(N):
        input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
        with torch.no_grad():
            res = model(input_batch)[0]

        prob = torch.nn.functional.softmax(res, dim=-1) #1_len(input_ids)_32000
        #print(prob.shape)
        all_probs=prob[0,j+len(input_ids)-N,:]
        order=np.array(np.argsort(all_probs).data.numpy())[::-1]
        #print(order)
        candidates=[]
        distribution=[]
        prob_sum=0
        k=20
        p=0.7
        while (len(candidates)<k and prob_sum<p):
            candidates.append(order[len(candidates)]) #id-numbers of candidat token
            #print(order[len(candidates)])
            prob_sum+=prob[0,j+len(input_ids)-N,order[len(candidates)]]
            distribution.append(prob[0,j+len(input_ids)-N,order[len(candidates)]])
        
        sampler=torch.distributions.Categorical(torch.Tensor(distribution))
        new_id=candidates[sampler.sample().data.numpy().item()]
        
        input_ids[j+len(input_ids)-N]=new_id

print(tokenizer.decode(input_ids))

[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. [SEP] another train trailer was recovered in the 1990s on the market and included in a tourist section for pedestrians and vehicles in cars and other vehicles across the road. the vehicles are lost, empty and missing, and still in storage, and some are stolen from them


In [None]:
#Третий способ уже классный!

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: