## Семинар 8: "Современные модели для NLP"

ФИО: Склянный Алексей Алексеевич

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [1]:
# !pip install torch
!pip install --upgrade transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 5.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 48.0 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 61.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [2]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import torch
from transformers import *
import tqdm



In [63]:
MODEL = (MobileBertForMaskedLM, MobileBertTokenizer, 'google/mobilebert-uncased')

# MODEL = (MobileBertForMaskedLM, MobileBertTokenizer, 'sshleifer/tiny-gpt2')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--google--mobilebert-uncased/snapshots/1f90a6c24c7879273a291d34a849033eba2dbc0f/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--google--mobilebert-uncased/snapshots/1f90a6c24c7879273a291d34a849033eba2dbc0f/config.json
Model config MobileBertConfig {
  "_name_or_path": "google/mobilebert-uncased",
  "architectures": [
    "MobileBertForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_activation": false,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "relu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "intra_bottleneck_size": 128,
  "key_query_shared_bottleneck": true,
  "layer_norm_eps": 1e-12,
  

In [5]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [6]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [7]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [8]:
torch.tensor(input_ids).unsqueeze(0).shape

torch.Size([1, 9])

In [9]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    print(model(input_batch)[0].shape)
    res = model.forward(input_batch)[0]

torch.Size([1, 9, 30522])


In [10]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [11]:
print(prob.shape)
print()
print(f"{input_ids=}")
print(f"{new_ids.reshape(-1).numpy()=}")
print(f"{new_ids.numpy()[0, :].shape=}")


torch.Size([1, 9, 30522])

input_ids=[101, 2182, 2003, 2070, 103, 2000, 4372, 16044, 102]
new_ids.reshape(-1).numpy()=array([ 1012,  2182,  2003,  2070,  2126,  2000,  4372, 16044,  1996])
new_ids.numpy()[0, :].shape=(9,)


In [12]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [13]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше. Также можно попробовать сравнить эту генерацию с какой-нибудь легковесной gpt, например, "sshleifer/tiny-gpt2".

### Max probability

In [14]:
from tqdm import tqdm

In [26]:
print(tokenizer.decode([102]))

a = [102]
b = a.pop()
b

[SEP]


102

In [25]:
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)

def gen_txt_continue(input_ids: list, num: int = 10):
  for i in tqdm(range(num)):
    # print("===========")
    # print(f"len : {input_ids[-3:]}")
    sep = input_ids.pop()

    input_ids.append(tokenizer.mask_token_id)
    # print(len(input_ids), input_ids[-3:])

    input_batch = torch.tensor(input_ids).unsqueeze(0)
    # print(f"{input_batch.shape=}")

    with torch.no_grad():
      preds = model.forward(input_batch)[0]

    prob = torch.nn.functional.softmax(preds, dim=-1)
    new_ids = prob.median(-1)[1][0].tolist()
    # print(f"{len(input_ids)=}")
    # print(f"{len(new_ids)=}")
    # print(i)

    input_ids[len(input_ids) - 1] = new_ids[len(input_ids[i:]) - 1]

    input_ids.append(sep)

  return input_ids

generated_txt = gen_txt_continue(input_ids, 100)
print()
print("original : ", GPT_TEXTS[0])
print(f"generated : {tokenizer.decode(generated_txt)}")

100%|██████████| 100/100 [00:45<00:00,  2.18it/s]


original :  In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
generated : [CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. pasture montevideo bearer standalone enablesえ shadowy willed screams swami ireland nfc strands rash sweeney autism editorial fridge adolescence batavia leipzig directing scoring sovereignty terrified genuinely davis monica meters 08 latham casualty sr generate surveyed 325 inverness lydia chimneys testament 1800s asking durga waterloo astronaut descended bets extraction sprung 1817 predictable respects spontaneously unanimously17 reset certification regulation atkinson removing zhang proliferation boog




In [24]:
input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)

def gen_txt_continue(input_ids: list, num: int = 10):
  for i in tqdm(range(num)):
    # print("===========")
    # print(f"len : {input_ids[-3:]}")
    sep = input_ids.pop()

    input_ids.append(tokenizer.mask_token_id)
    # print(len(input_ids), input_ids[-3:])

    input_batch = torch.tensor(input_ids).unsqueeze(0)
    # print(f"{input_batch.shape=}")

    with torch.no_grad():
      preds = model.forward(input_batch)[0]

    prob = torch.nn.functional.softmax(preds, dim=-1)
    new_ids = prob.median(-1)[1][0].tolist()
    # print(f"{len(input_ids)=}")
    # print(f"{len(new_ids)=}")
    # print(i)

    input_ids[len(input_ids) - 1] = new_ids[len(input_ids[i:]) - 1]

    input_ids.append(sep)

  return input_ids

generated_txt = gen_txt_continue(input_ids, 100)
print()
print("original : ", GPT_TEXTS[1])
print(f"generated : {tokenizer.decode(generated_txt)}")

100%|██████████| 100/100 [00:22<00:00,  4.49it/s]


original :  A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.
generated : [CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. doorway lives pulitzer refrain nicola dormitory sonar pantry bollywood buses щ batsman bicycles reeve westward 1611 processed plume counted mira commence songwriting tam 1966 situation sabotage legislature seller rf deposition nana superstructure ida ceramic tracks mcdonnell40 salute featuring seismic 男 brunette extracted frankly pornography cassandra remove additionally dolphins parana hilarious rayon drove sewage prayers secretly roughly southport belt entrepreneurship nicaragua wheeler selangor hilarious 1797 1753 1711 refugees ability fiji recommend 世 wheeler demonstrators stables looked organisation usc pace breed bequeathed indicate sighting vocal moors preparing loire genie curses refused mimi quaker completing 960 s




### Sample

In [51]:
a = torch.rand(4, 5)
print(a)
a[:, 3]

tensor([[0.5583, 0.3469, 0.1200, 0.4276, 0.8438],
        [0.1823, 0.6942, 0.6825, 0.0477, 0.9250],
        [0.0188, 0.2826, 0.3867, 0.9736, 0.2765],
        [0.6656, 0.5826, 0.9876, 0.5040, 0.8579]])


tensor([0.4276, 0.0477, 0.9736, 0.5040])

In [72]:
from random import randint

In [75]:
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)

def gen_txt_continue(input_ids: list, num: int = 10):
  for i in tqdm(range(num)):
    # print("===========")
    # print(f"len : {input_ids[-3:]}")
    sep = input_ids.pop()

    input_ids.append(tokenizer.mask_token_id)
    # print(len(input_ids), input_ids[-3:])

    input_batch = torch.tensor(input_ids).unsqueeze(0)
    # print(f"{input_batch.shape=}")

    with torch.no_grad():
      preds = model.forward(input_batch)[0]

    prob = torch.nn.functional.softmax(preds, dim=-1)
    # print(f"{torch.topk(prob, 5)[1][0].shape=}")
    new_ids = torch.topk(prob, 50)[1][0][:, randint(0, 49)].tolist()
    # print(f"{len(input_ids)=}")
    # print(f"{len(new_ids)=}")
    # print(i)

    input_ids[len(input_ids) - 1] = new_ids[len(input_ids[i:]) - 1]

    input_ids.append(sep)

  return input_ids

generated_txt = gen_txt_continue(input_ids, 100)
print()
print("original : ", GPT_TEXTS[0])
print(f"generated : {tokenizer.decode(generated_txt)}")

100%|██████████| 100/100 [00:29<00:00,  3.38it/s]


original :  In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
generated : [CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. the that however one with did because what so as most everything also for however a at yet the he its very of given those of some at to in about however " with not for could : given : with some, just was not is from some for some on not or had each because : : because for when - what at all only are both there when ; - and because at ; that in for also. that their also not which – those and them they to – are there there as when " [SEP]





In [76]:
input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)

def gen_txt_continue(input_ids: list, num: int = 10):
  for i in tqdm(range(num)):
    # print("===========")
    # print(f"len : {input_ids[-3:]}")
    sep = input_ids.pop()

    input_ids.append(tokenizer.mask_token_id)
    # print(len(input_ids), input_ids[-3:])

    input_batch = torch.tensor(input_ids).unsqueeze(0)
    # print(f"{input_batch.shape=}")

    with torch.no_grad():
      preds = model.forward(input_batch)[0]

    prob = torch.nn.functional.softmax(preds, dim=-1)
    # print(f"{torch.topk(prob, 5)[1][0].shape=}")
    new_ids = torch.topk(prob, 50)[1][0][:, randint(0, 49)].tolist()
    # print(f"{len(input_ids)=}")
    # print(f"{len(new_ids)=}")
    # print(i)

    input_ids[len(input_ids) - 1] = new_ids[len(input_ids[i:]) - 1]

    input_ids.append(sep)

  return input_ids

generated_txt = gen_txt_continue(input_ids, 100)
print()
print("original : ", GPT_TEXTS[1])
print(f"generated : {tokenizer.decode(generated_txt)}")

100%|██████████| 100/100 [00:23<00:00,  4.32it/s]


original :  A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.
generated : [CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. one nine three city that city nine two town :. four part country nine only none year 1 flag the day eight only none particular primary ones zero'class single none time primary last none five '. once in four - ". - school only summer once type primary with once single kindergarten : zero kindergarten one zero generation company'term. pavement the season " season village : school one zero sole school. time type station school place unit year city partyor mile pavement nothing class with : nothing class primary nothing [SEP]





#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: