# How does GPT work?
This is a tutorial notebook to step through the GPT architecture
Modified from: https://www.modeldifferently.com/en/2021/12/generaci%C3%B3n-de-fake-news-con-gpt-2/

Edited by: John Tan Chong Min (17 Jan 2022)



In [None]:
import torch, os, re, pandas as pd, json
from sklearn.model_selection import train_test_split
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding, GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, AutoConfig
from datasets import Dataset

# Load the base model and tokenizer

In [None]:
base_model = GPT2LMHeadModel.from_pretrained('gpt2')
base_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [None]:
base_model.num_parameters
# (wte): Embedding(50262, 768)
#     (wpe): Embedding(1024, 768)

<bound method ModuleUtilsMixin.num_parameters of GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0

In [None]:
print('Words in vocabulary: ', base_tokenizer.vocab_size)

Words in vocabulary:  50257


In [None]:
vocabulary = base_tokenizer.get_vocab()
vocabulary['Hello']

15496

In [None]:
text = "Hi, GPT is a fun tool to use."
base_tokenizer.tokenize(text)

['Hi', ',', 'ĠG', 'PT', 'Ġis', 'Ġa', 'Ġfun', 'Ġtool', 'Ġto', 'Ġuse', '.']

In [None]:
text_ids = base_tokenizer.encode(text, return_tensors = 'pt')
print(text_ids)

tensor([[17250,    11,   402, 11571,   318,   257,  1257,  2891,   284,   779,
            13]])


# Let's generate some text
- Visualize the probability distribution

In [None]:
generated_text_samples = base_model.generate(
    text_ids,
    max_length = 100, # generate 100 tokens
    top_k = 1, #only generate the top token
    output_scores=True,
    return_dict_in_generate = True,
    num_return_sequences = 1 # return 1 different results
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
generated_text_samples['sequences']

tensor([[17250,    11,   402, 11571,   318,   257,  1257,  2891,   284,   779,
            13,   632,   338,   257,  1049,   835,   284,   651,  2067,   351,
           534,  1628,   290,   651,  2067,   351,   534,  1628,    13,   198,
           198,    40,  1101,  1016,   284,   923,   416,  2282,   326,   314,
          1101,   407,   257,  1263,  4336,   286,   262,  2891,    13,   314,
          1053,   973,   340,   257,  1256,    11,   475,   314,  1053,  1239,
          1107,   973,   340,    13,   314,  1053,  1239,  1107,   973,   340,
            13,   314,  1053,  1239,  1107,   973,   340,    13,   314,  1053,
          1239,  1107,   973,   340,    13,   314,  1053,  1239,  1107,   973,
           340,    13,   314,  1053,  1239,  1107,   973,   340,    13,   314]])

In [None]:
# visualize the scores
import numpy as np
for score in generated_text_samples['scores']:
  scores = torch.nn.functional.softmax(score)
  print(scores, np.argmax(scores))

  scores = torch.nn.functional.softmax(score)


tensor([[2.3129e-05, 5.5025e-06, 8.1286e-06,  ..., 7.2621e-10, 2.1037e-10,
         4.4919e-03]]) tensor(632)
tensor([[1.7727e-05, 4.6587e-06, 8.8757e-07,  ..., 1.4128e-09, 3.5584e-08,
         2.9454e-06]]) tensor(338)
tensor([[2.0969e-05, 2.8716e-06, 2.5344e-07,  ..., 9.7610e-09, 2.8717e-08,
         1.4883e-06]]) tensor(257)
tensor([[2.5479e-05, 1.0630e-05, 9.1050e-07,  ..., 1.8050e-09, 2.3794e-08,
         3.4952e-06]]) tensor(1049)
tensor([[2.9991e-05, 1.2413e-05, 1.4060e-07,  ..., 2.1724e-09, 8.6589e-10,
         1.3379e-06]]) tensor(835)
tensor([[2.8836e-05, 3.7092e-06, 3.5629e-08,  ..., 3.2891e-09, 1.5218e-10,
         4.4855e-07]]) tensor(284)
tensor([[1.4929e-05, 2.5330e-06, 2.2616e-08,  ..., 1.4360e-09, 2.2469e-10,
         6.6996e-07]]) tensor(651)
tensor([[2.8207e-05, 4.6749e-06, 8.3393e-08,  ..., 2.4159e-07, 4.0431e-08,
         7.0419e-06]]) tensor(2067)
tensor([[1.1745e-02, 5.0278e-05, 1.6313e-06,  ..., 8.7674e-09, 8.0207e-10,
         2.4212e-05]]) tensor(351)
tensor([

In [None]:
for i, beam in enumerate(generated_text_samples['sequences']):
    print(f"Sentence {i+1}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
    print()

Sentence 1: Hi, GPT is a fun tool to use. It's a great way to get started with your project and get started with your project.

I'm going to start by saying that I'm not a big fan of the tool. I've used it a lot, but I've never really used it. I've never really used it. I've never really used it. I've never really used it. I've never really used it. I've never really used it. I



# Generate text with top k selection
Limits tokens to only top 5

In [None]:
generated_text_samples = base_model.generate(
    text_ids,
    max_length = 100, # generate 100 tokens
    top_k = 5, # choose from the top 5 tokens only
    do_sample=True, # randomly sample next token
    num_return_sequences = 5, # return 5 different results
    early_stopping = True # stops when the sentence is complete
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
for i, beam in enumerate(generated_text_samples):
    print(f"Sentence {i+1}: {base_tokenizer.decode(beam, skip_special_tokens=True)}")
    print()

Sentence 1: Hi, GPT is a fun tool to use.

It's a simple and fast way to create custom templates for a website and you don't have to worry about it all at once. Just click the button below and you're ready to start creating custom templates for your site!

If you'd like to use the tool to create custom templates for your website, you can use the following code snippet from the above codebase to generate the templates:

<script src="http

Sentence 2: Hi, GPT is a fun tool to use. It can be used to find the best value for a given amount, but I would not recommend using it for more than one person. It's a great way for people to compare prices, and I think it's a good way to learn a little more about the business.

I have a couple of questions. First, is it really worth it for a small business, or is it more of a hassle to do a few searches and

Sentence 3: Hi, GPT is a fun tool to use. I've been using it for years and it's a great tool to use to help you get a feel of what your favorite 

# Attention Demo
This is modified from https://morioh.com/p/67e7320b3cef

In [None]:
from bertviz import head_view
from transformers import GPT2Tokenizer, GPT2Model

model_version = 'gpt2'
model = GPT2Model.from_pretrained(model_version, output_attentions=True)
tokenizer = GPT2Tokenizer.from_pretrained(model_version)

text = "The cat sat on the mat. Nothing could move it."
inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=True)
input_ids = inputs['input_ids']
attention = model(input_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

In [None]:
# Total 12 different attention heads for decoder, over 12 layers
for layer in attention:
  print(layer.shape)

torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])
torch.Size([1, 12, 12, 12])


In [None]:
from IPython.core.display import HTML
display(HTML('<script src="/static/components/requirejs/require.js"></script>'))
# Above two lines only needed when running in Colab
head_view(attention, tokens)

<IPython.core.display.Javascript object>