## Introduction to the GPT family

In [1]:
from transformers import pipeline, set_seed, GPT2Tokenizer, GPT2LMHeadModel
from torch import tensor, numel
from bertviz import model_view

set_seed(42)

In [2]:
# pipeline object in transformers provides easy access to transformer usage
generator = pipeline('text-generation', model='gpt2')

# finish the sentence
generator("Hello, I'm a language model and I", max_length=30, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model and I don't want to be a computer programmer. As for all those bad languages in JavaScript, I'll never"},
 {'generated_text': "Hello, I'm a language model and I know that's not the right language to build this program of yours. What was wrong with the code?"},
 {'generated_text': "Hello, I'm a language model and I'm trying to work with that much of this, and have this kind of a great conversation about language model"}]

In [3]:
# load up a tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

'sinan' in tokenizer.get_vocab()

False

In [4]:
# encode a string and then convert the ids back into tokens. Note the Ġ character denoting a space before the token
tokenizer.convert_ids_to_tokens(tokenizer.encode('Sinan loves a beautiful day'))

['Sin', 'an', 'Ġloves', 'Ġa', 'Ġbeautiful', 'Ġday']

In [5]:
tokenizer.encode('Sinan loves a beautiful day')  # ids

[46200, 272, 10408, 257, 4950, 1110]

In [6]:
encoded = tokenizer.encode('Sinan loves a beautiful day', return_tensors='pt')  # as a pytorch tensor

encoded

tensor([[46200,   272, 10408,   257,  4950,  1110]])

In [7]:
# load up a tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [8]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dro

In [9]:
encoded

tensor([[46200,   272, 10408,   257,  4950,  1110]])

In [10]:
model.transformer.wte(encoded).shape  # 1 item in batch x 6 tokens x token dimension

torch.Size([1, 6, 768])

In [11]:
model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6)).shape  # manually create position vectors

torch.Size([1, 6, 768])

In [12]:
# create GPT input
initial_input = model.transformer.wte(encoded) + model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6))

initial_input.shape

torch.Size([1, 6, 768])

In [13]:
initial_input = model.transformer.drop(initial_input)  # run our input through the model's initual dropout later
initial_input

tensor([[[ 0.0107, -0.2453,  0.1275,  ..., -0.1969,  0.0006,  0.1539],
         [-0.1013, -0.0894, -0.0378,  ..., -0.0534, -0.0527,  0.0046],
         [-0.0716, -0.1690,  0.0386,  ..., -0.2034, -0.0197, -0.1113],
         [-0.0509, -0.0682,  0.1526,  ...,  0.0527,  0.0912, -0.0455],
         [ 0.0612,  0.1425,  0.1402,  ...,  0.0964,  0.0510,  0.1474],
         [-0.1283, -0.0632,  0.1287,  ..., -0.0907, -0.0655,  0.1085]]],
       grad_fn=<AddBackward0>)

In [14]:
model.lm_head

Linear(in_features=768, out_features=50257, bias=False)

In [15]:
for module in model.transformer.h:  # run the initial_input through every decoder in the stack
    initial_input = module(initial_input)[0]
    
initial_input = model.transformer.ln_f(initial_input)  # and then the final layer norm

In [16]:
initial_input

tensor([[[ 0.0542, -0.0179, -0.3388,  ..., -0.0948, -0.1067,  0.0129],
         [-0.4805,  0.1008, -0.7313,  ...,  0.0471, -0.4113,  0.0902],
         [ 0.0344, -0.2259, -0.5293,  ..., -0.1202,  0.1355,  0.2287],
         [-0.2374,  0.1787,  0.1845,  ..., -0.4057, -0.3617, -0.1861],
         [ 0.0235,  0.1212, -1.0182,  ..., -0.0597,  0.0020, -0.2220],
         [ 0.1206, -0.5034, -1.5260,  ..., -0.3367, -0.2821, -0.0410]]],
       grad_fn=<NativeLayerNormBackward0>)

In [17]:
# same as just running through the model
(initial_input == model(encoded, output_hidden_states=True).hidden_states[-1]).all()

tensor(True)

In [18]:
total_params = 0
for param in model.parameters():
    total_params += numel(param)
    
print(f'Number of params: {total_params:,}')

Number of params: 124,439,808


## Masked multi-headed attention

In [19]:
import torch
import pandas as pd


In [20]:
phrase = 'My friend was right about this class. It is so fun!'
encoded_phrase = tokenizer(phrase, return_tensors='pt')

response = model(**encoded_phrase, output_attentions=True, output_hidden_states=True)

len(response.attentions)

12

In [21]:
encoded_phrase

{'input_ids': tensor([[3666, 1545,  373,  826,  546,  428, 1398,   13,  632,  318,  523, 1257,
            0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [22]:
response.attentions[-1].shape  # represtnations from the final decoder

torch.Size([1, 12, 13, 13])

In [23]:
encoded_phrase['input_ids'].shape

torch.Size([1, 13])

In [24]:
tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0])

tokens

['My',
 'Ġfriend',
 'Ġwas',
 'Ġright',
 'Ġabout',
 'Ġthis',
 'Ġclass',
 '.',
 'ĠIt',
 'Ġis',
 'Ġso',
 'Ġfun',
 '!']

In [25]:
# Layer index 9, head 0. Check out the almost 60% attention the token it is giving to the token class
arr = response.attentions[9][0][0]

n_digits = 3

attention_df = pd.DataFrame((torch.round(arr * 10**n_digits) / (10**n_digits)).detach()).applymap(float)

attention_df.columns = tokens
attention_df.index = tokens

attention_df


Unnamed: 0,My,Ġfriend,Ġwas,Ġright,Ġabout,Ġthis,Ġclass,.,ĠIt,Ġis,Ġso,Ġfun,!
My,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġfriend,0.968,0.032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġwas,0.824,0.145,0.031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġright,0.979,0.008,0.007,0.005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġabout,0.979,0.008,0.004,0.005,0.005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġthis,0.924,0.031,0.007,0.006,0.016,0.016,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġclass,0.946,0.005,0.001,0.001,0.001,0.002,0.044,0.0,0.0,0.0,0.0,0.0,0.0
.,0.691,0.013,0.003,0.003,0.002,0.006,0.269,0.013,0.0,0.0,0.0,0.0,0.0
ĠIt,0.318,0.003,0.003,0.003,0.006,0.018,0.599,0.018,0.032,0.0,0.0,0.0,0.0
Ġis,0.331,0.006,0.002,0.002,0.003,0.018,0.533,0.013,0.062,0.03,0.0,0.0,0.0


In [26]:
tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0]) 
model_view(response.attentions, tokens)

<IPython.core.display.Javascript object>

In [27]:
response.hidden_states[-1].shape

torch.Size([1, 13, 768])

In [28]:
response.logits.shape

torch.Size([1, 13, 50257])

In [29]:
# look at the top next token in the auto-regressive language modelling task
pd.DataFrame(
    zip(tokens, tokenizer.convert_ids_to_tokens(response.logits.argmax(2)[0])), 
    columns=['Sequence up until', 'Next token with highest probability']
)

Unnamed: 0,Sequence up until,Next token with highest probability
0,My,Ċ
1,Ġfriend,","
2,Ġwas,Ġa
3,Ġright,.
4,Ġabout,Ġthat
5,Ġthis,.
6,Ġclass,.
7,.,ĠI
8,ĠIt,'s
9,Ġis,Ġa


In [30]:
generator('My friend', max_length=4, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'My friend and I'},
 {'generated_text': 'My friend who is'},
 {'generated_text': 'My friend and I'},
 {'generated_text': 'My friend, you'},
 {'generated_text': 'My friend, a'}]

In [31]:
generator(phrase, max_length=20, num_return_sequences=1, do_sample=False)  # greedy search

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'My friend was right about this class. It is so fun! I love it! I love the'}]

In [32]:
generator(phrase, max_length=20, num_return_sequences=1, do_sample=True)  # greedy search with sampling

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'My friend was right about this class. It is so fun! How can I do it and it'}]

## Pre-training GPT

In [33]:
from transformers import pipeline, set_seed
from torch import tensor

generator = pipeline('text-generation', model='gpt2', tokenizer=tokenizer)
set_seed(0)

In [34]:
# Bias
generator("The holocaust was", max_length=10, num_return_sequences=10, temperature=0.8, beams=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The holocaust was a cruel and degrading and brutal'},
 {'generated_text': 'The holocaust was created in order to save the'},
 {'generated_text': 'The holocaust was a crime carried out by Jews'},
 {'generated_text': 'The holocaust was perpetrated on the Jews who died'},
 {'generated_text': 'The holocaust was never officially discovered.\n\n'},
 {'generated_text': 'The holocaust was brought to an end on June'},
 {'generated_text': 'The holocaust was an event that happened in the'},
 {'generated_text': 'The holocaust was the product of human sin.'},
 {'generated_text': 'The holocaust was "the most horrible crime in'},
 {'generated_text': 'The holocaust was the worst in history. The'}]

In [35]:
generator("Jewish people are", max_length=10, num_return_sequences=10, temperature=0.8, beams=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Jewish people are not the primary target. They are'},
 {'generated_text': 'Jewish people are not allowed to be on the same'},
 {'generated_text': 'Jewish people are in the U.S. illegally'},
 {'generated_text': 'Jewish people are being persecuted by Muslim extremists."\n'},
 {'generated_text': "Jewish people are not just a minority – they're"},
 {'generated_text': 'Jewish people are a real danger." He said:'},
 {'generated_text': 'Jewish people are not the same as Jews. They'},
 {'generated_text': 'Jewish people are a tiny minority. These are people'},
 {'generated_text': 'Jewish people are called "foreigners" or "'},
 {'generated_text': 'Jewish people are often treated as terrorists. For example'}]

In [36]:
generator("Christian people are", max_length=10, num_return_sequences=10, temperature=0.8, beams=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Christian people are dying. There is so much to'},
 {'generated_text': 'Christian people are an integral part of the culture,'},
 {'generated_text': 'Christian people are not entitled to participate in the life'},
 {'generated_text': 'Christian people are dying, not just from a lack'},
 {'generated_text': 'Christian people are not happy that this is happening."'},
 {'generated_text': 'Christian people are dying due to this."\n\n'},
 {'generated_text': 'Christian people are much more likely to be atheists.'},
 {'generated_text': 'Christian people are not a different class from non-'},
 {'generated_text': 'Christian people are not the only ones who believe that'},
 {'generated_text': 'Christian people are not ready to accept the fact that'}]

In [37]:
generator("The earth is", max_length=10, num_return_sequences=10, temperature=0.8, beams=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The earth is a big, thick, flat,'},
 {'generated_text': 'The earth is flat, and the planet is flat'},
 {'generated_text': 'The earth is very flat, so we can see'},
 {'generated_text': 'The earth is a round and cold place and no'},
 {'generated_text': 'The earth is covered with clouds; it is not'},
 {'generated_text': 'The earth is round, and it is shaped like'},
 {'generated_text': 'The earth is falling apart.\n\nThe earth'},
 {'generated_text': 'The earth is small and is far away from the'},
 {'generated_text': 'The earth is a very fragile thing. We must'},
 {'generated_text': 'The earth is flat, but the stars are still'}]

## Few-shot learning

In [38]:
print(generator("""Sentiment Analysis
Text: I hate it when my phone battery dies.
Sentiment: Negative
###
Text: My day has been really great!
Sentiment: Positive
###
Text: Not a fan when it is cloudy
Sentiment:""", top_k=2, temperature=0.1, max_length=55)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentiment Analysis
Text: I hate it when my phone battery dies.
Sentiment: Negative
###
Text: My day has been really great!
Sentiment: Positive
###
Text: Not a fan when it is cloudy
Sentiment: Negative



In [39]:
print(generator("""Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company which develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A:""", top_k=5, beams=2, max_length=215, temperature=0.5)[0]['generated_text'])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company which develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A: AFC West
Q: What are the


In [40]:
## Zero Shot Learning

In [41]:
# Same question as before, with no previous examples ie Zero-shot learning. Hit or miss
print(generator(
    '''Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A:''',
    top_k=5, beams=2, max_length=80, temperature=0.5)[0]['generated_text']
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A: The Jets are a professional American football team based in the


In [42]:
# Zero-shot doesn't work as much with the sentiment analysis example
print(generator("""Sentiment Analysis
Text: This new music video was so good
Sentiment:""", top_k=2, temperature=0.1, max_length=55)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentiment Analysis
Text: This new music video was so good
Sentiment: This new music video was so good
Sentiment: This new music video was so good
Sentiment: This new music video was so good
Sentiment: This new music video was


In [43]:
# Zero-shot abstractive summarization

In [44]:
to_summarize = """This training will focus on how the GPT family of models are used for NLP tasks including abstractive text summarization and natural language generation. The training will begin with an introduction to necessary concepts including masked self attention, language models, and transformers and then build on those concepts to introduce the GPT architecture. We will then move into how GPT is used for multiple natural language processing tasks with hands-on examples of using pre-trained GPT-2 models as well as fine-tuning these models on custom corpora.

GPT models are some of the most relevant NLP architectures today and it is closely related to other important NLP deep learning models like BERT. Both of these models are derived from the newly invented transformer architecture and represent an inflection point in how machines process language and context.

The Natural Language Processing with Next-Generation Transformer Architectures series of online trainings provides a comprehensive overview of state-of-the-art natural language processing (NLP) models including GPT and BERT which are derived from the modern attention-driven transformer architecture and the applications these models are used to solve today. All of the trainings in the series blend theory and application through the combination of visual mathematical explanations, straightforward applicable Python examples within hands-on Jupyter notebook demos, and comprehensive case studies featuring modern problems solvable by NLP models. (Note that at any given time, only a subset of these classes will be scheduled and open for registration.)"""

In [45]:
print(generator(
    f"""Summarization Task:\n{to_summarize}\nTL;DR:""", 
    max_length=400, beams=5, temperature=0.7
)[0]['generated_text'].split('TL;DR:')[1])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Training is about taking a simple NLP model and applying it to a deep learning RNN and getting it to the point where it can be used in real-world tasks. It is also about getting a better understanding of the NLP architecture and the process by which it is applied to NLP task training.

Instructions for the training are available at: http://www.neuron.com/training/
