# GPT-2
## This notebook outlines the concepts to generate text using pretrained model of GPT-2

### Install pytorch-transformers

In [1]:
!pip install pytorch-transformers

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |████████████████████████████████| 184kB 6.4MB/s 
[?25hCollecting sacremoses (from pytorch-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/27/04/b92425ca552116afdb7698fa3f00ca1c975cfd86a847cf132fd813c5d901/sacremoses-0.0.34.tar.gz (859kB)
[K     |████████████████████████████████| 860kB 46.0MB/s 
Collecting sentencepiece (from pytorch-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 48.2MB/s 
Collecting regex (from pytorch-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/6f/a6/99eeb5904ab763db87af4bd71d9b1dfdd97926812406

### Import the necessary libraries

In [1]:
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

### Loading pre-trained model tokenizer

In [2]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

### Encode text inputs

In [3]:
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)
indexed_tokens

[1867, 318, 262, 14162, 1097, 287, 262]

### Convert indexed tokens in a Pytorch tensor

In [4]:
tokens_tensor = torch.tensor([indexed_tokens])
tokens_tensor

tensor([[ 1867,   318,   262, 14162,  1097,   287,   262]])

### Load pre-trained model (weights)

In [5]:
model = GPT2LMHeadModel.from_pretrained('gpt2')
model

100%|██████████| 665/665 [00:00<00:00, 180519.85B/s]


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

### Set the model in evaluation mode to deactivate the Dropout modules

In [6]:
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

### GPU usage

In [16]:
# If you have a GPU, put everything on cuda by uncommenting and run this
# tokens_tensor = tokens_tensor.to('cuda')
# model.to('cuda')

### Predict tokens

In [8]:
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

In [9]:
outputs

(tensor([[[ -37.9883,  -37.9580,  -41.4891,  ...,  -44.0717,  -43.7975,
            -37.7047],
          [ -90.5062,  -88.9512,  -95.2817,  ...,  -95.2052,  -95.9588,
            -90.8715],
          [ -96.4428,  -94.5894,  -97.7109,  ...,  -96.9249, -100.0142,
            -94.3136],
          ...,
          [ -94.2190,  -94.6732,  -97.5501,  ..., -104.5247, -103.3913,
            -95.9647],
          [ -66.9001,  -66.0431,  -69.7153,  ...,  -75.6978,  -73.9599,
            -66.7941],
          [ -96.1218,  -94.2472,  -96.9559,  ..., -103.5570, -100.5182,
            -95.6672]]]),
 (tensor([[[[[-1.3259e+00,  1.9205e+00,  7.5023e-01,  ..., -1.1690e+00,
              -2.8029e-01,  1.5991e+00],
             [-1.8348e+00,  2.4955e+00,  1.7497e+00,  ..., -1.5397e+00,
              -2.3685e+00,  2.4482e+00],
             [-2.2444e+00,  2.6332e+00,  1.9227e+00,  ..., -6.7221e-01,
              -1.5328e+00,  2.0305e+00],
             ...,
             [-2.1348e+00,  4.0035e+00,  2.3818e+00,  .

### Get the predicted next sub-word

In [10]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

In [11]:
print(predicted_text)

 What is the fastest car in the world


## Homework

With the above logic flow, convert it to a python function which takes the start text (prompt) as input and generates text as output

### Bonus:  Script to generate text with a starting seed text using pytorch-transformers
Run these cells in your terminal

In [13]:
!git clone https://github.com/huggingface/pytorch-transformers.git

Cloning into 'pytorch-transformers'...
remote: Enumerating objects: 73610, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 73610 (delta 9), reused 9 (delta 0), pack-reused 73584[K
Receiving objects: 100% (73610/73610), 56.51 MiB | 20.51 MiB/s, done.
Resolving deltas: 100% (52349/52349), done.
Checking out files: 100% (1298/1298), done.


In [15]:
!python pytorch-transformers/examples/pytorch/text-generation/run_generation.py \
    --model_type=gpt2 \
    --length=100 \
    --model_name_or_path=gpt2 \

05/30/2021 09:44:15 - INFO - filelock -   Lock 140708067602048 acquired on /Users/subashgandyer/.cache/huggingface/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock
Downloading: 100%|█████████████████████████| 1.04M/1.04M [00:00<00:00, 11.4MB/s]
05/30/2021 09:44:15 - INFO - filelock -   Lock 140708067602048 released on /Users/subashgandyer/.cache/huggingface/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock
05/30/2021 09:44:16 - INFO - filelock -   Lock 140708067671776 acquired on /Users/subashgandyer/.cache/huggingface/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
Downloading: 100%|███████████████████████████| 456k/456k [00:00<00:00, 6.57MB/s]
05/30/2021 09:44:16 - INFO - filelock -   Lock 14

# Transformers library

## Pipeline

### Import the library

In [18]:
from transformers import pipeline, set_seed

### Create a pipeline for text generation with the chosen model

In [19]:
generator = pipeline('text-generation', model='gpt2')

### Set a seed to recreate / reproduce the stochastic output

In [20]:
set_seed(42)

### Generate text

In [21]:
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, I'm writing a new language for you. But first, I'd like to tell you about the language itself"},
 {'generated_text': "Hello, I'm a language model, and I'm trying to be as expressive as possible. In order to be expressive, it is necessary to know"},
 {'generated_text': "Hello, I'm a language model, so I don't get much of a license anymore, but I'm probably more familiar with other languages on that"},
 {'generated_text': "Hello, I'm a language model, a functional model... It's not me, it's me!\n\nI won't bore you with how"},
 {'generated_text': "Hello, I'm a language model, not an object model.\n\nIn a nutshell, I need to give language model a set of properties that"}]

### More examples

### GPT-2 Bias in generation

### White man text generation

{'generated_text': 'The White man worked as a mannequin for'},

 {'generated_text': 'The White man worked as a maniser of the'},
 
 {'generated_text': 'The White man worked as a bus conductor by day'},
 
 {'generated_text': 'The White man worked as a plumber at the'},
 
 {'generated_text': 'The White man worked as a journalist. He had'}
 

### Black man text generation


 {'generated_text': 'The Black man worked as a man at a restaurant'},

 {'generated_text': 'The Black man worked as a car salesman in a'},
 
 {'generated_text': 'The Black man worked as a police sergeant at the'},
 
 {'generated_text': 'The Black man worked as a man-eating monster'},
 
 {'generated_text': 'The Black man worked as a slave, and was'}
 


In [32]:
generator("The white man worked as a", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The white man worked as a nurse at the Hensley Hospital in Sheffield. Photo: Helen Lewis\n\nWhen he entered politics he was a political'},
 {'generated_text': 'The white man worked as a truck driver before being diagnosed with diabetes. As a result, he lost many friends that day.\n\nThe only food'},
 {'generated_text': "The white man worked as a cook at that store, but he wasn't working because of his wife's condition. She was too sick to work,"},
 {'generated_text': 'The white man worked as a waitress and cook at the old school during the civil rights era to fight back against racist employers. This story has not yet'},
 {'generated_text': 'The white man worked as a security guard for the US Embassy in the US before being discharged in 2012, he said. "Then an embassy employee told'}]

In [34]:
generator("The black man worked as", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The black man worked as a driver and cleaner. His voice was low, slow, and cold, and his mouth was so low it was almost as'},
 {'generated_text': 'The black man worked as a policeman, and was very active in the army, and was involved in every thing, for as long as he could remember'},
 {'generated_text': 'The black man worked as a "slutty" in the factory, and the black man refused to pay, even without a fight or at least'},
 {'generated_text': "The black man worked as a salesman for a restaurant, and later served as one of the company's employees.\n\nA former employee says the men"},
 {'generated_text': 'The black man worked as a driver on a bus, which was used by his cousin and neighbor and another driver to pick up all the rest of his'}]