# Text generation with pre-trained Transformers
In this assignment we will work with Pre-trained Transformers such as GPT2 for generating text from a given sequence. Transformers aim to address the long term dependency issue in sequence-to-seuqence prediction by using concepts such as self-attention and positional encoding. GPT2 is a langauge model, pretrained on text generation, that can be used as a multi-task learner for tasks such as summarization, question-answering, and other generation tasks. This assignment's focus is on using GPT2 to generate text via greedy decoding and beam search. For more background on beam search, see [Jurafsky and Martin, chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf).

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 15.5MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 57.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 41.7MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [None]:
import copy
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




In [None]:
sentences = ['I like walking and', 
             'Martha wanted to read a book that',
             'Thomas is studying computer science to',
             'Their friendship inspired',
             'We should take the trash out since',
             'I am not a fan of coffee because',
             'I could not complete my homework by the deadline because',
             'The last semester was much easier due to',
             'I will be painting the walls white so that'
            ]

We apply greedy decoding to get predictions for each sentence here. This function returns the text output of greedy decoding. Modify it to return a tuple (ordered pair) of text and average log-likelihood per word for each sentence.

In [None]:
## TODO: Modify this function to return pairs of text and average log-likelihood
## per word for each sentence.
def greedy_decode(sentences, max_length, tokenizer):
  # Obtain loss from output and calculate
  # log likelihood for each sentence
  texts = []
  for sentence in sentences:
    predicted_sentence = copy.copy(sentence)
    log_pred = 0
    
    for i in range(max_length):
      indexed_tokens = tokenizer.encode(predicted_sentence)
      token_tensors = torch.tensor([indexed_tokens])

      with torch.no_grad():
        output = model(token_tensors, labels=token_tensors)
        predictions = output[1]

      predicted_index = torch.argmax(predictions[0, -1, :]).item()
      log_pred = log_pred + torch.argmax(predictions[0, -1, :]).item()
      predicted_sentence = tokenizer.decode(indexed_tokens + [predicted_index])
      avg_loglikelihood = log_pred/max_length
    texts.append((predicted_sentence, avg_loglikelihood))
  return texts

In [None]:
texts = greedy_decode(sentences, max_length=25, tokenizer=tokenizer)
texts

[("I like walking and biking, but I don't like being in a car. I like to be in a car. I like to be in",
  1928.6),
 ('Martha wanted to read a book that she had read in college. She was a little nervous about it, but she was excited about it. She was a little',
  1368.68),
 ('Thomas is studying computer science to become a professor of computer science at the University of California, Berkeley. He is also a member of the Board of Trustees',
  2454.12),
 ("Their friendship inspired him to become a writer and a writer's assistant. He also wrote a book about his life and career.\n\n\n",
  1496.4),
 ('We should take the trash out since it\'s not going to be a problem," he said.\n\n\n"We\'re going to have to do something about',
  523.8),
 ('I am not a fan of coffee because it is not good for you. I am a fan of coffee because it is not good for you.\n\n\nI',
  782.48),
 ('I could not complete my homework by the deadline because I was too busy with my homework to finish it. I was so tired and

You'll be implementing **beam search**, which returns a list of the $k$ most likely output sequences for each sentence. For this assignment, let $k = 8$. For the first token in the generated text, you will select the top $k$ output tokens. Then, for the next token, find the $k$-best continuations for each of those $k$ hypotheses and select the $k$-best overall. Return the $k$-best overall hypotheses according to average log likelihood per word. Note that if we don't average per word, the decoder will simply prefer shorter outputs. As above, return tuples of text output and avergae log likelihood.

In [None]:
## TODO: Implement beam search.
def beam_search(sentences, max_length, tokenizer, k=8):
  # Your code here
  # Beam search should be implemented 
  # without the use of model.generate().
  # Calculate average per-word log likelihood for each
  # output sequence.
  texts = []
  beamsize = k
  for s in sentences:
    chosen = []
    pred_sentence = copy.copy(s)
    encodedvals = [[tokenizer.encode(pred_sentence), 0]]

    for i in range(max_length):
      candidates = []
      for val in encodedvals:
        tkns_idx = val[0]
        tensor_tkns = torch.tensor([tkns_idx])
  
        with torch.no_grad():
          output = model(tensor_tkns, labels=tensor_tkns)

        avg_loglikelihood = (output[1][0, -1, :] + (val[1] * i))/(i+1)
        values, idx = torch.topk(avg_loglikelihood, k ,largest=True)

        for i in range(k):
          iitem = [idx[i].item()]
          candidates.append([tkns_idx + iitem, values[i]])

      sorted_candidates = sorted(candidates, key = lambda p: p[1], reverse=True)
      kvalues = sorted_candidates[:k]
      beam_dec = False
      dec_value = 0
      chosen_curr = []
      kvalues_copy = kvalues.copy()

      for i, val in enumerate(kvalues):
        if val[0][-1] == tokenizer.eos_token_id:
          chosen.append((tokenizer.decode(val[0]),val[1].item()))
          kvalues_copy.remove(val)
          beam_dec = True
          dec_value += 1
          chosen_curr.append(i)  
      kvalues = kvalues_copy
      if beam_dec:
        k -= dec_value
    for val in kvalues:
      chosen.append((tokenizer.decode(val[0]),val[1].item()))
    texts.append(chosen)
    k = beamsize
  return texts

In [None]:
texts = beam_search(sentences, max_length=20, tokenizer=tokenizer)
texts

[[('I like walking and biking', -5.115832805633545),
  ('I like walking and I', -5.126032829284668),
  ('I like walking and running', -5.126280784606934),
  ('I like walking and talking', -5.133096694946289),
  ('I like walking and playing', -5.164862632751465),
  ('I like walking and doing', -5.173821926116943),
  ('I like walking and going', -5.177611351013184),
  ('I like walking and walking', -5.178885459899902)],
 [('Martha wanted to read a book that she', -5.754121780395508),
  ('Martha wanted to read a book that was', -5.764102935791016),
  ('Martha wanted to read a book that would', -5.770351409912109),
  ('Martha wanted to read a book that had', -5.79383659362793),
  ('Martha wanted to read a book that I', -5.823231220245361),
  ('Martha wanted to read a book that said', -5.836156845092773),
  ('Martha wanted to read a book that might', -5.850336074829102),
  ('Martha wanted to read a book that didn', -5.853640079498291)],
 [('Thomas is studying computer science to become', -5

**TODO:** Record your observations here

In [None]:
texts = beam_search(sentences, max_length=23, tokenizer=tokenizer)
texts

[[('I like walking and biking', -4.448550224304199),
  ('I like walking and I', -4.4574198722839355),
  ('I like walking and running', -4.457635402679443),
  ('I like walking and talking', -4.463562488555908),
  ('I like walking and playing', -4.491184711456299),
  ('I like walking and doing', -4.49897575378418),
  ('I like walking and going', -4.502270698547363),
  ('I like walking and walking', -4.503378391265869)],
 [('Martha wanted to read a book that she', -5.003584384918213),
  ('Martha wanted to read a book that was', -5.012263298034668),
  ('Martha wanted to read a book that would', -5.017696857452393),
  ('Martha wanted to read a book that had', -5.038118839263916),
  ('Martha wanted to read a book that I', -5.063679218292236),
  ('Martha wanted to read a book that said', -5.0749192237854),
  ('Martha wanted to read a book that might', -5.087248802185059),
  ('Martha wanted to read a book that didn', -5.090121746063232)],
 [('Thomas is studying computer science to become', -4.

For further exploration, you can experiment with $k$ to see how the fluency of text changes.