<a href="https://colab.research.google.com/github/srinidhig/QA_chatbot/blob/main/Copy_of_BeamSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with pre-trained Transformers
In this assignment we will work with Pre-trained Transformers such as GPT2 for generating text from a given sequence. Transformers aim to address the long term dependency issue in sequence-to-seuqence prediction by using concepts such as self-attention and positional encoding. GPT2 is a langauge model, pretrained on text generation, that can be used as a multi-task learner for tasks such as summarization, question-answering, and other generation tasks. This assignment's focus is on using GPT2 to generate text via greedy decoding and beam search. For more background on beam search, see [Jurafsky and Martin, chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf).

In [4]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 19.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 49.4MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 54.3MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [5]:
import copy
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [6]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




In [7]:
sentences = ['I like walking and', 
             'Martha wanted to read a book that',
             'Thomas is studying computer science to',
             'Their friendship inspired',
             'We should take the trash out since',
             'I am not a fan of coffee because',
             'I could not complete my homework by the deadline because',
             'The last semester was much easier due to',
             'I will be painting the walls white so that'
            ]

We apply greedy decoding to get predictions for each sentence here. This function returns the text output of greedy decoding. Modify it to return a tuple (ordered pair) of text and average log-likelihood per word for each sentence.

In [8]:
## TODO: Modify this function to return pairs of text and average log-likelihood
## per word for each sentence.
def greedy_decode(sentences, max_length, tokenizer):
  # Obtain loss from output and calculate
  # log likelihood for each sentence
  texts = []
  ll_list = []
  for sentence in sentences:
    all_per_sentence = 0
    counter = 0
    predicted_sentence = copy.copy(sentence)
    # Predict a word each itertaion until the max length
    for i in range(max_length):
      indexed_tokens = tokenizer.encode(predicted_sentence)
      token_tensors = torch.tensor([indexed_tokens])

      with torch.no_grad():
        output = model(token_tensors, labels=token_tensors)
        predictions = output[1]

      predicted_index = torch.argmax(predictions[0, -1, :]).item()
      all_per_sentence += torch.max(predictions[0, -1, :])
      counter += 1

      predicted_sentence = tokenizer.decode(indexed_tokens + [predicted_index])
    texts.append(predicted_sentence)
    ll_list.append(all_per_sentence/counter)
  
  result = []
  for i in range(len(ll_list)):
      result.append((texts[i], ll_list[i]))
  return result

In [6]:
texts = greedy_decode(sentences, max_length=25, tokenizer=tokenizer)
texts

["I like walking and biking, but I don't like being in a car. I like to be in a car. I like to be in",
 'Martha wanted to read a book that she had read in college. She was a little nervous about it, but she was excited about it. She was a little',
 'Thomas is studying computer science to become a professor of computer science at the University of California, Berkeley. He is also a member of the Board of Trustees',
 "Their friendship inspired him to become a writer and a writer's assistant. He also wrote a book about his life and career.\n\n\n",
 'We should take the trash out since it\'s not going to be a problem," he said.\n\n\n"We\'re going to have to do something about',
 'I am not a fan of coffee because it is not good for you. I am a fan of coffee because it is not good for you.\n\n\nI',
 'I could not complete my homework by the deadline because I was too busy with my homework to finish it. I was so tired and tired of being bored. I was so tired',
 'The last semester was much easie

In [9]:
texts = greedy_decode(sentences, max_length=25, tokenizer=tokenizer)
texts

[("I like walking and biking, but I don't like being in a car. I like to be in a car. I like to be in",
  tensor(-126.9903)),
 ('Martha wanted to read a book that she had read in college. She was a little nervous about it, but she was excited about it. She was a little',
  tensor(-121.1514)),
 ('Thomas is studying computer science to become a professor of computer science at the University of California, Berkeley. He is also a member of the Board of Trustees',
  tensor(-87.4711)),
 ("Their friendship inspired him to become a writer and a writer's assistant. He also wrote a book about his life and career.\n\n\n",
  tensor(-109.5394)),
 ('We should take the trash out since it\'s not going to be a problem," he said.\n\n\n"We\'re going to have to do something about',
  tensor(-102.8561)),
 ('I am not a fan of coffee because it is not good for you. I am a fan of coffee because it is not good for you.\n\n\nI',
  tensor(-91.6273)),
 ('I could not complete my homework by the deadline because I

You'll be implementing **beam search**, which returns a list of the $k$ most likely output sequences for each sentence. For this assignment, let $k = 8$. For the first token in the generated text, you will select the top $k$ output tokens. Then, for the next token, find the $k$-best continuations for each of those $k$ hypotheses and select the $k$-best overall. Return the $k$-best overall hypotheses according to average log likelihood per word. Note that if we don't average per word, the decoder will simply prefer shorter outputs. As above, return tuples of text output and avergae log likelihood.

In [None]:
## TODO: Implement beam search.
def beam_search(sentences, max_length, tokenizer, k=8):
  # Your code here
  # Beam search should be implemented 
  # without the use of model.generate().
  # Calculate average per-word log likelihood for each
  # output sequence.
  texts = []
  best_ll = []
  for sentence in sentences:
    predicted_sentence = copy.copy(sentence)
    # Predict a word each itertaion until the max length
    for i in range(max_length):
      indexed_tokens = tokenizer.encode(predicted_sentence)
      token_tensors = torch.tensor([indexed_tokens])

      with torch.no_grad():
        output = model(token_tensors, labels=token_tensors)
        predictions = output[1]

      predicted_index = torch.argmax(predictions[0, -1, :]).item()
      best_ind = np.argpartition(predictions[0, -1, :], -k)[-k:]

      predicted_sentence = tokenizer.decode(indexed_tokens + [predicted_index])
    texts.append(predicted_sentence)
  
  result = []
  for i in range(len(ll_list)):
      result.append((texts[i], ll_list[i]))
  return result

**TODO:** Record your observations here

For further exploration, you can experiment with $k$ to see how the fluency of text changes.