<a href="https://colab.research.google.com/github/rahiakela/transfer-learning-for-natural-language-processing/blob/main/7-deep-transfer-learning-for-nlp/generative_pretrained_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Generative Pretrained Transformer(GPT)

The Generative Pretrained Transformer(GPT) was developed by OpenAI and was
among the earliest models to apply the transformer architecture to the semisupervised learning scenario.

GPT became the state of the art for most of the aforementioned tasks, it has generally come to be preferred as a text-generation model. Unlike BERT and its derivatives, which have come to dominate most other tasks, GPT was trained with the causal modeling objective (CLM) where the next token is predicted,
as opposed to BERT’s masked language modeling (MLM) fill-in-the-blanks-type
prediction objective.

BERT is essentially a stacked set of encoders of the original encoder-decoder transformer architecture. GPT is essentially the converse of that, in the sense that it stacks the decoders instead.

<img src='https://github.com/rahiakela/transfer-learning-for-natural-language-processing/blob/main/7-deep-transfer-learning-for-nlp/images/1.png?raw=1' width='800'/>

Besides the encoder-decoder attention, the other distinguishing feature of the transformer decoder is that its self-attention layer is “masked,” that is, future tokens are “masked” when computing attention for any given token.

In the attention calculation, this just means including only the tokens in “he didnt want to talk about cells” in the calculation and ignoring the rest.In GPT, we lightly modified so you can clearly see the future tokens being masked.


<img src='https://github.com/rahiakela/transfer-learning-for-natural-language-processing/blob/main/7-deep-transfer-learning-for-nlp/images/2.png?raw=1' width='800'/>

This introduces a sense of causality into the system and suitability for text generation, or predicting the next token. Because there is no encoder, the encoder-decoder attention is also dropped.

<img src='https://github.com/rahiakela/transfer-learning-for-natural-language-processing/blob/main/7-deep-transfer-learning-for-nlp/images/3.png?raw=1' width='800'/>

Note that the same output can be used for both text prediction/
generation and classification for some other task. Indeed, the authors devised an input transformation scheme where multiple tasks could be handled by the same architecture without any architectural changes.

Having briefly introduced the architecture of GPT, let’s use a pretrained version of
it for some fun coding experiments. We first use it to generate some open-ended
text given a prompt. We then also use a modification of GPT built at [Microsoft—
DialoGPT](https://arxiv.org/abs/1911.00536)—to perform multiturn conversations with a chatbot.

##Setup

In [None]:
!pip -qq install git+https://github.com/huggingface/transformers

In [2]:
from transformers import pipeline
from transformers import AutoModelWithLMHead, AutoTokenizer # you can use these utility classes that automatically load the right classes
from transformers import GPT2LMHeadModel, GPT2Tokenizer # or these more specific classes directly

import torch

##Transformers pipelines for text generation

The first thing we will do in this subsection is generate some open-ended text using GPT. We will also use this opportunity to introduce pipelines—an API exposing the pretrained models in the transformers library for inference.

Let’s start by initializing the transformers pipeline to the GPT-2 model.

In [None]:
gpt = pipeline("text-generation", model="gpt2")

By way of reminder, GPT in its original form is well suited for open-ended text generation, such as creative writing of sections of text to complement previous text. 

Let us see what the model generates when primed by "Somewhere over the rainbow…" up to a maximum of 100 tokens.

In [4]:
gpt("Somewhere over the rainbow", max_length=100)

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Somewhere over the rainbow, the city of Rangoon was filled with people who, with the help of an enormous wind powered cart filled with people, managed to reach the city, with several more people arriving through this route. But here they could see nothing. The city was like a sea of people, but a different and more peaceful one. There was no room to hide. In the end, it was just a mountain over there. Rangoon City's situation had nothing to do"}]

This seems very semantically correct, even if the message is a bit incoherent. You could imagine a creative writer using this to generate ideas to get around writer’s block!

Now, let’s see if we can prime the model with something less "creative" something more technical, to see how it will do. 

Let’s prime the model with the text "Transfer learning is a field of study".

In [5]:
gpt("Transfer learning is a field of study", max_length=100)

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Transfer learning is a field of study for all students and may only be completed for a limited number of years in the community with a minimal number of extra hours or time (see the section on How to Use Learning at Berkeley with School Year 2006).\n\nThis is a subject that will be discussed separately in the next week's issue of Class of '06.\n\nTo be included in a final report on Class of '06 this year, the following criteria must be met:\n\nYou have"}]

Again, we can see this text is pretty good in terms of semantic coherence, grammatic structure, spelling, punctuation, and so on—indeed, eerily good. However, as it continues, it becomes arguably factually incorrect.



In [6]:
gpt("He didn’t want to talk about cells on the cell phone because he considered it", max_length=100)

Using pad_token, but it is not set yet.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'He didn’t want to talk about cells on the cell phone because he considered it a good idea. If all it would offer were a solution, he said, "then I\'ll have to put it to work."\n\nHe didn\'t want to go further, however. He didn\'t want to talk about how he would pay for the service — just the two of them.\n\nTalks had started to fall apart.\n\nAdvertisement\n\nThen, this week, in February'}]

##Application to chatbots

It seems intuitively that one should be able to adopt GPT without major modification to
this application. Luckily for us, the folks at Microsoft already did this via the model DialoGPT,
which was also recently included in the transformers library. Its architecture is
the same as GPT’s, with the addition of special tokens to indicate the end of a participant’s
turn in a conversation. After seeing such a token, we can add the new contribution
of the participant to the priming context text and iteratively repeat the process via
direct application of GPT to generate a response from our chatbot. Naturally, the pretrained
GPT model was fine-tuned on conversational text to make sure the response
would be appropriate.

Let’s go ahead and build a chatbot! We will not use pipelines in this case, because
this model isn’t yet exposed through that API at the time of writing.

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("microsoft/DialoGPT-medium")
model = GPT2LMHeadModel.from_pretrained("microsoft/DialoGPT-medium")

A few things are worth highlighting at this stage. First, note that we are using the GPT-2 model classes.

Additionally, note that we could have used
the classes AutoModelWithLMHead and AutoTokenizer interchangeably with these
GPT-specific model classes.

Finally, note that the “LMHead” version of the GPT is used here. This means that
the output from the vanilla GPT is passed through one linear layer and one normalization
layer, followed by a transformation into a vector of probabilities of a dimension
equal to the size of the vocabulary. The maximum value corresponds to the next most
likely token if the model is correctly trained.

We first specify a maximum number of responses of five. We then encode
the conversational contribution of the user at each turn, append the contribution to
the chat history, and feed that to the loaded pretrained DialoGPT model for generating
the next response.

In [9]:
conversation_length = 5  # Chats for five line
for step in range(conversation_length):
  # encodes new user input, adds an endof-sentence token, and returns Tensor
  new_user_inputs_ids = tokenizer.encode(input("User: ") + tokenizer.eos_token, return_tensors="pt")
  # adds new input to the chat history
  bot_input_ids = torch.cat([chat_history_ids, new_user_inputs_ids], dim=1) if step > 0 else new_user_inputs_ids
  # generate a response of up to max_length tokens
  chat_history_ids = model.generate(bot_input_ids, 
                                    max_length=1000, 
                                    pad_token_id=tokenizer.eos_token_id)
  # display response
  print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]: ][0], skip_special_tokens=True)))

User: Hi there
DialoGPT: Hi there
User: How are you today?
DialoGPT: I'm good, how are you?
User: Good! How much money do you have?
DialoGPT: I have about 100k
User: What will you spend it on?
DialoGPT: I'm not sure, I'm not sure what I want to spend it on.
User: Make a decision, life is short.
DialoGPT: I'm going to go with a lot of things


One could easily play with this bot all day! We had a lot of fun asking it various questions and prompting it in various ways.

In [10]:
conversation_length = 5  # Chats for five line
for step in range(conversation_length):
  # encodes new user input, adds an endof-sentence token, and returns Tensor
  new_user_inputs_ids = tokenizer.encode(input("User: ") + tokenizer.eos_token, return_tensors="pt")
  # adds new input to the chat history
  bot_input_ids = torch.cat([chat_history_ids, new_user_inputs_ids], dim=1) if step > 0 else new_user_inputs_ids
  # generate a response of up to max_length tokens
  chat_history_ids = model.generate(bot_input_ids, 
                                    max_length=1000, 
                                    pad_token_id=tokenizer.eos_token_id)
  # display response
  print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]: ][0], skip_special_tokens=True)))

User: Hi robot
DialoGPT: Hello human
User: Hum?
DialoGPT: Hello human
User: Huh?
DialoGPT: Hello human
User: OK, what is your name?
DialoGPT: I'm not human
User: All right then.
DialoGPT: I'm not human


It’s quite plausible that the entity at the other end of this short conversation is a human, isn’t it? Does that mean it passes the Turing test?

As you increase the number of allowable conversational turns, you will find the bot
getting stuck in repeated responses that are off-topic. This is analogous to the GPT
open-ended text generation becoming more nonsensical as the length of generated
text increases. One simple way to improve this is to keep a fixed local context size,
where the model is prompted with conversation history only within that context.



##Text Generation with EleutherAI GPT Neo

GPT-Neo from EleutherAI is a recent smaller but worthy open source alternative to GPT3.

It is already available in the transformers library and can be used directly
by setting the model string to one of the model names provided by EleutherAI.

In [None]:
gpt = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")

In [None]:
gpt("Transfer learning is a field of study", max_length=100)

In [None]:
gpt("Somewhere over the rainbow", max_length=100)

Upon inspection, you should find its performance better but
naturally at a significantly higher cost (the weights of the largest model are more than 10 GB in size!).