**GPT-2** **:**

GPT-2 is a large transformers based language model with 1.5 billion parameters, trained on a dataset of 8 billion web pages. GPT-2 is trained with simple objective: predict the next word, given all the previous word within some text. The diversity of the dataset causes this simple goal to contain naturally occuring demonstrations of many tasks across many domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

🤗(Hugging Face) Transformers(formerly known as PyTorch-transformers and PyTorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

Basically Hugging Face Transformers is the mega python package that has some pre-defined or pre-trained functions, pipelines, and models. which we can use for our natural language processing tasks.

GPT-2 Tokenizer and Models are also included in 🤗 Transformers.


**CODE**


**STEP 1:**

Import GPT2LMHeadModel for Text generation
and 
GPT2Tokenizer for tokenizing the text

In [3]:
!python -m pip install transformers
from transformers import GPT2LMHeadModel , GPT2Tokenizer

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 8.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.9 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 50.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 29.9 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully u

**STEP 2:** 

We will now load the model in our notebook

In [4]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') 
model = GPT2LMHeadModel.from_pretrained('gpt2-large' , 
pad_token_id = tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/764 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

**STEP: 3**

For text generation, we have to feed first some text to our model and then from that text model generate text. 
First, we have to preprocess the text that we are feeding to the model. So in 3rd step, we tokenize that text.

In the first line we encode the text and return torch tensors. ‘pt’ means PyTorch Tensors.
Our words converted to the index of the number.

In [5]:
sentence = "What is Love?"
input_ids = tokenizer.encode(sentence, return_tensors="pt")

In [6]:
input_ids

tensor([[2061,  318, 5896,   30]])

The decode function will decode those numbers to text back again.

In [7]:
tokenizer.decode(input_ids[0])

'What is Love?'

**Step 4:**

We will generate the text using generate function from GPT2LMHeadModel.

In [8]:
output = model.generate(input_ids, max_length = 10000, num_beams = 5,
                        no_repeat_ngram_size  = 2,
                        early_stopping = True)

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


We are getting token id from that function. After that decoding the token id we get our result.

In [11]:
tokenizer.decode(output[0])

'What is Love?\n\nLove is a state of mind, a feeling, an emotion, or a mental state. It is the feeling of being in love with someone or something. The word "love" is derived from the Latin word for "to love," "lēgēre," which means to feel, to be attracted to, and to love. Love is an emotional state, not a physical one. A person can feel love for another person, but it is not the same thing as having physical feelings for that person. For example, if a man and a woman are in a relationship, the man may feel a strong attraction to the woman. However, he may not feel any physical attraction toward her. This is because he does not have any feelings of love or attraction for her, nor does he have a desire to have physical relations with her at this point in time. He may, however, feel an intense emotional attachment to her as a result of her being a person of interest to him. In other words, she is someone who he is interested in and he wants to get to know more about her and her life. If h