<a href="https://colab.research.google.com/github/sonukiran3101/Sonu07/blob/master/AI_Text_Generation_using_GPT2_large.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **OBJECTIVE** : 

Generate own blog posts using a unique technique called "Text Generation." 

# **OpenAI GPT2**

OpenAI GPT-2 model was proposed in "Language Models are Unsupervised Multitask Learners" by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It’s a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.

1.GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

2.GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script.

3.The PyTorch models can take the past as input, which is the previously computed key/value attention pairs. Using this past value prevents the model from re-computing pre-computed values in the context of text generation. See reusing the past in generative models for more information on the usage of this argument.

### **How we're doing it!!**

1. Install Hugging Face Transformers for NLP
2. Pre-load GPT2-Large for generating text from a string
3. Encode input and decode output from the model to generate a blog post



## Steps to be followed!!



#### *1. Install and import all the necessary dependencies*




In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 3.0MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 27.2MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |█

In [2]:
# Now lets import some dependencies from the Huggingface transformers

from transformers import GPT2LMHeadModel, GPT2Tokenizer

##### **GPT2LMHeadModel:** The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Parameters
config (GPT2Config) – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

deparallelize()[SOURCE]
Moves the model to cpu from a model parallel state.

##### **GPT2 Tokenizer:** This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

##### **The Final Generation Flow using both the Tokenizer and Model is:**

1. First Encode a sentence to the tokens using the Tokenizer. For example: 
   "I like ice cream" --> [12, 98, 23, 67]

2. Generate a new sequence of tokens using the GPT2 model. For example: 
   [12, 98, 23, 67] --> [12, 98, 23, 67, 78, 3]

3. Decode the generated sequence to words using the Tokenizer again.
   For example: [12, 98, 23, 67, 78, 3] --> "I like ice cream very much.." 

#### *2. Load our model*

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')  # Created our tokenizer
Model = GPT2LMHeadModel.from_pretrained('gpt2-large', pad_token_id = tokenizer.eos_token_id) # Instantiated our model

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=764.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3247202234.0, style=ProgressStyle(descr…




##### Since it does classification on the last token, it requires to know the position of the last token. If a **pad_token_id** is defined in the configuration, it finds the last token that is not a padding token in each row. If no **pad_token_id** is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when **inputs_embeds** are passed instead of **input_ids**, it does the same (take the last value in each row of the batch).

In [4]:
tokenizer.eos_token_id  ## This give a last token number. We can decode and check it as well.
                        ## eos means "End of Sentence"

50256

In [5]:
tokenizer.decode(tokenizer.eos_token_id) # Decoded the last token. It is given as "endoftext"

'<|endoftext|>'

#### *3. Tokenize Sentences*

The process of converting a string into a sequence of numbers as we discussed above. Later these inputs are passed to the GPT2

In [6]:
Sentence = "Titanic Movie"
input_ids = tokenizer.encode(Sentence, return_tensors='pt')

In [7]:
input_ids # Now we can check the ids for each word.

tensor([[   51, 18642,   291, 15875]])

In [8]:
## We can also decode and check each value from the list

tokenizer.decode(input_ids[0][2])

'ic'

#### *4.Generate & Decode*

In [9]:
Output = Model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
Output  # We generated the identities, this might take a little time.

tensor([[   51, 18642,   291, 15875, 44996,   198,   198,   464, 41184,  3807,
         11968,   318,   530,   286,   262,   749, 14133,  4263,   287,   262,
          2106,   286, 22041,    13,   383, 11968,   373,  2727,   416, 28331,
          6406, 12392, 20993, 45445,   805,    11,   508,   635,  2727,   262,
         11968,   329,   262,  2656,  2907,  6176,  2646,   287, 15589,    13,
           632,   373,  3562,   284,   804,   588,   262,  4074,   355,   340,
           561,   423,  3114,   287, 34463,    11,   618,   262,  2646,   373,
           717,  2716,    13,   554,   262,  3807,    11,   262, 41184,   318,
         18904,   355,   257,  4074,   326,   318, 27141,    11,   351,   262,
          2456,   366,  3237,  1148,  9164,     1,  3194,   319,   262,  1735]])

#### *Parameters we used here:*

*   max_length --> The number of words to be generated.

In this particular case we are using a beam search to be able to go and search through and find the most appropriate next word in the sequence.
*   num_beams --> We set number of beams so effectively how many search trees that we're going to five
*   no_repeat_ngram_size --> This parameter particulary stops our model from repeating certain sequences over and over again
*   early_stopping --> If we reach a point where we're not getting great outputs its going to stop generating.





In [10]:
# Now lets decode them and check for the new generated content

Text = tokenizer.decode(Output[0], skip_special_tokens = True)

In [11]:
print(Text)

Titanic Movie Poster

The Titanic movie poster is one of the most iconic images in the history of cinema. The poster was created by famed illustrator Ralph Steadman, who also created the poster for the original Star Wars film in 1977. It was designed to look like the ship as it would have looked in 1912, when the film was first released. In the movie, the Titanic is depicted as a ship that is sinking, with the words "All Is Lost" written on the side


In [12]:
!pip install emoji

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/24/fa/b3368f41b95a286f8d300e323449ab4e86b85334c2e0b477e94422b8ed0f/emoji-1.2.0-py3-none-any.whl (131kB)
[K     |████████████████████████████████| 133kB 2.1MB/s 
[?25hInstalling collected packages: emoji
Successfully installed emoji-1.2.0


In [13]:
# Now lets add some emojis at the end of the Text. 

import emoji

print(emoji.emojize("Hurray we got the Text!!! :partying_face:"))

Hurray we got the Text!!! 🥳


#### *5. Output Result*

In [14]:
Text = tokenizer.decode(Output[0], skip_special_tokens = True)

In [15]:
with open('blogpostTitanic.text','w') as f:
  f.write(Text)