<a href="https://colab.research.google.com/github/aicrashcoursewinter24/Suthi-CSC-480-Labs/blob/Transformer-based-Language-Model---GPT2/Transformer_based_Language_Model_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer-based Language Model - GPT2

- This notebook runs on Google Colab.
- Codes from [A Comprehensive Guide to Build Your Own Language Model in Python](https://medium.com/analytics-vidhya/a-comprehensive-guide-to-build-your-own-language-model-in-python-5141b3917d6d)
- Use the OpenAI GPT-2 language model (based on Transformers) to:
  - Generate text sequences based on seed texts
  - Convert text sequences into numerical representations

In [None]:
!pip install transformers



In [None]:
# Import required libraries
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

# Print the predicted word
print(predicted_text)

What is the fastest car in the world


In [None]:
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 178940, done.[K
remote: Counting objects: 100% (1103/1103), done.[K
remote: Compressing objects: 100% (517/517), done.[K
remote: Total 178940 (delta 702), reused 835 (delta 518), pack-reused 177837[K
Receiving objects: 100% (178940/178940), 198.92 MiB | 18.33 MiB/s, done.
Resolving deltas: 100% (125077/125077), done.


In [None]:
!ls transformers/examples

flax  legacy  pytorch  README.md  research_projects  run_on_remote.py  tensorflow


## Text Generation Using DPT2

- [Write with Transformer](https://transformer.huggingface.co/)



In [None]:
# !python transformers/examples/text-generation/run_generation.py \
#     --model_type=gpt2 \
#     --model_name_or_path=gpt2 \
#     --length=100

## Text Generation Using GPT2

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=50, num_return_sequences=5)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, but what I'm really doing is making a human-readable document. There are other languages, but those are the ones I like the most. To do your research, please contact me, this isn't your"},
 {'generated_text': "Hello, I'm a language model, not a syntax model. That's why I like it. I've done a lot of programming projects.\n\nBut my job as a C programmer is to sort through every single line of the script so I"},
 {'generated_text': "Hello, I'm a language model, and I'll do it in no time!\n\nOne of the things we learned from talking to my friend from college a bit earlier, and in the context of the current language model I think it's important"},
 {'generated_text': 'Hello, I\'m a language model, not a command line tool.\n\nIf my code is simple enough:\n\nif (use (string-replace "\\r" ))) {\n\nconsole. log\n\n}\n\nthat\'s'},
 {'generated_text': "Hello, I'm a language model, I've been using Language in all my work. Just a small example, let'

In [None]:
generator("Scrat is a fictional character in the Ice Age franchise. He is a saber-toothed squirrel who is obsessed with collecting acorns, constantly putting his life in danger to obtain and defend them. Scrat's storylines are mostly", max_length=75, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Scrat is a fictional character in the Ice Age franchise. He is a saber-toothed squirrel who is obsessed with collecting acorns, constantly putting his life in danger to obtain and defend them. Scrat's storylines are mostly based on real life and fantasy shows like The Harry Potter movies. In the same way that the Wild West has been a staple of the"},
 {'generated_text': "Scrat is a fictional character in the Ice Age franchise. He is a saber-toothed squirrel who is obsessed with collecting acorns, constantly putting his life in danger to obtain and defend them. Scrat's storylines are mostly based on old stories, like those about the two former ice beasts.\n\nAscension:\n\nThe first Apocryphal"},
 {'generated_text': "Scrat is a fictional character in the Ice Age franchise. He is a saber-toothed squirrel who is obsessed with collecting acorns, constantly putting his life in danger to obtain and defend them. Scrat's storylines are mostly the result of the events of The

## Transforming Texts into Features

In [None]:
# from transformers import GPT2Tokenizer, GPT2Model
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# model = GPT2Model.from_pretrained('gpt2')
# text = "Replace me by any text you'd like."
# encoded_input = tokenizer(text, return_tensors='pt') # return tensorflow tensors
# output = model(encoded_input)


from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')
text = "Scrat is a fictional character in the Ice Age franchise. He is a saber-toothed squirrel who is obsessed with collecting acorns, constantly putting his life in danger to obtain and defend them. Scrat's storylines are mostly"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
print(encoded_input)

All PyTorch model weights were used when initializing TFGPT2Model.

All the weights of TFGPT2Model were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2Model for predictions without further training.


{'input_ids': <tf.Tensor: shape=(1, 48), dtype=int32, numpy=
array([[ 3351, 10366,   318,   257, 19812,  2095,   287,   262,  6663,
         7129,  8663,    13,   679,   318,   257, 17463,   263,    12,
           83,  1025,   704, 33039,   508,   318, 21366,   351, 13157,
          936, 19942,    11,  7558,  5137,   465,  1204,   287,  3514,
          284,  7330,   290,  4404,   606,    13,  1446, 10366,   338,
        44880,   389,  4632]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 48), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1]], dtype=int32)>}


In [None]:
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

# Load tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2LMHeadModel.from_pretrained('gpt2')

# Input text
text = "Scrat is a fictional character in the Ice Age franchise. He is a saber-toothed squirrel who is obsessed with collecting acorns, constantly putting his life in danger to obtain and defend them. Scrat's storylines are mostly"

# Encode the input text
encoded_input = tokenizer(text, return_tensors='tf')


generated_text = model.generate(encoded_input['input_ids'], max_length=50)
decoded_text = tokenizer.decode(generated_text[0], skip_special_tokens=True)
print(decoded_text)




All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Scrat is a fictional character in the Ice Age franchise. He is a saber-toothed squirrel who is obsessed with collecting acorns, constantly putting his life in danger to obtain and defend them. Scrat's storylines are mostly based on


In [None]:

new_text = "What are the fun features of the GPT-2 model?"

# Encode the new input text
new_encoded_input = tokenizer(new_text, return_tensors='tf')

# Generate text
new_generated_text = model.generate(new_encoded_input['input_ids'], max_length=50)  # Adjust max_length accordingly
new_decoded_text = tokenizer.decode(new_generated_text[0], skip_special_tokens=True)
print(new_decoded_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What are the fun features of the GPT-2 model?

The GPT-2 is a very simple and powerful system. It is designed to be used in conjunction with the GPT-1 and GPT-2. The G
