# Basic Text Generation

We are using GPT2 because:

*   it is free
*   We can use the transformers library

In later sections we will use the openAI API (paid)



In [1]:
# Import libraries
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [None]:
# Initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
# Encoding the prompt to get the input ids
prompt = "Dear boss ... "
input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch

# Generate text using the model
outputs = model.generate(input_ids, max_length = 100)
tokenizer.decode(outputs[0], skip_special_tokens=True) # decode output to text

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


"Dear boss... \xa0I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able"

Notice the repetitive nature of the output => we need to fix this

In [None]:
# Simplified text generation function
def simple_text_generation(prompt, model, tokenizer, max_length = 100):
  # Encoding the prompt to get the input ids
  input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch

  # Generate text using the model
  outputs = model.generate(input_ids, max_length = 100)

  # Decode the generated output IDs back into text
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Test the function
prompt = "Dear boss ... "
text_generated = simple_text_generation(prompt,
                                        model,
                                        tokenizer,
                                        max_length = 100)
print(text_generated)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear boss...  I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able to do this. I'm not going to be able


# Fine Tuning

In [None]:
# Load dataset (scientific research abstracts related to machine learning)
data = [
    "This paper presents a new method for improving the performance of machine learning models by using data augmentation techniques.",
    "We propose a novel approach to natural language processing that leverages the power of transformers and attention mechanisms.",
    "In this study, we investigate the impact of deep learning algorithms on the accuracy of image recognition tasks.",
    "Our research demonstrates the effectiveness of transfer learning in enhancing the capabilities of neural networks.",
    "This work explores the use of reinforcement learning for optimizing decision-making processes in complex environments.",
    "We introduce a framework for unsupervised learning that significantly reduces the need for labeled data.",
    "The results of our experiments show that ensemble methods can substantially boost model performance.",
    "We analyze the scalability of various machine learning algorithms when applied to large datasets.",
    "Our findings suggest that hyperparameter tuning is crucial for achieving optimal results in machine learning applications.",
    "This research highlights the importance of feature engineering in the context of predictive modeling."
]

In [None]:
# Tokenization
# All inputs must have the same length
# Ensure all inputs have the same length by adding a dummy token at the end
# This process of adding dummy tokens is called padding.

tokenizer.pad_token = tokenizer.eos_token

In [None]:
# Tokenize the data
tokenized_data = [tokenizer.encode_plus(
    sentence,                   # Input sentence
    add_special_tokens = True,  # Add special tokens
    return_tensors = "pt",      # Return PyTorch tensors
    padding = "max_length",     # Pad to the maximum length
    max_length = 50)            # Maximum length of the sequence
    for sentence in data]       # Iterate over the data

# Preview
tokenized_data[:2]

[{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
            286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0]])},
 {'input_ids': tensor([[ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
          17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 502

In [None]:
# Isolate the input IDs and the attention masks
inputs_ids = [item['input_ids'].squeeze() for item in tokenized_data]
attention_masks = [item['attention_mask'].squeeze() for item in tokenized_data]
attention_masks[:2]

[tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0])]

In [None]:
# Convert the input IDs and attention masks to tensors
# This step is necessary for processing the tuned model
inputs_ids = torch.stack(inputs_ids)
attention_masks = torch.stack(attention_masks)

In [None]:
# Padding all input sequences to ensure they have the same length
padded_input_ids = pad_sequence(
    inputs_ids,
    batch_first = True,
    padding_value = tokenizer.eos_token_id) # Use the tokenizer's end-of-sequence token ID as the padding value

# Padding all attention masks to ensure they have the same length
padded_attention_masks = pad_sequence(
    attention_masks,
    batch_first = True,
    padding_value = 0) # Use 0 as the padding value for attention masks

In [None]:
# Create a custom dataset class for handling text data, including input IDs and attention masks
class TextDataset(Dataset):
  def __init__(self, input_ids, attention_masks):
    self.input_ids = input_ids  # Store the input IDs
    self.attention_masks = attention_masks  # Store the attention masks
    self.labels = input_ids.clone()  # Create labels identical to input IDs for tasks like language modeling

  def __len__(self):
    return len(self.input_ids)  # Return the number of samples in the dataset

  def __getitem__(self, idx):
    # Return a dictionary containing the input IDs, attention mask, and labels for a given index
    return {
        'input_ids': self.input_ids[idx],
        'attention_mask': self.attention_masks[idx],
        'labels': self.labels[idx]
    }

# Instantiate the dataset using the padded input IDs and attention masks
dataset = TextDataset(padded_input_ids, padded_attention_masks)
dataset[:2]

{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
            286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
         [ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
          17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 

# Fine tuning the GPT2

In [None]:
# Prepare the data in batches using a DataLoader
# Set the batch size to 2 and shuffle the data for each epoch
dataloader= DataLoader(dataset, batch_size = 2, shuffle = True)

In [None]:
# Initialize an optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr = 5e-5)

# Set the model to training mode
model.train()

# Training loop
for epoch in range(10):
  for batch in dataloader:
    # Unpacking the input and attention mask ids
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']

    # Reset the gradients to zero
    optimizer.zero_grad()

    # Forward pass
    # Processing the input and attention maks
    outputs = model(input_ids = input_ids,
                    attention_mask = attention_mask,
                    labels = input_ids)
    loss = outputs.loss

    # Backward pass: compute the gradients of the loss
    loss.backward()

    # Update the model parameters
    optimizer.step()

  # Print the loss for the current epoch to monitor the progress
  print(f"Epoch {epoch + 1} - Loss: {loss.item()}")

Epoch 1 - Loss: 1.5911985635757446
Epoch 2 - Loss: 1.1605656147003174
Epoch 3 - Loss: 0.9104509949684143
Epoch 4 - Loss: 0.8469998836517334
Epoch 5 - Loss: 0.5974200963973999
Epoch 6 - Loss: 0.5316460132598877
Epoch 7 - Loss: 0.45725950598716736
Epoch 8 - Loss: 0.5214544534683228
Epoch 9 - Loss: 0.3062812387943268
Epoch 10 - Loss: 0.2686574161052704


In [None]:
# Define a function to generate text given a prompt
def generate_text(prompt, model, tokenizer, max_length=100):
  # Encode the prompt to obtain input IDs and attention mask
  inputs = tokenizer.encode_plus(prompt, return_tensors="pt")
  # Extract input ids and attention mask
  input_ids = inputs['input_ids']
  attention_mask = inputs['attention_mask']

  # Generate text using the model
  outputs = model.generate(
      input_ids,                       # Provide input IDs to the model
      attention_mask=attention_mask,   # Provide attention mask to the model
      max_length=max_length            # Set the maximum length for the generated text
  )

  # Decode the generated text and return it, skipping special tokens
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Test the function
prompt = "In this research, we "
text_generated = generate_text(prompt, model, tokenizer, max_length = 500)
print(text_generated)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In this research, we   investigate the impact of machine learning algorithms on the accuracy of decision-making processes in complex environments.


In [None]:
# Test the function
prompt = "Dear Boss ..."
text_generated = generate_text(prompt, model, tokenizer, max_length = 500)
print(text_generated)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear Boss...
