<a href="https://colab.research.google.com/github/seancary62/AddingFilesDemo/blob/main/Starter_File_Basics_of_Generation_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Text Generation

We'll use GPT2 because:
* It's free!
* We can use the transformers library

In later sections we'll use the open API (paid)

In [4]:
# Imports
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [5]:
# Initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [6]:
# Encoding the prompt to get the input ads
prompt = "Dear boss..."
input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch
# Generate text using the model
outputs = model.generate(input_ids, max_length=50, do_sample=True)
tokenizer.decode(outputs[0], skip_special_tokens=True)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


"Dear boss...we are the most hated club outside our own city...you're our worst nightmare...what could you say about us?\n\nA new petition of support for the club, calling for FIFA to change its policy on violence against women has"

In [7]:
# Simplified text generation function
def simple_text_generation(prompt, model, tokenizer, max_length = 100):
  # Encode the prompt to get the input IDs
  input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch
  # Generate text using the model
  outputs = model.generate(input_ids, max_length=50, do_sample=True)

  return tokenizer.decode(outputs[0], skip_special_tokens=True)



In [24]:
# Test the function
prompt = "Dear boss..."
print(simple_text_generation(prompt,
                             model,
                             tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear boss...it's not right how we treat people you talk about."

Advertisement - Continue Reading Below

The same is true of all the other things about the new movie. It's all just so much more than a mere box office


# Fine Tuning

In [9]:
# Load dataset (scientific research abstracts related to machine learning)
data = [
    "This paper presents a new method for improving the performance of machine learning models by using data augmentation techniques.",
    "We propose a novel approach to natural language processing that leverages the power of transformers and attention mechanisms.",
    "In this study, we investigate the impact of deep learning algorithms on the accuracy of image recognition tasks.",
    "Our research demonstrates the effectiveness of transfer learning in enhancing the capabilities of neural networks.",
    "This work explores the use of reinforcement learning for optimizing decision-making processes in complex environments.",
    "We introduce a framework for unsupervised learning that significantly reduces the need for labeled data.",
    "The results of our experiments show that ensemble methods can substantially boost model performance.",
    "We analyze the scalability of various machine learning algorithms when applied to large datasets.",
    "Our findings suggest that hyperparameter tuning is crucial for achieving optimal results in machine learning applications.",
    "This research highlights the importance of feature engineering in the context of predictive modeling."
]



In [10]:
# Tokenization
# All inputs must have the same length
# Add a dummy token at the end
# Having the same length => This is called padding
tokenizer.pad_token = tokenizer.eos_token

In [11]:
# Tokenize the data
tokenized_data = [tokenizer.encode_plus(
    sentence,
    add_special_tokens=True,
    max_length=50,
    padding="max_length",
    return_tensors="pt") for sentence in data]
tokenized_data[0]

{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
           286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
            13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]])}

In [12]:
# Isolate the input IDs and the attention masks
input_ids = [item['input_ids'].squeeze() for item in tokenized_data]
attention_masks = [item['attention_mask'].squeeze() for item in tokenized_data]
attention_masks[0]

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0])

In [13]:
# Convert the input ids and attention masks to tensor
# Step necessary for processing the tuned model
input_ids = torch.stack(input_ids)
attention_masks = torch.stack(attention_masks)

In [14]:
# Padding all sequences to make sure they are the same length
padded_input_ids = pad_sequence(input_ids,
             batch_first = True,
             padding_value = tokenizer.eos_token_id)
padded_attention_masks = pad_sequence(attention_masks,
             batch_first = True,
             padding_value = 0)

In [17]:
# Create a custom dataset class inculding datalabels
class TextDataset(Dataset):
  def __init__(self, input_ids, attention_masks):
    self.input_ids = input_ids
    self.attention_masks = attention_masks
    self.labels = input_ids.clone()

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return {
        'input_ids': self.input_ids[idx],
        'attention_mask': self.attention_masks[idx],
        'labels': self.labels[idx]
    }

# Apply the class
dataset = TextDataset(padded_input_ids, padded_attention_masks)
dataset[:2]

{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
            286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
         [ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
          17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
             13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 

In [18]:
# Prepare data in batches
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [25]:
# Initialize an optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Set the model to training mode
model.train()

# Training loop
for epoch in range(10):
  for batch in dataloader:
    # Unpacking the input and attention mask ids
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']

    # Reset the gradients to zero
    optimizer.zero_grad()

    # Forward pass
    # Processing the input and attention masks
    outputs = model(input_ids = input_ids,
                    attention_mask=attention_mask,
                    labels=input_ids)

    loss = outputs.loss

    # Backward pass: compute gradients of the loss
    loss.backward()

    # Update the model parameters
    optimizer.step()

  # Print the loss for the current epoch to monito the progress.
  print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Epoch 1, Loss: 1.68159019947052
Epoch 2, Loss: 1.152671217918396
Epoch 3, Loss: 1.1428251266479492
Epoch 4, Loss: 0.8552444577217102
Epoch 5, Loss: 0.6248494982719421
Epoch 6, Loss: 0.7161487936973572
Epoch 7, Loss: 0.499800443649292
Epoch 8, Loss: 0.5959373712539673
Epoch 9, Loss: 0.27258825302124023
Epoch 10, Loss: 0.19146256148815155


In [32]:
# Define function to generate text
def generate_text(prompt, model, tokenizer, max_length=100):
  # Encode the prompt to get the input IDs
  inputs = tokenizer.encode_plus(prompt, return_tensors = "pt")
  input_ids = inputs['input_ids']
  attention_mask = inputs['attention_mask']

  # Generate text using the model
  outputs = model.generate(input_ids,
                           attention_mask = attention_mask,
                           max_length=max_length)


  return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [34]:
# Test the function
prompt = "In this research, we "
text_generated = generate_text(prompt, model, tokenizer, max_length=500)
print(text_generated)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In this research, we  provide a framework for modeling the impact of machine learning algorithms on the accuracy of data augmentation tasks.
