<a href="https://colab.research.google.com/github/seancary62/AddingFilesDemo/blob/main/Starter_File_Basics_of_Generation_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Text Generation

We'll use GPT2 because:
* It's free!
* We can use the transformers library

In later sections we'll use the open API (paid)

In [1]:
# Imports
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [3]:
# Initialize the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [7]:
# Encoding the prompt to get the input ads
prompt = "Dear boss..."
input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch
# Generate text using the model
outputs = model.generate(input_ids, max_length=50, do_sample=True)
tokenizer.decode(outputs[0], skip_special_tokens=True)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Dear boss... why don\'t you ask me to do something. I want you to know I know a lot of people who have tried really hard to do this."\n\n\n"I\'ll show you why," he says. "You don\'t seem'

In [8]:
# Simplified text generation function
def simple_text_generation(prompt, model, tokenizer, max_length = 100):
  # Encode the prompt to get the input IDs
  input_ids = tokenizer.encode(prompt, return_tensors="pt") # pt = pytorch
  # Generate text using the model
  outputs = model.generate(input_ids, max_length=50, do_sample=True)

  return tokenizer.decode(outputs[0], skip_special_tokens=True)



In [10]:
# Test the function
prompt = "Dear boss..."
print(simple_text_generation(prompt,
                             model,
                             tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear boss...you're going to have to take all this time to get this resolved."

I was standing before the door, the door still locked shut. When the front desk began to open...

It was a single, large desk


# Fine Tuning

In [11]:
# Load dataset (scientific research abstracts related to machine learning)
data = [
    "This paper presents a new method for improving the performance of machine learning models by using data augmentation techniques.",
    "We propose a novel approach to natural language processing that leverages the power of transformers and attention mechanisms.",
    "In this study, we investigate the impact of deep learning algorithms on the accuracy of image recognition tasks.",
    "Our research demonstrates the effectiveness of transfer learning in enhancing the capabilities of neural networks.",
    "This work explores the use of reinforcement learning for optimizing decision-making processes in complex environments.",
    "We introduce a framework for unsupervised learning that significantly reduces the need for labeled data.",
    "The results of our experiments show that ensemble methods can substantially boost model performance.",
    "We analyze the scalability of various machine learning algorithms when applied to large datasets.",
    "Our findings suggest that hyperparameter tuning is crucial for achieving optimal results in machine learning applications.",
    "This research highlights the importance of feature engineering in the context of predictive modeling."
]



In [12]:
# Tokenization
# All inputs must have the same length
# Add a dummy token at the end
# Having the same length => This is called padding
tokenizer.pad_token = tokenizer.eos_token

In [19]:
# Tokenize the data
tokenized_data = [tokenizer.encode_plus(
    sentence,
    add_special_tokens=True,
    max_length=50,
    padding="max_length",
    return_tensors="pt") for sentence in data]
tokenized_data[0]

{'input_ids': tensor([[ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
           286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
            13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]])}

In [21]:
# Isolate the input IDs and the attention masks
[item['input_ids'].squeeze() for item in tokenized_data]

[tensor([ 1212,  3348, 10969,   257,   649,  2446,   329, 10068,   262,  2854,
           286,  4572,  4673,  4981,   416,  1262,  1366, 16339, 14374,  7605,
            13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]),
 tensor([ 1135, 18077,   257,  5337,  3164,   284,  3288,  3303,  7587,   326,
         17124,  1095,   262,  1176,   286,  6121,   364,   290,  3241, 11701,
            13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]),
 tensor([  818,   428,  2050,    11,   356,  9161,   262,  2928,   286,  2769,
          4673, 16113,   319,   262,  9922,   286,  2939,  9465,  8861,    13,
         50256, 50256, 50256, 50256, 50256, 5025