# Text Generation
***
## Table of Contents
***

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

## 1. Introduction


## 2. Device Agnostic Code
GPU acceleration delivers significant speed-up over CPU for deep learning tasks, especially for large models and batch sizes.

In [2]:
device = torch.device(
    device="cuda"  # GPU
    if torch.cuda.is_available()
    else "mps"  # MPS (MacOS)
    if torch.backends.mps.is_available()
    else "cpu"  # No GPU Available
)
device

device(type='mps')

## 3. Loading Pre-Trained Model
### GPT-2

In [3]:
MODEL_NAME = "gpt2-large"
MAX_TOKENS = 50
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")
tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")

In [4]:
model.generation_config.pad_token_id = model.generation_config.eos_token_id

In [5]:
input_text = "Hello! Today I"

model_inputs = tokeniser(input_text, return_tensors="pt").to(device)

In [6]:
model_inputs

{'input_ids': tensor([[15496,     0,  6288,   314]], device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1]], device='mps:0')}

## 4. Repetition
Phrases being repeated in the generated text is a common phenomenon in augoregressive language models. If a token or phrase has a high probability continuation in training data, the model's predictions tend to loop, generating the same phrase repeatedly. 
To avoid or reduce repetition, several techniques can be applied:

- Employ nucleus sampling or top-k sampling (explained in detail below).
- Increase `temperature` above 1.0 to flatten probability distribution, encouraging more random choices.
- Use `repetition_penalty` (typically between 1.1 and 1.5) to reduce the probability of tokens that have already been generated.
- Apply `no_repeat_ngram_size` to strictly prevent the model from generating the same n-gram repeatedly.

## 5. Basic Generation Strategies
### Greedy Search
Greedy Search is the default setting of decoding strategy used by `.generate()`. At each step, it selects the token with the highest probability as the next token. This method is simple and fast, thus is suitable for generating short text. However, for longer text, it can lead to repetitive and less diverse sequences.

By default, greedy search generates up to 20 new tokens unless specified in `GenerationConfig`.

In [7]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to show you how to make a simple, yet effective, way to get your hands on a few of the most popular and most sought after brands of beer.

The first thing you need to do is to find a beer that you


### Sampling
Sampling selects a next token randomly based on the probability distribution over the entire vocabulary of the model. This reduces repetition and can generate more creative, diverse outputs compared to the greedy search strategy.

Sampling is enabled by setting the parameters: `do_sample=True` and `num_beams=1`.

In [8]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    num_beams=1,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I have the most awaited update for the first installment of my Daedric armor, The Golden Crown. You guys have waited for it since the release and I'm really happy to be able to share it with you.

I've always wanted to


### Beam Search
Beam Search maintains multiple candidate sequences (beams) simultaneously. At each step, it expands each beam by selecting tokens, then retains the top $k$ beams based on cumulative (overall) probability score. This strategy is suited for input-grounded tasks such as image captioning or speech recognition.

Beam search is enabled by setting `num_beams > 1`, optionally combined with `do_sample = True`.

In [9]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    num_beams=5,
    do_sample=True,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to show you how to make a simple, easy-to-use, and easy to use app that you can use in your web applications.

In this tutorial, we'll be creating a web application that will allow you to search


### Top-k Sampling
At each step of generation, the model predicts a probability distribution over the entire vocabulary for the next token, then selects only the top $k$ most probable tokens, ranked by their predicted probability. The probabilities of these top $k$ tokens are renormalised to sum to 1, and the next token is randomly sampled from this restricted set.

The number $k$ is configured by the parameter `top_k`.

In [10]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    top_k=10,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm excited to introduce you to my favorite type of game â€“ one that I have played for years but only just recently discovered. The game in question is called "Doom" and has a pretty simple concept: shoot zombies with the bow.




### Top-p (Nucleus) Sampling
Instead of selecting tokens from the entire vocabulary, nucleus sampling samples from the smallest set of tokens whose cumulative probability exceeds the threshold $p$. This introduces controlled randomness, resulting in more diverse and creative text generation.

The probability threshold $p$ is configured by the parameter `top_p`.

In [11]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    top_p=0.9,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I would like to share some of the best tips and tricks from my 10 years of coaching, along with a few of my own ideas and tactics I've tried on myself.

In no particular order, I'll share my most helpful tactics, techniques


## 6. Advanced Generation Strategies
### Speculative Decoding
Speculative decoding uses a second smaller draft model to generate multiple speculative tokens which are then verified by the larget target model to speed up autoregressive decoding. For example, with GPT-2 model:
- Draft model: smaller GPT-2 (e.g., `gpt2` or `distilgpt2`)
- Target model: `gpt2-large`

Speculative decoding can be enabled by setting a draft model to the parameter `assistant_model`.

In [12]:
draft_model = AutoModelForCausalLM.from_pretrained("gpt2")
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    assistant_model=draft_model,
    no_repeat_ngram_size=2,
    do_sample=True,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I am talking about creating an interface that uses a "cabal" in some way. I have been working with this interface for an article that was published in RethinkDB, and I want to share some of my thoughts with you guys on this


Or, we can use the same parameter in Pipeline:

In [13]:
pipe = pipeline(
    task="text-generation",
    tokenizer=tokeniser,
    model=model,
    assistant_model=draft_model,
    torch_dtype=torch.bfloat16,
)
pipe_output = pipe(
    text_inputs=input_text,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    temperature=0.8,
)
print(pipe_output[0]["generated_text"])

Device set to use mps


Hello! Today I have a new set of skills that I want you to try out. The first one is a very easy one, which is, to roll over a random enemy for a point. Here's the roll over:

Here's how the skill looks


### Prompt Lookup Decoding
Prompt Lookup Decoding is a variant of speculative decoding that leverages the significant overlap between the input prompt and the generated output. Unlike traditional speculative decoding that uses a smaller draft model to propose next tokens, this method uses substring matching directly on the input prompt to generate candidate tokens more efficiently.

It is sufficient to specify the number of overlapping tokens in the `prompt_lookup_num_tokens` to enable this technique.

In [14]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    prompt_lookup_num_tokens=3,
    no_repeat_ngram_size=2,
    do_sample=True,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I will be sharing with you a recipe that I've been loving for a while: Slow Roasted Beef Wellington with Caramelized Onions!

I'm so excited for this dish because I love the idea behind it, the caramelized onions make


### Self-Speculative Decoding
Self-Speculative Decoding is a method to accelerate inference in LLMs by using different parts of the same model rather instead of relying on another smaller model. It uses early (shallower) layers of the same model as a draft model and deeper layers for verification. During generation, the model selectively skips some intermediate layers to generate draft tokens faster, improving efficiecy without losing output quality.

General models like GPT-2, GPT-3, or GPT-J typically do not support self-speculative decoding because they lack architectural support for layer sparsity or early exits. For this section, we will use a llama-based model that supports self-speculative decoding.

Passing the `assistant_early_exist` parameter to the `generate()` function will activate self-speculative decoding. This parameter controls how many early layers are used for the draft (speculative) stage during generation.

In [None]:
from huggingface_hub import login
from dotenv import load_dotenv
import os

load_dotenv()  # Load .env variables

HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN)

# from huggingface_hub import whoami

# user_info = whoami()
# print(f"Logged in as: {user_info['name']}")

In [None]:
# Self-Speculative Decoding is not available for GPT-2
model = AutoModelForCausalLM.from_pretrained(
    "facebook/layerskip-llama3.2-1B", device_map="auto"
)

model.generation_config.pad_token_id = (
    model.generation_config.eos_token_id
)  # Handle warnings

tokeniser = AutoTokenizer.from_pretrained(
    "facebook/layerskip-llama3.2-1B", padding_side="left"
)
model_inputs = tokeniser(input_text, return_tensors="pt").to(device)
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    assistant_early_exit=4,
    no_repeat_ngram_size=2,
    do_sample=False,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Some parameters are on the meta device because they were offloaded to the disk.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128

['Hello! Today I am sharing a card I made for the new challenge at the Paper Players. I used the sketch from the challenge and the colors are from my stash. The sentiment is from a stamp set I have had for a long time. It is called "Thank']


### Universal Assisted Decoding
Universal Assisted Decoding (UAD), sometimes called Universal Assisted Generation, is an advanced method designed to speed up language model text generation by using two models (a large target model and a smaller assistant model) even when those models use different tokenisers or come from entirely different model families.

After the assistant model generates tokens, these are converted to text and then re-encoded with the target model's tokeniser so the target model can verify them. This allows for flexible speedups of any large model using an arbitrary smaller model from a different family or tokeniser.

UAD can be implemented by setting the assistant tokeniser to the parameter `assistant_tokenizer`.

In [26]:
MODEL_NAME = "gpt2"
DRAFT_MODEL_NAME = "double7/vicuna-68m"
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")
tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")
model.generation_config.pad_token_id = model.generation_config.eos_token_id

draft_model = AutoModelForCausalLM.from_pretrained(DRAFT_MODEL_NAME)
draft_tokeniser = AutoTokenizer.from_pretrained(DRAFT_MODEL_NAME)
draft_model.generation_config.pad_token_id = draft_model.generation_config.eos_token_id

input_text = "Hello! Today I"

model_inputs = tokeniser(input_text, return_tensors="pt").to(device)

outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    assistant_model=draft_model,
    tokenizer=tokeniser,
    assistant_tokenizer=draft_tokeniser,
    no_repeat_ngram_size=2,
    do_sample=False,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to talk about the new features of the game.

The new feature is the ability to play the "Doom" mode. This mode is a new way to experience the Doom experience. It's a game mode that allows you to
