# Text Generation
***
## Table of Contents
1. [Introduction](#1-introduction)
1. [Device Agnostic Code](#2-device-agnostic-code)
1. [Loading Pre-Trained Model](#3-loading-pre-trained-model)
    - [GPT-2](#gpt-2)
1. [Repetition](#4-repetition)
1. [Basic Generation Strategies](#5-basic-generation-strategies)
    - [Greedy Search](#greedy-search)
    - [Sampling](#sampling)
    - [Beam Search](#beam-search)
    - [Top-k Sampling](#top-k-sampling)
    - [Top-p (Nucleus) Sampling](#top-p-nucleus-sampling)
1. [Advanced Generation Strategies](#6-advanced-generation-strategies)
    - [Speculative Decoding](#speculative-decoding)
    - [Prompt Lookup Decoding](#prompt-lookup-decoding)
    - [Self-Speculative Decoding](#self-speculative-decoding)
    - [Universal Assisted Decoding](#universal-assisted-decoding)
    - [Contrastive Search](#contrastive-search)
    - [Decoding by Contrasting Layers (DoLa)](#decoding-by-contrasting-layers-dola)
    - [Diverse Beam Search](#diverse-beam-search)
1. [References](#7-references)
***

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

## 1. Introduction
Text generation is the task of automatically producing coherent and contextually relevant text using machine learning models. It is a fundamental component of natural language processing (NLP), enabling applications such as conversational agents, content creation, and summarisation. 

The purpose of this project is to explore various text generation strategies, both basic and advanced, using a range of pretrained transformer models (i.e., GPT-2, GPT-2 Large, layerskip-llama3.2-1B, and double7/vicuna-68m) to gain deeper insights into their capabilities and performance across different generation techniques.


## 2. Device Agnostic Code
GPU acceleration delivers significant speed-up over CPU for deep learning tasks, especially for large models and batch sizes.

In [2]:
device = torch.device(
    device="cuda"  # GPU
    if torch.cuda.is_available()
    else "mps"  # MPS (MacOS)
    if torch.backends.mps.is_available()
    else "cpu"  # No GPU Available
)
device

device(type='mps')

## 3. Loading Pre-Trained Model
### GPT-2

In [3]:
MODEL_NAME = "gpt2-large"
MAX_TOKENS = 50
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")
tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [4]:
model.generation_config.pad_token_id = model.generation_config.eos_token_id

In [5]:
input_text = "Hello! Today I"

model_inputs = tokeniser(input_text, return_tensors="pt").to(device)

In [6]:
model_inputs

{'input_ids': tensor([[15496,     0,  6288,   314]], device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1]], device='mps:0')}

## 4. Repetition
Phrases being repeated in the generated text is a common phenomenon in augoregressive language models. If a token or phrase has a high probability continuation in training data, the model's predictions tend to loop, generating the same phrase repeatedly. 
To avoid or reduce repetition, several techniques can be applied:

- Employ nucleus sampling or top-k sampling (explained in detail below).
- Increase `temperature` above 1.0 to flatten probability distribution, encouraging more random choices.
- Use `repetition_penalty` (typically between 1.1 and 1.5) to reduce the probability of tokens that have already been generated.
- Apply `no_repeat_ngram_size` to strictly prevent the model from generating the same n-gram repeatedly.

## 5. Basic Generation Strategies
### Greedy Search
Greedy Search is the default setting of decoding strategy used by `.generate()`. At each step, it selects the token with the highest probability as the next token. This method is simple and fast, thus is suitable for generating short text. However, for longer text, it can lead to repetitive and less diverse sequences.

By default, greedy search generates up to 20 new tokens unless specified in `GenerationConfig`.

In [7]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to show you how to make a simple, yet effective, way to get your hands on a few of the most popular and most sought after brands of beer.

The first thing you need to do is to find a beer that you


### Sampling
Sampling selects a next token randomly based on the probability distribution over the entire vocabulary of the model. This reduces repetition and can generate more creative, diverse outputs compared to the greedy search strategy.

Sampling is enabled by setting the parameters: `do_sample=True` and `num_beams=1`.

In [8]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    num_beams=1,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm getting a "preview" picture from Microsoft with some of the first shots, some "lookins" and some renders. Some people are saying that this is just pre-visualization work and nothing more than that, but the fact is,


### Beam Search
Beam Search maintains multiple candidate sequences (beams) simultaneously. At each step, it expands each beam by selecting tokens, then retains the top $k$ beams based on cumulative (overall) probability score. This strategy is suited for input-grounded tasks such as image captioning or speech recognition.

Beam search is enabled by setting `num_beams > 1`, optionally combined with `do_sample = True`.

In [9]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    num_beams=5,
    do_sample=True,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to show you how to make a very simple and simple to use, but very useful, LED strip.

The first thing you need to do is to download and install the Arduino IDE. If you don't have it yet, you


### Top-k Sampling
At each step of generation, the model predicts a probability distribution over the entire vocabulary for the next token, then selects only the top $k$ most probable tokens, ranked by their predicted probability. The probabilities of these top $k$ tokens are renormalised to sum to 1, and the next token is randomly sampled from this restricted set.

The number $k$ is configured by the parameter `top_k`.

In [10]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    top_k=10,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to tell you about a very simple and simple-to-use tool that you can use to make some very powerful and fun sounds in your games.

If you've played a lot of indie games recently, chances are you know what


### Top-p (Nucleus) Sampling
Instead of selecting tokens from the entire vocabulary, nucleus sampling samples from the smallest set of tokens whose cumulative probability exceeds the threshold $p$. This introduces controlled randomness, resulting in more diverse and creative text generation.

The probability threshold $p$ is configured by the parameter `top_p`.

In [11]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    top_p=0.9,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'd like to present to you the first release of the official Linux distribution of this game:

Linux 4.6.3: Xubuntu 14.04 LTS with Unity 7
.
- Lubuntu is a lightweight distribution based on Ubuntu


## 6. Advanced Generation Strategies
### Speculative Decoding
Speculative decoding uses a second smaller draft model to generate multiple speculative tokens which are then verified by the larget target model to speed up autoregressive decoding. For example, with GPT-2 model:
- Draft model: smaller GPT-2 (e.g., `gpt2` or `distilgpt2`)
- Target model: `gpt2-large`

Speculative decoding can be enabled by setting a draft model to the parameter `assistant_model`.

In [12]:
draft_model = AutoModelForCausalLM.from_pretrained("gpt2")
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    assistant_model=draft_model,
    no_repeat_ngram_size=2,
    do_sample=True,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Hello! Today I'm going to talk about a method that was introduced just recently called the "WidowMaker" method. It's a new method introduced to the framework, and it's going allow us to create a full set of objects called "wife", "


Or, we can use the same parameter in Pipeline:

In [13]:
pipe = pipeline(
    task="text-generation",
    tokenizer=tokeniser,
    model=model,
    assistant_model=draft_model,
    torch_dtype=torch.bfloat16,
)
pipe_output = pipe(
    text_inputs=input_text,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    temperature=0.8,
)
print(pipe_output[0]["generated_text"])

Device set to use mps


Hello! Today I'm going to give you guys the quick rundown on the new update on the project in the Dev's Corner. There was a lot of content that has been built, so this should be a quick overview of what's been done.

So,


### Prompt Lookup Decoding
Prompt Lookup Decoding is a variant of speculative decoding that leverages the significant overlap between the input prompt and the generated output. Unlike traditional speculative decoding that uses a smaller draft model to propose next tokens, this method uses substring matching directly on the input prompt to generate candidate tokens more efficiently.

It is sufficient to specify the number of overlapping tokens in the `prompt_lookup_num_tokens` to enable this technique.

In [14]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    prompt_lookup_num_tokens=3,
    no_repeat_ngram_size=2,
    do_sample=True,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'd like to talk about some things you can do with the Command Line. Not as a replacement for the IDE in your workflow, but as an alternative. I often find that the old IDE does the same job as the command line, and I know


### Self-Speculative Decoding
Self-Speculative Decoding is a method to accelerate inference in LLMs by using different parts of the same model rather instead of relying on another smaller model. It uses early (shallower) layers of the same model as a draft model and deeper layers for verification. During generation, the model selectively skips some intermediate layers to generate draft tokens faster, improving efficiecy without losing output quality.

General models like GPT-2, GPT-3, or GPT-J typically do not support self-speculative decoding because they lack architectural support for layer sparsity or early exits. For this section, we will use a llama-based model that supports self-speculative decoding.

Passing the `assistant_early_exist` parameter to the `generate()` function will activate self-speculative decoding. This parameter controls how many early layers are used for the draft (speculative) stage during generation.

In [15]:
from huggingface_hub import login
from dotenv import load_dotenv
import os

load_dotenv()  # Load .env variables

HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN)

# from huggingface_hub import whoami

# user_info = whoami()
# print(f"Logged in as: {user_info['name']}")

In [16]:
# Self-Speculative Decoding is not available for GPT-2
model = AutoModelForCausalLM.from_pretrained(
    "facebook/layerskip-llama3.2-1B", device_map="auto"
)

model.generation_config.pad_token_id = (
    model.generation_config.eos_token_id
)  # Handle warnings

tokeniser = AutoTokenizer.from_pretrained(
    "facebook/layerskip-llama3.2-1B", padding_side="left"
)
model_inputs = tokeniser(input_text, return_tensors="pt").to(device)
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    assistant_early_exit=4,
    no_repeat_ngram_size=2,
    do_sample=False,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/126 [00:00<?, ?B/s]

Some parameters are on the meta device because they were offloaded to the disk.


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Hello! Today I am sharing a card I made for the new challenge at the Paper Players. I used the sketch from the challenge and the colors are from my stash. The sentiment is from a stamp set I have had for a long time. It is called "Thank


### Universal Assisted Decoding
Universal Assisted Decoding (UAD), sometimes called Universal Assisted Generation, is an advanced method designed to speed up language model text generation by using two models (a large target model and a smaller assistant model) even when those models use different tokenisers or come from entirely different model families.

After the assistant model generates tokens, these are converted to text and then re-encoded with the target model's tokeniser so the target model can verify them. This allows for flexible speedups of any large model using an arbitrary smaller model from a different family or tokeniser.

UAD can be implemented by setting the assistant tokeniser to the parameter `assistant_tokenizer`.

In [17]:
MODEL_NAME = "gpt2"
DRAFT_MODEL_NAME = "double7/vicuna-68m"
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")
tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")
model.generation_config.pad_token_id = model.generation_config.eos_token_id

draft_model = AutoModelForCausalLM.from_pretrained(DRAFT_MODEL_NAME)
draft_tokeniser = AutoTokenizer.from_pretrained(DRAFT_MODEL_NAME)
draft_model.generation_config.pad_token_id = draft_model.generation_config.eos_token_id

input_text = "Hello! Today I"

model_inputs = tokeniser(input_text, return_tensors="pt").to(device)

outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    assistant_model=draft_model,
    tokenizer=tokeniser,
    assistant_tokenizer=draft_tokeniser,
    no_repeat_ngram_size=2,
    do_sample=False,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/714 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/272M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/952 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Hello! Today I'm going to talk about the new features of the game.

The new feature is the ability to play the "Doom" mode. This mode is a new way to experience the Doom experience. It's a game mode that allows you to


### Contrastive Search
Contrastive Search is a decoding method designed to reduce repetition and improve diversity in language model outputs, especially for generating long sequences. It works by comparing the similarity of candidate tokens with previously generated tokens and penalising those that are similar, thereby encouraging more diverse and informative text.

This method is typically controlled by two parameters:

- `penalty_alpha`: A hyperparameter that balances the trade-off between selecting high-probability tokens and penalising similar (redundant) tokens.
- `top_k`: The number of top candidate tokens considered at each decoding step for applying the contrastive scoring.

In [18]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    no_repeat_ngram_size=2,
    do_sample=False,
    penalty_alpha=0.5,
    top_k=4,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to talk about the "Battleship" of the United States.

The Battle of Lexington
 (1803)
. . .
, . The Battle at Lexington was a battle fought between the English and the French in


### Decoding by Contrasting Layers (DoLA)
Decoding by Contrasting Layers (DoLa) is a contrastive decoding strategy that improves factual accuracy and reduces hallucinations in large language models (LLMs) without requiring external knowledge retrieval or additional fine-tuning.

DoLa compares the token predictions (logits) from the final layers with those from earlier layers during decoding, aiming to highlight tokens that represent factual knowledge more effectively. This technique is not recommended for smaller models such as GPT-2.

DoLa is enabled by the following two parameters:

- `dola_layers`: candidate layers to be contrasted with the final layer. Can be set to 'high' (for short-answer tasks) or 'low' (for long-answer tasks).
- `repetition_penalty`: reduces repetition; it is recommended to set this to 1.2.

In [19]:
MODEL_NAME = "gpt2-large"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, device_map="auto"
)
tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")
model.generation_config.pad_token_id = model.generation_config.eos_token_id

model_inputs = tokeniser(input_text, return_tensors="pt").to(device)
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=False,
    dola_layers="high",
    repetition_penalty=1.2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'd like to tell you a bit about our website!


If you like, you can come over and look around for the free games or come and have a look around in our lobby!


### Diverse Beam Search
Diverse Beam Search is a variant of traditional beam search designed to generate more diverse output sequences while reducing computational costs by dividing the total number of beams into smaller groups (beam groups).

This method is enabled by setting the following three parameters:

- `num_beams`: Total number of beams across all groups.
- `num_beam_groups`: Number of groups into which beams are divided. Must divide evenly into num_beams (i.e., num_beams % num_beam_groups == 0).
- `diversity_penalty`: Strength of the penalty applied to tokens in other groups, encouraging diverse outputs.

In [20]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    num_beams=6,
    num_beam_groups=3,
    diversity_penalty=1.0,
    do_sample=False,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I want to show you how to create a simple web application using the AngularJS framework. This tutorial will show you how to create a simple web application using the AngularJS framework. This tutorial will show you how to create a simple web application using the Angular


## 7. References
1. Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav. (2024). Hugging Face. *Faster Text Generation with Self-Speculative Decoding*.<br>
https://huggingface.co/blog/layerskip

1. Majd Farah. (2023). *Generating Text with GPT2 in Under 10 Lines of Code*.<br>
https://medium.com/@majd.farah08/generating-text-with-gpt2-in-under-10-lines-of-code-5725a38ea685

1. Hugging Face. (n.d.). *Generation strategies*.<br>
https://huggingface.co/docs/transformers/en/generation_strategies

1. Hugging Face. (n.d.). *Text generation*. <br>
https://huggingface.co/docs/transformers/en/llm_tutorial

1. Daniel Korat, Jonathan Mamou, Nadav Timor, Oren Pereg, Joao Gante, Moshe Wasserblat, Moshe Berchansky, Lewis Tunstall. (2024). *Universal Assisted Generation: Faster Decoding with Any Assistant Model*.<br>
https://huggingface.co/blog/universal_assisted_generation
