# Text Generation
***
## Table of Contents
***

In [34]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

## 1. Introduction


## 2. Device Agnostic Code
GPU acceleration delivers significant speed-up over CPU for deep learning tasks, especially for large models and batch sizes.

In [35]:
device = torch.device(
    device="cuda"  # GPU
    if torch.cuda.is_available()
    else "mps"  # MPS (MacOS)
    if torch.backends.mps.is_available()
    else "cpu"  # No GPU Available
)
device

device(type='mps')

## 3. Loading Pre-Trained Model
### GPT-2

In [36]:
MODEL_NAME = "gpt2-large"
MAX_TOKENS = 50
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")
tokeniser = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")

In [37]:
model.generation_config.pad_token_id = model.generation_config.eos_token_id

In [38]:
input_text = "Hello! Today I"

model_inputs = tokeniser(input_text, return_tensors="pt").to(device)

In [39]:
model_inputs

{'input_ids': tensor([[15496,     0,  6288,   314]], device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1]], device='mps:0')}

## 4. Repetition
Phrases being repeated in the generated text is a common phenomenon in augoregressive language models. If a token or phrase has a high probability continuation in training data, the model's predictions tend to loop, generating the same phrase repeatedly. 
To avoid or reduce repetition, several techniques can be applied:

- Employ nucleus sampling or top-k sampling (explained in detail below).
- Increase `temperature` above 1.0 to flatten probability distribution, encouraging more random choices.
- Use `repetition_penalty` (typically between 1.1 and 1.5) to reduce the probability of tokens that have already been generated.
- Apply `no_repeat_ngram_size` to strictly prevent the model from generating the same n-gram repeatedly.

## 5. Basic Generation Strategies
### Greedy Search
Greedy Search is the default setting of decoding strategy used by `.generate()`. At each step, it selects the token with the highest probability as the next token. This method is simple and fast, thus is suitable for generating short text. However, for longer text, it can lead to repetitive and less diverse sequences.

By default, greedy search generates up to 20 new tokens unless specified in `GenerationConfig`.

In [40]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to show you how to make a simple, yet effective, way to get your hands on a few of the most popular and most sought after brands of beer.

The first thing you need to do is to find a beer that you


### Sampling
Sampling selects a next token randomly based on the probability distribution over the entire vocabulary of the model. This reduces repetition and can generate more creative, diverse outputs compared to the greedy search strategy.

Sampling is enabled by setting the parameters: `do_sample=True` and `num_beams=1`.

In [41]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    num_beams=1,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I've got two amazing pieces that aren't pictured that are incredibly simple and easy to make. First, we have a little cute book with a cute little dragon at the front. I have it up on my bookshelf right now and was thinking if


### Beam Search
Beam Search maintains multiple candidate sequences (beams) simultaneously. At each step, it expands each beam by selecting tokens, then retains the top $k$ beams based on cumulative (overall) probability score. This strategy is suited for input-grounded tasks such as image captioning or speech recognition.

Beam search is enabled by setting `num_beams > 1`, optionally combined with `do_sample = True`.

In [42]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    num_beams=5,
    do_sample=True,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to show you how to create a simple web app using Angular 2.

Angular 2 is a new framework for building web applications. It's fast, easy to learn, and supports a wide variety of technologies. In this tutorial,


### Top-k Sampling
At each step of generation, the model predicts a probability distribution over the entire vocabulary for the next token, then selects only the top $k$ most probable tokens, ranked by their predicted probability. The probabilities of these top $k$ tokens are renormalised to sum to 1, and the next token is randomly sampled from this restricted set.

The number $k$ is configured by the parameter `top_k`.

In [43]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    top_k=10,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I have a very cool little project I want to talk about. My project is to make an app that will allow you to play around and make your own games. I've already written an application for it. It was pretty simple.

I made


### Top-p (Nucleus) Sampling
Instead of selecting tokens from the entire vocabulary, nucleus sampling samples from the smallest set of tokens whose cumulative probability exceeds the threshold $p$. This introduces controlled randomness, resulting in more diverse and creative text generation.

The probability threshold $p$ is configured by the parameter `top_p`.

In [44]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    top_p=0.9,
    no_repeat_ngram_size=2,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I would like to talk to you about the most important part of your career: the next level in your development. Today we will see about all of this with your new position as a UX Designer at one of the major game studios.

We all


## 6. Advanced Generation Strategies
### Speculative Decoding
Speculative decoding uses a second smaller draft model to generate multiple speculative tokens which are then verified by the larget target model to speed up autoregressive decoding. For example, with GPT-2 model:
- Draft model: smaller GPT-2 (e.g., `gpt2` or `distilgpt2`)
- Target model: `gpt2-large`

Speculative decoding can be enabled by setting a draft model to the parameter `assistant_model`.

In [45]:
draft_model = AutoModelForCausalLM.from_pretrained("gpt2")
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    assistant_model=draft_model,
    no_repeat_ngram_size=2,
    do_sample=True,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to cover the most common error you'll encounter in using JavaScript in a web application:

You need to use the <script type="text/javascript">…</script> tag as JavaScript is disabled for this page. When you use


Or, we can use the same parameter in Pipeline:

In [48]:
pipe = pipeline(
    task="text-generation",
    tokenizer=tokeniser,
    model=model,
    assistant_model=draft_model,
    torch_dtype=torch.bfloat16,
)
pipe_output = pipe(
    text_inputs=input_text,
    max_new_tokens=MAX_TOKENS,
    do_sample=True,
    temperature=0.8,
)
print(pipe_output[0]["generated_text"])

Device set to use mps


Hello! Today I'm going to put together a quick guide to install and create a new MFC app to test the app with (for Windows Phone users, that is) and for Android users, it will also be helpful to get you up and running with your new


### Prompt Lookup Decoding
Prompt Lookup Decoding is a variant of speculative decoding that leverages the significant overlap between the input prompt and the generated output. Unlike traditional speculative decoding that uses a smaller draft model to propose next tokens, this method uses substring matching directly on the input prompt to generate candidate tokens more efficiently.

It is sufficient to specify the number of overlapping tokens in the `prompt_lookup_num_tokens` to enable this technique.

In [None]:
outputs = model.generate(
    **model_inputs,
    max_new_tokens=MAX_TOKENS,
    prompt_lookup_num_tokens=3,
    no_repeat_ngram_size=2,
    do_sample=True,
)
print(tokeniser.batch_decode(outputs, skip_special_tokens=True)[0])

Hello! Today I'm going to be talking to you about the subject which is my subject, the things that you must never know but that I can guarantee you!

Why Do We Need To Fear All Of This?
 in order to ensure you will never ever


### Self-Speculative Decoding