### Decoder Only Architecture

Decoder-only models are designed for autoregressive generation: predicting the next token in a sequence given all previous tokens. They consist solely of Transformer decoder layers, each performing masked self-attention so that the model can only “see” tokens to the left.

![Decoder Only](transformer-architectures.png)

#### Architecture Details

1. **Input Representation**
- Token embeddings: Each input token is converted to a dense vector.
- Position embeddings: Added to token embeddings to encode ordering.
- No segment embeddings are needed for single-sequence generation.

2. **Stack of Decoder Layers**

Each layer consists of:

- Masked Multi-Head Self-Attention
    - Queries, keys, and values come from the layer’s input.
    - A causal mask prevents tokens from attending to future positions.
- Feed-Forward Network
    - Two linear transformations with a non-linearity (usually GELU) in between.
- Layer Normalization & Residual Connections
    - Applied around both the attention sub-layer and the feed-forward sub-layer.
- Language Modeling Head
    - A final linear layer tied to the token embeddings projects hidden states to vocabulary logits.
    - Softmax over logits produces a probability distribution over next tokens.


#### Advantages
- Simplicity
    - Only one stack of layers—easier to scale and optimize.
- Autoregressive Generation
    - Naturally suited for tasks requiring left-to-right decoding (e.g., story and code generation).
- Efficient Inference
    - Can cache past key/value tensors to avoid re-computing attention for earlier tokens.
- Pretraining & Fine-tuning
    - Standard language modeling pretraining easily transfers to many generation tasks.


**Disadvantages**
- No Bidirectional Context
    - Cannot use right-to-left context, limiting understanding compared with encoder-only or encoder-decoder models.
- Exposure Bias
    - Trained on ground-truth prefixes, but during generation it conditions on its own previous predictions.
- Limited for Discriminative Tasks
    - Less suitable for tasks like classification or extractive QA without adding special heads.


### Why and When It’s Used
- Text Generation
    - Chatbots, story writing, code completion, poetry, dialogue systems.
- Autoregressive Modeling
    - Any scenario where you predict the next element in a sequence (language, music, DNA).
- Fine-Tuning on Generation Tasks
    - Given a pretrained decoder-only model, you can fine-tune it on specific generation tasks with prompts.


#### Python Code Demo: Text Generation with GPT-2

In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# 1) Load pretrained GPT-2 tokenizer & model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model     = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

# 2) Prepare prompt
prompt_text = (
    "In a distant future, humanity has colonized Mars. "
    "The first team to land discovers"
)
input_ids = tokenizer(prompt_text, return_tensors="pt").input_ids

# 3) Generate text
# - max_length: total length including prompt
# - do_sample, top_k, top_p: sampling parameters for more creative output
outputs = model.generate(
    input_ids,
    max_length=100,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8,
    num_return_sequences=1,
)

# 4) Decode and print
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)

  from .autonotebook import tqdm as notebook_tqdm
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In a distant future, humanity has colonized Mars. The first team to land discovers that there is an artificial moon of Jupiter in the vicinity. Marsites are able to identify the moon by looking at a map of its orbit. They then go on to discover that there is an artificial moon nearby, and that it is the first moon in a massive solar system.

In a distant future, humanity has colonized Mars. The first team to land discovers that there is an artificial moon of Jupiter


In [2]:

# 2) Prepare prompt
prompt_text = (
    "Write a dialog for TOny Stark when he encounters Thanos"
)
input_ids = tokenizer(prompt_text, return_tensors="pt").input_ids

# 3) Generate text
# - max_length: total length including prompt
# - do_sample, top_k, top_p: sampling parameters for more creative output
outputs = model.generate(
    input_ids,
    max_length=100,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8,
    num_return_sequences=1,
)

# 4) Decode and print
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Write a dialog for TOny Stark when he encounters Thanos or Hawkeye

Crossover: Marvel Unlimited

Sora (unseen) is one of the first characters to appear in Marvel Unlimited, and she is the only character to appear in the comics.

Sora has her powers from the comics, but they are not so used in the comics. She does have the ability to speak the same language as the Avengers and the Avengers. She is one of the few characters


**This snippet shows how a decoder-only model takes a text prompt, applies masked self-attention to generate one token at a time, and uses sampling strategies (top_k, top_p) to produce varied, coherent continuations.**


#### Comparison of Autoregressive Sampling Methods

##### 1. Greedy Search
Greedy search selects the single token with the highest probability at each step, producing a deterministic output. It’s easy to implement and very fast but often leads to repetitive or bland text.

- Pros
    - Very fast and memory-light
    - Deterministic (same prompt → same output)
- Cons
    - Low diversity, can get stuck in loops
    - Often produces generic or repetitive phrases


Here’s a minimal example of greedy decoding with a decoder-only model (GPT-2). At each step, we pick the highest-probability token and append it until we hit max_length or the EOS token.


In [3]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# 1. Load pretrained GPT-2 and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model     = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

# 2. Encode prompt
prompt_text = "Once upon a time in a land far away,"
input_ids   = tokenizer.encode(prompt_text, return_tensors="pt")

# 3. Greedy decoding loop
max_new_tokens = 50
generated = input_ids

with torch.no_grad():
    for _ in range(max_new_tokens):
        # Forward pass to get logits for the last token
        outputs = model(generated)
        logits  = outputs.logits   # shape: [1, seq_len, vocab_size]

        # Select the token with highest probability (greedy)
        next_token = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True)

        # Append to sequence
        generated = torch.cat([generated, next_token], dim=-1)

        # Stop if EOS token is generated
        if next_token.item() == tokenizer.eos_token_id:
            break

# 4. Decode and print
decoded = tokenizer.decode(generated[0], skip_special_tokens=True)
print(decoded)

Once upon a time in a land far away, the sun was shining, and the moon was shining. The sun was shining, and the moon was shining. The sun was shining, and the moon was shining. The sun was shining, and the moon was shining. The sun was shining, and


**Explanation of key steps:**

- We encode the prompt into input_ids.
- In each iteration, we pass the entire sequence through the model and extract the logits for the last position.
- torch.argmax picks the token with the highest logit (greedy choice).
- We append that token and repeat until we reach max_new_tokens or generate the EOS token.
This ensures a deterministic, left-to-right generation where the most likely continuation is always chosen.


##### 2. Beam Search
Beam search keeps the top k candidate sequences (beams) at each generation step, expanding each in parallel. It balances exploration and exploitation, yielding higher-quality outputs than greedy search for many tasks.

- Pros
    - Higher overall quality and coherence
    - Can recover from early mistakes by exploring alternatives
- Cons
    - Increased computational and memory cost
    - If beam width is too large, outputs can become overly safe or generic


Here’s how you can use Hugging Face’s generate API to perform beam-search decoding with a GPT-2 (decoder-only) model. This lets you explore multiple high-probability continuations in parallel:


In [10]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# 1. Load pretrained model & tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model     = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

# 2. Prepare your prompt
prompt = "Space travel is all about human ingenuity to reach vast lengths"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# 3. Beam search generation
#   - num_beams: how many parallel beams to keep
#   - num_return_sequences: how many of the final beams to return
#   - early_stopping=True: stop when all beams reach EOS
beam_outputs = model.generate(
    input_ids,
    max_length=60,
    num_beams=5,
    num_return_sequences=3,
    early_stopping=True,
)

# 4. Decode & display each beam
for i, beam in enumerate(beam_outputs):
    text = tokenizer.decode(beam, skip_special_tokens=True)
    print(f"\n=== Beam {i+1} ===\n{text}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



=== Beam 1 ===
Space travel is all about human ingenuity to reach vast lengths of space.

In the past few years, we've learned a lot about space travel. We've learned a lot about how it works. And we've learned a lot about how it's possible.

We've learned a lot

=== Beam 2 ===
Space travel is all about human ingenuity to reach vast lengths of space.

In the past few years, we've learned a lot about space travel. We've learned a lot about how it works. And we've learned a lot about how it works in the real world.

We've

=== Beam 3 ===
Space travel is all about human ingenuity to reach vast lengths of space.

In the past few years, we've learned a lot about space travel. We've learned a lot about how it works. And we've learned a lot about how it works for us.

We've learned a


##### Key arguments:

- num_beams: number of hypotheses tracked at each step (higher → more exhaustive search, but slower).
- num_return_sequences: how many of the top beams you want back.
- early_stopping=True: stops generation as soon as all beams have generated an end‐of‐sequence token.

Beam search trades off between greedy determinism and full random sampling: you get coherent, high-probability outputs, but with more diversity than pure greedy decoding.


##### 3. Nucleus (Top-p) Sampling
Nucleus sampling constructs a dynamic shortlist of tokens whose cumulative probability mass ≥ p and samples from it. This ensures that only the most plausible tokens are considered, while still allowing randomness.

- Pros
    - High diversity and creative outputs
    - Avoids extremely low-probability words
- Cons
    - Quality can fluctuate—occasional incoherence
    - More unpredictable and harder to reproduce


Here’s a minimal example showing how to do nucleus (top-p) sampling with a decoder-only model (GPT-2). We set do_sample=True, top_p=0.9 (include only the smallest set of tokens whose cumulative prob ≥ 0.9), and top_k=0 (disable top-k filtering).


In [11]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# 1. Load pretrained GPT-2 and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model     = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

# 2. Prepare your prompt
prompt = "Space travel is all about human ingenuity to reach vast lengths"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# 3. Nucleus (top-p) sampling generation
sample_outputs = model.generate(
    input_ids,
    max_length=100,
    do_sample=True,        # enable sampling
    top_p=0.9,             # nucleus sampling threshold
    top_k=0,               # disable top-k
    temperature=1.0,       # control randomness
    num_return_sequences=3 # generate 3 distinct samples
)

# 4. Decode and print each sampled continuation
for i, output in enumerate(sample_outputs, 1):
    text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"=== Sample {i} ===\n{text}\n")



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


=== Sample 1 ===
Space travel is all about human ingenuity to reach vast lengths of space, to establish infrastructure, to experiment for an underdeveloped and enslaved space environment. But it's also about making the enterprise seem like if nothing else, at least, it's not. This isn't to say that aliens don't play chess with the company we love. Time-lapse pictures of aliens — and popularized for cinematic and cartoon purposes — have been used for many years, despite numerous efforts.

"The difference

=== Sample 2 ===
Space travel is all about human ingenuity to reach vast lengths, rapid seas and waves without air and water. Today, many writers — including Laurie Penny, Tim Burton, Margaret Atwood, Robert Zemeckis and George Clooney — are inspired by people with particular hunchback ability to go to places that are impossible to imagine and keep their sanity. Others represent plants, herbs and minerals that are invisible to reality and which have miraculous uses within our own minds

##### Explanation of key args:
- top_p=0.9 means at each step you sample from the smallest set of tokens whose total probability mass is ≥ 0.9.
- top_k=0 turns off fixed-size top-k filtering so only top-p is used.
- temperature softens or sharpens the distribution (<1 for less random, >1 for more random).


**Trade-offs Summary**

| Method | Quality | Diversity | Speed | Complexity | 
|----------|------------|-----------|----------|-------------|
| Greedy Search | Low–Medium | Very Low | Very Fast | O(1) per token | 
| Beam Search (k) | Medium–High | Low–Medium | Slower (O(k)) | O(k) per token | 
| Nucleus Sampling | Medium–High | High | Moderate | O(vocab) per token | 



