# 8. Constrained / Structured Decoding

Have you ever wondered how APIs like OpenAI's `response_format={ "type": "json_object" }` actually guarantee that the output is flawless JSON 100% of the time?

They use **Constrained Decoding** (often powered by libraries like `guidance`, `outlines`, or `llama.cpp` grammars).

This is arguably the most important concept in modern agentic AI. If you want an LLM to call a Python function, you cannot tolerate a hallucinated quotation mark or a missing bracket. The syntax must be mathematically guaranteed.

## The Concept

Instead of begging the model in the prompt ("PLEASE output valid JSON, do not include markdown, I will tip you $200"), Constrained Decoding intercepts the raw **Logits** right before they are converted into probabilities via Softmax.

It uses a deterministic parser (like a Regex engine or JSON schema validator) to analyze the string generated so far. It then asks the parser: *"Which tokens in the vocabulary are legally allowed to come next?"*

Every illegal token has its logit score manually hard-coded to `-Infinity`. When Softmax is applied, `-Infinity` becomes $0.0\%$. It becomes mathematically impossible for the model to hallucinate bad syntax.

In [1]:
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

device = 'mps' if torch.backends.mps.is_available() else 'cuda' if torch.cuda.is_available() else 'cpu'
model_id = "Qwen/Qwen2.5-0.5B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
model.eval()

print("Ready for structured generation!")

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

Ready for structured generation!


### A Simple Example: Forcing a Binary Choice

Let's say we are building a sentiment analysis classifier. We ONLY want the model to output the literal word `"Positive"` or `"Negative"`. Nothing else is allowed.

Instead of hoping the model obeys the prompt, we will brutally enforce it.

In [2]:
# Find the exact Token IDs for the words we want to allow
# Note: Tokenization is tricky! We prepend a space because words in the middle of a sentence often start with a space token.
pos_id = tokenizer.encode(" Positive", add_special_tokens=False)[0]
neg_id = tokenizer.encode(" Negative", add_special_tokens=False)[0]

print(f"Allowed Tokens -> Positive: {pos_id}, Negative: {neg_id}")

prompt = "Review: The food was absolutely disgusting and cold.\nSentiment:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

with torch.no_grad():
    outputs = model(input_ids)
    
logits = outputs.logits[0, -1, :]

print("\n--- Top 3 Natural Model Choices (No Constraints) ---")
top_values, top_indices = torch.topk(logits, 3)
for i in range(3):
    print(f"{repr(tokenizer.decode(top_indices[i].item()))} (Logit: {top_values[i].item():.2f})")

print("\n--- Applying the Constraint Mask ---")
# Create a mask of -Infinity
mask = torch.full_like(logits, -float('Inf'))

# Unmask ONLY the two allowed tokens!
mask[pos_id] = logits[pos_id]
mask[neg_id] = logits[neg_id]

top_values_masked, top_indices_masked = torch.topk(mask, 3)
for i in range(3):
    val = top_values_masked[i].item()
    print(f"{repr(tokenizer.decode(top_indices_masked[i].item()))} (Logit: {val})")

# Generate the final forced token
forced_token = torch.argmax(mask).unsqueeze(0).unsqueeze(0)
print(f"\nFinal Output: {prompt}{tokenizer.decode(forced_token[0])}")

Allowed Tokens -> Positive: 43903, Negative: 50857

--- Top 3 Natural Model Choices (No Constraints) ---
' negative' (Logit: 15.69)
' Mixed' (Logit: 15.00)
' Negative' (Logit: 15.00)

--- Applying the Constraint Mask ---
' Negative' (Logit: 15.0)
' Positive' (Logit: 11.875)
'!' (Logit: -inf)

Final Output: Review: The food was absolutely disgusting and cold.
Sentiment: Negative


### ðŸ”¬ Experimentation Ideas

1. **Force a number generation:** 
   * *Find the token IDs for the digits `0` through `9`. Mask the logits so the model is mathematically forced to write a 10-digit phone number, even if the prompt asks it to write a poem.*
2. **Implement a strict JSON array:**
   * *Write a loop that forces the first token to be `[`, forces the middle tokens to be numbers or commas, and forces the final token to be `]`.*
3. **Explore `transformers` LogitsProcessors:**
   * *Look up HuggingFace's `LogitsProcessor` class. This is the exact mechanism they use under the hood to let you write custom python code that executes exactly where we built our `mask` above!*

In [3]:
import torch
from transformers import LogitsProcessor, LogitsProcessorList

print("=== Experiment 1: Forcing a 10-Digit Phone Number ===")
# 1. We must find the exact Token IDs for the digits 0-9 in the model's vocabulary
digit_tokens = [str(i) for i in range(10)]
digit_ids = [tokenizer.encode(d, add_special_tokens=False)[0] for d in digit_tokens]

poem_prompt = "Write a beautiful, emotional poem about the moon and stars:\n"
input_ids = tokenizer(poem_prompt, return_tensors="pt").input_ids.to(device)

print(f"Prompt: {repr(poem_prompt)}")
print("Generating: ", end="")

# 2. We force exactly 10 generation steps
for step in range(10):
    with torch.no_grad():
        outputs = model(input_ids)
    
    logits = outputs.logits[0, -1, :]
    
    # Create an absolute -Infinity mask
    mask = torch.full_like(logits, -float('Inf'))
    
    # UNMASK ONLY the digits
    for d_id in digit_ids:
        mask[d_id] = logits[d_id]
        
    next_token = torch.argmax(mask).unsqueeze(0).unsqueeze(0)
    input_ids = torch.cat([input_ids, next_token], dim=-1)
    print(tokenizer.decode(next_token[0]), end="", flush=True)


print("\n\n=== Experiment 2: Strict JSON Array Generation ===")
# 1. Token IDs for structural JSON characters
bracket_open = tokenizer.encode("[", add_special_tokens=False)[0]
bracket_close = tokenizer.encode("]", add_special_tokens=False)[0]
comma = tokenizer.encode(",", add_special_tokens=False)[0]

json_prompt = "The user data formatted as a JSON array:\n"
input_ids = tokenizer(json_prompt, return_tensors="pt").input_ids.to(device)

# 2. We build a simple "State Machine" of arrays of allowed tokens at each specific step
# Forcing the structure: [ digit , digit , digit ]
allowed_sequences = [
    [bracket_open],            # Step 0: Must be [
    digit_ids,                 # Step 1: Must be any digit
    [comma],                   # Step 2: Must be ,
    digit_ids,                 # Step 3: Must be any digit
    [comma],                   # Step 4: Must be ,
    digit_ids,                 # Step 5: Must be any digit
    [bracket_close]            # Step 6: Must be ]
]

print(f"Prompt: {repr(json_prompt)}")
print("Generating: ", end="")

for step in range(len(allowed_sequences)):
    with torch.no_grad():
        outputs = model(input_ids)
    logits = outputs.logits[0, -1, :]
    
    # Apply the mask for this specific step in our state machine
    mask = torch.full_like(logits, -float('Inf'))
    allowed_ids = allowed_sequences[step]
    
    for a_id in allowed_ids:
        mask[a_id] = logits[a_id]
        
    next_token = torch.argmax(mask).unsqueeze(0).unsqueeze(0)
    input_ids = torch.cat([input_ids, next_token], dim=-1)
    print(tokenizer.decode(next_token[0]), end="", flush=True)


print("\n\n=== Experiment 3: Using HuggingFace LogitsProcessors ===")
print("Instead of a manual loop, HuggingFace lets us inject our custom rules directly into model.generate()!")

# We create a custom class that inherits from LogitsProcessor
class NoVowelsProcessor(LogitsProcessor):
    """A processor that bans the letter 'e'/'E' from ever being generated."""
    def __init__(self, tokenizer):
        self.banned_ids = []
        # Find all tokens in the entire vocabulary that contain 'e' or 'E'
        for token_str, token_id in tokenizer.get_vocab().items():
            if 'e' in token_str or 'E' in token_str:
                self.banned_ids.append(token_id)

    # This function intercept the logits right before Softmax inside model.generate()
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        for token_id in self.banned_ids:
            scores[:, token_id] = -float('Inf')
        return scores

# Initialize our custom processor
processor_list = LogitsProcessorList([NoVowelsProcessor(tokenizer)])

test_prompt = "The secret meaning of life is"
inputs = tokenizer(test_prompt, return_tensors="pt").to(device)

print(f"Prompt: {test_prompt}")
print("Generating (Banning the letter 'e'):")

# Generate normally, but pass our custom logic in!
out = model.generate(
    **inputs, 
    max_new_tokens=20, 
    logits_processor=processor_list, # <--- The magic happens here
    pad_token_id=tokenizer.eos_token_id,
    do_sample=False
)

print(tokenizer.decode(out[0]))


=== Experiment 1: Forcing a 10-Digit Phone Number ===
Prompt: 'Write a beautiful, emotional poem about the moon and stars:\n'
Generating: 1234567890

=== Experiment 2: Strict JSON Array Generation ===
Prompt: 'The user data formatted as a JSON array:\n'
Generating: [1,2,3]

=== Experiment 3: Using HuggingFace LogitsProcessors ===
Instead of a manual loop, HuggingFace lets us inject our custom rules directly into model.generate()!
Prompt: The secret meaning of life is
Generating (Banning the letter 'e'):
The secret meaning of life is to find a way to ________ your own worth.
A. show
B. show off

