# Debug Training Input: Verify Thinking Tags Are Preserved

This notebook shows exactly what text/tokens go into the model during training.
Use this to verify the fix for the thinking tags bug.

In [1]:
import json
import os
from transformers import Qwen3OmniMoeProcessor

In [2]:
# Load processor
processor = Qwen3OmniMoeProcessor.from_pretrained('Qwen/Qwen3-Omni-30B-A3B-Thinking')
print("Processor loaded!")

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 


Processor loaded!


## 1. The Bug: Original Approach (Thinking Stripped)

In [3]:
# Sample response with thinking
response = "<think>\nI hear a bird chirping. The sound has a melodic, high-pitched quality typical of songbirds.\n</think>\n\nANSWER: Bird"

print("=" * 60)
print("RAW RESPONSE (what we want to train on):")
print("=" * 60)
print(response)
print("=" * 60)

RAW RESPONSE (what we want to train on):
<think>
I hear a bird chirping. The sound has a melodic, high-pitched quality typical of songbirds.
</think>

ANSWER: Bird


In [4]:
# BROKEN: Original approach - include assistant in conversation
conversation_broken = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "text", "text": "What sound is this?"},
        ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": response}],
    },
]

text_broken = processor.apply_chat_template(
    conversation_broken, tokenize=False, add_generation_prompt=False
)

print("=" * 60)
print("BROKEN: What model was trained on (thinking STRIPPED!):")
print("=" * 60)
print(text_broken)
print("=" * 60)
print("\n>>> PROBLEM: <think> tags are GONE!")

BROKEN: What model was trained on (thinking STRIPPED!):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|audio_start|><|audio_pad|><|audio_end|>What sound is this?<|im_end|>
<|im_start|>assistant
ANSWER: Bird<|im_end|>


>>> PROBLEM: <think> tags are GONE!


## 2. The Fix: New Approach (Thinking Preserved)

In [5]:
# FIXED: Only include system + user, manually append assistant
conversation_without_assistant = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "/path/to/audio.wav"},
            {"type": "text", "text": "What sound is this?"},
        ],
    },
]

# Apply template with add_generation_prompt=True
text_fixed = processor.apply_chat_template(
    conversation_without_assistant, tokenize=False, add_generation_prompt=True
)

# Manually append assistant response
text_fixed = text_fixed + response + "<|im_end|>\n"

print("=" * 60)
print("FIXED: What model will now train on (thinking PRESERVED!):")
print("=" * 60)
print(text_fixed)
print("=" * 60)
print("\n>>> SUCCESS: <think> tags are preserved!")

FIXED: What model will now train on (thinking PRESERVED!):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|audio_start|><|audio_pad|><|audio_end|>What sound is this?<|im_end|>
<|im_start|>assistant
<think>
I hear a bird chirping. The sound has a melodic, high-pitched quality typical of songbirds.
</think>

ANSWER: Bird<|im_end|>


>>> SUCCESS: <think> tags are preserved!


## 3. Side-by-Side Comparison

In [6]:
print("COMPARISON:")
print("=" * 80)
print("\nBROKEN (original):")
print("-" * 40)
# Just show assistant part
broken_assistant = text_broken.split("<|im_start|>assistant")[1] if "<|im_start|>assistant" in text_broken else "N/A"
print(f"<|im_start|>assistant{broken_assistant}")

print("\nFIXED (new):")
print("-" * 40)
fixed_assistant = text_fixed.split("<|im_start|>assistant")[1] if "<|im_start|>assistant" in text_fixed else "N/A"
print(f"<|im_start|>assistant{fixed_assistant}")

print("\n" + "=" * 80)
print("Has <think> tags?")
print(f"  BROKEN: {'<think>' in text_broken}")
print(f"  FIXED:  {'<think>' in text_fixed}")

COMPARISON:

BROKEN (original):
----------------------------------------
<|im_start|>assistant
ANSWER: Bird<|im_end|>


FIXED (new):
----------------------------------------
<|im_start|>assistant
<think>
I hear a bird chirping. The sound has a melodic, high-pitched quality typical of songbirds.
</think>

ANSWER: Bird<|im_end|>


Has <think> tags?
  BROKEN: False
  FIXED:  True


## 4. Test with Real Training Data

In [7]:
# Load a sample from your actual training data
train_data_path = "../src/training/rest/outputs/rest/filtered/filtered.json"  # Adjust path as needed

if os.path.exists(train_data_path):
    with open(train_data_path, 'r') as f:
        data = json.load(f)
    
    print(f"Loaded {len(data)} training samples")
    print("\nFirst sample:")
    print(json.dumps(data[0], indent=2)[:1000])
else:
    print(f"Training data not found at: {train_data_path}")
    print("Please update the path to your filtered.json file")

Training data not found at: ../src/training/rest/outputs/rest/filtered/filtered.json
Please update the path to your filtered.json file


In [8]:
# Process a real sample with the FIXED approach
if os.path.exists(train_data_path) and len(data) > 0:
    sample = data[0]
    question = sample.get("question", "").replace("<sound>", "").strip()
    response = sample.get("response", "")
    
    print("Question:", question[:100], "..." if len(question) > 100 else "")
    print("\nResponse (first 500 chars):")
    print(response[:500])
    print("\n" + "=" * 60)
    
    # Check if response has thinking
    has_think = "<think>" in response
    print(f"Response has <think> tags: {has_think}")
    
    if has_think:
        # Simulate the FIXED training approach
        conversation_without_assistant = [
            {
                "role": "system",
                "content": [{"type": "text", "text": "You are a helpful assistant that analyzes audio carefully. Think step by step before giving your final answer."}],
            },
            {
                "role": "user",
                "content": [
                    {"type": "audio", "audio": "/path/to/audio.wav"},
                    {"type": "text", "text": question},
                ],
            },
        ]
        
        text = processor.apply_chat_template(
            conversation_without_assistant, tokenize=False, add_generation_prompt=True
        )
        text = text + response + "<|im_end|>\n"
        
        print("\nFIXED training text (with thinking preserved):")
        print("=" * 60)
        print(text)
        print("=" * 60)
        print(f"\n<think> preserved in final text: {'<think>' in text}")

## 5. Tokenization Check

In [9]:
# Check that the tokens are correct
response = "<think>\nI hear a bird.\n</think>\n\nANSWER: Bird"

conversation_without_assistant = [
    {"role": "system", "content": [{"type": "text", "text": "You are helpful."}]},
    {"role": "user", "content": [
        {"type": "audio", "audio": "/path/to/audio.wav"},
        {"type": "text", "text": "What sound?"},
    ]},
]

text = processor.apply_chat_template(
    conversation_without_assistant, tokenize=False, add_generation_prompt=True
)
text = text + response + "<|im_end|>\n"

# Tokenize
tokens = processor.tokenizer.encode(text, add_special_tokens=False)

print("Token IDs around the thinking section:")
print("=" * 60)

# Find and show tokens around <think>
decoded_tokens = [processor.tokenizer.decode([t]) for t in tokens]
for i, (tok_id, tok_str) in enumerate(zip(tokens, decoded_tokens)):
    if 'think' in tok_str.lower() or 'answer' in tok_str.lower() or tok_str.strip() in ['<', '>', '/', 'ANSWER']:
        # Show context around these tokens
        start = max(0, i - 2)
        end = min(len(tokens), i + 3)
        print(f"\nAround position {i}:")
        for j in range(start, end):
            marker = ">>>" if j == i else "   "
            print(f"{marker} [{j}] {tokens[j]:6d} = {repr(decoded_tokens[j])}")

Token IDs around the thinking section:

Around position 23:
    [21]  77091 = 'assistant'
    [22]    198 = '\n'
>>> [23] 151667 = '<think>'
    [24]    198 = '\n'
    [25]     40 = 'I'

Around position 30:
    [28]  11958 = ' bird'
    [29]    624 = '.\n'
>>> [30] 151668 = '</think>'
    [31]    271 = '\n\n'
    [32]  11692 = 'ANS'


## 6. Why the Bug Happens (Template Logic)

In [10]:
# The chat template has this logic:
template_logic = '''
The template sets `ns.last_query_index` based on:

    {%- if message.role == "user" and message.content is string ... %}
        {%- set ns.last_query_index = index %}
    {%- endif %}

KEY: "message.content is string"

For multimodal content, we use ARRAY format:
    "content": [{"type": "audio", ...}, {"type": "text", ...}]

This is NOT a string, so the condition fails!

Then later, thinking is only added if:
    {%- if loop.index0 > ns.last_query_index %}

But since ns.last_query_index wasn't updated, this condition is FALSE
for the assistant message, and <think> tags are stripped.
'''

print(template_logic)


The template sets `ns.last_query_index` based on:

    {%- if message.role == "user" and message.content is string ... %}
        {%- set ns.last_query_index = index %}
    {%- endif %}

KEY: "message.content is string"

For multimodal content, we use ARRAY format:
    "content": [{"type": "audio", ...}, {"type": "text", ...}]

This is NOT a string, so the condition fails!

Then later, thinking is only added if:
    {%- if loop.index0 > ns.last_query_index %}

But since ns.last_query_index wasn't updated, this condition is FALSE
for the assistant message, and <think> tags are stripped.



In [11]:
# Demonstrate the string vs array difference
print("STRING content (works):")
print(f"  'What sound?' is string: {isinstance('What sound?', str)}")

print("\nARRAY content (breaks thinking):")
array_content = [{"type": "audio", "audio": "x"}, {"type": "text", "text": "What?"}]
print(f"  [{{'type': 'audio'...}}] is string: {isinstance(array_content, str)}")

print("\nThis is why multimodal breaks the template's thinking logic!")

STRING content (works):
  'What sound?' is string: True

ARRAY content (breaks thinking):
  [{'type': 'audio'...}] is string: False

This is why multimodal breaks the template's thinking logic!


## Summary

### The Bug
- Qwen3-Omni's chat template strips `<think>` tags when user content is an array (multimodal)
- This is because the template logic checks `message.content is string`

### The Fix
1. Apply chat template only to system + user messages (`add_generation_prompt=True`)
2. Manually append the assistant response with `<think>` tags preserved

### Verification
- BROKEN: `<|im_start|>assistant\nANSWER: Bird<|im_end|>`
- FIXED: `<|im_start|>assistant\n<think>\n...\n</think>\n\nANSWER: Bird<|im_end|>`