Apply Gemma 4 IT chat template in inference.py and C++ runner by mergennachin · Pull Request #19614 · pytorch/executorch

mergennachin · 2026-05-15T13:53:34Z

Gemma 4 31B-IT is instruction-tuned and produces degenerate output
without the chat template wrapping. Auto-wrap --prompt with the IT
template (<|turn>user\n{prompt}<turn|>\n<|turn>model\n
<|channel>thought\n<channel|>) by default; --raw-prompt / --raw_prompt
skips wrapping for pre-formatted input.

pytorch-bot · 2026-05-15T13:53:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19614

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-15T13:54:26Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Applies the Gemma 4 IT chat template to user prompts by default in both the Python inference.py and C++ runner, with an opt-out flag for pre-formatted input, to avoid degenerate output from the instruction-tuned model.

Changes:

Add apply_chat_template helper and --raw-prompt flag in inference.py.
Add --raw_prompt flag and template-wrapping logic in main.cpp (BOS prepended separately).
Document the auto-wrapping and opt-out flags in the README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
examples/models/gemma4_31b/inference.py	Defines chat template, wraps prompt by default, adds `--raw-prompt` flag
examples/models/gemma4_31b/main.cpp	Adds `--raw_prompt` flag and wraps prompt text with the IT chat template
examples/models/gemma4_31b/README.md	Documents auto-wrapping behavior and skip flags for both runners

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mergennachin · 2026-05-15T14:05:57Z

@claude Review this PR

claude · 2026-05-15T14:06:42Z

Claude finished @mergennachin's task in 1m 31s —— View job

Review of PR #19614

Gather context and read changed files
Review code changes
Post review feedback

Overall this is a clean, well-scoped change. Two issues worth addressing before merge:

1. `str.format()` will crash on prompts with curly braces — `inference.py:78`

apply_chat_template uses _CHAT_TEMPLATE.format(prompt=prompt). If the user prompt contains curly braces (e.g., "explain {dict} in Python"), this will raise a KeyError/ValueError.

# Current (broken on curly-brace input):
_CHAT_TEMPLATE = (
    "<bos><|turn>user\n{prompt}<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"
)

def apply_chat_template(prompt: str) -> str:
    return _CHAT_TEMPLATE.format(prompt=prompt)

A simple fix is to use str.replace() instead:

def apply_chat_template(prompt: str) -> str:
    return "<bos><|turn>user\n" + prompt + "<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"

Fix this →

2. BOS handling inconsistency between Python and C++ — `inference.py:72` vs `main.cpp:243`

The Python template includes the literal string <bos>:

_CHAT_TEMPLATE = (
    "<bos><|turn>user\n{prompt}<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"
)

The C++ template does not:

prompt_text = "<|turn>user\n" + prompt_text +
    "<turn|>\n<|turn>model\n<|channel>thought\n<channel|>";

Both runners separately prepend BOS as a token ID after encoding (Python at inference.py:104, C++ at main.cpp:255-256). Whether this causes a double-BOS in Python depends on how the tokenizers library handles the <bos> string during encode() — if it recognizes <bos> as a special token and emits token id 2, the guard in generate() will skip the extra prepend and things work out; if it tokenizes <bos> as literal subword pieces, the prompt gets garbled.

Either way, the two runners should behave identically. I'd suggest making the Python template match the C++ one (omit <bos> from the template string) since both runners already handle BOS prepending at the token-ID level:

def apply_chat_template(prompt: str) -> str:
    return "<|turn>user\n" + prompt + "<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"

Fix this →

Minor observations (no action needed)

Flag naming: --raw-prompt (Python) vs --raw_prompt (C++) — this is fine and idiomatic for each tool (argparse vs gflags), and the README correctly documents both.
Prompt display (inference.py:229): prints args.prompt (unwrapped), which is good UX — the user sees what they typed. If debugging template issues, users can add --raw-prompt and pass the full template manually.

Looks good overall — nice improvement for the IT model's usability. The two issues above are the only things I'd want addressed before merge.

Gemma 4 31B-IT is instruction-tuned and produces degenerate output without the chat template wrapping. Auto-wrap --prompt with the IT template (<bos><|turn>user\n{prompt}<turn|>\n<|turn>model\n <|channel>thought\n<channel|>) by default; --raw-prompt / --raw_prompt skips wrapping for pre-formatted input.

Copilot AI review requested due to automatic review settings May 15, 2026 13:53

mergennachin requested a review from lucylq as a code owner May 15, 2026 13:53

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2026

mergennachin requested a review from Gasoonjia May 15, 2026 13:53

Copilot started reviewing on behalf of mergennachin May 15, 2026 13:54 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Gasoonjia approved these changes May 15, 2026

View reviewed changes

mergennachin force-pushed the gemma4-chat-template branch from 78ee61f to 5d5c26e Compare May 15, 2026 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply Gemma 4 IT chat template in inference.py and C++ runner#19614

Apply Gemma 4 IT chat template in inference.py and C++ runner#19614
mergennachin wants to merge 1 commit into
mainfrom
gemma4-chat-template

mergennachin commented May 15, 2026

Uh oh!

pytorch-bot Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mergennachin commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergennachin commented May 15, 2026

Uh oh!

pytorch-bot Bot commented May 15, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19614

❗ 1 Active SEVs

Uh oh!

github-actions Bot commented May 15, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mergennachin commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #19614

1. str.format() will crash on prompts with curly braces — inference.py:78

2. BOS handling inconsistency between Python and C++ — inference.py:72 vs main.cpp:243

Minor observations (no action needed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

This PR needs a `release notes:` label

claude Bot commented May 15, 2026 •

edited

Loading

1. `str.format()` will crash on prompts with curly braces — `inference.py:78`

2. BOS handling inconsistency between Python and C++ — `inference.py:72` vs `main.cpp:243`