Skip to content

[llm][3/4] Python bindings for JinjaChatFormatter + LlamaRunner integration#19535

Open
seyeong-han wants to merge 3 commits into
pytorch:mainfrom
seyeong-han:chat-python-bindings
Open

[llm][3/4] Python bindings for JinjaChatFormatter + LlamaRunner integration#19535
seyeong-han wants to merge 3 commits into
pytorch:mainfrom
seyeong-han:chat-python-bindings

Conversation

@seyeong-han
Copy link
Copy Markdown
Contributor

Summary

Part 3 of the chat-template support stack split out of #16987 per @kirklandsign's request.

This PR exposes the JinjaChatFormatter (added in PR-A #19533) to Python via pybind11, and integrates it into the example LlamaRunner Python class.

Stack overview

PR Subject
1/4 #19533 Library + tests
2/4 #19534 TextLLMRunner echo gating + EOS merge
3/4 (this PR) Python bindings + Python LlamaRunner integration
4/4 llama_main CLI flags + chat_formatter wrapper + universal Jinja docs

What this PR adds

C++ pybind11 bindings (extension/llm/runner/pybindings.cpp)

  • ChatMessage(role, content)
  • ChatConversation(messages, bos_token, eos_token, add_generation_prompt)
  • ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom)
  • JinjaChatFormatter with from_template / from_string / from_file static factories, format(prompt, system_prompt) and format_conversation(ChatConversation) methods, and includes_bos()

Python package surface

  • extension/llm/runner/__init__.py — re-exports the new bindings via __all__
  • extension/llm/runner/_llm_runner.pyi — type stubs for the new classes (IDE / mypy support)

Python LlamaRunner integration (examples/models/llama/runner/generation.py)

LlamaRunner now accepts chat_format / system_prompt / chat_template_file kwargs and exposes _format_prompt + chat_completion using the JinjaChatFormatter.

Backward-compat: default chat_format is "none" (matches llama_main, preserves backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner callers that don't pass chat_format).

_resolve_template_type maps "llama3.2" / "llama32" / "llama3_2" to ChatTemplateType.Llama32 (consistent with C++ parseChatTemplateType) — addresses the cross-language consistency comment from Copilot review on the original PR.

CLI integration (examples/models/llama/runner/eager.py)

Adds --chat_template_file CLI flag for chat mode.

Tests (extension/llm/runner/test/test_runner_pybindings.py)

Python tests covering the new bindings end-to-end.

Why this is split out

Python changes are independently testable and reviewers may want different eyes on the Python vs. C++ paths. Also isolates the backward-compat concern around the chat_format default.

Test Plan

  • Build with EXECUTORCH_BUILD_PYBIND=ON
  • Run Python tests: pytest extension/llm/runner/test/test_runner_pybindings.py
  • Verify from executorch.extension.llm.runner import JinjaChatFormatter works
  • Verify LlamaRunner.chat_completion() formats prompts correctly with default Llama3 template
  • Verify LlamaRunner constructor with chat_format="none" (default) is backward-compatible
  • Verify _resolve_template_type maps Llama 3.2 variants to Llama32

Depends on

  • PR-A: #19533 (JinjaChatFormatter library headers/symbols)

Original PR

Splitting #16987 into 4 reviewable PRs.

cc @kirklandsign @larryliu0820 @metascroy

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19535

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based
JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3
templates, build glue (CMake/Buck), and a focused C++ unit-test suite.
This PR is reviewable in isolation — it has no behavior change for any
existing runner; downstream PRs (B/C/D) plug it in.

This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request:

  1/4 (this PR)  Library + tests
  2/4            TextLLMRunner echo-gated special-token filter + EOS merge
  3/4            Python bindings + Python LlamaRunner integration
  4/4            llama_main CLI flags + chat_formatter wrapper + docs

What this PR adds
-----------------
* extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt,
  targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the
  ChatTemplateType enum + ModelTokens. The CMake file FetchContent's
  Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE
  FetchContent_MakeAvailable so it propagates correctly, plus header
  staging for nonstd headers that some Jinja2Cpp installations omit.
  Installs chat_templates.h so SDK consumers can include it.
* extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the
  Universal Jinja chat formatter that supports any HuggingFace / vLLM
  chat template, not just the embedded ones. Loadable via fromTemplate
  (built-in), fromString (any string), or fromFile (any .jinja file).
  formatConversation injects vLLM/HuggingFace-standard params (tools=[],
  tool_choice=None, date_string, chat_template_kwargs) so any template
  that references those variables renders correctly.
* normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp:
  notably, 'not tools is none' maps to 'tools' (truthy check), preserving
  the intent of 'tools is not none' for empty-list defaults.
* extension/llm/runner/{CMakeLists.txt, targets.bzl} — link
  extension_llm_runner against jinja2cpp (PRIVATE) and define
  EXECUTORCH_USE_JINJA2CPP.
* extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt,
  targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3
  embedded templates, parseChatTemplateType (case-insensitive), and
  three universal-Jinja regression tests:
    - generic HuggingFace-style template (proves it's not Llama-specific)
    - tools-aware template (validates the tools=[] default)
    - 'not tools is none' normalization regression test
* CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template)
  guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER.
* shim_et/xplat/executorch/build/build_variables.bzl — adds
  jinja_chat_formatter.cpp to the runner sources.

Notes
-----
* No behavior change for existing TextLLMRunner / MultimodalRunner users:
  the formatter is opt-in, only invoked when downstream code calls it.
* Sample vLLM templates are NOT checked in (per reviewer feedback);
  documentation in the follow-up CLI PR points users to vLLM's examples
  directory and HuggingFace tokenizer_config.json files.

Original PR (full stack): pytorch#16987
Part 2 of the chat-template support stack split out of pytorch#16987.

What this PR adds
-----------------
* extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()'
  with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and
  generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for
  Llama-format <|...|> tokens. wrapped_callback now suppresses these
  from the printed stream when GenerationConfig.echo == false. When
  echo == true, raw model output (including chat-template tokens) is
  emitted unchanged - this preserves backward compatibility for users
  who explicitly want to see raw tokens.

* extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES
  the tokenizer's primary eos_tok() with any additional EOS IDs the
  model metadata exports under kEosIds, instead of clearing the set
  when metadata is present. This is correct for HF-tokenizer models
  (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model
  also wants <|eot_id|> as a stop token. Also logs the primary tok
  and only logs metadata IDs that are newly inserted.

Why this is split out
---------------------
These are runner-behavior changes that affect ALL TextLLMRunner users,
not just the new chat-template path. They deserve focused review for
backward-compat impact (echo gating) and EOS-set semantics (merge vs
clear).

Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter
            library) — only for stack ordering; this PR has no
            include or symbol dependency on that library.

Original PR (full stack): pytorch#16987
…ration

Part 3 of the chat-template support stack split out of pytorch#16987.

What this PR adds
-----------------
* extension/llm/runner/pybindings.cpp: New pybind11 classes:
  - ChatMessage(role, content)
  - ChatConversation(messages, bos_token, eos_token, add_generation_prompt)
  - ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom)
  - JinjaChatFormatter with from_template / from_string / from_file
    static factories, format(prompt, system_prompt) and
    format_conversation(ChatConversation) methods, includes_bos().
* extension/llm/runner/__init__.py: re-exports the new bindings via
  __all__.
* extension/llm/runner/_llm_runner.pyi: type stubs for the new
  classes so consumers get IDE / mypy support.
* extension/llm/runner/test/test_runner_pybindings.py: Python tests
  covering the new bindings end-to-end.
* examples/models/llama/runner/generation.py: LlamaRunner now accepts
  chat_format / system_prompt / chat_template_file kwargs and exposes
  _format_prompt + chat_completion using the JinjaChatFormatter.
  Default chat_format is 'none' (matches llama_main, preserves
  backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner
  callers). _resolve_template_type maps 'llama3.2' / 'llama32' /
  'llama3_2' to ChatTemplateType.Llama32 (consistent with C++
  parseChatTemplateType).
* examples/models/llama/runner/eager.py: adds --chat_template_file CLI
  flag for chat mode.

Why this is split out
---------------------
Python changes are independently testable and reviewers may want
different eyes on the Python vs C++ paths. Also isolates the
backward-compat concern around the chat_format default.

Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter
            library headers/symbols).

Original PR (full stack): pytorch#16987
@seyeong-han seyeong-han force-pushed the chat-python-bindings branch from 175f580 to 13af6b1 Compare May 13, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant