feat(mtmd): add SMT backend multimodal inference support for llama-server and llama-mtmd-cli#1
Merged
alex-spacemit merged 40 commits intoMay 18, 2026
Conversation
Add llama-mtmd-cli-ep, a multimodal CLI that replaces mtmd's CLIP/GGUF vision encoding with Spacemit EP's ONNX vision engine. This bypasses the mtmd chunk mechanism (which requires mmproj GGUF) and instead performs embedding-level fusion: text tokens are mapped to embeddings via an external embedding table, concatenated withONNX-encoded image embeddings, and decoded as a single embd batch. New files: - ep-vision-wrapper.h/cpp: pimpl wrapper for SpineVisionModelEngine - mtmd-cli-ep.cpp: multimodal CLI with EP vision support - CMakeLists.txt: MTMD_BUILD_EP_CLI conditional build target Change-Id: Idec5fea42a2a166ebf44463b60258e1dedb36e7e
- Add tokenizer stage - Distinguish between prefill and decode phases Change-Id: I397d15937b719e46ec4238b15cbbf87c943a2c63
Change-Id: If472027c8883641e8c3e75e0b0417523428a7c7f
Change-Id: Id1b76f4f497bb6804f38f43acc29b27083f043ba
Change-Id: If6bc429163e5122161efb29aef1fe058cc95b73c
Change-Id: Iace35d9a7e8df6e213c8ef127010dd6884454afc
Change-Id: I19f539c9bc986165a0af2e00b90481c093a7db26
Change-Id: Iae70b7f51270d1197eba44130c7b39f11ea1a8b4
Change-Id: I13e7fa811f023d908229033444abb9f94276c810
Change-Id: I465daed1cfa7912c61cf1137431bf9e0e4247c56
Change-Id: I9127495f5d1496b601892de9b120b78432f23741
Change-Id: Id208b773698630e075f4351258174bf747d2dae1
Change-Id: Ic9161f531adcc4ed5aa0324e6ba11e2fdbaf9326
Change-Id: I62d743c7b3c743184dd262a16b2acbb18342e44c
Change-Id: Ic81576f79c1c9ff7d4c6bf8e1c13baa5a0a063cb
Change-Id: I85a46056a0ad88fcce6caba8cadb9ac7c984a180
Change-Id: I9ef0d297140f7e062e79f3d5241fd7959b3c2691
Change-Id: If638f61641e0ef614a8c996e6d087f5f702ee289
Change-Id: I26c5b3f0745c479a7d152846316322a312f00ddb
Change-Id: Ia85eb50b1be8ecc21db75ff2b926ce96e4d75c5e
Change-Id: I057c87c42692fdbfb6e64ff11cb1068a514e30e5
Change-Id: Ide883257dfa4edc82ff6f02f45e6264c2dce1b84
Change-Id: Ieea39b2ac1080ce1f83d22ef119dd9f1868bcac8
Change-Id: I8d90693a80330844b22fb1a20be2abd2e0aea2ed
… onnx warmup Change-Id: If05141e887649946f39eaa96247f3f11691c6194
Change-Id: I9f4c2e8b7a1d6c3e5f0a9b8c7d6e5f4a3b2c1d0e
Change-Id: I26e3b6a525cadc5add2f4cded0330fe238b56e9c
- Remove duplicate pos_next and size_up_to_pos function definitions - Add missing smt_ctx parameter to process_chunk call - Remove obsolete get_api_show handler that no longer exists in current codebase
Change-Id: Ie8ff4104814b6d513765da899c729d78651eea58
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds SMT (SpacemiT) multimodal inference support across llama-server and llama-mtmd-cli, including runtime backend selection and profiling hooks.
Changes:
- Introduces SMT vision/audio wrappers and server-side SMT media encode/decode integration.
- Adds CLI SMT backend entrypoint and CMake wiring for SpacemiT ORT bundle builds.
- Adds ggml trace/profiling events and exposes backend metadata via server routes.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/server/server-smt-vision.h | Declares SMT media encoding/decoding API and fallback stubs when disabled. |
| tools/server/server-smt-vision.cpp | Implements SMT vision/audio encode + LLM embedding decode path and token boundary detection. |
| tools/server/server-context.h | Adds media_backend to server metadata. |
| tools/server/server-context.cpp | Adds backend selection (mtmd vs smt), SMT prompt processing, encoder timing stats, and exposes backend in responses. |
| tools/server/server-common.h | Extends server_tokens to carry SMT media chunks and adds SMT prompt API surface. |
| tools/server/CMakeLists.txt | Adds LLAMA_SERVER_SMT_VISION build option wiring and links ORT + SpacemiT sample sources. |
| tools/mtmd/smt-vision-wrapper.h / .cpp | Adds SMT vision engine wrapper (config parsing, warmup, encode). |
| tools/mtmd/smt-vision-preprocess.h / .cpp | Adds SMT image preprocessing from common image formats to float32 NCHW tensor bytes. |
| tools/mtmd/smt-audio-wrapper.h / .cpp | Adds SMT audio split-encoder wrapper with ONNXRuntime + optional SpacemiT EP init. |
| tools/mtmd/mtmd-cli.cpp | Routes to SMT CLI backend based on params. |
| tools/mtmd/mtmd-cli-smt.cpp | New SMT-backed multimodal CLI implementation. |
| tools/mtmd/mtmd-audio.h / .cpp | Exposes a reusable log-mel spectrogram helper for SMT audio path. |
| tools/mtmd/CMakeLists.txt | Adds SMT-related sources/includes/libs to mtmd CLI build when enabled. |
| src/llama-context.cpp | Adds trace begin/end events around llama_decode. |
| ggml/src/ggml-profile.c | Adds JSON trace emitter implementation behind GGML_BUILD_PROFILE. |
| ggml/include/ggml-profile.h | Adds profiling/trace API surface (enabled via GGML_BUILD_PROFILE). |
| docs/build-riscv64-spacemit.md | Updates build/run docs for SMT enablement and usage. |
| convert_hf_to_gguf.py | Adds tokenizer loading fallback logic and Qwen3-ASR text arch handling. |
| common/common.h | Adds SMT backend selection params under LLAMA_SERVER_SMT_VISION. |
| common/arg.cpp | Adds --media-backend/--vision-backend and --smt-config-dir CLI options. |
| CMakeLists.txt | Adds LLAMA_SERVER_SMT_VISION option and compile definition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds multimodal large-model inference feature support on the Spacemit SMT backend, targeting
llama-serverandllama-mtmd-cliin thespacemit-mtmdbranch.The main goal is to enable SMT-based multimodal inference workflows on Spacemit platforms, including server-side and CLI-side integration, runtime backend selection, model/config compatibility updates, and related usability improvements for
image/audio pipelines.
What is included
llama-serverllama-mtmd-clivision_historyUser-facing impact
This PR enables multimodal inference feature support on the SMT backend for Spacemit targets, including:
llama-serverllama-mtmd-cliNotes
spacemit-mtmd.