feat(mtmd): add SMT backend multimodal inference support for llama-server and llama-mtmd-cli by co-seven · Pull Request #1 · spacemit-com/llama.cpp

co-seven · 2026-05-15T06:18:34Z

Summary

This PR adds multimodal large-model inference feature support on the Spacemit SMT backend, targeting llama-server and llama-mtmd-cli in the spacemit-mtmd branch.

The main goal is to enable SMT-based multimodal inference workflows on Spacemit platforms, including server-side and CLI-side integration, runtime backend selection, model/config compatibility updates, and related usability improvements for
image/audio pipelines.

What is included

Add SMT backend multimodal integration for llama-server
Add SMT backend multimodal integration for llama-mtmd-cli
Support runtime backend selection and related config handling
Add SMT image/audio processing path and warmup support
Add profiling support for vision/audio encoder stages
Add timing summary for vision/audio encode stages
Add SMT-only multimodal chat controls such as vision_history
Improve backend loading and thread/config handling
Add compatibility updates for related multimodal model paths and configs
Add support updates for:
- Qwen3.5-related model arch handling
- Qwen3-ASR SMT audio path
- PaddleOCR-VL SMT vision pipeline
Add tokenizer/config fallback and encoder input-size compatibility fixes
Update build and usage documentation for SMT / multimodal workflows

User-facing impact

This PR enables multimodal inference feature support on the SMT backend for Spacemit targets, including:

image understanding through SMT vision integration
audio-capable multimodal workflows on supported models
SMT-backed server inference through llama-server
SMT-backed CLI inference through llama-mtmd-cli
improved deployment/configuration usability for multimodal inference on Spacemit platforms

Notes

This PR focuses on SMT backend multimodal feature enablement and related integration work.
The submitted changes are the selected feature/fix commits prepared for upstreaming into spacemit-mtmd.

Add llama-mtmd-cli-ep, a multimodal CLI that replaces mtmd's CLIP/GGUF vision encoding with Spacemit EP's ONNX vision engine. This bypasses the mtmd chunk mechanism (which requires mmproj GGUF) and instead performs embedding-level fusion: text tokens are mapped to embeddings via an external embedding table, concatenated withONNX-encoded image embeddings, and decoded as a single embd batch. New files: - ep-vision-wrapper.h/cpp: pimpl wrapper for SpineVisionModelEngine - mtmd-cli-ep.cpp: multimodal CLI with EP vision support - CMakeLists.txt: MTMD_BUILD_EP_CLI conditional build target Change-Id: Idec5fea42a2a166ebf44463b60258e1dedb36e7e

- Add tokenizer stage - Distinguish between prefill and decode phases Change-Id: I397d15937b719e46ec4238b15cbbf87c943a2c63

Change-Id: If472027c8883641e8c3e75e0b0417523428a7c7f

Change-Id: Id1b76f4f497bb6804f38f43acc29b27083f043ba

Change-Id: If6bc429163e5122161efb29aef1fe058cc95b73c

Change-Id: Iace35d9a7e8df6e213c8ef127010dd6884454afc

Change-Id: I19f539c9bc986165a0af2e00b90481c093a7db26

Change-Id: Iae70b7f51270d1197eba44130c7b39f11ea1a8b4

Change-Id: I13e7fa811f023d908229033444abb9f94276c810

Change-Id: I465daed1cfa7912c61cf1137431bf9e0e4247c56

Change-Id: I9127495f5d1496b601892de9b120b78432f23741

Change-Id: Id208b773698630e075f4351258174bf747d2dae1

Change-Id: Ic9161f531adcc4ed5aa0324e6ba11e2fdbaf9326

Change-Id: I62d743c7b3c743184dd262a16b2acbb18342e44c

Change-Id: Ic81576f79c1c9ff7d4c6bf8e1c13baa5a0a063cb

Change-Id: I85a46056a0ad88fcce6caba8cadb9ac7c984a180

Change-Id: I9ef0d297140f7e062e79f3d5241fd7959b3c2691

Change-Id: If638f61641e0ef614a8c996e6d087f5f702ee289

Change-Id: I26c5b3f0745c479a7d152846316322a312f00ddb

Change-Id: Ia85eb50b1be8ecc21db75ff2b926ce96e4d75c5e

Change-Id: I057c87c42692fdbfb6e64ff11cb1068a514e30e5

Change-Id: Ide883257dfa4edc82ff6f02f45e6264c2dce1b84

Change-Id: Ieea39b2ac1080ce1f83d22ef119dd9f1868bcac8

Change-Id: I8d90693a80330844b22fb1a20be2abd2e0aea2ed

… onnx warmup Change-Id: If05141e887649946f39eaa96247f3f11691c6194

Change-Id: I9f4c2e8b7a1d6c3e5f0a9b8c7d6e5f4a3b2c1d0e

Change-Id: I26e3b6a525cadc5add2f4cded0330fe238b56e9c

- Remove duplicate pos_next and size_up_to_pos function definitions - Add missing smt_ctx parameter to process_chunk call - Remove obsolete get_api_show handler that no longer exists in current codebase

Change-Id: Ie8ff4104814b6d513765da899c729d78651eea58

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds SMT (SpacemiT) multimodal inference support across llama-server and llama-mtmd-cli, including runtime backend selection and profiling hooks.

Changes:

Introduces SMT vision/audio wrappers and server-side SMT media encode/decode integration.
Adds CLI SMT backend entrypoint and CMake wiring for SpacemiT ORT bundle builds.
Adds ggml trace/profiling events and exposes backend metadata via server routes.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tools/server/server-smt-vision.h	Declares SMT media encoding/decoding API and fallback stubs when disabled.
tools/server/server-smt-vision.cpp	Implements SMT vision/audio encode + LLM embedding decode path and token boundary detection.
tools/server/server-context.h	Adds `media_backend` to server metadata.
tools/server/server-context.cpp	Adds backend selection (mtmd vs smt), SMT prompt processing, encoder timing stats, and exposes backend in responses.
tools/server/server-common.h	Extends `server_tokens` to carry SMT media chunks and adds SMT prompt API surface.
tools/server/CMakeLists.txt	Adds LLAMA_SERVER_SMT_VISION build option wiring and links ORT + SpacemiT sample sources.
tools/mtmd/smt-vision-wrapper.h / .cpp	Adds SMT vision engine wrapper (config parsing, warmup, encode).
tools/mtmd/smt-vision-preprocess.h / .cpp	Adds SMT image preprocessing from common image formats to float32 NCHW tensor bytes.
tools/mtmd/smt-audio-wrapper.h / .cpp	Adds SMT audio split-encoder wrapper with ONNXRuntime + optional SpacemiT EP init.
tools/mtmd/mtmd-cli.cpp	Routes to SMT CLI backend based on params.
tools/mtmd/mtmd-cli-smt.cpp	New SMT-backed multimodal CLI implementation.
tools/mtmd/mtmd-audio.h / .cpp	Exposes a reusable log-mel spectrogram helper for SMT audio path.
tools/mtmd/CMakeLists.txt	Adds SMT-related sources/includes/libs to mtmd CLI build when enabled.
src/llama-context.cpp	Adds trace begin/end events around `llama_decode`.
ggml/src/ggml-profile.c	Adds JSON trace emitter implementation behind `GGML_BUILD_PROFILE`.
ggml/include/ggml-profile.h	Adds profiling/trace API surface (enabled via `GGML_BUILD_PROFILE`).
docs/build-riscv64-spacemit.md	Updates build/run docs for SMT enablement and usage.
convert_hf_to_gguf.py	Adds tokenizer loading fallback logic and Qwen3-ASR text arch handling.
common/common.h	Adds SMT backend selection params under `LLAMA_SERVER_SMT_VISION`.
common/arg.cpp	Adds `--media-backend/--vision-backend` and `--smt-config-dir` CLI options.
CMakeLists.txt	Adds `LLAMA_SERVER_SMT_VISION` option and compile definition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 25 out of 27 changed files in this pull request and generated 10 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

co-seven and others added 30 commits May 15, 2026 02:50

feat(ggml-profile): add more stage for profile

fca6bd2

- Add tokenizer stage - Distinguish between prefill and decode phases Change-Id: I397d15937b719e46ec4238b15cbbf87c943a2c63

mtmd-ep: align image boundary policy and simplify debug paths

3ce2522

Change-Id: If472027c8883641e8c3e75e0b0417523428a7c7f

server: add EP vision backend with runtime selection

b6a4608

Change-Id: Id1b76f4f497bb6804f38f43acc29b27083f043ba

mtmd: update EP cli flow and add qwen3vl preprocess docs

ea645ce

Change-Id: If6bc429163e5122161efb29aef1fe058cc95b73c

ep vision: add direct jpg preprocessing for server and mtmd cli

d8c0e65

Change-Id: Iace35d9a7e8df6e213c8ef127010dd6884454afc

docs: add EP multimodal usage and qwen3vl tuning note

b2d780f

Change-Id: I19f539c9bc986165a0af2e00b90481c093a7db26

update doc

b3a0b25

Change-Id: Iae70b7f51270d1197eba44130c7b39f11ea1a8b4

updating build doc for EP accelerator

46e213e

Change-Id: I13e7fa811f023d908229033444abb9f94276c810

smt: simplify spacemit vision integration

9a3f28d

Change-Id: I465daed1cfa7912c61cf1137431bf9e0e4247c56

smt: remove verbose runtime logging

1e66622

Change-Id: I9127495f5d1496b601892de9b120b78432f23741

smt: rename vision integration and tighten gating

515b1ee

Change-Id: Id208b773698630e075f4351258174bf747d2dae1

feat: add audio support(Qwen3ASR) for smt.

4b444fa

Change-Id: Ic9161f531adcc4ed5aa0324e6ba11e2fdbaf9326

fix: adjust backend loader

fd1bc5c

Change-Id: I62d743c7b3c743184dd262a16b2acbb18342e44c

fix: EP thread num set to 4

84d4a0d

Change-Id: Ic81576f79c1c9ff7d4c6bf8e1c13baa5a0a063cb

feat: add profile for audio&vision encoder

fb734cb

Change-Id: I85a46056a0ad88fcce6caba8cadb9ac7c984a180

Add vision/audio encode to summary timing print

59fadd0

Change-Id: I9ef0d297140f7e062e79f3d5241fd7959b3c2691

support qwen3.5 model arch

0c611b4

Change-Id: If638f61641e0ef614a8c996e6d087f5f702ee289

bugfix for qwen3.5 model arch

eefdebe

Change-Id: I26c5b3f0745c479a7d152846316322a312f00ddb

support different inputsize for encoder

38700d6

Change-Id: Ia85eb50b1be8ecc21db75ff2b926ce96e4d75c5e

fix:add tokenizer.json fallback for TokenizersBackend models

0d50190

Change-Id: I057c87c42692fdbfb6e64ff11cb1068a514e30e5

feat: add warmup for smt llama-server

4a5adea

Change-Id: Ide883257dfa4edc82ff6f02f45e6264c2dce1b84

feat: add SMT-only vision_history control for multimodal chat

52fab41

Change-Id: Ieea39b2ac1080ce1f83d22ef119dd9f1868bcac8

fix: support config thread num

ff32442

Change-Id: I8d90693a80330844b22fb1a20be2abd2e0aea2ed

bugfix: fix qwen3vl image embedding hidden size mistake and add audio…

ff0c974

… onnx warmup Change-Id: If05141e887649946f39eaa96247f3f11691c6194

mtmd: support ep_config threading options

73117cd

Change-Id: I9f4c2e8b7a1d6c3e5f0a9b8c7d6e5f4a3b2c1d0e

mtmd: support legacy config and ep_config Spacemit EP options

0acf203

Change-Id: I26e3b6a525cadc5add2f4cded0330fe238b56e9c

fix: resolve merge conflicts in server-common.cpp and server-context.cpp

bac31df

- Remove duplicate pos_next and size_up_to_pos function definitions - Add missing smt_ctx parameter to process_chunk call - Remove obsolete get_api_show handler that no longer exists in current codebase

feat(mtmd): add PaddleOCR-VL support to llama-server SMT vision pipeline

0386720

Change-Id: Ie8ff4104814b6d513765da899c729d78651eea58

fix(server): restore SMT context compile fixes

f51fc67

alex-spacemit requested a review from Copilot May 15, 2026 06:39

Copilot started reviewing on behalf of alex-spacemit May 15, 2026 06:39 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

co-seven added 2 commits May 15, 2026 14:52

docs: restore build-riscv64-spacemit.md to upstream version

1c47eb6

fix: address review updates for profiling and server context

0e07454

github-actions Bot added build examples python server ggml labels May 15, 2026

co-seven added 3 commits May 15, 2026 09:19

fix: update SMT mtmd CLI and guard POSIX-only backend usage

38a9ca6

fix: merge duplicate Qwen3ASR converter definitions

a3675de

fix: add build cross for PR process

3c135a3

github-actions Bot added the devops label May 15, 2026

co-seven added 2 commits May 15, 2026 10:04

fix: yml format

7e81288

fix(server): restore prompt cache reload for cleared idle slots

3b7a695

alex-spacemit requested a review from Copilot May 18, 2026 04:57

Copilot started reviewing on behalf of alex-spacemit May 18, 2026 04:58 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

co-seven and others added 3 commits May 18, 2026 13:37

Potential fix for pull request finding

9bba967

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix: address review feedback for prompt cache, profiling, and SMT paths

44a7170

fix: use get_vocab() for fast tokenizers

ba8c191

alex-spacemit merged commit 700aa8b into spacemit-com:spacemit-mtmd May 18, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mtmd): add SMT backend multimodal inference support for llama-server and llama-mtmd-cli#1

feat(mtmd): add SMT backend multimodal inference support for llama-server and llama-mtmd-cli#1
alex-spacemit merged 40 commits into
spacemit-com:spacemit-mtmdfrom
co-seven:spacemit-mtmd

co-seven commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

co-seven commented May 15, 2026

Summary

What is included

User-facing impact

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants