Skip to content

feat(mtmd): add SMT backend multimodal inference support for llama-server and llama-mtmd-cli#1

Merged
alex-spacemit merged 40 commits into
spacemit-com:spacemit-mtmdfrom
co-seven:spacemit-mtmd
May 18, 2026
Merged

feat(mtmd): add SMT backend multimodal inference support for llama-server and llama-mtmd-cli#1
alex-spacemit merged 40 commits into
spacemit-com:spacemit-mtmdfrom
co-seven:spacemit-mtmd

Conversation

@co-seven
Copy link
Copy Markdown
Collaborator

Summary

This PR adds multimodal large-model inference feature support on the Spacemit SMT backend, targeting llama-server and llama-mtmd-cli in the spacemit-mtmd branch.

The main goal is to enable SMT-based multimodal inference workflows on Spacemit platforms, including server-side and CLI-side integration, runtime backend selection, model/config compatibility updates, and related usability improvements for
image/audio pipelines.

What is included

  • Add SMT backend multimodal integration for llama-server
  • Add SMT backend multimodal integration for llama-mtmd-cli
  • Support runtime backend selection and related config handling
  • Add SMT image/audio processing path and warmup support
  • Add profiling support for vision/audio encoder stages
  • Add timing summary for vision/audio encode stages
  • Add SMT-only multimodal chat controls such as vision_history
  • Improve backend loading and thread/config handling
  • Add compatibility updates for related multimodal model paths and configs
  • Add support updates for:
    • Qwen3.5-related model arch handling
    • Qwen3-ASR SMT audio path
    • PaddleOCR-VL SMT vision pipeline
  • Add tokenizer/config fallback and encoder input-size compatibility fixes
  • Update build and usage documentation for SMT / multimodal workflows

User-facing impact

This PR enables multimodal inference feature support on the SMT backend for Spacemit targets, including:

  • image understanding through SMT vision integration
  • audio-capable multimodal workflows on supported models
  • SMT-backed server inference through llama-server
  • SMT-backed CLI inference through llama-mtmd-cli
  • improved deployment/configuration usability for multimodal inference on Spacemit platforms

Notes

  • This PR focuses on SMT backend multimodal feature enablement and related integration work.
  • The submitted changes are the selected feature/fix commits prepared for upstreaming into spacemit-mtmd.

co-seven and others added 30 commits May 15, 2026 02:50
Add llama-mtmd-cli-ep, a multimodal CLI that replaces mtmd's CLIP/GGUF
vision encoding with Spacemit EP's ONNX vision engine. This bypasses
the mtmd chunk mechanism (which requires mmproj GGUF) and instead
performs embedding-level fusion: text tokens are mapped to embeddings
via an external embedding table, concatenated withONNX-encoded image
embeddings, and decoded as a single embd batch.

New files:
- ep-vision-wrapper.h/cpp: pimpl wrapper for SpineVisionModelEngine
- mtmd-cli-ep.cpp: multimodal CLI with EP vision support
- CMakeLists.txt: MTMD_BUILD_EP_CLI conditional build target

Change-Id: Idec5fea42a2a166ebf44463b60258e1dedb36e7e
- Add tokenizer stage
- Distinguish between prefill and decode phases

Change-Id: I397d15937b719e46ec4238b15cbbf87c943a2c63
Change-Id: If472027c8883641e8c3e75e0b0417523428a7c7f
Change-Id: Id1b76f4f497bb6804f38f43acc29b27083f043ba
Change-Id: If6bc429163e5122161efb29aef1fe058cc95b73c
Change-Id: Iace35d9a7e8df6e213c8ef127010dd6884454afc
Change-Id: I19f539c9bc986165a0af2e00b90481c093a7db26
Change-Id: Iae70b7f51270d1197eba44130c7b39f11ea1a8b4
Change-Id: I13e7fa811f023d908229033444abb9f94276c810
Change-Id: I465daed1cfa7912c61cf1137431bf9e0e4247c56
Change-Id: I9127495f5d1496b601892de9b120b78432f23741
Change-Id: Id208b773698630e075f4351258174bf747d2dae1
Change-Id: Ic9161f531adcc4ed5aa0324e6ba11e2fdbaf9326
Change-Id: I62d743c7b3c743184dd262a16b2acbb18342e44c
Change-Id: Ic81576f79c1c9ff7d4c6bf8e1c13baa5a0a063cb
Change-Id: I85a46056a0ad88fcce6caba8cadb9ac7c984a180
Change-Id: I9ef0d297140f7e062e79f3d5241fd7959b3c2691
Change-Id: If638f61641e0ef614a8c996e6d087f5f702ee289
Change-Id: I26c5b3f0745c479a7d152846316322a312f00ddb
Change-Id: Ia85eb50b1be8ecc21db75ff2b926ce96e4d75c5e
Change-Id: I057c87c42692fdbfb6e64ff11cb1068a514e30e5
Change-Id: Ide883257dfa4edc82ff6f02f45e6264c2dce1b84
Change-Id: Ieea39b2ac1080ce1f83d22ef119dd9f1868bcac8
Change-Id: I8d90693a80330844b22fb1a20be2abd2e0aea2ed
… onnx warmup

Change-Id: If05141e887649946f39eaa96247f3f11691c6194
Change-Id: I9f4c2e8b7a1d6c3e5f0a9b8c7d6e5f4a3b2c1d0e
Change-Id: I26e3b6a525cadc5add2f4cded0330fe238b56e9c
- Remove duplicate pos_next and size_up_to_pos function definitions
- Add missing smt_ctx parameter to process_chunk call
- Remove obsolete get_api_show handler that no longer exists in current codebase
Change-Id: Ie8ff4104814b6d513765da899c729d78651eea58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds SMT (SpacemiT) multimodal inference support across llama-server and llama-mtmd-cli, including runtime backend selection and profiling hooks.

Changes:

  • Introduces SMT vision/audio wrappers and server-side SMT media encode/decode integration.
  • Adds CLI SMT backend entrypoint and CMake wiring for SpacemiT ORT bundle builds.
  • Adds ggml trace/profiling events and exposes backend metadata via server routes.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tools/server/server-smt-vision.h Declares SMT media encoding/decoding API and fallback stubs when disabled.
tools/server/server-smt-vision.cpp Implements SMT vision/audio encode + LLM embedding decode path and token boundary detection.
tools/server/server-context.h Adds media_backend to server metadata.
tools/server/server-context.cpp Adds backend selection (mtmd vs smt), SMT prompt processing, encoder timing stats, and exposes backend in responses.
tools/server/server-common.h Extends server_tokens to carry SMT media chunks and adds SMT prompt API surface.
tools/server/CMakeLists.txt Adds LLAMA_SERVER_SMT_VISION build option wiring and links ORT + SpacemiT sample sources.
tools/mtmd/smt-vision-wrapper.h / .cpp Adds SMT vision engine wrapper (config parsing, warmup, encode).
tools/mtmd/smt-vision-preprocess.h / .cpp Adds SMT image preprocessing from common image formats to float32 NCHW tensor bytes.
tools/mtmd/smt-audio-wrapper.h / .cpp Adds SMT audio split-encoder wrapper with ONNXRuntime + optional SpacemiT EP init.
tools/mtmd/mtmd-cli.cpp Routes to SMT CLI backend based on params.
tools/mtmd/mtmd-cli-smt.cpp New SMT-backed multimodal CLI implementation.
tools/mtmd/mtmd-audio.h / .cpp Exposes a reusable log-mel spectrogram helper for SMT audio path.
tools/mtmd/CMakeLists.txt Adds SMT-related sources/includes/libs to mtmd CLI build when enabled.
src/llama-context.cpp Adds trace begin/end events around llama_decode.
ggml/src/ggml-profile.c Adds JSON trace emitter implementation behind GGML_BUILD_PROFILE.
ggml/include/ggml-profile.h Adds profiling/trace API surface (enabled via GGML_BUILD_PROFILE).
docs/build-riscv64-spacemit.md Updates build/run docs for SMT enablement and usage.
convert_hf_to_gguf.py Adds tokenizer loading fallback logic and Qwen3-ASR text arch handling.
common/common.h Adds SMT backend selection params under LLAMA_SERVER_SMT_VISION.
common/arg.cpp Adds --media-backend/--vision-backend and --smt-config-dir CLI options.
CMakeLists.txt Adds LLAMA_SERVER_SMT_VISION option and compile definition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/server/server-context.cpp Outdated
Comment thread ggml/src/ggml-profile.c Outdated
Comment thread ggml/src/ggml-profile.c
Comment thread tools/mtmd/mtmd-cli.cpp Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread tools/server/CMakeLists.txt Outdated
Comment thread tools/mtmd/CMakeLists.txt
Comment thread tools/mtmd/smt-audio-wrapper.cpp Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 27 changed files in this pull request and generated 10 comments.

Comment thread src/llama-context.cpp
Comment thread ggml/src/ggml-profile.c
Comment thread tools/mtmd/smt-vision-wrapper.cpp
Comment thread tools/mtmd/smt-vision-wrapper.cpp
Comment thread tools/mtmd/smt-audio-wrapper.cpp
Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread tools/server/server-smt-vision.h Outdated
Comment thread tools/mtmd/mtmd-cli-smt.cpp
Comment thread tools/mtmd/mtmd-cli-smt.cpp Outdated
@alex-spacemit alex-spacemit merged commit 700aa8b into spacemit-com:spacemit-mtmd May 18, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants