feature: backend/vllm-mlx by thushan · Pull Request #109 · thushan/olla

thushan · 2026-02-20T11:53:19Z

This PR enables vllm-mlx natively.

Summary by CodeRabbit

Release Notes

New Features

Added vLLM-MLX backend support: an Apple Silicon-optimised inference server with OpenAI-compatible API standards
Enables MLX model serving with streaming capabilities and comprehensive performance metrics extraction
Supports quantisation levels (2–8 bit) and model discovery on Apple Silicon devices (M1–M4)

thushan · 2026-02-20T11:53:48Z

@CodeRabbit review

coderabbitai · 2026-02-20T11:53:57Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-02-20T11:54:08Z

Walkthrough

This pull request introduces comprehensive support for vLLM-MLX, an Apple Silicon-optimized OpenAI-compatible inference server using the MLX framework. It adds a new backend profile configuration, converters, parsers, response models, and constants across the system.

Changes

Cohort / File(s)	Summary
Backend Configuration `config/profiles/vllm-mlx.yaml`	New profile defining routing, API endpoints (health, models, chat/completions, completions, embeddings), MLX model context patterns, memory/concurrency requirements, quantization levels, and streaming configuration for vLLM-MLX.
Converter Implementation `internal/adapter/converter/vllm_mlx_converter.go`, `internal/adapter/converter/vllm_mlx_converter_test.go`	New converter that transforms unified models into vLLM-MLX format, including model filtering, alias resolution, and ownership extraction from HuggingFace-style model IDs; comprehensive test coverage for alias fallback and owner derivation logic.
Converter Factory `internal/adapter/converter/factory.go`, `internal/adapter/converter/factory_test.go`	Registers the new VLLMMLX converter in the factory; updated tests expect 10 supported formats including "vllm-mlx".
Parser & Response Models `internal/adapter/registry/profile/vllm_mlx.go`, `internal/adapter/registry/profile/vllm_mlx_parser.go`, `internal/adapter/registry/profile/vllm_mlx_parser_test.go`	Adds VLLMMLXResponse and VLLMMLXModel types for OpenAI-compatible response parsing; implements parser to convert JSON responses into unified domain.ModelInfo objects with publisher, format, and timestamp handling; extensive tests cover field mapping, empty inputs, and error cases.
Parser Registry `internal/adapter/registry/profile/parsers.go`	Updates getParserForFormat switch to instantiate vllmMLXParser for the VLLMMLX provider type.
Constants & Version `internal/core/constants/llm.go`, `internal/core/constants/providers.go`, `internal/core/domain/profile.go`, `internal/version/version.go`	Adds RecipeMLX constant, provider type/display/prefix constants (ProviderTypeVLLMMLX, ProviderDisplayVLLMMLX, ProviderPrefixVLLMMLX1/2), ProfileVLLMMLX constant, and registers "vllm-mlx" in SupportedBackends.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

chore: Consolidate Converters #58: Introduces BaseConverter and refactors converters to use it; the main PR's VLLMMLX converter relies on BaseConverter APIs (FindProviderAlias).
feat: backend/vllm #44: Adds native vLLM provider support; both PRs follow the same pattern of adding profile configurations, parsers, converters, and constants for LLM backends.
refactor: Proxy Configurations #59: Centralises BaseConverter utilities; the main PR's VLLMMLX converter embeds and uses BaseConverter, sharing the same infrastructure.

Suggested labels

enhancement, llm-backend

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feature: backend/vllm-mlx' directly and clearly identifies the main change—adding native vllm-mlx backend support. It is concise, specific, and matches the PR's primary objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch backend/vllm-mlx

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

internal/adapter/converter/factory_test.go (1)
59-60: assert.Contains(t, qpErr.Reason, "vllm") doesn't exclusively verify the standalone "vllm" format.

"vllm" is a substring of "vllm-mlx", so this assertion passes even if the "vllm" converter were removed from the factory (as long as "vllm-mlx" remains in the error string). Consider using a word-boundary or format-separator pattern, or checking formatMap similarly to how line 80–81 does it for the supported-formats test.
♻️ Suggested approach
-		assert.Contains(t, qpErr.Reason, "vllm")
+		assert.Contains(t, qpErr.Reason, "vllm,") // comma-separated; or
+		// assert.True(t, strings.Contains(qpErr.Reason, "vllm") && !strings.HasPrefix(qpErr.Reason[strings.Index(qpErr.Reason, "vllm"):], "vllm-mlx"))
Alternatively, update getSupportedFormats to use a stable delimiter and assert on the delimited token.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/converter/factory_test.go` around lines 59 - 60, The
assertion on qpErr.Reason is too loose because "vllm" is a substring of
"vllm-mlx"; update the test in factory_test.go to assert the standalone presence
of the "vllm" converter by either (A) parsing qpErr.Reason into tokens using the
same delimiter used by getSupportedFormats and asserting one token equals
"vllm", or (B) checking the factory's formatMap (or equivalent map used in the
test) directly for the "vllm" key (similar to how the supported-formats
assertions on lines ~80–81 check specific entries); reference qpErr.Reason,
getSupportedFormats, and formatMap to locate and change the assertion.
internal/adapter/converter/vllm_mlx_converter.go (1)
13-15: Misleading comment — these are convenience aliases, not backward-compatibility shims.

Nothing pre-existing is being maintained here. Consider rewording to clarify intent, e.g. "Package-level aliases so converter tests and logic can reference the profile types without a qualifier."
✏️ Suggested wording
-// Type aliases for backward compatibility with tests
+// Type aliases so converter code can reference profile types without a package qualifier.
 type VLLMMLXResponse = profile.VLLMMLXResponse
 type VLLMMLXModel = profile.VLLMMLXModel
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/converter/vllm_mlx_converter.go` around lines 13 - 15,
Update the misleading comment above the type aliases VLLMMLXResponse and
VLLMMLXModel: replace the "Type aliases for backward compatibility with tests"
wording with a clearer description that these are package-level convenience
aliases allowing converter code and tests to reference profile.VLLMMLXResponse
and profile.VLLMMLXModel without a qualifier (e.g., "Package-level convenience
aliases so converter logic and tests can reference profile types without a
qualifier"). Ensure the new comment sits immediately above the two alias
declarations VLLMMLXResponse and VLLMMLXModel.
internal/adapter/registry/profile/vllm_mlx_parser.go (1)
46-49: Hardcoded "vllm-mlx" string — use the constant for consistency.

Line 47 compares against the string literal "vllm-mlx" while also referencing constants.ProviderTypeVLLM in the same condition. Since constants.ProviderTypeVLLMMLX exists (value "vllm-mlx"), use it here to avoid silent drift.
♻️ Proposed fix
-	if model.OwnedBy != "" && model.OwnedBy != "vllm-mlx" && model.OwnedBy != constants.ProviderTypeVLLM {
+	if model.OwnedBy != "" && model.OwnedBy != constants.ProviderTypeVLLMMLX && model.OwnedBy != constants.ProviderTypeVLLM {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/registry/profile/vllm_mlx_parser.go` around lines 46 - 49,
The condition in the parser is comparing model.OwnedBy against a hardcoded
"vllm-mlx" string; replace that literal with the existing constant
constants.ProviderTypeVLLMMLX so the check is consistent with the rest of the
code; update the if condition that references model.OwnedBy (and
constants.ProviderTypeVLLM) to use constants.ProviderTypeVLLMMLX instead of the
literal, keeping the rest of the logic intact.
internal/adapter/registry/profile/vllm_mlx_parser_test.go (1)
128-158: Use constants.ProviderTypeVLLM in the test's defaultOwners slice for consistency.

The parser (line 47 of vllm_mlx_parser.go) already uses constants.ProviderTypeVLLM when checking the default owned-by value, so the test should reference the same constant rather than hardcoding "vllm". This ensures the test and parser stay in lockstep if the constant's value ever changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/registry/profile/vllm_mlx_parser_test.go` around lines 128 -
158, Update the test's defaultOwners slice to use the shared constant instead of
the hardcoded string: replace the literal "vllm" with constants.ProviderTypeVLLM
in the t.Run block (the test function "does not set publisher when owned_by is
default") so it matches the parser's check in vllm_mlx_parser.go; ensure you
import the package that defines constants.ProviderTypeVLLM if not already
imported and run the test to verify no regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/profiles/vllm-mlx.yaml`:
- Around line 20-26: Update the misleading comment above the anthropic_support
block to say that vLLM-MLX implements the Anthropic Messages API natively
(provides its own /v1/messages endpoint and token counting via
token_count/messages_path) rather than "inheriting the vLLM API surface"; leave
the configuration keys (anthropic_support, enabled, messages_path, token_count)
unchanged and only revise the explanatory comment to reflect native
implementation in vLLM-MLX.
- Around line 165-174: The quantization.multipliers map is missing an entry for
the "bf16" quantization level listed in
features.model_format.quantization_levels, causing lookups to return
missing/zero values; add a "bf16" key to quantization.multipliers (e.g., bf16:
0.5) so memory/multiplier calculations in any code paths using that map (refer
to quantization.multipliers and features.model_format.quantization_levels)
produce correct estimates.

In `@internal/adapter/registry/profile/vllm_mlx_parser.go`:
- Around line 3-9: The file calls json.Unmarshal but does not import the
encoding/json package; add "encoding/json" to the import block in
internal/adapter/registry/profile/vllm_mlx_parser.go so json.Unmarshal resolves
and the file compiles (update the import list where fmt, time, constants, domain
are imported).

---

Nitpick comments:
In `@internal/adapter/converter/factory_test.go`:
- Around line 59-60: The assertion on qpErr.Reason is too loose because "vllm"
is a substring of "vllm-mlx"; update the test in factory_test.go to assert the
standalone presence of the "vllm" converter by either (A) parsing qpErr.Reason
into tokens using the same delimiter used by getSupportedFormats and asserting
one token equals "vllm", or (B) checking the factory's formatMap (or equivalent
map used in the test) directly for the "vllm" key (similar to how the
supported-formats assertions on lines ~80–81 check specific entries); reference
qpErr.Reason, getSupportedFormats, and formatMap to locate and change the
assertion.

In `@internal/adapter/converter/vllm_mlx_converter.go`:
- Around line 13-15: Update the misleading comment above the type aliases
VLLMMLXResponse and VLLMMLXModel: replace the "Type aliases for backward
compatibility with tests" wording with a clearer description that these are
package-level convenience aliases allowing converter code and tests to reference
profile.VLLMMLXResponse and profile.VLLMMLXModel without a qualifier (e.g.,
"Package-level convenience aliases so converter logic and tests can reference
profile types without a qualifier"). Ensure the new comment sits immediately
above the two alias declarations VLLMMLXResponse and VLLMMLXModel.

In `@internal/adapter/registry/profile/vllm_mlx_parser_test.go`:
- Around line 128-158: Update the test's defaultOwners slice to use the shared
constant instead of the hardcoded string: replace the literal "vllm" with
constants.ProviderTypeVLLM in the t.Run block (the test function "does not set
publisher when owned_by is default") so it matches the parser's check in
vllm_mlx_parser.go; ensure you import the package that defines
constants.ProviderTypeVLLM if not already imported and run the test to verify no
regressions.

In `@internal/adapter/registry/profile/vllm_mlx_parser.go`:
- Around line 46-49: The condition in the parser is comparing model.OwnedBy
against a hardcoded "vllm-mlx" string; replace that literal with the existing
constant constants.ProviderTypeVLLMMLX so the check is consistent with the rest
of the code; update the if condition that references model.OwnedBy (and
constants.ProviderTypeVLLM) to use constants.ProviderTypeVLLMMLX instead of the
literal, keeping the rest of the logic intact.

coderabbitai · 2026-02-20T12:01:07Z

config/profiles/vllm-mlx.yaml

+  # Anthropic Messages API support
+  # vLLM-MLX inherits the vLLM API surface, which includes native Anthropic Messages
+  # API support. Token counting is available via /v1/messages/count_tokens.
+  anthropic_support:
+    enabled: true
+    messages_path: /v1/messages
+    token_count: true


⚠️ Potential issue | 🟡 Minor

Minor comment inaccuracy — vLLM-MLX implements the Anthropic API independently, not by inheriting upstream vLLM.

vLLM-MLX provides its own "native /v1/messages endpoint" rather than re-using upstream vLLM's Anthropic layer. The comment "vLLM-MLX inherits the vLLM API surface" is misleading — the Anthropic support is natively implemented in this fork. The enabled: true configuration itself is correct and justified; only the comment needs updating.

📝 Suggested comment update

- # Anthropic Messages API support - # vLLM-MLX inherits the vLLM API surface, which includes native Anthropic Messages - # API support. Token counting is available via /v1/messages/count_tokens. + # Anthropic Messages API support + # vLLM-MLX provides a native /v1/messages endpoint (independent of upstream vLLM), + # enabling Claude Code / OpenCode compatibility. Token counting is available via + # /v1/messages/count_tokens.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Anthropic Messages API support

# vLLM-MLX inherits the vLLM API surface, which includes native Anthropic Messages

# API support. Token counting is available via /v1/messages/count_tokens.

anthropic_support:

enabled: true

messages_path: /v1/messages

token_count: true

# Anthropic Messages API support

# vLLM-MLX provides a native /v1/messages endpoint (independent of upstream vLLM),

# enabling Claude Code / OpenCode compatibility. Token counting is available via

# /v1/messages/count_tokens.

anthropic_support:

enabled: true

messages_path: /v1/messages

token_count: true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/profiles/vllm-mlx.yaml` around lines 20 - 26, Update the misleading comment above the anthropic_support block to say that vLLM-MLX implements the Anthropic Messages API natively (provides its own /v1/messages endpoint and token counting via token_count/messages_path) rather than "inheriting the vLLM API surface"; leave the configuration keys (anthropic_support, enabled, messages_path, token_count) unchanged and only revise the explanatory comment to reflect native implementation in vLLM-MLX.

coderabbitai · 2026-02-20T12:01:08Z

config/profiles/vllm-mlx.yaml

+  # MLX quantisation levels map loosely to these memory multipliers.
+  # Naming convention differs from GGUF (e.g. "4bit" rather than "Q4_K_M"),
+  # but the effective compression ratios are broadly similar.
+  quantization:
+    multipliers:
+      2bit: 0.35
+      3bit: 0.45
+      4bit: 0.50   # Most common MLX distribution format
+      6bit: 0.75
+      8bit: 0.875


⚠️ Potential issue | 🟡 Minor

bf16 is missing from quantization.multipliers whilst listed as a supported quantisation level.

features.model_format.quantization_levels (line 233) includes "bf16", but resources.quantization.multipliers has no bf16 entry. Any code path that applies a multiplier lookup for a bf16-quantised model will silently get a zero/missing value, producing incorrect memory footprint estimates (e.g. zero or a default) for those models.

🛠️ Proposed fix

8bit: 0.875 + bf16: 1.0 # BFloat16 — uncompressed weights, full memory footprint

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# MLX quantisation levels map loosely to these memory multipliers.

# Naming convention differs from GGUF (e.g. "4bit" rather than "Q4_K_M"),

# but the effective compression ratios are broadly similar.

quantization:

multipliers:

2bit: 0.35

3bit: 0.45

4bit: 0.50 # Most common MLX distribution format

6bit: 0.75

8bit: 0.875

# MLX quantisation levels map loosely to these memory multipliers.

# Naming convention differs from GGUF (e.g. "4bit" rather than "Q4_K_M"),

# but the effective compression ratios are broadly similar.

quantization:

multipliers:

2bit: 0.35

3bit: 0.45

4bit: 0.50 # Most common MLX distribution format

6bit: 0.75

8bit: 0.875

bf16: 1.0 # BFloat16 — uncompressed weights, full memory footprint

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@config/profiles/vllm-mlx.yaml` around lines 165 - 174, The quantization.multipliers map is missing an entry for the "bf16" quantization level listed in features.model_format.quantization_levels, causing lookups to return missing/zero values; add a "bf16" key to quantization.multipliers (e.g., bf16: 0.5) so memory/multiplier calculations in any code paths using that map (refer to quantization.multipliers and features.model_format.quantization_levels) produce correct estimates.

coderabbitai · 2026-02-20T12:01:08Z

internal/adapter/registry/profile/vllm_mlx_parser.go

+import (
+	"fmt"
+	"time"
+
+	"github.com/thushan/olla/internal/core/constants"
+	"github.com/thushan/olla/internal/core/domain"
+)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n internal/adapter/registry/profile/vllm_mlx_parser.go

Repository: thushan/olla

Length of output: 2103

🏁 Script executed:

# Check Go version cat go.mod | head -5

Repository: thushan/olla

Length of output: 108

Add missing encoding/json import — file will not compile without it.

Line 21 calls json.Unmarshal to deserialise the response, but encoding/json is not imported. This prevents the file from building.

Proposed fix

import ( + "encoding/json" "fmt" "time" "github.com/thushan/olla/internal/core/constants" "github.com/thushan/olla/internal/core/domain" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import (

"fmt"

"time"

"github.com/thushan/olla/internal/core/constants"

"github.com/thushan/olla/internal/core/domain"

)

import (

"encoding/json"

"fmt"

"time"

"github.com/thushan/olla/internal/core/constants"

"github.com/thushan/olla/internal/core/domain"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/adapter/registry/profile/vllm_mlx_parser.go` around lines 3 - 9, The file calls json.Unmarshal but does not import the encoding/json package; add "encoding/json" to the import block in internal/adapter/registry/profile/vllm_mlx_parser.go so json.Unmarshal resolves and the file compiles (update the import list where fmt, time, constants, domain are imported).

thushan added 4 commits February 20, 2026 10:14

initial vllm-mlx bits

50b651c

parser and updates to factory

3b8615b

Merge branch 'main' into backend/vllm-mlx

fdcf1ac

Merge remote-tracking branch 'origin/main' into backend/vllm-mlx

5739403

thushan merged commit c1e4d2b into main Feb 20, 2026
5 checks passed

thushan deleted the backend/vllm-mlx branch February 20, 2026 11:55

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

chenrui333 mentioned this pull request Feb 20, 2026

olla 0.0.23 chenrui333/homebrew-tap#4420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: backend/vllm-mlx#109

feature: backend/vllm-mlx#109
thushan merged 4 commits intomainfrom
backend/vllm-mlx

thushan commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

thushan commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 20, 2026

Uh oh!

coderabbitai bot Feb 20, 2026

Uh oh!

coderabbitai bot Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thushan commented Feb 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

thushan commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thushan commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 20, 2026 •

edited

Loading