Skip to content

feature: backend/vllm-mlx#109

Merged
thushan merged 4 commits intomainfrom
backend/vllm-mlx
Feb 20, 2026
Merged

feature: backend/vllm-mlx#109
thushan merged 4 commits intomainfrom
backend/vllm-mlx

Conversation

@thushan
Copy link
Owner

@thushan thushan commented Feb 20, 2026

This PR enables vllm-mlx natively.

Summary by CodeRabbit

Release Notes

New Features

  • Added vLLM-MLX backend support: an Apple Silicon-optimised inference server with OpenAI-compatible API standards
  • Enables MLX model serving with streaming capabilities and comprehensive performance metrics extraction
  • Supports quantisation levels (2–8 bit) and model discovery on Apple Silicon devices (M1–M4)

@thushan
Copy link
Owner Author

thushan commented Feb 20, 2026

@CodeRabbit review

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Walkthrough

This pull request introduces comprehensive support for vLLM-MLX, an Apple Silicon-optimized OpenAI-compatible inference server using the MLX framework. It adds a new backend profile configuration, converters, parsers, response models, and constants across the system.

Changes

Cohort / File(s) Summary
Backend Configuration
config/profiles/vllm-mlx.yaml
New profile defining routing, API endpoints (health, models, chat/completions, completions, embeddings), MLX model context patterns, memory/concurrency requirements, quantization levels, and streaming configuration for vLLM-MLX.
Converter Implementation
internal/adapter/converter/vllm_mlx_converter.go, internal/adapter/converter/vllm_mlx_converter_test.go
New converter that transforms unified models into vLLM-MLX format, including model filtering, alias resolution, and ownership extraction from HuggingFace-style model IDs; comprehensive test coverage for alias fallback and owner derivation logic.
Converter Factory
internal/adapter/converter/factory.go, internal/adapter/converter/factory_test.go
Registers the new VLLMMLX converter in the factory; updated tests expect 10 supported formats including "vllm-mlx".
Parser & Response Models
internal/adapter/registry/profile/vllm_mlx.go, internal/adapter/registry/profile/vllm_mlx_parser.go, internal/adapter/registry/profile/vllm_mlx_parser_test.go
Adds VLLMMLXResponse and VLLMMLXModel types for OpenAI-compatible response parsing; implements parser to convert JSON responses into unified domain.ModelInfo objects with publisher, format, and timestamp handling; extensive tests cover field mapping, empty inputs, and error cases.
Parser Registry
internal/adapter/registry/profile/parsers.go
Updates getParserForFormat switch to instantiate vllmMLXParser for the VLLMMLX provider type.
Constants & Version
internal/core/constants/llm.go, internal/core/constants/providers.go, internal/core/domain/profile.go, internal/version/version.go
Adds RecipeMLX constant, provider type/display/prefix constants (ProviderTypeVLLMMLX, ProviderDisplayVLLMMLX, ProviderPrefixVLLMMLX1/2), ProfileVLLMMLX constant, and registers "vllm-mlx" in SupportedBackends.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • chore: Consolidate Converters #58: Introduces BaseConverter and refactors converters to use it; the main PR's VLLMMLX converter relies on BaseConverter APIs (FindProviderAlias).
  • feat: backend/vllm #44: Adds native vLLM provider support; both PRs follow the same pattern of adding profile configurations, parsers, converters, and constants for LLM backends.
  • refactor: Proxy Configurations #59: Centralises BaseConverter utilities; the main PR's VLLMMLX converter embeds and uses BaseConverter, sharing the same infrastructure.

Suggested labels

enhancement, llm-backend

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feature: backend/vllm-mlx' directly and clearly identifies the main change—adding native vllm-mlx backend support. It is concise, specific, and matches the PR's primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch backend/vllm-mlx

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@thushan thushan merged commit c1e4d2b into main Feb 20, 2026
5 checks passed
@thushan thushan deleted the backend/vllm-mlx branch February 20, 2026 11:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
internal/adapter/converter/factory_test.go (1)

59-60: assert.Contains(t, qpErr.Reason, "vllm") doesn't exclusively verify the standalone "vllm" format.

"vllm" is a substring of "vllm-mlx", so this assertion passes even if the "vllm" converter were removed from the factory (as long as "vllm-mlx" remains in the error string). Consider using a word-boundary or format-separator pattern, or checking formatMap similarly to how line 80–81 does it for the supported-formats test.

♻️ Suggested approach
-		assert.Contains(t, qpErr.Reason, "vllm")
+		assert.Contains(t, qpErr.Reason, "vllm,") // comma-separated; or
+		// assert.True(t, strings.Contains(qpErr.Reason, "vllm") && !strings.HasPrefix(qpErr.Reason[strings.Index(qpErr.Reason, "vllm"):], "vllm-mlx"))

Alternatively, update getSupportedFormats to use a stable delimiter and assert on the delimited token.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/converter/factory_test.go` around lines 59 - 60, The
assertion on qpErr.Reason is too loose because "vllm" is a substring of
"vllm-mlx"; update the test in factory_test.go to assert the standalone presence
of the "vllm" converter by either (A) parsing qpErr.Reason into tokens using the
same delimiter used by getSupportedFormats and asserting one token equals
"vllm", or (B) checking the factory's formatMap (or equivalent map used in the
test) directly for the "vllm" key (similar to how the supported-formats
assertions on lines ~80–81 check specific entries); reference qpErr.Reason,
getSupportedFormats, and formatMap to locate and change the assertion.
internal/adapter/converter/vllm_mlx_converter.go (1)

13-15: Misleading comment — these are convenience aliases, not backward-compatibility shims.

Nothing pre-existing is being maintained here. Consider rewording to clarify intent, e.g. "Package-level aliases so converter tests and logic can reference the profile types without a qualifier."

✏️ Suggested wording
-// Type aliases for backward compatibility with tests
+// Type aliases so converter code can reference profile types without a package qualifier.
 type VLLMMLXResponse = profile.VLLMMLXResponse
 type VLLMMLXModel = profile.VLLMMLXModel
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/converter/vllm_mlx_converter.go` around lines 13 - 15,
Update the misleading comment above the type aliases VLLMMLXResponse and
VLLMMLXModel: replace the "Type aliases for backward compatibility with tests"
wording with a clearer description that these are package-level convenience
aliases allowing converter code and tests to reference profile.VLLMMLXResponse
and profile.VLLMMLXModel without a qualifier (e.g., "Package-level convenience
aliases so converter logic and tests can reference profile types without a
qualifier"). Ensure the new comment sits immediately above the two alias
declarations VLLMMLXResponse and VLLMMLXModel.
internal/adapter/registry/profile/vllm_mlx_parser.go (1)

46-49: Hardcoded "vllm-mlx" string — use the constant for consistency.

Line 47 compares against the string literal "vllm-mlx" while also referencing constants.ProviderTypeVLLM in the same condition. Since constants.ProviderTypeVLLMMLX exists (value "vllm-mlx"), use it here to avoid silent drift.

♻️ Proposed fix
-	if model.OwnedBy != "" && model.OwnedBy != "vllm-mlx" && model.OwnedBy != constants.ProviderTypeVLLM {
+	if model.OwnedBy != "" && model.OwnedBy != constants.ProviderTypeVLLMMLX && model.OwnedBy != constants.ProviderTypeVLLM {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/registry/profile/vllm_mlx_parser.go` around lines 46 - 49,
The condition in the parser is comparing model.OwnedBy against a hardcoded
"vllm-mlx" string; replace that literal with the existing constant
constants.ProviderTypeVLLMMLX so the check is consistent with the rest of the
code; update the if condition that references model.OwnedBy (and
constants.ProviderTypeVLLM) to use constants.ProviderTypeVLLMMLX instead of the
literal, keeping the rest of the logic intact.
internal/adapter/registry/profile/vllm_mlx_parser_test.go (1)

128-158: Use constants.ProviderTypeVLLM in the test's defaultOwners slice for consistency.

The parser (line 47 of vllm_mlx_parser.go) already uses constants.ProviderTypeVLLM when checking the default owned-by value, so the test should reference the same constant rather than hardcoding "vllm". This ensures the test and parser stay in lockstep if the constant's value ever changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/registry/profile/vllm_mlx_parser_test.go` around lines 128 -
158, Update the test's defaultOwners slice to use the shared constant instead of
the hardcoded string: replace the literal "vllm" with constants.ProviderTypeVLLM
in the t.Run block (the test function "does not set publisher when owned_by is
default") so it matches the parser's check in vllm_mlx_parser.go; ensure you
import the package that defines constants.ProviderTypeVLLM if not already
imported and run the test to verify no regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/profiles/vllm-mlx.yaml`:
- Around line 20-26: Update the misleading comment above the anthropic_support
block to say that vLLM-MLX implements the Anthropic Messages API natively
(provides its own /v1/messages endpoint and token counting via
token_count/messages_path) rather than "inheriting the vLLM API surface"; leave
the configuration keys (anthropic_support, enabled, messages_path, token_count)
unchanged and only revise the explanatory comment to reflect native
implementation in vLLM-MLX.
- Around line 165-174: The quantization.multipliers map is missing an entry for
the "bf16" quantization level listed in
features.model_format.quantization_levels, causing lookups to return
missing/zero values; add a "bf16" key to quantization.multipliers (e.g., bf16:
0.5) so memory/multiplier calculations in any code paths using that map (refer
to quantization.multipliers and features.model_format.quantization_levels)
produce correct estimates.

In `@internal/adapter/registry/profile/vllm_mlx_parser.go`:
- Around line 3-9: The file calls json.Unmarshal but does not import the
encoding/json package; add "encoding/json" to the import block in
internal/adapter/registry/profile/vllm_mlx_parser.go so json.Unmarshal resolves
and the file compiles (update the import list where fmt, time, constants, domain
are imported).

---

Nitpick comments:
In `@internal/adapter/converter/factory_test.go`:
- Around line 59-60: The assertion on qpErr.Reason is too loose because "vllm"
is a substring of "vllm-mlx"; update the test in factory_test.go to assert the
standalone presence of the "vllm" converter by either (A) parsing qpErr.Reason
into tokens using the same delimiter used by getSupportedFormats and asserting
one token equals "vllm", or (B) checking the factory's formatMap (or equivalent
map used in the test) directly for the "vllm" key (similar to how the
supported-formats assertions on lines ~80–81 check specific entries); reference
qpErr.Reason, getSupportedFormats, and formatMap to locate and change the
assertion.

In `@internal/adapter/converter/vllm_mlx_converter.go`:
- Around line 13-15: Update the misleading comment above the type aliases
VLLMMLXResponse and VLLMMLXModel: replace the "Type aliases for backward
compatibility with tests" wording with a clearer description that these are
package-level convenience aliases allowing converter code and tests to reference
profile.VLLMMLXResponse and profile.VLLMMLXModel without a qualifier (e.g.,
"Package-level convenience aliases so converter logic and tests can reference
profile types without a qualifier"). Ensure the new comment sits immediately
above the two alias declarations VLLMMLXResponse and VLLMMLXModel.

In `@internal/adapter/registry/profile/vllm_mlx_parser_test.go`:
- Around line 128-158: Update the test's defaultOwners slice to use the shared
constant instead of the hardcoded string: replace the literal "vllm" with
constants.ProviderTypeVLLM in the t.Run block (the test function "does not set
publisher when owned_by is default") so it matches the parser's check in
vllm_mlx_parser.go; ensure you import the package that defines
constants.ProviderTypeVLLM if not already imported and run the test to verify no
regressions.

In `@internal/adapter/registry/profile/vllm_mlx_parser.go`:
- Around line 46-49: The condition in the parser is comparing model.OwnedBy
against a hardcoded "vllm-mlx" string; replace that literal with the existing
constant constants.ProviderTypeVLLMMLX so the check is consistent with the rest
of the code; update the if condition that references model.OwnedBy (and
constants.ProviderTypeVLLM) to use constants.ProviderTypeVLLMMLX instead of the
literal, keeping the rest of the logic intact.

Comment on lines +20 to +26
# Anthropic Messages API support
# vLLM-MLX inherits the vLLM API surface, which includes native Anthropic Messages
# API support. Token counting is available via /v1/messages/count_tokens.
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor comment inaccuracy — vLLM-MLX implements the Anthropic API independently, not by inheriting upstream vLLM.

vLLM-MLX provides its own "native /v1/messages endpoint" rather than re-using upstream vLLM's Anthropic layer. The comment "vLLM-MLX inherits the vLLM API surface" is misleading — the Anthropic support is natively implemented in this fork. The enabled: true configuration itself is correct and justified; only the comment needs updating.

📝 Suggested comment update
-  # Anthropic Messages API support
-  # vLLM-MLX inherits the vLLM API surface, which includes native Anthropic Messages
-  # API support. Token counting is available via /v1/messages/count_tokens.
+  # Anthropic Messages API support
+  # vLLM-MLX provides a native /v1/messages endpoint (independent of upstream vLLM),
+  # enabling Claude Code / OpenCode compatibility. Token counting is available via
+  # /v1/messages/count_tokens.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Anthropic Messages API support
# vLLM-MLX inherits the vLLM API surface, which includes native Anthropic Messages
# API support. Token counting is available via /v1/messages/count_tokens.
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: true
# Anthropic Messages API support
# vLLM-MLX provides a native /v1/messages endpoint (independent of upstream vLLM),
# enabling Claude Code / OpenCode compatibility. Token counting is available via
# /v1/messages/count_tokens.
anthropic_support:
enabled: true
messages_path: /v1/messages
token_count: true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/profiles/vllm-mlx.yaml` around lines 20 - 26, Update the misleading
comment above the anthropic_support block to say that vLLM-MLX implements the
Anthropic Messages API natively (provides its own /v1/messages endpoint and
token counting via token_count/messages_path) rather than "inheriting the vLLM
API surface"; leave the configuration keys (anthropic_support, enabled,
messages_path, token_count) unchanged and only revise the explanatory comment to
reflect native implementation in vLLM-MLX.

Comment on lines +165 to +174
# MLX quantisation levels map loosely to these memory multipliers.
# Naming convention differs from GGUF (e.g. "4bit" rather than "Q4_K_M"),
# but the effective compression ratios are broadly similar.
quantization:
multipliers:
2bit: 0.35
3bit: 0.45
4bit: 0.50 # Most common MLX distribution format
6bit: 0.75
8bit: 0.875
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

bf16 is missing from quantization.multipliers whilst listed as a supported quantisation level.

features.model_format.quantization_levels (line 233) includes "bf16", but resources.quantization.multipliers has no bf16 entry. Any code path that applies a multiplier lookup for a bf16-quantised model will silently get a zero/missing value, producing incorrect memory footprint estimates (e.g. zero or a default) for those models.

🛠️ Proposed fix
     8bit: 0.875
+    bf16: 1.0     # BFloat16 — uncompressed weights, full memory footprint
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# MLX quantisation levels map loosely to these memory multipliers.
# Naming convention differs from GGUF (e.g. "4bit" rather than "Q4_K_M"),
# but the effective compression ratios are broadly similar.
quantization:
multipliers:
2bit: 0.35
3bit: 0.45
4bit: 0.50 # Most common MLX distribution format
6bit: 0.75
8bit: 0.875
# MLX quantisation levels map loosely to these memory multipliers.
# Naming convention differs from GGUF (e.g. "4bit" rather than "Q4_K_M"),
# but the effective compression ratios are broadly similar.
quantization:
multipliers:
2bit: 0.35
3bit: 0.45
4bit: 0.50 # Most common MLX distribution format
6bit: 0.75
8bit: 0.875
bf16: 1.0 # BFloat16 — uncompressed weights, full memory footprint
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/profiles/vllm-mlx.yaml` around lines 165 - 174, The
quantization.multipliers map is missing an entry for the "bf16" quantization
level listed in features.model_format.quantization_levels, causing lookups to
return missing/zero values; add a "bf16" key to quantization.multipliers (e.g.,
bf16: 0.5) so memory/multiplier calculations in any code paths using that map
(refer to quantization.multipliers and
features.model_format.quantization_levels) produce correct estimates.

Comment on lines +3 to +9
import (
"fmt"
"time"

"github.com/thushan/olla/internal/core/constants"
"github.com/thushan/olla/internal/core/domain"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n internal/adapter/registry/profile/vllm_mlx_parser.go

Repository: thushan/olla

Length of output: 2103


🏁 Script executed:

# Check Go version
cat go.mod | head -5

Repository: thushan/olla

Length of output: 108


Add missing encoding/json import — file will not compile without it.

Line 21 calls json.Unmarshal to deserialise the response, but encoding/json is not imported. This prevents the file from building.

Proposed fix
 import (
+	"encoding/json"
 	"fmt"
 	"time"
 
 	"github.com/thushan/olla/internal/core/constants"
 	"github.com/thushan/olla/internal/core/domain"
 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import (
"fmt"
"time"
"github.com/thushan/olla/internal/core/constants"
"github.com/thushan/olla/internal/core/domain"
)
import (
"encoding/json"
"fmt"
"time"
"github.com/thushan/olla/internal/core/constants"
"github.com/thushan/olla/internal/core/domain"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/adapter/registry/profile/vllm_mlx_parser.go` around lines 3 - 9, The
file calls json.Unmarshal but does not import the encoding/json package; add
"encoding/json" to the import block in
internal/adapter/registry/profile/vllm_mlx_parser.go so json.Unmarshal resolves
and the file compiles (update the import list where fmt, time, constants, domain
are imported).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant