feat: 4-tier integration testing framework by weklund · Pull Request #16 · weklund/mlx-stack

weklund · 2026-04-02T21:53:26Z

Summary

Adds a comprehensive integration testing framework that proves the core user contract: models serve inference, tool calling works, LiteLLM routing works, and coding agents can connect via the OpenAI client.
Adds catalog validation to CI so broken HuggingFace repo URLs (like the qwen3.5-8b 404 in Qwen 3.5 0.8B fails with ArraysCache error on vllm-mlx — fast tier unusable #15) are caught before merge.
Adds nightly and pre-release CI workflows for smoke tests and full stack integration.

Test Tiers

Tier	Marker	What it proves	When it runs
1. Catalog Validation	`catalog_validation`	All HF repos exist, fields valid, capabilities consistent	Every PR (CI)
2. Model Smoke	`smoke`	Each catalog model loads and serves inference, tool calling, thinking	Nightly
3. Stack Integration	`integration`	Full lifecycle, LiteLLM routing, concurrent requests, clean shutdown	Pre-release
4. Harness Compatibility	`harness`	OpenAI Python client works (chat, streaming, tool calling, multi-turn)	Pre-release

New Make Targets

make test-catalog      # Tier 1 — fast, requires network
make test-smoke        # Tier 2 — slow, requires macOS + vllm-mlx
make test-integration  # Tier 3 — slow, requires macOS + vllm-mlx + litellm
make test-harness      # Tier 4 — slow, requires above + openai package

Key Infrastructure

Dynamic port allocation — no hardcoded ports, no conflicts with running stacks
ServiceManager — context manager with guaranteed cleanup (SIGTERM → SIGKILL → port verification)
Persistent model cache — ~/.mlx-stack-test-cache/models/ avoids re-downloading across runs
Compatibility matrix — JSON report of pass/fail per model per capability
Skip decorators — graceful degradation for non-macOS, insufficient memory, missing deps

Verification

make check passes (lint + typecheck + 1,421 unit tests)
New tests correctly deselected from default make test run (168 deselected)
No changes to existing test behavior

Test plan

make check passes on CI (lint, typecheck, unit tests)
make test-catalog validates catalog entries against HuggingFace API
Nightly workflow triggers and runs smoke tests on smallest models
Pre-release workflow triggers on release creation

🤖 Generated with Claude Code

…havior in tests Introduces ServiceHealth StrEnum for the 5 service health states (healthy, degraded, down, crashed, stopped), replacing raw strings across stack_status.py, cli/status.py, and watchdog.py. This gives type safety — pyright catches invalid status values at check time. Rewrites 5 tests that were asserting implementation (mock was called) instead of behavior (correct output shown). Each rewritten test was verified by inducing a bug that slips through the old test but is caught by the new one: - test_pull_force_redownloads: asserts "is ready" in output, not "already exists" (catches force flag being ignored) - test_pull_with_bench_flag: asserts benchmark output is displayed (catches silent benchmark with no output) - test_successful_download: asserts correct repo/path passed and completion message printed (catches wrong model downloaded) - test_table_shows_tier_data: uses mixed statuses per tier, asserts each distinct status appears (catches display showing same status for all tiers) - test_fresh_install: asserts plist contains correct binary path and label (catches plist written with wrong binary) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a Makefile with targets matching exactly what CI runs: install, lint, typecheck, test, and check (all three). Updates ci.yml and release-please.yml to use make targets instead of inline commands. Developers, AI agents, and CI now run the same commands — running make check locally guarantees CI will pass. Also fixes pyright errors in test files that construct ServiceStatus with raw strings instead of ServiceHealth enum members. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add comprehensive integration tests that prove the user contract holds: models serve inference, tool calling works, LiteLLM routing works, and coding agents can connect via OpenAI client. Tier 1 (catalog_validation): Validates all HF repo URLs exist and catalog entries are consistent. Runs in CI on every PR — would have caught issue #15 (qwen3.5-8b 404). Tier 2 (smoke): Per-model inference, tool calling, and thinking validation parameterized across the full catalog. Runs nightly. Tier 3 (integration): Full stack lifecycle — init, pull, up, LiteLLM routing, concurrent requests, clean shutdown. Runs pre-release. Tier 4 (harness): OpenAI Python client compatibility — chat, streaming, model listing, tool calling, multi-turn. Validates what aider/OpenCode/ Continue/Claude Code use under the hood. Runs pre-release. Shared fixtures provide dynamic port allocation, persistent model cache, service lifecycle management with guaranteed cleanup, and skip decorators for platform/memory/dependency checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces the 5-command flow (profile → recommend → init → pull → up) with a single guided experience that walks users through hardware detection, model selection, and stack startup. Key components: - discovery.py: Queries mlx-community HuggingFace API for text-generation models and merges with static benchmark_data.json for performance and quality overlay. Falls back to benchmark-only models when offline. - onboarding.py: Orchestration logic — scoring, memory-budget filtering, default selection, tier assignment, config generation, model download, and stack startup. Uses same intent weights as the existing scoring engine but operates on DiscoveredModel instead of CatalogEntry. - cli/setup.py: Interactive 6-step CLI with Rich display. Supports --accept-defaults for non-interactive mode, --intent flag, --budget-pct, quant override syntax (e.g. 1:int8,3), and optional LaunchAgent install. - benchmark_data.json: Static export from mlx_transformers_benchmark with 69 model entries across 3 hardware profiles (M4 Pro 24/64GB, M5 Max 128GB). Includes 60 behavioral unit tests that assert outcomes (models returned, scores produced, tiers assigned, CLI output) not implementation details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace mock.assert_called/assert_not_called with behavioral assertions across test_cli_pull, test_watchdog, test_launchd, and test_cli_status. Each removed assertion was redundant — the return value or observable output already proved the behavior. Key changes: - test_retry_on_first_failure now asserts download succeeds with "Download complete" output instead of checking mock call count - Remove test_health_check_uses_correct_paths (pure mock.call_args inspection, behavior covered by test_five_distinct_states) - Simplify litellm port extraction in cli/setup.py - Fix onboarding.py docstring that incorrectly claimed no Rich dep Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The DeepSeek-R1-0528-Qwen3-32B repos on HuggingFace require authentication (401), so the catalog entry needs gated: true for CI catalog validation to pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scanned every non-gated catalog entry against HuggingFace API and found 7 more repos returning 401 (auth required): nemotron-49b, nemotron-8b, qwen3.5-{3b,8b,14b,32b,72b}. All Qwen3.5 and Nemotron models on mlx-community are now gated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

🤖 I have created a release *beep* *boop* --- ## [0.3.4](v0.3.3...v0.3.4) (2026-04-03) ### Features * 4-tier integration testing framework ([#16](#16)) ([e3dcf9a](e3dcf9a)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

weklund and others added 9 commits April 2, 2026 12:30

chore: add REQUIREMENTS.md to gitignore

229d87c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into fix/test-behavior-assertions

e914168

fix: mark deepseek-r1-32b as gated in catalog

a53f24c

The DeepSeek-R1-0528-Qwen3-32B repos on HuggingFace require authentication (401), so the catalog entry needs gated: true for CI catalog validation to pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

weklund merged commit e3dcf9a into main Apr 3, 2026
2 checks passed

github-actions bot mentioned this pull request Apr 3, 2026

chore(main): release 0.3.4 #19

Merged

weklund mentioned this pull request Apr 3, 2026

Fix/test behavior assertions #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 4-tier integration testing framework#16

feat: 4-tier integration testing framework#16
weklund merged 9 commits intomainfrom
fix/test-behavior-assertions

weklund commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weklund commented Apr 2, 2026

Summary

Test Tiers

New Make Targets

Key Infrastructure

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant