Skip to content

feat: 4-tier integration testing framework#16

Merged
weklund merged 9 commits intomainfrom
fix/test-behavior-assertions
Apr 3, 2026
Merged

feat: 4-tier integration testing framework#16
weklund merged 9 commits intomainfrom
fix/test-behavior-assertions

Conversation

@weklund
Copy link
Copy Markdown
Owner

@weklund weklund commented Apr 2, 2026

Summary

  • Adds a comprehensive integration testing framework that proves the core user contract: models serve inference, tool calling works, LiteLLM routing works, and coding agents can connect via the OpenAI client.
  • Adds catalog validation to CI so broken HuggingFace repo URLs (like the qwen3.5-8b 404 in Qwen 3.5 0.8B fails with ArraysCache error on vllm-mlx — fast tier unusable #15) are caught before merge.
  • Adds nightly and pre-release CI workflows for smoke tests and full stack integration.

Test Tiers

Tier Marker What it proves When it runs
1. Catalog Validation catalog_validation All HF repos exist, fields valid, capabilities consistent Every PR (CI)
2. Model Smoke smoke Each catalog model loads and serves inference, tool calling, thinking Nightly
3. Stack Integration integration Full lifecycle, LiteLLM routing, concurrent requests, clean shutdown Pre-release
4. Harness Compatibility harness OpenAI Python client works (chat, streaming, tool calling, multi-turn) Pre-release

New Make Targets

make test-catalog      # Tier 1 — fast, requires network
make test-smoke        # Tier 2 — slow, requires macOS + vllm-mlx
make test-integration  # Tier 3 — slow, requires macOS + vllm-mlx + litellm
make test-harness      # Tier 4 — slow, requires above + openai package

Key Infrastructure

  • Dynamic port allocation — no hardcoded ports, no conflicts with running stacks
  • ServiceManager — context manager with guaranteed cleanup (SIGTERM → SIGKILL → port verification)
  • Persistent model cache~/.mlx-stack-test-cache/models/ avoids re-downloading across runs
  • Compatibility matrix — JSON report of pass/fail per model per capability
  • Skip decorators — graceful degradation for non-macOS, insufficient memory, missing deps

Verification

  • make check passes (lint + typecheck + 1,421 unit tests)
  • New tests correctly deselected from default make test run (168 deselected)
  • No changes to existing test behavior

Test plan

  • make check passes on CI (lint, typecheck, unit tests)
  • make test-catalog validates catalog entries against HuggingFace API
  • Nightly workflow triggers and runs smoke tests on smallest models
  • Pre-release workflow triggers on release creation

🤖 Generated with Claude Code

weklund and others added 9 commits April 2, 2026 12:30
…havior in tests

Introduces ServiceHealth StrEnum for the 5 service health states
(healthy, degraded, down, crashed, stopped), replacing raw strings
across stack_status.py, cli/status.py, and watchdog.py. This gives
type safety — pyright catches invalid status values at check time.

Rewrites 5 tests that were asserting implementation (mock was called)
instead of behavior (correct output shown). Each rewritten test was
verified by inducing a bug that slips through the old test but is
caught by the new one:

- test_pull_force_redownloads: asserts "is ready" in output, not
  "already exists" (catches force flag being ignored)
- test_pull_with_bench_flag: asserts benchmark output is displayed
  (catches silent benchmark with no output)
- test_successful_download: asserts correct repo/path passed and
  completion message printed (catches wrong model downloaded)
- test_table_shows_tier_data: uses mixed statuses per tier, asserts
  each distinct status appears (catches display showing same status
  for all tiers)
- test_fresh_install: asserts plist contains correct binary path and
  label (catches plist written with wrong binary)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a Makefile with targets matching exactly what CI runs: install,
lint, typecheck, test, and check (all three). Updates ci.yml and
release-please.yml to use make targets instead of inline commands.

Developers, AI agents, and CI now run the same commands — running
make check locally guarantees CI will pass. Also fixes pyright errors
in test files that construct ServiceStatus with raw strings instead
of ServiceHealth enum members.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive integration tests that prove the user contract holds:
models serve inference, tool calling works, LiteLLM routing works, and
coding agents can connect via OpenAI client.

Tier 1 (catalog_validation): Validates all HF repo URLs exist and
catalog entries are consistent. Runs in CI on every PR — would have
caught issue #15 (qwen3.5-8b 404).

Tier 2 (smoke): Per-model inference, tool calling, and thinking
validation parameterized across the full catalog. Runs nightly.

Tier 3 (integration): Full stack lifecycle — init, pull, up, LiteLLM
routing, concurrent requests, clean shutdown. Runs pre-release.

Tier 4 (harness): OpenAI Python client compatibility — chat, streaming,
model listing, tool calling, multi-turn. Validates what aider/OpenCode/
Continue/Claude Code use under the hood. Runs pre-release.

Shared fixtures provide dynamic port allocation, persistent model cache,
service lifecycle management with guaranteed cleanup, and skip decorators
for platform/memory/dependency checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the 5-command flow (profile → recommend → init → pull → up)
with a single guided experience that walks users through hardware
detection, model selection, and stack startup.

Key components:

- discovery.py: Queries mlx-community HuggingFace API for text-generation
  models and merges with static benchmark_data.json for performance and
  quality overlay. Falls back to benchmark-only models when offline.

- onboarding.py: Orchestration logic — scoring, memory-budget filtering,
  default selection, tier assignment, config generation, model download,
  and stack startup. Uses same intent weights as the existing scoring
  engine but operates on DiscoveredModel instead of CatalogEntry.

- cli/setup.py: Interactive 6-step CLI with Rich display. Supports
  --accept-defaults for non-interactive mode, --intent flag, --budget-pct,
  quant override syntax (e.g. 1:int8,3), and optional LaunchAgent install.

- benchmark_data.json: Static export from mlx_transformers_benchmark with
  69 model entries across 3 hardware profiles (M4 Pro 24/64GB, M5 Max 128GB).

Includes 60 behavioral unit tests that assert outcomes (models returned,
scores produced, tiers assigned, CLI output) not implementation details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace mock.assert_called/assert_not_called with behavioral
assertions across test_cli_pull, test_watchdog, test_launchd, and
test_cli_status. Each removed assertion was redundant — the return
value or observable output already proved the behavior.

Key changes:
- test_retry_on_first_failure now asserts download succeeds with
  "Download complete" output instead of checking mock call count
- Remove test_health_check_uses_correct_paths (pure mock.call_args
  inspection, behavior covered by test_five_distinct_states)
- Simplify litellm port extraction in cli/setup.py
- Fix onboarding.py docstring that incorrectly claimed no Rich dep

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The DeepSeek-R1-0528-Qwen3-32B repos on HuggingFace require
authentication (401), so the catalog entry needs gated: true for
CI catalog validation to pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scanned every non-gated catalog entry against HuggingFace API and
found 7 more repos returning 401 (auth required): nemotron-49b,
nemotron-8b, qwen3.5-{3b,8b,14b,32b,72b}. All Qwen3.5 and
Nemotron models on mlx-community are now gated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@weklund weklund merged commit e3dcf9a into main Apr 3, 2026
2 checks passed
weklund pushed a commit that referenced this pull request Apr 3, 2026
🤖 I have created a release *beep* *boop*
---


## [0.3.4](v0.3.3...v0.3.4)
(2026-04-03)


### Features

* 4-tier integration testing framework
([#16](#16))
([e3dcf9a](e3dcf9a))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant