Fix/test behavior assertions by weklund · Pull Request #23 · weklund/mlx-stack

weklund · 2026-04-03T14:26:43Z

No description provided.

…havior in tests Introduces ServiceHealth StrEnum for the 5 service health states (healthy, degraded, down, crashed, stopped), replacing raw strings across stack_status.py, cli/status.py, and watchdog.py. This gives type safety — pyright catches invalid status values at check time. Rewrites 5 tests that were asserting implementation (mock was called) instead of behavior (correct output shown). Each rewritten test was verified by inducing a bug that slips through the old test but is caught by the new one: - test_pull_force_redownloads: asserts "is ready" in output, not "already exists" (catches force flag being ignored) - test_pull_with_bench_flag: asserts benchmark output is displayed (catches silent benchmark with no output) - test_successful_download: asserts correct repo/path passed and completion message printed (catches wrong model downloaded) - test_table_shows_tier_data: uses mixed statuses per tier, asserts each distinct status appears (catches display showing same status for all tiers) - test_fresh_install: asserts plist contains correct binary path and label (catches plist written with wrong binary) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a Makefile with targets matching exactly what CI runs: install, lint, typecheck, test, and check (all three). Updates ci.yml and release-please.yml to use make targets instead of inline commands. Developers, AI agents, and CI now run the same commands — running make check locally guarantees CI will pass. Also fixes pyright errors in test files that construct ServiceStatus with raw strings instead of ServiceHealth enum members. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add comprehensive integration tests that prove the user contract holds: models serve inference, tool calling works, LiteLLM routing works, and coding agents can connect via OpenAI client. Tier 1 (catalog_validation): Validates all HF repo URLs exist and catalog entries are consistent. Runs in CI on every PR — would have caught issue #15 (qwen3.5-8b 404). Tier 2 (smoke): Per-model inference, tool calling, and thinking validation parameterized across the full catalog. Runs nightly. Tier 3 (integration): Full stack lifecycle — init, pull, up, LiteLLM routing, concurrent requests, clean shutdown. Runs pre-release. Tier 4 (harness): OpenAI Python client compatibility — chat, streaming, model listing, tool calling, multi-turn. Validates what aider/OpenCode/ Continue/Claude Code use under the hood. Runs pre-release. Shared fixtures provide dynamic port allocation, persistent model cache, service lifecycle management with guaranteed cleanup, and skip decorators for platform/memory/dependency checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces the 5-command flow (profile → recommend → init → pull → up) with a single guided experience that walks users through hardware detection, model selection, and stack startup. Key components: - discovery.py: Queries mlx-community HuggingFace API for text-generation models and merges with static benchmark_data.json for performance and quality overlay. Falls back to benchmark-only models when offline. - onboarding.py: Orchestration logic — scoring, memory-budget filtering, default selection, tier assignment, config generation, model download, and stack startup. Uses same intent weights as the existing scoring engine but operates on DiscoveredModel instead of CatalogEntry. - cli/setup.py: Interactive 6-step CLI with Rich display. Supports --accept-defaults for non-interactive mode, --intent flag, --budget-pct, quant override syntax (e.g. 1:int8,3), and optional LaunchAgent install. - benchmark_data.json: Static export from mlx_transformers_benchmark with 69 model entries across 3 hardware profiles (M4 Pro 24/64GB, M5 Max 128GB). Includes 60 behavioral unit tests that assert outcomes (models returned, scores produced, tiers assigned, CLI output) not implementation details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace mock.assert_called/assert_not_called with behavioral assertions across test_cli_pull, test_watchdog, test_launchd, and test_cli_status. Each removed assertion was redundant — the return value or observable output already proved the behavior. Key changes: - test_retry_on_first_failure now asserts download succeeds with "Download complete" output instead of checking mock call count - Remove test_health_check_uses_correct_paths (pure mock.call_args inspection, behavior covered by test_five_distinct_states) - Simplify litellm port extraction in cli/setup.py - Fix onboarding.py docstring that incorrectly claimed no Rich dep Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The DeepSeek-R1-0528-Qwen3-32B repos on HuggingFace require authentication (401), so the catalog entry needs gated: true for CI catalog validation to pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scanned every non-gated catalog entry against HuggingFace API and found 7 more repos returning 401 (auth required): nemotron-49b, nemotron-8b, qwen3.5-{3b,8b,14b,32b,72b}. All Qwen3.5 and Nemotron models on mlx-community are now gated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

weklund · 2026-04-03T14:30:48Z

Closing: branch is stale (most commits already merged via #13, #16). Cherry-picking the useful last commit (78df7e2) onto a fresh branch.

weklund and others added 10 commits April 2, 2026 12:30

chore: add REQUIREMENTS.md to gitignore

229d87c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into fix/test-behavior-assertions

e914168

fix: mark deepseek-r1-32b as gated in catalog

a53f24c

The DeepSeek-R1-0528-Qwen3-32B repos on HuggingFace require authentication (401), so the catalog entry needs gated: true for CI catalog validation to pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: failing smoke tests

78df7e2

weklund closed this Apr 3, 2026

weklund mentioned this pull request Apr 3, 2026

fix: disable continuous_batching and harden smoke tests #25

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/test behavior assertions#23

Fix/test behavior assertions#23
weklund wants to merge 10 commits intomainfrom
fix/test-behavior-assertions

weklund commented Apr 3, 2026

Uh oh!

weklund commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weklund commented Apr 3, 2026

Uh oh!

weklund commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant