Skip to content

Fix/test behavior assertions#23

Closed
weklund wants to merge 10 commits intomainfrom
fix/test-behavior-assertions
Closed

Fix/test behavior assertions#23
weklund wants to merge 10 commits intomainfrom
fix/test-behavior-assertions

Conversation

@weklund
Copy link
Copy Markdown
Owner

@weklund weklund commented Apr 3, 2026

No description provided.

weklund and others added 10 commits April 2, 2026 12:30
…havior in tests

Introduces ServiceHealth StrEnum for the 5 service health states
(healthy, degraded, down, crashed, stopped), replacing raw strings
across stack_status.py, cli/status.py, and watchdog.py. This gives
type safety — pyright catches invalid status values at check time.

Rewrites 5 tests that were asserting implementation (mock was called)
instead of behavior (correct output shown). Each rewritten test was
verified by inducing a bug that slips through the old test but is
caught by the new one:

- test_pull_force_redownloads: asserts "is ready" in output, not
  "already exists" (catches force flag being ignored)
- test_pull_with_bench_flag: asserts benchmark output is displayed
  (catches silent benchmark with no output)
- test_successful_download: asserts correct repo/path passed and
  completion message printed (catches wrong model downloaded)
- test_table_shows_tier_data: uses mixed statuses per tier, asserts
  each distinct status appears (catches display showing same status
  for all tiers)
- test_fresh_install: asserts plist contains correct binary path and
  label (catches plist written with wrong binary)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a Makefile with targets matching exactly what CI runs: install,
lint, typecheck, test, and check (all three). Updates ci.yml and
release-please.yml to use make targets instead of inline commands.

Developers, AI agents, and CI now run the same commands — running
make check locally guarantees CI will pass. Also fixes pyright errors
in test files that construct ServiceStatus with raw strings instead
of ServiceHealth enum members.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive integration tests that prove the user contract holds:
models serve inference, tool calling works, LiteLLM routing works, and
coding agents can connect via OpenAI client.

Tier 1 (catalog_validation): Validates all HF repo URLs exist and
catalog entries are consistent. Runs in CI on every PR — would have
caught issue #15 (qwen3.5-8b 404).

Tier 2 (smoke): Per-model inference, tool calling, and thinking
validation parameterized across the full catalog. Runs nightly.

Tier 3 (integration): Full stack lifecycle — init, pull, up, LiteLLM
routing, concurrent requests, clean shutdown. Runs pre-release.

Tier 4 (harness): OpenAI Python client compatibility — chat, streaming,
model listing, tool calling, multi-turn. Validates what aider/OpenCode/
Continue/Claude Code use under the hood. Runs pre-release.

Shared fixtures provide dynamic port allocation, persistent model cache,
service lifecycle management with guaranteed cleanup, and skip decorators
for platform/memory/dependency checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the 5-command flow (profile → recommend → init → pull → up)
with a single guided experience that walks users through hardware
detection, model selection, and stack startup.

Key components:

- discovery.py: Queries mlx-community HuggingFace API for text-generation
  models and merges with static benchmark_data.json for performance and
  quality overlay. Falls back to benchmark-only models when offline.

- onboarding.py: Orchestration logic — scoring, memory-budget filtering,
  default selection, tier assignment, config generation, model download,
  and stack startup. Uses same intent weights as the existing scoring
  engine but operates on DiscoveredModel instead of CatalogEntry.

- cli/setup.py: Interactive 6-step CLI with Rich display. Supports
  --accept-defaults for non-interactive mode, --intent flag, --budget-pct,
  quant override syntax (e.g. 1:int8,3), and optional LaunchAgent install.

- benchmark_data.json: Static export from mlx_transformers_benchmark with
  69 model entries across 3 hardware profiles (M4 Pro 24/64GB, M5 Max 128GB).

Includes 60 behavioral unit tests that assert outcomes (models returned,
scores produced, tiers assigned, CLI output) not implementation details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace mock.assert_called/assert_not_called with behavioral
assertions across test_cli_pull, test_watchdog, test_launchd, and
test_cli_status. Each removed assertion was redundant — the return
value or observable output already proved the behavior.

Key changes:
- test_retry_on_first_failure now asserts download succeeds with
  "Download complete" output instead of checking mock call count
- Remove test_health_check_uses_correct_paths (pure mock.call_args
  inspection, behavior covered by test_five_distinct_states)
- Simplify litellm port extraction in cli/setup.py
- Fix onboarding.py docstring that incorrectly claimed no Rich dep

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The DeepSeek-R1-0528-Qwen3-32B repos on HuggingFace require
authentication (401), so the catalog entry needs gated: true for
CI catalog validation to pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scanned every non-gated catalog entry against HuggingFace API and
found 7 more repos returning 401 (auth required): nemotron-49b,
nemotron-8b, qwen3.5-{3b,8b,14b,32b,72b}. All Qwen3.5 and
Nemotron models on mlx-community are now gated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@weklund
Copy link
Copy Markdown
Owner Author

weklund commented Apr 3, 2026

Closing: branch is stale (most commits already merged via #13, #16). Cherry-picking the useful last commit (78df7e2) onto a fresh branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant