Closed
Conversation
…havior in tests Introduces ServiceHealth StrEnum for the 5 service health states (healthy, degraded, down, crashed, stopped), replacing raw strings across stack_status.py, cli/status.py, and watchdog.py. This gives type safety — pyright catches invalid status values at check time. Rewrites 5 tests that were asserting implementation (mock was called) instead of behavior (correct output shown). Each rewritten test was verified by inducing a bug that slips through the old test but is caught by the new one: - test_pull_force_redownloads: asserts "is ready" in output, not "already exists" (catches force flag being ignored) - test_pull_with_bench_flag: asserts benchmark output is displayed (catches silent benchmark with no output) - test_successful_download: asserts correct repo/path passed and completion message printed (catches wrong model downloaded) - test_table_shows_tier_data: uses mixed statuses per tier, asserts each distinct status appears (catches display showing same status for all tiers) - test_fresh_install: asserts plist contains correct binary path and label (catches plist written with wrong binary) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a Makefile with targets matching exactly what CI runs: install, lint, typecheck, test, and check (all three). Updates ci.yml and release-please.yml to use make targets instead of inline commands. Developers, AI agents, and CI now run the same commands — running make check locally guarantees CI will pass. Also fixes pyright errors in test files that construct ServiceStatus with raw strings instead of ServiceHealth enum members. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive integration tests that prove the user contract holds: models serve inference, tool calling works, LiteLLM routing works, and coding agents can connect via OpenAI client. Tier 1 (catalog_validation): Validates all HF repo URLs exist and catalog entries are consistent. Runs in CI on every PR — would have caught issue #15 (qwen3.5-8b 404). Tier 2 (smoke): Per-model inference, tool calling, and thinking validation parameterized across the full catalog. Runs nightly. Tier 3 (integration): Full stack lifecycle — init, pull, up, LiteLLM routing, concurrent requests, clean shutdown. Runs pre-release. Tier 4 (harness): OpenAI Python client compatibility — chat, streaming, model listing, tool calling, multi-turn. Validates what aider/OpenCode/ Continue/Claude Code use under the hood. Runs pre-release. Shared fixtures provide dynamic port allocation, persistent model cache, service lifecycle management with guaranteed cleanup, and skip decorators for platform/memory/dependency checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the 5-command flow (profile → recommend → init → pull → up) with a single guided experience that walks users through hardware detection, model selection, and stack startup. Key components: - discovery.py: Queries mlx-community HuggingFace API for text-generation models and merges with static benchmark_data.json for performance and quality overlay. Falls back to benchmark-only models when offline. - onboarding.py: Orchestration logic — scoring, memory-budget filtering, default selection, tier assignment, config generation, model download, and stack startup. Uses same intent weights as the existing scoring engine but operates on DiscoveredModel instead of CatalogEntry. - cli/setup.py: Interactive 6-step CLI with Rich display. Supports --accept-defaults for non-interactive mode, --intent flag, --budget-pct, quant override syntax (e.g. 1:int8,3), and optional LaunchAgent install. - benchmark_data.json: Static export from mlx_transformers_benchmark with 69 model entries across 3 hardware profiles (M4 Pro 24/64GB, M5 Max 128GB). Includes 60 behavioral unit tests that assert outcomes (models returned, scores produced, tiers assigned, CLI output) not implementation details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace mock.assert_called/assert_not_called with behavioral assertions across test_cli_pull, test_watchdog, test_launchd, and test_cli_status. Each removed assertion was redundant — the return value or observable output already proved the behavior. Key changes: - test_retry_on_first_failure now asserts download succeeds with "Download complete" output instead of checking mock call count - Remove test_health_check_uses_correct_paths (pure mock.call_args inspection, behavior covered by test_five_distinct_states) - Simplify litellm port extraction in cli/setup.py - Fix onboarding.py docstring that incorrectly claimed no Rich dep Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The DeepSeek-R1-0528-Qwen3-32B repos on HuggingFace require authentication (401), so the catalog entry needs gated: true for CI catalog validation to pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scanned every non-gated catalog entry against HuggingFace API and
found 7 more repos returning 401 (auth required): nemotron-49b,
nemotron-8b, qwen3.5-{3b,8b,14b,32b,72b}. All Qwen3.5 and
Nemotron models on mlx-community are now gated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
Author
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.