Conversation
Authors the complete SKLD-bench v2.1 family for elixir-ecto-sandbox-test per the workstream plan in taxonomy/elixir/SEEDING-PLAN.md. Fifth family shipped this morning. The drafting subagent authored 151 challenges, 15 test fixtures, and 13 golden references cleanly before hitting the Max subscription rate limit at the metadata+evaluation layer. This commit completes the family with hand-authored family.json, seed.json, score.py, criteria.json, environment.yml, and _calibration.json (generated post-hoc). Pool stats: - 151 total challenges (rich curve target 150; overshot by 1) - Tier distribution: 37 easy / 44 medium / 43 hard / 27 legendary - 10 capabilities + 1 foundation = 11 dimensions covered - 15 test fixtures, 13 golden references - 28 challenges held out (~18% balanced across tiers) This was the "ugly" pain point named in BoothIQ's vibe-coded Elixir post-mortem. Every capability maps to a specific Claude failure mode from the research dossier. Score.py: static analyzer for Ecto sandbox test isolation. Detects: - Canonical start_owner!/stop_owner lifecycle - shared: not tags[:async] dispatcher pattern - Sandbox.allow/3 for spawned processes - Phoenix.Ecto.SQL.Sandbox plug usage - use Oban.Testing + perform_job/3 (not drain_queue in sandbox) - Anti-patterns: Sandbox.mode(_, :auto), Process.sleep workarounds, Oban.drain_queue in sandbox, seeding test DB to force passes, async: false as an ownership-bug workaround Score.py validation: - Sanity (golden easy-01 canonical DataCase): 1.0 (above 0.9 target) - Discrimination (buggy_data_case fixture): 0.4687 (well below 0.7 pass) - Empty file: 0.087 (well below 0.3 target) Discrimination headroom: 0.53, which is excellent for fitness comparison. Tier methodology: heuristic per SEEDING-PLAN.md item 4. Research: 47 citations across 11 capabilities (see research.md). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SKLD-bench v2.1 challenge pool: elixir-ecto-sandbox-test
Fifth of 7 families being shipped this morning (ecto-schema-changeset #9, ecto-query-writer #10, oban-worker #11, security-linter #12).
Partial-state recovery note
The overnight drafting subagent authored 151 challenges + 15 fixtures + 13 goldens cleanly before hitting the Max subscription rate limit at the metadata+evaluation layer. This PR completes the family by adding the six missing files:
family.json— metadata, capability list, held_out_ids (hand-authored)seed.json— gen 0 SkillGenome with 11 starter variants (hand-authored)evaluation/score.py— sandbox pattern scanner (hand-authored)evaluation/criteria.json— per-capability weights (hand-authored)evaluation/environment.yml— deps (hand-authored)challenges/_calibration.json— generated post-hoc from actual challenge poolPool stats
Why this family matters
elixir-ecto-sandbox-testis the "ugly" pain point named directly in BoothIQ's "150k lines of vibe-coded Elixir" post-mortem:Every capability maps to a specific Claude failure mode documented in the research dossier (47 citations). The
tidewave-dev-vs-test-trapcapability is named after the specific bug where Claude reads from the Tidewave dev DB connection and thinks it's looking at the test DB.Score.py validation
easy-01canonical DataCase)buggy_data_case.exfixture as output)Discrimination headroom: 0.53, excellent for fitness comparison.
Score.py detection surface
Canonical patterns (positive signals):
Ecto.Adapters.SQL.Sandbox.start_owner!+stop_ownerlifecycleshared: not tags[:async]dispatcherSandbox.allow/3for spawned processes / TasksPhoenix.Ecto.SQL.Sandboxplug (LiveView integration)use Oban.Testing, repo:+perform_job/3(notdrain_queue){:shared, self()}tuple when fallback is necessaryAnti-patterns (penalized):
Sandbox.mode(Repo, :auto)in test codeasync: falseas an ownership-bug workaroundProcess.sleepto "wait for" connection transferOban.drain_queuein sandbox contextResearch provenance
47 citations across 11 capabilities. Key sources:
Tier methodology
Heuristic per SEEDING-PLAN.md item 4.
🤖 Generated with Claude Code