seed(elixir-ecto-sandbox-test): SKLD-bench v2.1 challenge pool (151 challenges) by ty13r · Pull Request #13 · ty13r/skillforge

ty13r · 2026-04-11T13:46:11Z

SKLD-bench v2.1 challenge pool: elixir-ecto-sandbox-test

Fifth of 7 families being shipped this morning (ecto-schema-changeset #9, ecto-query-writer #10, oban-worker #11, security-linter #12).

Partial-state recovery note

The overnight drafting subagent authored 151 challenges + 15 fixtures + 13 goldens cleanly before hitting the Max subscription rate limit at the metadata+evaluation layer. This PR completes the family by adding the six missing files:

family.json — metadata, capability list, held_out_ids (hand-authored)
seed.json — gen 0 SkillGenome with 11 starter variants (hand-authored)
evaluation/score.py — sandbox pattern scanner (hand-authored)
evaluation/criteria.json — per-capability weights (hand-authored)
evaluation/environment.yml — deps (hand-authored)
challenges/_calibration.json — generated post-hoc from actual challenge pool

Pool stats

Total challenges: 151 (rich curve target 150 — overshot by 1)
Tier distribution: easy 37 / medium 44 / hard 43 / legendary 27
Held-out: 28 balanced across tiers
Capability coverage: 10 capabilities + 1 foundation = 11 dimensions

Why this family matters

elixir-ecto-sandbox-test is the "ugly" pain point named directly in BoothIQ's "150k lines of vibe-coded Elixir" post-mortem:

"It can't debug concurrent test failures. It doesn't understand that each test runs in an isolated transaction... Claude doesn't understand this."

Every capability maps to a specific Claude failure mode documented in the research dossier (47 citations). The tidewave-dev-vs-test-trap capability is named after the specific bug where Claude reads from the Tidewave dev DB connection and thinks it's looking at the test DB.

Score.py validation

Check	Result	Target	Status
Sanity (golden `easy-01` canonical DataCase)	1.0	≥0.9	✅
Discrimination (`buggy_data_case.ex` fixture as output)	0.4687	<0.7 (fail)	✅ correctly fails
Empty file	0.087	≤0.3	✅

Discrimination headroom: 0.53, excellent for fitness comparison.

Score.py detection surface

Canonical patterns (positive signals):

Ecto.Adapters.SQL.Sandbox.start_owner! + stop_owner lifecycle
shared: not tags[:async] dispatcher
Sandbox.allow/3 for spawned processes / Tasks
Phoenix.Ecto.SQL.Sandbox plug (LiveView integration)
use Oban.Testing, repo: + perform_job/3 (not drain_queue)
{:shared, self()} tuple when fallback is necessary

Anti-patterns (penalized):

Sandbox.mode(Repo, :auto) in test code
async: false as an ownership-bug workaround
Process.sleep to "wait for" connection transfer
Oban.drain_queue in sandbox context
Seeding the test DB to force passing tests (the specific Claude failure from the research)

Research provenance

47 citations across 11 capabilities. Key sources:

BoothIQ "150k lines of vibe-coded Elixir" post-mortem (the "ugly" clincher)
HN discussion of the BoothIQ article
Rakshan Shetty: Elixir Testing Patterns Ecto Sandbox
Elixir Forum threads on concurrent test failures

Tier methodology

Heuristic per SEEDING-PLAN.md item 4.

🤖 Generated with Claude Code

Authors the complete SKLD-bench v2.1 family for elixir-ecto-sandbox-test per the workstream plan in taxonomy/elixir/SEEDING-PLAN.md. Fifth family shipped this morning. The drafting subagent authored 151 challenges, 15 test fixtures, and 13 golden references cleanly before hitting the Max subscription rate limit at the metadata+evaluation layer. This commit completes the family with hand-authored family.json, seed.json, score.py, criteria.json, environment.yml, and _calibration.json (generated post-hoc). Pool stats: - 151 total challenges (rich curve target 150; overshot by 1) - Tier distribution: 37 easy / 44 medium / 43 hard / 27 legendary - 10 capabilities + 1 foundation = 11 dimensions covered - 15 test fixtures, 13 golden references - 28 challenges held out (~18% balanced across tiers) This was the "ugly" pain point named in BoothIQ's vibe-coded Elixir post-mortem. Every capability maps to a specific Claude failure mode from the research dossier. Score.py: static analyzer for Ecto sandbox test isolation. Detects: - Canonical start_owner!/stop_owner lifecycle - shared: not tags[:async] dispatcher pattern - Sandbox.allow/3 for spawned processes - Phoenix.Ecto.SQL.Sandbox plug usage - use Oban.Testing + perform_job/3 (not drain_queue in sandbox) - Anti-patterns: Sandbox.mode(_, :auto), Process.sleep workarounds, Oban.drain_queue in sandbox, seeding test DB to force passes, async: false as an ownership-bug workaround Score.py validation: - Sanity (golden easy-01 canonical DataCase): 1.0 (above 0.9 target) - Discrimination (buggy_data_case fixture): 0.4687 (well below 0.7 pass) - Empty file: 0.087 (well below 0.3 target) Discrimination headroom: 0.53, which is excellent for fitness comparison. Tier methodology: heuristic per SEEDING-PLAN.md item 4. Research: 47 citations across 11 capabilities (see research.md). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ty13r merged commit 1a0febc into main Apr 11, 2026

ty13r deleted the seed/elixir-ecto-sandbox-test branch April 11, 2026 13:46

This was referenced Apr 11, 2026

seed(elixir-phoenix-liveview): SKLD-bench v2.1 challenge pool (125 challenges) #14

Merged

seed(elixir-pattern-match-refactor): SKLD-bench v2.1 challenge pool (120 challenges) #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seed(elixir-ecto-sandbox-test): SKLD-bench v2.1 challenge pool (151 challenges)#13

seed(elixir-ecto-sandbox-test): SKLD-bench v2.1 challenge pool (151 challenges)#13
ty13r merged 1 commit intomainfrom
seed/elixir-ecto-sandbox-test

ty13r commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ty13r commented Apr 11, 2026

SKLD-bench v2.1 challenge pool: elixir-ecto-sandbox-test

Partial-state recovery note

Pool stats

Why this family matters

Score.py validation

Score.py detection surface

Research provenance

Tier methodology

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant