Skip to content

seed(elixir-ecto-sandbox-test): SKLD-bench v2.1 challenge pool (151 challenges)#13

Merged
ty13r merged 1 commit intomainfrom
seed/elixir-ecto-sandbox-test
Apr 11, 2026
Merged

seed(elixir-ecto-sandbox-test): SKLD-bench v2.1 challenge pool (151 challenges)#13
ty13r merged 1 commit intomainfrom
seed/elixir-ecto-sandbox-test

Conversation

@ty13r
Copy link
Copy Markdown
Owner

@ty13r ty13r commented Apr 11, 2026

SKLD-bench v2.1 challenge pool: elixir-ecto-sandbox-test

Fifth of 7 families being shipped this morning (ecto-schema-changeset #9, ecto-query-writer #10, oban-worker #11, security-linter #12).

Partial-state recovery note

The overnight drafting subagent authored 151 challenges + 15 fixtures + 13 goldens cleanly before hitting the Max subscription rate limit at the metadata+evaluation layer. This PR completes the family by adding the six missing files:

  • family.json — metadata, capability list, held_out_ids (hand-authored)
  • seed.json — gen 0 SkillGenome with 11 starter variants (hand-authored)
  • evaluation/score.py — sandbox pattern scanner (hand-authored)
  • evaluation/criteria.json — per-capability weights (hand-authored)
  • evaluation/environment.yml — deps (hand-authored)
  • challenges/_calibration.json — generated post-hoc from actual challenge pool

Pool stats

  • Total challenges: 151 (rich curve target 150 — overshot by 1)
  • Tier distribution: easy 37 / medium 44 / hard 43 / legendary 27
  • Held-out: 28 balanced across tiers
  • Capability coverage: 10 capabilities + 1 foundation = 11 dimensions

Why this family matters

elixir-ecto-sandbox-test is the "ugly" pain point named directly in BoothIQ's "150k lines of vibe-coded Elixir" post-mortem:

"It can't debug concurrent test failures. It doesn't understand that each test runs in an isolated transaction... Claude doesn't understand this."

Every capability maps to a specific Claude failure mode documented in the research dossier (47 citations). The tidewave-dev-vs-test-trap capability is named after the specific bug where Claude reads from the Tidewave dev DB connection and thinks it's looking at the test DB.

Score.py validation

Check Result Target Status
Sanity (golden easy-01 canonical DataCase) 1.0 ≥0.9
Discrimination (buggy_data_case.ex fixture as output) 0.4687 <0.7 (fail) ✅ correctly fails
Empty file 0.087 ≤0.3

Discrimination headroom: 0.53, excellent for fitness comparison.

Score.py detection surface

Canonical patterns (positive signals):

  • Ecto.Adapters.SQL.Sandbox.start_owner! + stop_owner lifecycle
  • shared: not tags[:async] dispatcher
  • Sandbox.allow/3 for spawned processes / Tasks
  • Phoenix.Ecto.SQL.Sandbox plug (LiveView integration)
  • use Oban.Testing, repo: + perform_job/3 (not drain_queue)
  • {:shared, self()} tuple when fallback is necessary

Anti-patterns (penalized):

  • Sandbox.mode(Repo, :auto) in test code
  • async: false as an ownership-bug workaround
  • Process.sleep to "wait for" connection transfer
  • Oban.drain_queue in sandbox context
  • Seeding the test DB to force passing tests (the specific Claude failure from the research)

Research provenance

47 citations across 11 capabilities. Key sources:

  • BoothIQ "150k lines of vibe-coded Elixir" post-mortem (the "ugly" clincher)
  • HN discussion of the BoothIQ article
  • Rakshan Shetty: Elixir Testing Patterns Ecto Sandbox
  • Elixir Forum threads on concurrent test failures

Tier methodology

Heuristic per SEEDING-PLAN.md item 4.

🤖 Generated with Claude Code

Authors the complete SKLD-bench v2.1 family for elixir-ecto-sandbox-test
per the workstream plan in taxonomy/elixir/SEEDING-PLAN.md. Fifth family
shipped this morning. The drafting subagent authored 151 challenges,
15 test fixtures, and 13 golden references cleanly before hitting the
Max subscription rate limit at the metadata+evaluation layer. This commit
completes the family with hand-authored family.json, seed.json, score.py,
criteria.json, environment.yml, and _calibration.json (generated post-hoc).

Pool stats:
- 151 total challenges (rich curve target 150; overshot by 1)
- Tier distribution: 37 easy / 44 medium / 43 hard / 27 legendary
- 10 capabilities + 1 foundation = 11 dimensions covered
- 15 test fixtures, 13 golden references
- 28 challenges held out (~18% balanced across tiers)

This was the "ugly" pain point named in BoothIQ's vibe-coded Elixir
post-mortem. Every capability maps to a specific Claude failure mode
from the research dossier.

Score.py: static analyzer for Ecto sandbox test isolation. Detects:
- Canonical start_owner!/stop_owner lifecycle
- shared: not tags[:async] dispatcher pattern
- Sandbox.allow/3 for spawned processes
- Phoenix.Ecto.SQL.Sandbox plug usage
- use Oban.Testing + perform_job/3 (not drain_queue in sandbox)
- Anti-patterns: Sandbox.mode(_, :auto), Process.sleep workarounds,
  Oban.drain_queue in sandbox, seeding test DB to force passes,
  async: false as an ownership-bug workaround

Score.py validation:
- Sanity (golden easy-01 canonical DataCase): 1.0 (above 0.9 target)
- Discrimination (buggy_data_case fixture): 0.4687 (well below 0.7 pass)
- Empty file: 0.087 (well below 0.3 target)

Discrimination headroom: 0.53, which is excellent for fitness comparison.

Tier methodology: heuristic per SEEDING-PLAN.md item 4.
Research: 47 citations across 11 capabilities (see research.md).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant