seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges) by ty13r · Pull Request #11 · ty13r/skillforge

ty13r · 2026-04-11T13:34:59Z

SKLD-bench v2.1 challenge pool: elixir-oban-worker

Third of 7 families being shipped this morning (ecto-schema-changeset PR #9, ecto-query-writer next).

Pool stats

Total challenges: 100 (binary curve target hit exactly)
Tier distribution: easy 35 / medium 35 / hard 22 / legendary 8
Held-out: 20 balanced across tiers
Capability coverage: 11 capabilities + 1 foundation = 12 dimensions

Capability coverage

Capability	Primary	Secondary	E	M	H	L
testing-workers	12	—	4	5	2	1
return-values	10	12	5	4	1	0
unique-constraints	10	4	3	4	2	1
args-serialization ⭐	9	25	4	3	1	1
worker-philosophy (F)	9	37	3	2	3	1
cron-scheduling	8	2	3	2	3	0
recurring-jobs-vs-cron	8	5	3	2	2	1
transactional-jobs	8	7	1	3	3	1
queues-and-priority	7	7	2	3	1	1
retry-strategy	7	16	2	2	2	1
perform-callback-basics	6	22	4	1	1	0
telemetry-and-observability	6	1	1	4	1	0

All 12 capabilities hit the ≥5 binary-family minimum. ⭐ args-serialization is the highest-impact safety fix (atom keys in args + struct serialization, both per plugin iron laws).

All three named Oban failure modes covered

Non-idempotent jobs → transactional-jobs (8) + perform-callback-basics (6)
Atom keys instead of strings in args → args-serialization (9)
Storing Elixir structs in args → args-serialization (same cluster)

Post-hoc calibration manifest

Drafting subagent was cut off by the Max subscription rate limit at the final _calibration.json step. All other content authored cleanly. Manifest generated post-hoc by walking the actual challenge files.

Score.py

Authored by drafting subagent. Uses regex for String.to_atom in worker bodies, atom-key detection in args (%{user_id: patterns), return-value protocol tags (:ok, {:ok, _}, {:error, _}, {:discard, _}, {:snooze, _}), presence of unique: blocks. Not re-validated post-hoc.

Research provenance

38 citations across 12 capabilities. Key source: oliver-kriska/claude-elixir-phoenix (three explicit Oban iron laws: idempotency, atom keys, stored structs).

Tier methodology

Heuristic per SEEDING-PLAN.md item 4.

🤖 Generated with Claude Code

Authors the complete SKLD-bench v2.1 family for elixir-oban-worker per the workstream plan in taxonomy/elixir/SEEDING-PLAN.md. Third family shipped this morning. The drafting subagent was cut off by the Max rate limit at the final _calibration.json step; the manifest was generated post-hoc by walking the actual challenge pool. Pool stats: - 100 total challenges (binary curve target hit exactly) - Tier distribution: 35 easy / 35 medium / 22 hard / 8 legendary - 11 capabilities + 1 foundation = 12 dimensions covered - 13 test fixtures, 12 golden references - 20 challenges held out (~20% balanced across tiers) Capability primary-tag counts (target >=5 for binary, all met): - testing-workers: 12 (highest) - return-values: 10 - unique-constraints: 10 - args-serialization: 9 (highest-impact safety fix per plugin iron laws) - worker-philosophy (foundation): 9 - cron-scheduling: 8 - recurring-jobs-vs-cron: 8 - transactional-jobs: 8 - queues-and-priority: 7 - retry-strategy: 7 - perform-callback-basics: 6 - telemetry-and-observability: 6 All three named Oban failure modes are covered: - Non-idempotent jobs: transactional-jobs + perform-callback-basics - Atom keys in args: args-serialization (9 challenges) - Stored structs in args: args-serialization (same cluster) Score.py: authored by drafting subagent. Uses regex for String.to_atom calls in worker bodies, atom keys in args (%{user_id: patterns), return value protocol (:ok / {:ok, _} / {:error, _} / {:discard, _} / {:snooze, _}), presence of unique: blocks. Not re-validated post-hoc; treat as best-effort. Tier methodology: heuristic per SEEDING-PLAN.md item 4. Research: 38 citations across 12 capabilities (see research.md). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l-test phase Matt asked "should we try installing the liveview skill now?" after everything else said green (rich run detail page, Gold Standard Checklist all green indicators, zip export validator pass, Package Explorer showing 16 files). The install test revealed three real bugs that had passed every schema-level quality gate. This commit fixes the bugs, promotes the install test to a mandatory pipeline phase, and codifies the learnings into PLAN-V2.1 so the v2.1 engine never ships another broken skill. ## The 3 bugs (found by actually running the package) **1. validate.sh used declare -A (bash 4+ only)** macOS ships bash 3.2. Line 49 `declare -A HITS_BY` failed with `declare: -A: invalid option`. The enrichment agent that generated this script tested it on Linux and never verified macOS. **2. validate.sh piped detectors into report (subshell variable loss)** Even after fixing the declare bug with eval + ${!var} indirect expansion, the summary showed "all clean" with TOTAL_HITS=0 while the detector output reported real hits. In bash 3.2 pipelines create subshells, so the assignments inside `report` never propagated back. Fix: process substitution `report "key" "fix" < <(detector)` keeps report running in the parent shell. This bug would have bitten on Linux bash 4+ too without `shopt -s lastpipe`. **3. main_helper.py migrate produced malformed Elixir** - Left `<%= ... %>` wrappers around `<.link>` components (invalid HEEx) - Lost trailing `class: "btn"` keyword args instead of absorbing as attrs - Put `:for` on the outer `<ul>` instead of the inner `<li>` (would duplicate the whole list) - Skipped `live_redirect user.name, to: ...` because the regex only matched double-quoted text - Missed `Routes.user_path(socket, :index)` without leading `@` inside `push_navigate` calls Fixes: - New `_strip_eex_around_link` post-processing pass that removes `<%= %>` around `.link` components and absorbs trailing keyword args as component attrs via `_absorb_kw_args_as_attrs` - New `_format_link_text` helper that detects quoted-literal vs Elixir expression text and wraps expressions in HEEx curly syntax `{user.name}` - Rewrote `_EEX_FOR_BLOCK_RE` / `_EEX_IF_BLOCK_RE` to match the INNER tag inside the block, not any wrapping outer tag - Widened `_ROUTES_CALL_RE` with optional `@?` before socket - Excluded `%` from `_LIVE_*_RE` target groups so `%>` doesn't get consumed **Plus a minor new-live UX wart**: `dashboard_live` produced `MyAppWeb.DashboardLiveLive`. Fix: strip a trailing `_live` from the input before camel case conversion; clearer help text + error message. ## Patch flow 1. Fixed scripts written to /tmp/skld-fixes/scripts/ 2. Tested standalone against a fake Phoenix project (32 anti-pattern hits, correct summary, FAIL exit 1) 3. Tested migrate against pre_1_7_user_list.ex — 9 rewrite passes producing valid Phoenix 1.7+ HEEx with :for on <li>, :if on <span>, absorbed class="btn", {user.name} curly interpolation, push_navigate(socket, to: ~p"/users") 4. New `scripts/mock_pipeline/patch_composite_scripts.py` helper patches the seed JSON's composite genome supporting_files in place (replaces the bad validate.sh + main_helper.py values) 5. Nuked local DB, rebooted uvicorn, downloaded zip, extracted, verified all scripts work from the installed location ## End-to-end install verification - `/tmp/skld-phoenix-demo/` — realistic Phoenix project dir with `mix.exs`, `lib/my_app_web/live/`, and the composite skill dropped into `.claude/skills/elixir-phoenix-liveview-composite/` - validate.sh: 32 anti-pattern hits across 14 detectors, correct summary, FAIL exit 1 - main_helper.py scan: 35 gcc-style diagnostics - main_helper.py migrate: valid HEEx output, 9 rewrite passes - main_helper.py new-live dashboard: scaffolded MyAppWeb.DashboardLive (no DashboardLiveLive) ## Dogfood subagent test Dispatched an Opus subagent with instructions to read the installed skill and write a `TaskListLive` module for a Tasks feature. The subagent produced a 190-line file that scanned CLEAN on the first try — zero anti-pattern hits. It used every Phoenix 1.7+ idiom the skill teaches: streams with phx-update="stream", :for on <li>, :if for filtering, <.link> components, ~p verified routes, to_form/2 forms, typed %Action{} funnel into pure handle_action/2 dispatcher. The subagent also identified two real skill gaps (missing "filter a stream via :if" pattern, missing "hoist inline form into assign" tip) — valuable follow-up items for the next skill iteration. ## Pipeline: install test is now MANDATORY **scripts/mock_pipeline/NEXT-SEED-RUN-PLAYBOOK.md §Phase 7.5** — every bridge seed run must run the install test before being marked complete. The playbook includes the exact bash script that downloads the zip, creates a fake project, runs every script, asserts on outputs, and optionally dispatches a subagent dogfood test. **plans/PLAN-V2.1.md §P1.5 "Final-package installation test (MANDATORY)"** — the v2.1 production engine must include a `skillforge/engine/install_test.py` module called from `run_v21_evolution()` AFTER champion eval but BEFORE save_genome (composite). On failure the run transitions to a new `install_test_failed` status. The zip export endpoint and seed loader reject runs in that state. **plans/PLAN-V2.1.md §3.5 "Install-test learnings (post-rebrand)"** documents the four bugs as permanent learnings so future engine work doesn't repeat them. **Success criterion #11** added to the v2.1 shipped gate. ## journal + PROGRESS - journal/013-phoenix-liveview-install-test.md (session narrative, ~400 lines covering rich run detail rebuild, two-phase rebrand, OG meta injection, install test discoveries, subagent dogfood) - plans/PROGRESS.md (6 dated entries for today) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l-test phase (#23) Matt asked "should we try installing the liveview skill now?" after everything else said green (rich run detail page, Gold Standard Checklist all green indicators, zip export validator pass, Package Explorer showing 16 files). The install test revealed three real bugs that had passed every schema-level quality gate. This commit fixes the bugs, promotes the install test to a mandatory pipeline phase, and codifies the learnings into PLAN-V2.1 so the v2.1 engine never ships another broken skill. ## The 3 bugs (found by actually running the package) **1. validate.sh used declare -A (bash 4+ only)** macOS ships bash 3.2. Line 49 `declare -A HITS_BY` failed with `declare: -A: invalid option`. The enrichment agent that generated this script tested it on Linux and never verified macOS. **2. validate.sh piped detectors into report (subshell variable loss)** Even after fixing the declare bug with eval + ${!var} indirect expansion, the summary showed "all clean" with TOTAL_HITS=0 while the detector output reported real hits. In bash 3.2 pipelines create subshells, so the assignments inside `report` never propagated back. Fix: process substitution `report "key" "fix" < <(detector)` keeps report running in the parent shell. This bug would have bitten on Linux bash 4+ too without `shopt -s lastpipe`. **3. main_helper.py migrate produced malformed Elixir** - Left `<%= ... %>` wrappers around `<.link>` components (invalid HEEx) - Lost trailing `class: "btn"` keyword args instead of absorbing as attrs - Put `:for` on the outer `<ul>` instead of the inner `<li>` (would duplicate the whole list) - Skipped `live_redirect user.name, to: ...` because the regex only matched double-quoted text - Missed `Routes.user_path(socket, :index)` without leading `@` inside `push_navigate` calls Fixes: - New `_strip_eex_around_link` post-processing pass that removes `<%= %>` around `.link` components and absorbs trailing keyword args as component attrs via `_absorb_kw_args_as_attrs` - New `_format_link_text` helper that detects quoted-literal vs Elixir expression text and wraps expressions in HEEx curly syntax `{user.name}` - Rewrote `_EEX_FOR_BLOCK_RE` / `_EEX_IF_BLOCK_RE` to match the INNER tag inside the block, not any wrapping outer tag - Widened `_ROUTES_CALL_RE` with optional `@?` before socket - Excluded `%` from `_LIVE_*_RE` target groups so `%>` doesn't get consumed **Plus a minor new-live UX wart**: `dashboard_live` produced `MyAppWeb.DashboardLiveLive`. Fix: strip a trailing `_live` from the input before camel case conversion; clearer help text + error message. ## Patch flow 1. Fixed scripts written to /tmp/skld-fixes/scripts/ 2. Tested standalone against a fake Phoenix project (32 anti-pattern hits, correct summary, FAIL exit 1) 3. Tested migrate against pre_1_7_user_list.ex — 9 rewrite passes producing valid Phoenix 1.7+ HEEx with :for on <li>, :if on <span>, absorbed class="btn", {user.name} curly interpolation, push_navigate(socket, to: ~p"/users") 4. New `scripts/mock_pipeline/patch_composite_scripts.py` helper patches the seed JSON's composite genome supporting_files in place (replaces the bad validate.sh + main_helper.py values) 5. Nuked local DB, rebooted uvicorn, downloaded zip, extracted, verified all scripts work from the installed location ## End-to-end install verification - `/tmp/skld-phoenix-demo/` — realistic Phoenix project dir with `mix.exs`, `lib/my_app_web/live/`, and the composite skill dropped into `.claude/skills/elixir-phoenix-liveview-composite/` - validate.sh: 32 anti-pattern hits across 14 detectors, correct summary, FAIL exit 1 - main_helper.py scan: 35 gcc-style diagnostics - main_helper.py migrate: valid HEEx output, 9 rewrite passes - main_helper.py new-live dashboard: scaffolded MyAppWeb.DashboardLive (no DashboardLiveLive) ## Dogfood subagent test Dispatched an Opus subagent with instructions to read the installed skill and write a `TaskListLive` module for a Tasks feature. The subagent produced a 190-line file that scanned CLEAN on the first try — zero anti-pattern hits. It used every Phoenix 1.7+ idiom the skill teaches: streams with phx-update="stream", :for on <li>, :if for filtering, <.link> components, ~p verified routes, to_form/2 forms, typed %Action{} funnel into pure handle_action/2 dispatcher. The subagent also identified two real skill gaps (missing "filter a stream via :if" pattern, missing "hoist inline form into assign" tip) — valuable follow-up items for the next skill iteration. ## Pipeline: install test is now MANDATORY **scripts/mock_pipeline/NEXT-SEED-RUN-PLAYBOOK.md §Phase 7.5** — every bridge seed run must run the install test before being marked complete. The playbook includes the exact bash script that downloads the zip, creates a fake project, runs every script, asserts on outputs, and optionally dispatches a subagent dogfood test. **plans/PLAN-V2.1.md §P1.5 "Final-package installation test (MANDATORY)"** — the v2.1 production engine must include a `skillforge/engine/install_test.py` module called from `run_v21_evolution()` AFTER champion eval but BEFORE save_genome (composite). On failure the run transitions to a new `install_test_failed` status. The zip export endpoint and seed loader reject runs in that state. **plans/PLAN-V2.1.md §3.5 "Install-test learnings (post-rebrand)"** documents the four bugs as permanent learnings so future engine work doesn't repeat them. **Success criterion #11** added to the v2.1 shipped gate. ## journal + PROGRESS - journal/013-phoenix-liveview-install-test.md (session narrative, ~400 lines covering rich run detail rebuild, two-phase rebrand, OG meta injection, install test discoveries, subagent dogfood) - plans/PROGRESS.md (6 dated entries for today) Co-authored-by: Matt (via Claude Code) <matt@skillforge.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ty13r merged commit 927ca63 into main Apr 11, 2026

ty13r deleted the seed/elixir-oban-worker branch April 11, 2026 13:35

ty13r mentioned this pull request Apr 12, 2026

fix: composite script bugs from first install test + MANDATORY install-test phase #23

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges)#11

seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges)#11
ty13r merged 1 commit intomainfrom
seed/elixir-oban-worker

ty13r commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ty13r commented Apr 11, 2026

SKLD-bench v2.1 challenge pool: elixir-oban-worker

Pool stats

Capability coverage

All three named Oban failure modes covered

Post-hoc calibration manifest

Score.py

Research provenance

Tier methodology

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant