seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges)#11
Merged
seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges)#11
Conversation
Authors the complete SKLD-bench v2.1 family for elixir-oban-worker per
the workstream plan in taxonomy/elixir/SEEDING-PLAN.md. Third family
shipped this morning. The drafting subagent was cut off by the Max rate
limit at the final _calibration.json step; the manifest was generated
post-hoc by walking the actual challenge pool.
Pool stats:
- 100 total challenges (binary curve target hit exactly)
- Tier distribution: 35 easy / 35 medium / 22 hard / 8 legendary
- 11 capabilities + 1 foundation = 12 dimensions covered
- 13 test fixtures, 12 golden references
- 20 challenges held out (~20% balanced across tiers)
Capability primary-tag counts (target >=5 for binary, all met):
- testing-workers: 12 (highest)
- return-values: 10
- unique-constraints: 10
- args-serialization: 9 (highest-impact safety fix per plugin iron laws)
- worker-philosophy (foundation): 9
- cron-scheduling: 8
- recurring-jobs-vs-cron: 8
- transactional-jobs: 8
- queues-and-priority: 7
- retry-strategy: 7
- perform-callback-basics: 6
- telemetry-and-observability: 6
All three named Oban failure modes are covered:
- Non-idempotent jobs: transactional-jobs + perform-callback-basics
- Atom keys in args: args-serialization (9 challenges)
- Stored structs in args: args-serialization (same cluster)
Score.py: authored by drafting subagent. Uses regex for String.to_atom
calls in worker bodies, atom keys in args (%{user_id: patterns), return
value protocol (:ok / {:ok, _} / {:error, _} / {:discard, _} / {:snooze, _}),
presence of unique: blocks. Not re-validated post-hoc; treat as best-effort.
Tier methodology: heuristic per SEEDING-PLAN.md item 4.
Research: 38 citations across 12 capabilities (see research.md).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 11, 2026
ty13r
pushed a commit
that referenced
this pull request
Apr 12, 2026
…l-test phase
Matt asked "should we try installing the liveview skill now?" after
everything else said green (rich run detail page, Gold Standard Checklist
all green indicators, zip export validator pass, Package Explorer showing
16 files). The install test revealed three real bugs that had passed
every schema-level quality gate. This commit fixes the bugs, promotes the
install test to a mandatory pipeline phase, and codifies the learnings
into PLAN-V2.1 so the v2.1 engine never ships another broken skill.
## The 3 bugs (found by actually running the package)
**1. validate.sh used declare -A (bash 4+ only)**
macOS ships bash 3.2. Line 49 `declare -A HITS_BY` failed with
`declare: -A: invalid option`. The enrichment agent that generated this
script tested it on Linux and never verified macOS.
**2. validate.sh piped detectors into report (subshell variable loss)**
Even after fixing the declare bug with eval + ${!var} indirect expansion,
the summary showed "all clean" with TOTAL_HITS=0 while the detector output
reported real hits. In bash 3.2 pipelines create subshells, so the
assignments inside `report` never propagated back. Fix: process
substitution `report "key" "fix" < <(detector)` keeps report running in
the parent shell. This bug would have bitten on Linux bash 4+ too without
`shopt -s lastpipe`.
**3. main_helper.py migrate produced malformed Elixir**
- Left `<%= ... %>` wrappers around `<.link>` components (invalid HEEx)
- Lost trailing `class: "btn"` keyword args instead of absorbing as attrs
- Put `:for` on the outer `<ul>` instead of the inner `<li>` (would
duplicate the whole list)
- Skipped `live_redirect user.name, to: ...` because the regex only
matched double-quoted text
- Missed `Routes.user_path(socket, :index)` without leading `@` inside
`push_navigate` calls
Fixes:
- New `_strip_eex_around_link` post-processing pass that removes `<%= %>`
around `.link` components and absorbs trailing keyword args as
component attrs via `_absorb_kw_args_as_attrs`
- New `_format_link_text` helper that detects quoted-literal vs Elixir
expression text and wraps expressions in HEEx curly syntax `{user.name}`
- Rewrote `_EEX_FOR_BLOCK_RE` / `_EEX_IF_BLOCK_RE` to match the INNER
tag inside the block, not any wrapping outer tag
- Widened `_ROUTES_CALL_RE` with optional `@?` before socket
- Excluded `%` from `_LIVE_*_RE` target groups so `%>` doesn't get
consumed
**Plus a minor new-live UX wart**: `dashboard_live` produced
`MyAppWeb.DashboardLiveLive`. Fix: strip a trailing `_live` from the
input before camel case conversion; clearer help text + error message.
## Patch flow
1. Fixed scripts written to /tmp/skld-fixes/scripts/
2. Tested standalone against a fake Phoenix project (32 anti-pattern
hits, correct summary, FAIL exit 1)
3. Tested migrate against pre_1_7_user_list.ex — 9 rewrite passes
producing valid Phoenix 1.7+ HEEx with :for on <li>, :if on <span>,
absorbed class="btn", {user.name} curly interpolation,
push_navigate(socket, to: ~p"/users")
4. New `scripts/mock_pipeline/patch_composite_scripts.py` helper
patches the seed JSON's composite genome supporting_files in place
(replaces the bad validate.sh + main_helper.py values)
5. Nuked local DB, rebooted uvicorn, downloaded zip, extracted,
verified all scripts work from the installed location
## End-to-end install verification
- `/tmp/skld-phoenix-demo/` — realistic Phoenix project dir with
`mix.exs`, `lib/my_app_web/live/`, and the composite skill dropped
into `.claude/skills/elixir-phoenix-liveview-composite/`
- validate.sh: 32 anti-pattern hits across 14 detectors, correct
summary, FAIL exit 1
- main_helper.py scan: 35 gcc-style diagnostics
- main_helper.py migrate: valid HEEx output, 9 rewrite passes
- main_helper.py new-live dashboard: scaffolded MyAppWeb.DashboardLive
(no DashboardLiveLive)
## Dogfood subagent test
Dispatched an Opus subagent with instructions to read the installed
skill and write a `TaskListLive` module for a Tasks feature. The
subagent produced a 190-line file that scanned CLEAN on the first
try — zero anti-pattern hits. It used every Phoenix 1.7+ idiom the
skill teaches: streams with phx-update="stream", :for on <li>, :if
for filtering, <.link> components, ~p verified routes, to_form/2
forms, typed %Action{} funnel into pure handle_action/2 dispatcher.
The subagent also identified two real skill gaps (missing "filter a
stream via :if" pattern, missing "hoist inline form into assign"
tip) — valuable follow-up items for the next skill iteration.
## Pipeline: install test is now MANDATORY
**scripts/mock_pipeline/NEXT-SEED-RUN-PLAYBOOK.md §Phase 7.5** — every
bridge seed run must run the install test before being marked complete.
The playbook includes the exact bash script that downloads the zip,
creates a fake project, runs every script, asserts on outputs, and
optionally dispatches a subagent dogfood test.
**plans/PLAN-V2.1.md §P1.5 "Final-package installation test
(MANDATORY)"** — the v2.1 production engine must include a
`skillforge/engine/install_test.py` module called from
`run_v21_evolution()` AFTER champion eval but BEFORE save_genome
(composite). On failure the run transitions to a new
`install_test_failed` status. The zip export endpoint and seed loader
reject runs in that state.
**plans/PLAN-V2.1.md §3.5 "Install-test learnings (post-rebrand)"**
documents the four bugs as permanent learnings so future engine work
doesn't repeat them.
**Success criterion #11** added to the v2.1 shipped gate.
## journal + PROGRESS
- journal/013-phoenix-liveview-install-test.md (session narrative,
~400 lines covering rich run detail rebuild, two-phase rebrand,
OG meta injection, install test discoveries, subagent dogfood)
- plans/PROGRESS.md (6 dated entries for today)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
ty13r
added a commit
that referenced
this pull request
Apr 12, 2026
…l-test phase (#23) Matt asked "should we try installing the liveview skill now?" after everything else said green (rich run detail page, Gold Standard Checklist all green indicators, zip export validator pass, Package Explorer showing 16 files). The install test revealed three real bugs that had passed every schema-level quality gate. This commit fixes the bugs, promotes the install test to a mandatory pipeline phase, and codifies the learnings into PLAN-V2.1 so the v2.1 engine never ships another broken skill. ## The 3 bugs (found by actually running the package) **1. validate.sh used declare -A (bash 4+ only)** macOS ships bash 3.2. Line 49 `declare -A HITS_BY` failed with `declare: -A: invalid option`. The enrichment agent that generated this script tested it on Linux and never verified macOS. **2. validate.sh piped detectors into report (subshell variable loss)** Even after fixing the declare bug with eval + ${!var} indirect expansion, the summary showed "all clean" with TOTAL_HITS=0 while the detector output reported real hits. In bash 3.2 pipelines create subshells, so the assignments inside `report` never propagated back. Fix: process substitution `report "key" "fix" < <(detector)` keeps report running in the parent shell. This bug would have bitten on Linux bash 4+ too without `shopt -s lastpipe`. **3. main_helper.py migrate produced malformed Elixir** - Left `<%= ... %>` wrappers around `<.link>` components (invalid HEEx) - Lost trailing `class: "btn"` keyword args instead of absorbing as attrs - Put `:for` on the outer `<ul>` instead of the inner `<li>` (would duplicate the whole list) - Skipped `live_redirect user.name, to: ...` because the regex only matched double-quoted text - Missed `Routes.user_path(socket, :index)` without leading `@` inside `push_navigate` calls Fixes: - New `_strip_eex_around_link` post-processing pass that removes `<%= %>` around `.link` components and absorbs trailing keyword args as component attrs via `_absorb_kw_args_as_attrs` - New `_format_link_text` helper that detects quoted-literal vs Elixir expression text and wraps expressions in HEEx curly syntax `{user.name}` - Rewrote `_EEX_FOR_BLOCK_RE` / `_EEX_IF_BLOCK_RE` to match the INNER tag inside the block, not any wrapping outer tag - Widened `_ROUTES_CALL_RE` with optional `@?` before socket - Excluded `%` from `_LIVE_*_RE` target groups so `%>` doesn't get consumed **Plus a minor new-live UX wart**: `dashboard_live` produced `MyAppWeb.DashboardLiveLive`. Fix: strip a trailing `_live` from the input before camel case conversion; clearer help text + error message. ## Patch flow 1. Fixed scripts written to /tmp/skld-fixes/scripts/ 2. Tested standalone against a fake Phoenix project (32 anti-pattern hits, correct summary, FAIL exit 1) 3. Tested migrate against pre_1_7_user_list.ex — 9 rewrite passes producing valid Phoenix 1.7+ HEEx with :for on <li>, :if on <span>, absorbed class="btn", {user.name} curly interpolation, push_navigate(socket, to: ~p"/users") 4. New `scripts/mock_pipeline/patch_composite_scripts.py` helper patches the seed JSON's composite genome supporting_files in place (replaces the bad validate.sh + main_helper.py values) 5. Nuked local DB, rebooted uvicorn, downloaded zip, extracted, verified all scripts work from the installed location ## End-to-end install verification - `/tmp/skld-phoenix-demo/` — realistic Phoenix project dir with `mix.exs`, `lib/my_app_web/live/`, and the composite skill dropped into `.claude/skills/elixir-phoenix-liveview-composite/` - validate.sh: 32 anti-pattern hits across 14 detectors, correct summary, FAIL exit 1 - main_helper.py scan: 35 gcc-style diagnostics - main_helper.py migrate: valid HEEx output, 9 rewrite passes - main_helper.py new-live dashboard: scaffolded MyAppWeb.DashboardLive (no DashboardLiveLive) ## Dogfood subagent test Dispatched an Opus subagent with instructions to read the installed skill and write a `TaskListLive` module for a Tasks feature. The subagent produced a 190-line file that scanned CLEAN on the first try — zero anti-pattern hits. It used every Phoenix 1.7+ idiom the skill teaches: streams with phx-update="stream", :for on <li>, :if for filtering, <.link> components, ~p verified routes, to_form/2 forms, typed %Action{} funnel into pure handle_action/2 dispatcher. The subagent also identified two real skill gaps (missing "filter a stream via :if" pattern, missing "hoist inline form into assign" tip) — valuable follow-up items for the next skill iteration. ## Pipeline: install test is now MANDATORY **scripts/mock_pipeline/NEXT-SEED-RUN-PLAYBOOK.md §Phase 7.5** — every bridge seed run must run the install test before being marked complete. The playbook includes the exact bash script that downloads the zip, creates a fake project, runs every script, asserts on outputs, and optionally dispatches a subagent dogfood test. **plans/PLAN-V2.1.md §P1.5 "Final-package installation test (MANDATORY)"** — the v2.1 production engine must include a `skillforge/engine/install_test.py` module called from `run_v21_evolution()` AFTER champion eval but BEFORE save_genome (composite). On failure the run transitions to a new `install_test_failed` status. The zip export endpoint and seed loader reject runs in that state. **plans/PLAN-V2.1.md §3.5 "Install-test learnings (post-rebrand)"** documents the four bugs as permanent learnings so future engine work doesn't repeat them. **Success criterion #11** added to the v2.1 shipped gate. ## journal + PROGRESS - journal/013-phoenix-liveview-install-test.md (session narrative, ~400 lines covering rich run detail rebuild, two-phase rebrand, OG meta injection, install test discoveries, subagent dogfood) - plans/PROGRESS.md (6 dated entries for today) Co-authored-by: Matt (via Claude Code) <matt@skillforge.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SKLD-bench v2.1 challenge pool: elixir-oban-worker
Third of 7 families being shipped this morning (ecto-schema-changeset PR #9, ecto-query-writer next).
Pool stats
Capability coverage
All 12 capabilities hit the ≥5 binary-family minimum. ⭐
args-serializationis the highest-impact safety fix (atom keys in args + struct serialization, both per plugin iron laws).All three named Oban failure modes covered
transactional-jobs(8) +perform-callback-basics(6)args-serialization(9)args-serialization(same cluster)Post-hoc calibration manifest
Drafting subagent was cut off by the Max subscription rate limit at the final
_calibration.jsonstep. All other content authored cleanly. Manifest generated post-hoc by walking the actual challenge files.Score.py
Authored by drafting subagent. Uses regex for
String.to_atomin worker bodies, atom-key detection in args (%{user_id:patterns), return-value protocol tags (:ok,{:ok, _},{:error, _},{:discard, _},{:snooze, _}), presence ofunique:blocks. Not re-validated post-hoc.Research provenance
38 citations across 12 capabilities. Key source:
oliver-kriska/claude-elixir-phoenix(three explicit Oban iron laws: idempotency, atom keys, stored structs).Tier methodology
Heuristic per SEEDING-PLAN.md item 4.
🤖 Generated with Claude Code