Skip to content

seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges)#11

Merged
ty13r merged 1 commit intomainfrom
seed/elixir-oban-worker
Apr 11, 2026
Merged

seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges)#11
ty13r merged 1 commit intomainfrom
seed/elixir-oban-worker

Conversation

@ty13r
Copy link
Copy Markdown
Owner

@ty13r ty13r commented Apr 11, 2026

SKLD-bench v2.1 challenge pool: elixir-oban-worker

Third of 7 families being shipped this morning (ecto-schema-changeset PR #9, ecto-query-writer next).

Pool stats

  • Total challenges: 100 (binary curve target hit exactly)
  • Tier distribution: easy 35 / medium 35 / hard 22 / legendary 8
  • Held-out: 20 balanced across tiers
  • Capability coverage: 11 capabilities + 1 foundation = 12 dimensions

Capability coverage

Capability Primary Secondary E M H L
testing-workers 12 4 5 2 1
return-values 10 12 5 4 1 0
unique-constraints 10 4 3 4 2 1
args-serialization ⭐ 9 25 4 3 1 1
worker-philosophy (F) 9 37 3 2 3 1
cron-scheduling 8 2 3 2 3 0
recurring-jobs-vs-cron 8 5 3 2 2 1
transactional-jobs 8 7 1 3 3 1
queues-and-priority 7 7 2 3 1 1
retry-strategy 7 16 2 2 2 1
perform-callback-basics 6 22 4 1 1 0
telemetry-and-observability 6 1 1 4 1 0

All 12 capabilities hit the ≥5 binary-family minimum. ⭐ args-serialization is the highest-impact safety fix (atom keys in args + struct serialization, both per plugin iron laws).

All three named Oban failure modes covered

  1. Non-idempotent jobstransactional-jobs (8) + perform-callback-basics (6)
  2. Atom keys instead of strings in argsargs-serialization (9)
  3. Storing Elixir structs in argsargs-serialization (same cluster)

Post-hoc calibration manifest

Drafting subagent was cut off by the Max subscription rate limit at the final _calibration.json step. All other content authored cleanly. Manifest generated post-hoc by walking the actual challenge files.

Score.py

Authored by drafting subagent. Uses regex for String.to_atom in worker bodies, atom-key detection in args (%{user_id: patterns), return-value protocol tags (:ok, {:ok, _}, {:error, _}, {:discard, _}, {:snooze, _}), presence of unique: blocks. Not re-validated post-hoc.

Research provenance

38 citations across 12 capabilities. Key source: oliver-kriska/claude-elixir-phoenix (three explicit Oban iron laws: idempotency, atom keys, stored structs).

Tier methodology

Heuristic per SEEDING-PLAN.md item 4.

🤖 Generated with Claude Code

Authors the complete SKLD-bench v2.1 family for elixir-oban-worker per
the workstream plan in taxonomy/elixir/SEEDING-PLAN.md. Third family
shipped this morning. The drafting subagent was cut off by the Max rate
limit at the final _calibration.json step; the manifest was generated
post-hoc by walking the actual challenge pool.

Pool stats:
- 100 total challenges (binary curve target hit exactly)
- Tier distribution: 35 easy / 35 medium / 22 hard / 8 legendary
- 11 capabilities + 1 foundation = 12 dimensions covered
- 13 test fixtures, 12 golden references
- 20 challenges held out (~20% balanced across tiers)

Capability primary-tag counts (target >=5 for binary, all met):
- testing-workers: 12 (highest)
- return-values: 10
- unique-constraints: 10
- args-serialization: 9 (highest-impact safety fix per plugin iron laws)
- worker-philosophy (foundation): 9
- cron-scheduling: 8
- recurring-jobs-vs-cron: 8
- transactional-jobs: 8
- queues-and-priority: 7
- retry-strategy: 7
- perform-callback-basics: 6
- telemetry-and-observability: 6

All three named Oban failure modes are covered:
- Non-idempotent jobs: transactional-jobs + perform-callback-basics
- Atom keys in args: args-serialization (9 challenges)
- Stored structs in args: args-serialization (same cluster)

Score.py: authored by drafting subagent. Uses regex for String.to_atom
calls in worker bodies, atom keys in args (%{user_id: patterns), return
value protocol (:ok / {:ok, _} / {:error, _} / {:discard, _} / {:snooze, _}),
presence of unique: blocks. Not re-validated post-hoc; treat as best-effort.

Tier methodology: heuristic per SEEDING-PLAN.md item 4.
Research: 38 citations across 12 capabilities (see research.md).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ty13r ty13r merged commit 927ca63 into main Apr 11, 2026
@ty13r ty13r deleted the seed/elixir-oban-worker branch April 11, 2026 13:35
ty13r pushed a commit that referenced this pull request Apr 12, 2026
…l-test phase

Matt asked "should we try installing the liveview skill now?" after
everything else said green (rich run detail page, Gold Standard Checklist
all green indicators, zip export validator pass, Package Explorer showing
16 files). The install test revealed three real bugs that had passed
every schema-level quality gate. This commit fixes the bugs, promotes the
install test to a mandatory pipeline phase, and codifies the learnings
into PLAN-V2.1 so the v2.1 engine never ships another broken skill.

## The 3 bugs (found by actually running the package)

**1. validate.sh used declare -A (bash 4+ only)**
macOS ships bash 3.2. Line 49 `declare -A HITS_BY` failed with
`declare: -A: invalid option`. The enrichment agent that generated this
script tested it on Linux and never verified macOS.

**2. validate.sh piped detectors into report (subshell variable loss)**
Even after fixing the declare bug with eval + ${!var} indirect expansion,
the summary showed "all clean" with TOTAL_HITS=0 while the detector output
reported real hits. In bash 3.2 pipelines create subshells, so the
assignments inside `report` never propagated back. Fix: process
substitution `report "key" "fix" < <(detector)` keeps report running in
the parent shell. This bug would have bitten on Linux bash 4+ too without
`shopt -s lastpipe`.

**3. main_helper.py migrate produced malformed Elixir**
- Left `<%= ... %>` wrappers around `<.link>` components (invalid HEEx)
- Lost trailing `class: "btn"` keyword args instead of absorbing as attrs
- Put `:for` on the outer `<ul>` instead of the inner `<li>` (would
  duplicate the whole list)
- Skipped `live_redirect user.name, to: ...` because the regex only
  matched double-quoted text
- Missed `Routes.user_path(socket, :index)` without leading `@` inside
  `push_navigate` calls

Fixes:
- New `_strip_eex_around_link` post-processing pass that removes `<%= %>`
  around `.link` components and absorbs trailing keyword args as
  component attrs via `_absorb_kw_args_as_attrs`
- New `_format_link_text` helper that detects quoted-literal vs Elixir
  expression text and wraps expressions in HEEx curly syntax `{user.name}`
- Rewrote `_EEX_FOR_BLOCK_RE` / `_EEX_IF_BLOCK_RE` to match the INNER
  tag inside the block, not any wrapping outer tag
- Widened `_ROUTES_CALL_RE` with optional `@?` before socket
- Excluded `%` from `_LIVE_*_RE` target groups so `%>` doesn't get
  consumed

**Plus a minor new-live UX wart**: `dashboard_live` produced
`MyAppWeb.DashboardLiveLive`. Fix: strip a trailing `_live` from the
input before camel case conversion; clearer help text + error message.

## Patch flow

1. Fixed scripts written to /tmp/skld-fixes/scripts/
2. Tested standalone against a fake Phoenix project (32 anti-pattern
   hits, correct summary, FAIL exit 1)
3. Tested migrate against pre_1_7_user_list.ex — 9 rewrite passes
   producing valid Phoenix 1.7+ HEEx with :for on <li>, :if on <span>,
   absorbed class="btn", {user.name} curly interpolation,
   push_navigate(socket, to: ~p"/users")
4. New `scripts/mock_pipeline/patch_composite_scripts.py` helper
   patches the seed JSON's composite genome supporting_files in place
   (replaces the bad validate.sh + main_helper.py values)
5. Nuked local DB, rebooted uvicorn, downloaded zip, extracted,
   verified all scripts work from the installed location

## End-to-end install verification

- `/tmp/skld-phoenix-demo/` — realistic Phoenix project dir with
  `mix.exs`, `lib/my_app_web/live/`, and the composite skill dropped
  into `.claude/skills/elixir-phoenix-liveview-composite/`
- validate.sh: 32 anti-pattern hits across 14 detectors, correct
  summary, FAIL exit 1
- main_helper.py scan: 35 gcc-style diagnostics
- main_helper.py migrate: valid HEEx output, 9 rewrite passes
- main_helper.py new-live dashboard: scaffolded MyAppWeb.DashboardLive
  (no DashboardLiveLive)

## Dogfood subagent test

Dispatched an Opus subagent with instructions to read the installed
skill and write a `TaskListLive` module for a Tasks feature. The
subagent produced a 190-line file that scanned CLEAN on the first
try — zero anti-pattern hits. It used every Phoenix 1.7+ idiom the
skill teaches: streams with phx-update="stream", :for on <li>, :if
for filtering, <.link> components, ~p verified routes, to_form/2
forms, typed %Action{} funnel into pure handle_action/2 dispatcher.

The subagent also identified two real skill gaps (missing "filter a
stream via :if" pattern, missing "hoist inline form into assign"
tip) — valuable follow-up items for the next skill iteration.

## Pipeline: install test is now MANDATORY

**scripts/mock_pipeline/NEXT-SEED-RUN-PLAYBOOK.md §Phase 7.5** — every
bridge seed run must run the install test before being marked complete.
The playbook includes the exact bash script that downloads the zip,
creates a fake project, runs every script, asserts on outputs, and
optionally dispatches a subagent dogfood test.

**plans/PLAN-V2.1.md §P1.5 "Final-package installation test
(MANDATORY)"** — the v2.1 production engine must include a
`skillforge/engine/install_test.py` module called from
`run_v21_evolution()` AFTER champion eval but BEFORE save_genome
(composite). On failure the run transitions to a new
`install_test_failed` status. The zip export endpoint and seed loader
reject runs in that state.

**plans/PLAN-V2.1.md §3.5 "Install-test learnings (post-rebrand)"**
documents the four bugs as permanent learnings so future engine work
doesn't repeat them.

**Success criterion #11** added to the v2.1 shipped gate.

## journal + PROGRESS

- journal/013-phoenix-liveview-install-test.md (session narrative,
  ~400 lines covering rich run detail rebuild, two-phase rebrand,
  OG meta injection, install test discoveries, subagent dogfood)
- plans/PROGRESS.md (6 dated entries for today)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ty13r added a commit that referenced this pull request Apr 12, 2026
…l-test phase (#23)

Matt asked "should we try installing the liveview skill now?" after
everything else said green (rich run detail page, Gold Standard Checklist
all green indicators, zip export validator pass, Package Explorer showing
16 files). The install test revealed three real bugs that had passed
every schema-level quality gate. This commit fixes the bugs, promotes the
install test to a mandatory pipeline phase, and codifies the learnings
into PLAN-V2.1 so the v2.1 engine never ships another broken skill.

## The 3 bugs (found by actually running the package)

**1. validate.sh used declare -A (bash 4+ only)**
macOS ships bash 3.2. Line 49 `declare -A HITS_BY` failed with
`declare: -A: invalid option`. The enrichment agent that generated this
script tested it on Linux and never verified macOS.

**2. validate.sh piped detectors into report (subshell variable loss)**
Even after fixing the declare bug with eval + ${!var} indirect expansion,
the summary showed "all clean" with TOTAL_HITS=0 while the detector output
reported real hits. In bash 3.2 pipelines create subshells, so the
assignments inside `report` never propagated back. Fix: process
substitution `report "key" "fix" < <(detector)` keeps report running in
the parent shell. This bug would have bitten on Linux bash 4+ too without
`shopt -s lastpipe`.

**3. main_helper.py migrate produced malformed Elixir**
- Left `<%= ... %>` wrappers around `<.link>` components (invalid HEEx)
- Lost trailing `class: "btn"` keyword args instead of absorbing as attrs
- Put `:for` on the outer `<ul>` instead of the inner `<li>` (would
  duplicate the whole list)
- Skipped `live_redirect user.name, to: ...` because the regex only
  matched double-quoted text
- Missed `Routes.user_path(socket, :index)` without leading `@` inside
  `push_navigate` calls

Fixes:
- New `_strip_eex_around_link` post-processing pass that removes `<%= %>`
  around `.link` components and absorbs trailing keyword args as
  component attrs via `_absorb_kw_args_as_attrs`
- New `_format_link_text` helper that detects quoted-literal vs Elixir
  expression text and wraps expressions in HEEx curly syntax `{user.name}`
- Rewrote `_EEX_FOR_BLOCK_RE` / `_EEX_IF_BLOCK_RE` to match the INNER
  tag inside the block, not any wrapping outer tag
- Widened `_ROUTES_CALL_RE` with optional `@?` before socket
- Excluded `%` from `_LIVE_*_RE` target groups so `%>` doesn't get
  consumed

**Plus a minor new-live UX wart**: `dashboard_live` produced
`MyAppWeb.DashboardLiveLive`. Fix: strip a trailing `_live` from the
input before camel case conversion; clearer help text + error message.

## Patch flow

1. Fixed scripts written to /tmp/skld-fixes/scripts/
2. Tested standalone against a fake Phoenix project (32 anti-pattern
   hits, correct summary, FAIL exit 1)
3. Tested migrate against pre_1_7_user_list.ex — 9 rewrite passes
   producing valid Phoenix 1.7+ HEEx with :for on <li>, :if on <span>,
   absorbed class="btn", {user.name} curly interpolation,
   push_navigate(socket, to: ~p"/users")
4. New `scripts/mock_pipeline/patch_composite_scripts.py` helper
   patches the seed JSON's composite genome supporting_files in place
   (replaces the bad validate.sh + main_helper.py values)
5. Nuked local DB, rebooted uvicorn, downloaded zip, extracted,
   verified all scripts work from the installed location

## End-to-end install verification

- `/tmp/skld-phoenix-demo/` — realistic Phoenix project dir with
  `mix.exs`, `lib/my_app_web/live/`, and the composite skill dropped
  into `.claude/skills/elixir-phoenix-liveview-composite/`
- validate.sh: 32 anti-pattern hits across 14 detectors, correct
  summary, FAIL exit 1
- main_helper.py scan: 35 gcc-style diagnostics
- main_helper.py migrate: valid HEEx output, 9 rewrite passes
- main_helper.py new-live dashboard: scaffolded MyAppWeb.DashboardLive
  (no DashboardLiveLive)

## Dogfood subagent test

Dispatched an Opus subagent with instructions to read the installed
skill and write a `TaskListLive` module for a Tasks feature. The
subagent produced a 190-line file that scanned CLEAN on the first
try — zero anti-pattern hits. It used every Phoenix 1.7+ idiom the
skill teaches: streams with phx-update="stream", :for on <li>, :if
for filtering, <.link> components, ~p verified routes, to_form/2
forms, typed %Action{} funnel into pure handle_action/2 dispatcher.

The subagent also identified two real skill gaps (missing "filter a
stream via :if" pattern, missing "hoist inline form into assign"
tip) — valuable follow-up items for the next skill iteration.

## Pipeline: install test is now MANDATORY

**scripts/mock_pipeline/NEXT-SEED-RUN-PLAYBOOK.md §Phase 7.5** — every
bridge seed run must run the install test before being marked complete.
The playbook includes the exact bash script that downloads the zip,
creates a fake project, runs every script, asserts on outputs, and
optionally dispatches a subagent dogfood test.

**plans/PLAN-V2.1.md §P1.5 "Final-package installation test
(MANDATORY)"** — the v2.1 production engine must include a
`skillforge/engine/install_test.py` module called from
`run_v21_evolution()` AFTER champion eval but BEFORE save_genome
(composite). On failure the run transitions to a new
`install_test_failed` status. The zip export endpoint and seed loader
reject runs in that state.

**plans/PLAN-V2.1.md §3.5 "Install-test learnings (post-rebrand)"**
documents the four bugs as permanent learnings so future engine work
doesn't repeat them.

**Success criterion #11** added to the v2.1 shipped gate.

## journal + PROGRESS

- journal/013-phoenix-liveview-install-test.md (session narrative,
  ~400 lines covering rich run detail rebuild, two-phase rebrand,
  OG meta injection, install test discoveries, subagent dogfood)
- plans/PROGRESS.md (6 dated entries for today)

Co-authored-by: Matt (via Claude Code) <matt@skillforge.local>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant