Skip to content

seed(elixir-pattern-match-refactor): SKLD-bench v2.1 challenge pool (120 challenges)#15

Merged
ty13r merged 1 commit intomainfrom
seed/elixir-pattern-match-refactor
Apr 11, 2026
Merged

seed(elixir-pattern-match-refactor): SKLD-bench v2.1 challenge pool (120 challenges)#15
ty13r merged 1 commit intomainfrom
seed/elixir-pattern-match-refactor

Conversation

@ty13r
Copy link
Copy Markdown
Owner

@ty13r ty13r commented Apr 11, 2026

SKLD-bench v2.1 challenge pool: elixir-pattern-match-refactor

Seventh and final family being shipped this morning (ecto-schema-changeset #9, ecto-query-writer #10, oban-worker #11, security-linter #12, ecto-sandbox-test #13, phoenix-liveview #14).

Partial-state recovery note

The overnight drafting subagent completed easy/medium/hard tiers + 15 fixtures + 12 goldens before the Max subscription rate limit cut it off before the legendary tier was started at all. This PR completes the family with hand-authored:

  • family.json (with known_gaps declaration)
  • seed.json — gen 0 SkillGenome with 11 starter variants
  • evaluation/score.py — refactor-quality scorer
  • evaluation/criteria.json
  • evaluation/environment.yml
  • challenges/_calibration.json — generated post-hoc

Pool stats

  • Total challenges: 120 (rich curve target 150 — 80% of target)
  • Tier distribution: easy 35 / medium 47 / hard 38 / legendary 0 ⚠️
  • Held-out: 20 balanced across tiers
  • Capability coverage: 10 capabilities + 1 foundation = 11 dimensions

⚠️ Known gap: legendary tier empty

The legendary tier has 0 of 30 target challenges. Drafting was cut off before any legendary challenges were written.

One orphaned golden/elixir-pattern-match-refactor-legendary-01.ex file exists (~75 lines of reference output) but no corresponding challenge JSON. Future augmentation can use it as a starting point.

Impact: 120 challenges across easy/medium/hard still provide substantial coverage across 11 dimensions. Champion fitness curves won't have a meaningful "legendary" anchor until augmented.

Per-capability primary coverage

Capability Primary Notes
with-expressions ⭐ 17
refactor-philosophy (F) 16
defensive-nil-checks-elimination ⭐ 14
enum-vs-recursion-choice 13
pipe-operator-flows 12
guard-clauses 10 below 12
recursive-functions 9 below 12
function-head-pattern-matching 8 below 12
binary-pattern-matching-basic 8 below 12
map-and-struct-destructuring 7 below 12
cond-and-if-reduction 6 below 12

⭐ = highest-priority capabilities per the research dossier. with-expressions is the bridge between pattern matching and error handling; defensive-nil-checks-elimination is the most-cited single complaint.

6 capabilities are below the 12-per-cap rich target. All are covered across the remaining 3 tiers. Augmentation is a follow-up.

Score.py validation

Check Result Target Status
Sanity (golden easy-01 defensive-nil elimination) 0.9593 ≥0.9
Discrimination (ruby_style_user_service fixture) 0.2703 <0.7 (fail)

Discrimination headroom: 0.69 — excellent.

Score.py approach

This family scores refactor quality via structural counting rather than fixed substring matches:

Positive signals (rewarded):

  • Multi-clause function heads (same name, multiple def with different patterns)
  • |> pipe operator usage
  • with expressions
  • when guard clauses
  • Map/struct destructure in function heads (%User{id: id})
  • List head/tail patterns ([h | t])
  • Binary patterns (<<"prefix", rest::binary>>)
  • Enum.map / Enum.reduce / Enum.filter

Anti-patterns (penalized):

  • if / case / cond keyword counts (total ≤2 for refactored output)
  • is_nil() defensive guards
  • x && x.field Ruby-style safe-nav pun
  • Intermediate temp/tmp/result variables breaking pipe flow
  • String.starts_with? / ends_with? instead of binary patterns
  • Complex function calls in guards

Research provenance

38 citations across 11 capabilities in research.md. Key sources:

  • BoothIQ "150k lines of vibe-coded Elixir" post-mortem: "Claude writes Ruby-style Elixir — if/then/else chains, defensive nil-checking, early returns"
  • HN troupo: "writes Java even if it's Elixir"
  • HN dnautics: "case functioncall() do nil -> ... end instead of idiomatic if var = functioncall() do"
  • Elixir Forum Alex66: "Still correcting if/else chains that should be pattern matches"
  • Dashbit, José Valim on idiomatic Elixir

Tier methodology

Heuristic per SEEDING-PLAN.md item 4.

🤖 Generated with Claude Code

Authors the complete SKLD-bench v2.1 family for elixir-pattern-match-refactor
per the workstream plan in taxonomy/elixir/SEEDING-PLAN.md. Seventh and
FINAL family shipped this morning. The drafting subagent authored 120
challenges + 15 test fixtures + 12 golden references before hitting the
Max subscription rate limit BEFORE the legendary tier was written at all.
This commit completes the family with hand-authored family.json, seed.json,
score.py, criteria.json, environment.yml, and _calibration.json.

Pool stats:
- 120 total challenges (rich curve target 150)
- Tier distribution: 35 easy / 47 medium / 38 hard / 0 legendary
- 10 capabilities + 1 foundation = 11 dimensions covered
- 15 test fixtures, 12 golden references
- 20 challenges held out (~17% balanced across tiers)

Known gap: legendary tier has 0 challenges (target 30). The drafting
agent completed easy/medium/hard tiers but was cut off before the
legendary tier was started. One orphaned legendary golden reference
file exists but has no corresponding challenge JSON. The family ships
as-is because 120 challenges across easy/medium/hard already provide
substantial evaluation coverage.

Per-capability primary counts (rich target 12-16):
- with-expressions: 17
- refactor-philosophy (foundation): 16
- defensive-nil-checks-elimination: 14
- enum-vs-recursion-choice: 13
- pipe-operator-flows: 12
- guard-clauses: 10 [below 12]
- recursive-functions: 9 [below 12]
- function-head-pattern-matching: 8 [below 12]
- binary-pattern-matching-basic: 8 [below 12]
- map-and-struct-destructuring: 7 [below 12]
- cond-and-if-reduction: 6 [below 12]

Score.py: regex-based structural scorer. Counts function heads per name
(more = better — indicates multi-clause pattern matching), counts
if/case/cond constructs (fewer = better), detects pipe usage, with
expressions, defensive is_nil checks, Ruby-style `x && x.field` puns,
intermediate temp vars breaking pipe flow.

Score.py validation:
- Sanity (golden easy-01, defensive-nil elimination): 0.9593 (above 0.9 target)
- Discrimination (ruby_style fixture): 0.2703 (well below 0.7 pass)
- Discrimination headroom: 0.69

This is the most-cited Elixir+Claude complaint (per research). The pool
teaches Claude to write idiomatic Elixir by refactoring Ruby/Java-style
imperative code into pattern-matched function heads, pipes, and with
expressions.

Tier methodology: heuristic per SEEDING-PLAN.md item 4.
Research: 38 citations across 11 capabilities (see research.md).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ty13r ty13r merged commit 95ec4ae into main Apr 11, 2026
@ty13r ty13r deleted the seed/elixir-pattern-match-refactor branch April 11, 2026 13:55
ty13r pushed a commit that referenced this pull request Apr 13, 2026
Covers Phases 0-5 of PLAN-V2.1.3, the $53 API incident, Bible
rewrite, and the 6-workstream frontend sprint (PR #36). Updates
PROGRESS.md with frontend sprint completion entry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant