seed(elixir-ecto-schema-changeset): SKLD-bench v2.1 challenge pool (100 challenges) by ty13r · Pull Request #9 · ty13r/skillforge

ty13r · 2026-04-11T13:31:57Z

SKLD-bench v2.1 challenge pool: elixir-ecto-schema-changeset

Per the SKLD-bench v2.1 workstream documented in taxonomy/elixir/SEEDING-PLAN.md, this PR ships the complete challenge pool, score script, fixtures, golden references, gen 0 seed, and per-capability research dossier for the elixir-ecto-schema-changeset family.

This is the first of 7 families being shipped under the SKLD-bench overnight workstream. The other 6 families are still in drafting and will arrive as separate PRs.

Pool stats

Total challenges: 100 (binary curve target hit exactly)
Tier distribution: easy 35 / medium 35 / hard 22 / legendary 8
Held-out: 20 challenges balanced across tiers
Capability coverage: 11 capabilities + 1 foundation = 12 dimensions

Capability coverage (primary-tagged)

Capability	Primary	Secondary	E	M	H	L
field-types-and-decimal ⭐	14	16	7	4	3	0
embedded-schemas	11	5	2	5	3	1
associations	10	13	4	5	1	0
cast-and-allowed-fields	9	9	3	4	1	1
schema-organization (F)	9	21	3	2	2	2
validations-basic	8	15	6	1	0	1
migrations	7	25	3	2	2	0
soft-deletes-and-timestamps	7	1	2	2	2	1
unique-constraints-and-indexes	7	10	2	3	2	0
validations-custom	7	5	1	3	2	1
polymorphic-associations	6	1	0	2	3	1
multi-tenant-schemas	5	1	2	2	1	0

The field-types-and-decimal capability is the highest-confidence iron-law in the family — it carries the :decimal-not-:float rule for monetary fields named in BoothIQ's "ugly" post-mortem.

Score.py validation

Check	Result	Target	Status
Sanity (9 golden refs)	0.86 - 1.00	≥0.9	⚠️ near-miss (0.86 outlier from rebalanced weighting)
Discrimination (file with all anti-patterns)	0.36	≤0.3	⚠️ near-miss
Empty file	0.00	≤0.3	✅

The two near-misses are within 10% of target. Discrimination headroom from goldens to bad-input is 0.50, which is sufficient for fitness comparison. The score.py is functionally discriminating — both numeric ceilings can be tightened in a follow-up.

Family-specific scoring checks:

money_not_float regex guard (catches field :amount, :float and similar money-named fields)
no_is_admin_public_cast heuristic (catches missing cast/3 allowlists for role/admin fields)
unique_constraint/unique_index matching (catches mismatched changeset and migration constraints)

Research provenance

Per-capability research dossier at taxonomy/elixir/elixir-ecto-schema-changeset/research.md (47 citations across 12 capabilities). Key sources:

BoothIQ "150k lines of vibe-coded Elixir" post-mortem (the :float for money clincher)
oliver-kriska/claude-elixir-phoenix iron-law catalog
Elixir Forum "current status of LLMs writing Elixir" thread

Tier methodology

Heuristic — tiers assigned by drafting agent judgment per the rubric in taxonomy/elixir/SEEDING-PLAN.md § Heuristic tier rubric. Empirical Haiku+Sonnet calibration is deferred as a future workstream (see SEEDING-PLAN.md item 4).

Files added

family.json + seed.json + research.md
test_fixtures/ (16 .ex files)
golden/ (12 .ex files)
challenges/{easy,medium,hard,legendary}/ (100 .json files)
challenges/_calibration.json
evaluation/{score.py,criteria.json,environment.yml}

Test plan

All challenges parse as valid JSON
All challenges have unique IDs
All capabilities have primary-tagged challenges (12/12)
All capabilities ≥5 primary-tagged (binary minimum)
Tier counts match family.json declaration
Held-out IDs are balanced across tiers
score.py runs against goldens and bad input with discriminating results
criteria.json capability weights sum to ~1.0
_calibration.json methodology is "heuristic" with deferral note

🤖 Generated with Claude Code

Authors the complete SKLD-bench v2.1 family for elixir-ecto-schema-changeset per the workstream plan in taxonomy/elixir/SEEDING-PLAN.md. First of 7 families being shipped under the SKLD-bench overnight workstream. Pool stats: - 100 total challenges (binary curve target hit exactly) - Tier distribution: 35 easy / 35 medium / 22 hard / 8 legendary - 11 capabilities + 1 foundation = 12 dimensions covered - 16 test fixtures, 12 golden references - 20 challenges held out (~20% balanced across tiers) Capability primary-tag counts (target >=5 for binary): - field-types-and-decimal: 14 (highest — :float-not-:decimal iron law) - embedded-schemas: 11 - associations: 10 - cast-and-allowed-fields: 9 - schema-organization (foundation): 9 - validations-basic: 8 - migrations: 7 - soft-deletes-and-timestamps: 7 - unique-constraints-and-indexes: 7 - validations-custom: 7 - polymorphic-associations: 6 - multi-tenant-schemas: 5 Score.py validation: - Sanity check vs goldens: 0.86-1.0 (target >=0.9; one 0.86 from rebalanced weighting, but all well above 0.7 pass threshold) - Discrimination check vs bad input: 0.36 (target <=0.3; slight near-miss but well below 0.7 pass threshold) - Empty file: 0.0 - Family-specific checks: money_not_float regex guard, no_is_admin_public_cast heuristic, unique_constraint/unique_index matching Tier methodology: heuristic. Tiers assigned by drafting agent judgment per SEEDING-PLAN.md item 4. Empirical Haiku+Sonnet calibration is a deferred future workstream. Research: 47 citations across 12 capabilities (see research.md). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Captures the full SKLD-bench v2.1 authoring + audit + augment story: - journal/012-skld-bench-authoring.md: 14-hour session narrative covering the overnight autonomous run, the Max rate-limit cut-off at 22-27 min, the morning recovery via sequential hand-authoring, the deep audit pass that discovered 9 cross-file consistency issues no structural validator could catch, and the legendary-tier augmentation PR. - plans/PROGRESS.md: 4 new completed entries (seed shipping, audit fixes, legendary augment, journal entry) all dated 2026-04-11. No MVP checklist or Decisions Log changes — this workstream was content authoring, not new features requiring architectural decisions. - CLAUDE.md: Current Status section updated to reflect v2.0 shipped, v2.1 content shipped (7 Elixir families, 867 challenges, PRs #9-#17), v2.1 plumbing pending. Key Reference Documents section now lists SPEC-V2.1, SEEDING-PLAN.md, and SCHEMAS.md. Plans & Progress section updated to point PLAN-V2.1.md as the next active plan (pending write) and demotes PLAN-V2.0.md to shipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ty13r merged commit 952173e into main Apr 11, 2026

ty13r deleted the seed/elixir-ecto-schema-changeset branch April 11, 2026 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seed(elixir-ecto-schema-changeset): SKLD-bench v2.1 challenge pool (100 challenges)#9

seed(elixir-ecto-schema-changeset): SKLD-bench v2.1 challenge pool (100 challenges)#9
ty13r merged 1 commit intomainfrom
seed/elixir-ecto-schema-changeset

ty13r commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ty13r commented Apr 11, 2026

SKLD-bench v2.1 challenge pool: elixir-ecto-schema-changeset

Pool stats

Capability coverage (primary-tagged)

Score.py validation

Research provenance

Tier methodology

Files added

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant