The goal of this phase of the project is to fill in the two tables below. Note the goal doesn't include building a central framework for evals or training, performance tuning, scaling beyond a reasonable amount.
Pure evaluation:
| Model | Benchmark | Performance |
| ... | ... | ... |
| ... | ... | ... |
RL training:
| Model | Benchmark | Performance |
| ... | ... | ... |
| ... | ... | ... |
where choices are listed (and can be appended to) here:
List of models
List of benchmarks
Some other constraints for the first phase:
- Multi-step, single-submission rollouts. See implementation.
- Verification hierarchy fixed: AST syntax checking -> Linter -> Type Checker -> Benchmark provided tests
- Simple RL algorithm e.g. GRPO.
In the second stage, we'll discuss the results from this phase and explore at least three branches. For now, hold off on these.
- RL approaches e.g. DAgger like approaches, RLIF, PPO vs GRPO, potentially custom developments.
- Performance tuning: container startup times (and the splitting of verifiers across containers), scaling bottlenecks, custom GPU kernels or handwritten-optimized backward passes (unsloth).
- Richer static analysis (control flow, data flow), dynamic tracing etc. to augment context.
The goal of this phase of the project is to fill in the two tables below. Note the goal doesn't include building a central framework for evals or training, performance tuning, scaling beyond a reasonable amount.
Pure evaluation:
| Model | Benchmark | Performance |
| ... | ... | ... |
| ... | ... | ... |
RL training:
| Model | Benchmark | Performance |
| ... | ... | ... |
| ... | ... | ... |
where choices are listed (and can be appended to) here:
List of models
List of benchmarks
Some other constraints for the first phase:
In the second stage, we'll discuss the results from this phase and explore at least three branches. For now, hold off on these.