Skip to content

[Guiding Goal] #9

@TreeinRandomForest

Description

@TreeinRandomForest

The goal of this phase of the project is to fill in the two tables below. Note the goal doesn't include building a central framework for evals or training, performance tuning, scaling beyond a reasonable amount.

Pure evaluation:

| Model | Benchmark | Performance |
| ... | ... | ... |
| ... | ... | ... |

RL training:

| Model | Benchmark | Performance |
| ... | ... | ... |
| ... | ... | ... |

where choices are listed (and can be appended to) here:

List of models

List of benchmarks

Some other constraints for the first phase:

  • Multi-step, single-submission rollouts. See implementation.
  • Verification hierarchy fixed: AST syntax checking -> Linter -> Type Checker -> Benchmark provided tests
  • Simple RL algorithm e.g. GRPO.

In the second stage, we'll discuss the results from this phase and explore at least three branches. For now, hold off on these.

  • RL approaches e.g. DAgger like approaches, RLIF, PPO vs GRPO, potentially custom developments.
  • Performance tuning: container startup times (and the splitting of verifiers across containers), scaling bottlenecks, custom GPU kernels or handwritten-optimized backward passes (unsloth).
  • Richer static analysis (control flow, data flow), dynamic tracing etc. to augment context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions