[Guiding Goal]

The goal of this phase of the project is to fill in the two tables below. Note the goal doesn't include building a central framework for evals or training, performance tuning, scaling beyond a reasonable amount.

**Pure evaluation**:

| Model | Benchmark | Performance |
| ... | ... | ... |
| ... | ... | ... |

**RL training**:

| Model | Benchmark | Performance |
| ... | ... | ... |
| ... | ... | ... |

where choices are listed (and can be appended to) here:

[List of models](https://github.com/TreeinRandomForest/code-agent/issues/1)

[List of benchmarks](https://github.com/TreeinRandomForest/code-agent/issues/5)

Some other constraints for the first phase:

- Multi-step, single-submission rollouts. See [implementation](https://github.com/TreeinRandomForest/code-agent/blob/minimal/agent_func_complete.py).
- Verification hierarchy fixed: AST syntax checking -> Linter -> Type Checker -> Benchmark provided tests
- Simple RL algorithm e.g. GRPO.

In the second stage, we'll discuss the results from this phase and explore at least three branches. For now, hold off on these.
- RL approaches e.g. DAgger like approaches, RLIF, PPO vs GRPO, potentially custom developments.
- Performance tuning: container startup times (and the splitting of verifiers across containers), scaling bottlenecks, custom GPU kernels or handwritten-optimized backward passes (unsloth).
- Richer static analysis (control flow, data flow), dynamic tracing etc. to augment context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Guiding Goal] #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Guiding Goal] #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions