LoCoDiff: Natural Long Context Code Bench

LoCoDiff is a novel long-context benchmark for evaluating language models' ability to understand git history and reconstruct code. Developed by the Mentat AI team, this benchmark offers several unique strengths:

Utilizes naturally interconnected content, not artificially generated or padded context
No junk context: every part of the context is required for the task
Tests a real skill critical for coding agents: keeping track of the state of edited files
Prompt generation and output evaluation are simple and easy to understand
Challenges models' capacity to generate long-form outputs
Surprisingly difficult for reasoning models to reason about
Easy to procedurally generate: any file in any git repo can be made into a benchmark case

To see results, methodology, and analysis:

👉 Explore the Interactive Benchmark Dashboard!

For instructions on running the benchmark yourself, see the benchmark pipeline README.

Name		Name	Last commit message	Last commit date
Latest commit History 892 Commits
.github/workflows		.github/workflows
.mentat		.mentat
benchmark_pipeline		benchmark_pipeline
docs		docs
locodiff-250425		locodiff-250425
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LoCoDiff: Natural Long Context Code Bench

👉 Explore the Interactive Benchmark Dashboard!

About

Uh oh!

Releases

Packages

Languages

License

sf-wind/LoCoDiff-bench

Folders and files

Latest commit

History

Repository files navigation

LoCoDiff: Natural Long Context Code Bench

👉 Explore the Interactive Benchmark Dashboard!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages