ContextGC Barrier

ContextGC Barrier is a small research repo about a question:

When a conversation gets too long for the prompt window, which old messages should stay in the prompt?

The name is inspired by ZGC. The idea here is simple: be careful about what stays live when space is tight.

This repo compares three strategies:

summary80
barrier
summary80_barrier

The focus is narrow: keep exact old facts available when the prompt window is tight.

At A Glance

If you only want the short version:

Strategy	What it does	Current replay result
`summary80`	Compress older context into a rolling summary.	Weak baseline under tight budgets.
`barrier`	Keep the most relevant older raw messages.	Best performer in the current debugging replay benchmark.
`summary80_barrier`	Summarize first, then add back protected raw exceptions.	Better than `summary80`, but not better than `barrier` on the current replay.

What Each Strategy Does

Strategy	Simple description
`summary80`	When the prompt gets tight, older context is compressed into one rolling summary.
`barrier`	Older raw messages are scored, and the most useful ones stay in the prompt.
`summary80_barrier`	Older context is summarized first, then important raw `user` and `tool` messages are added back.

What The Benchmark Measures

There are two benchmark styles:

matrix: live multi-turn tasks for debugging, document QA, coding, and support
debugging_replay: a scripted debugging replay benchmark where every strategy sees the exact same frozen transcript

The main benchmark to read is debugging_replay.

Every strategy sees the same frozen debugging transcript. The model stays the same. The prompt budget stays the same. Only the retention strategy changes.

That makes it a clean comparison of context handling.

Runtime Notes

The shared runtime is ContextGCBarrier.

Valid strategy ids are:

summary80
barrier
summary80_barrier

barrier is still the default.

For the summary strategies, context_state() also reports:

summary_active
summarized_through_index
summary_tokens
protected_exception_indexes

Repository Layout

contextgc_barrier: runtime selection, scoring, summary logic, local demo
benchmark: task generation, matrix runner, replay runner, stats, CLI
tests: unit and integration coverage
benchmark/results/debugging_replay: current benchmark output

Setup

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e ".[dev]"

For live local Qwen runs on Apple Silicon:

python -m pip install -e ".[mlx]"

Optional spaCy model:

python -m spacy download en_core_web_sm

Quick Start

Run the local demo for one strategy:

./.venv/bin/python -m contextgc_barrier.demo --strategy barrier

Run the small proof profile:

./.venv/bin/python benchmark/run_benchmark.py \
  --profile proof \
  --window-budget 3072

Run the scripted debugging replay smoke benchmark:

./.venv/bin/python benchmark/run_benchmark.py \
  --profile debugging_replay_smoke \
  --output-dir benchmark/results/debugging_replay_smoke

Run the full scripted debugging replay benchmark:

./.venv/bin/python benchmark/run_benchmark.py \
  --profile debugging_replay \
  --output-dir benchmark/results/debugging_replay

Run the full task matrix:

./.venv/bin/python benchmark/run_benchmark.py \
  --profile matrix \
  --strategies summary80,barrier,summary80_barrier

Replay Artifacts

The replay benchmark writes:

transcripts.jsonl: the frozen scripted transcripts
runs.jsonl: one run per transcript/window/strategy
aggregate.json and aggregate.csv: grouped metrics
summary.md: human-readable summary
audit_queue.jsonl: rows flagged for manual review
progress.log: corpus generation and evaluation progress

The frozen assistant turns can refer back to the earlier incident, but they do not restate the answer facts. That keeps the replay focused on retention.

Current Result

Latest full run: summary.md

This is the current benchmark to read. It uses 20 frozen debugging transcripts, 3 strategies, and 3 prompt windows.

Window	`summary80`	`barrier`	`summary80_barrier`	Honest read
`3072`	`0.211`	`0.829`	`0.786`	`barrier` is clearly best. The hybrid beats summary but still trails raw retention.
`4096`	`0.200`	`0.818`	`0.793`	Same result. `barrier` stays ahead.
`16384`	`0.825`	`0.825`	`0.825`	Once the full transcript fits, the strategies tie.

The paired comparisons in summary.md support this reading:

Comparison	`3072`	`4096`	What to say publicly
`barrier` vs `summary80`	delta `+0.618`, wins `20/20`, `p=0.000`	delta `+0.618`, wins `20/20`, `p=0.000`	`barrier` clearly beats `summary80` on this debugging replay benchmark.
`summary80_barrier` vs `barrier`	delta `-0.043`, `p=0.388`	delta `-0.025`, `p=1.000`	The hybrid does not beat plain `barrier` on this benchmark.

The short takeaway is simple: keeping the right raw context beats replacing old context with a lossy rolling summary.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
benchmark		benchmark
contextgc_barrier		contextgc_barrier
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContextGC Barrier

At A Glance

What Each Strategy Does

What The Benchmark Measures

Runtime Notes

Repository Layout

Setup

Quick Start

Replay Artifacts

Current Result

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

ContextGC Barrier

At A Glance

What Each Strategy Does

What The Benchmark Measures

Runtime Notes

Repository Layout

Setup

Quick Start

Replay Artifacts

Current Result

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages