A Negative Result
DeepSeek's Engram modules (N-gram hash lookup tables) reduce perplexity by 22-30% on held-out text but produce zero improvement on downstream benchmarks (HellaSwag, PIQA, ARC-Challenge) across four experiments on models from 0.5B to 4B parameters on a single RTX 4090.
Engram is a conditional memory mechanism from DeepSeek-V3 that injects trainable embedding tables into transformer layers. Each table is indexed by a rolling N-gram hash of recent tokens, giving the model direct access to token-sequence statistics without the attention mechanism having to learn them from scratch.
It works beautifully at 671B parameters. The question we asked: does it help at consumer hardware scale?
| Model | Mode | Base PPL | Engram PPL | PPL Delta | HellaSwag | PIQA | ARC-C |
|---|---|---|---|---|---|---|---|
| Qwen2.5-0.5B | Frozen, 100K | 20.75 | 15.73 | -24.2% | +0.35 | -1.14 | +0.34 |
| Qwen2.5-1.5B | Frozen, 5K | 14.84 | 11.02 | -25.7% | +0.30 | n/a | -0.61 |
| Qwen3.5-4B | Frozen, 10K | 12.87 | 9.76 | -24.2% | -0.02 | -0.33 | +1.79 |
| Qwen3.5-2B | Full FT, 10K | 16.31 | 11.44 | -29.9% | +0.17 | -0.71 | -0.85 |
All downstream deltas are within noise (±2 pp). No consistent directional trend.
The 2B full fine-tuning experiment is the key result: the model co-adapts with Engram, achieves the largest perplexity reduction of any experiment, and still shows zero downstream improvement. This rules out "frozen model ignores the signal" as an explanation.
We evaluate four hypotheses in the paper. The most compelling explanation: Engram captures local token co-occurrence statistics that are highly effective for next-token prediction but orthogonal to the multi-step reasoning and world knowledge that benchmarks require. A single Engram layer at position 1 captures 86% of the total perplexity gain, consistent with a surface-level statistical correction rather than deep representational learning.
engram_qwen35_integration.py # Engram module (injects into Qwen3.5 models)
train_qwen35.py # Train with frozen base model
train_qwen35_fullft.py # Train with full joint fine-tuning
eval_qwen35.py # Perplexity evaluation (WikiText-103 test split)
bench_qwen35.py # Downstream benchmarks (HellaSwag, PIQA, ARC-C)
results/
consolidated_results.json # All experiment results (machine-readable)
paper.tex # Paper source (LaTeX)
paper.md # Paper in Markdown
paper.pdf # Paper (PDF)
figure1_training_loss.png # Training loss curve (Figure 1)
Requirements:
- Python 3.10+
- PyTorch (with CUDA)
- transformers, datasets, lm-eval
- bitsandbytes
- Single GPU with 24 GB VRAM (tested on RTX 4090)
Quick start:
pip install torch transformers datasets lm-eval bitsandbytesRun an experiment (frozen base + Engram):
python train_qwen35.py --model Qwen/Qwen3.5-4B-Base --layers 1,8,16,23 --steps 10000Evaluate perplexity:
python eval_qwen35.py --model Qwen/Qwen3.5-4B-Base --checkpoint experiments/qwen35_4b_10kRun downstream benchmarks:
python bench_qwen35.py --model Qwen/Qwen3.5-4B-Base --checkpoint experiments/qwen35_4b_10kFull fine-tuning:
python train_qwen35_fullft.py --model Qwen/Qwen3.5-2B-Base --layers 1,6,12,17 --steps 10000Note: You'll need to adjust HF model paths and data directories to match your setup. See script headers for full argument lists.
- Perplexity is not a proxy for downstream performance. A 30% perplexity reduction corresponded to literally zero downstream improvement.
- Frozen vs. full fine-tuning doesn't matter. The largest PPL improvement (full FT, -29.9%) still produced zero benchmark gains.
- Scale matters. Engram may genuinely help at 671B scale (DeepSeek-V3) but the effect does not emerge at 0.5B-4B.
- Local statistics ≠ reasoning. Engram captures token co-occurrence patterns that help fill in blanks but don't enhance multi-step reasoning.
All experiments ran on a single NVIDIA RTX 4090 (24 GB VRAM):
- 4B frozen: fits easily (~10 GB VRAM)
- 2B full fine-tune: 14.08 GB peak VRAM
- 4B full fine-tune: OOM (requires >24 GB)
Engram embedding tables are stored on disk via mmap (~2.5 GB per layer, fp16) and only accessed pages are loaded into RAM.
The full paper is available as paper.pdf, paper.md, or paper.tex.
@misc{engram_consumer_hw_2026,
title={Conditional Memory for Small Language Models: An Empirical Evaluation of N-Gram Hash Lookup Tables},
author={Independent Replication Study},
year={2026}
}This work builds on DeepSeek's Engram mechanism and uses models from the Qwen family.
Apache 2.0 (see LICENSE).