Conditional Memory for Small Language Models: An Empirical Evaluation of N-Gram Hash Lookup Tables

A Negative Result

TL;DR

DeepSeek's Engram modules (N-gram hash lookup tables) reduce perplexity by 22-30% on held-out text but produce zero improvement on downstream benchmarks (HellaSwag, PIQA, ARC-Challenge) across four experiments on models from 0.5B to 4B parameters on a single RTX 4090.

What is Engram?

Engram is a conditional memory mechanism from DeepSeek-V3 that injects trainable embedding tables into transformer layers. Each table is indexed by a rolling N-gram hash of recent tokens, giving the model direct access to token-sequence statistics without the attention mechanism having to learn them from scratch.

It works beautifully at 671B parameters. The question we asked: does it help at consumer hardware scale?

Results

Model	Mode	Base PPL	Engram PPL	PPL Delta	HellaSwag	PIQA	ARC-C
Qwen2.5-0.5B	Frozen, 100K	20.75	15.73	-24.2%	+0.35	-1.14	+0.34
Qwen2.5-1.5B	Frozen, 5K	14.84	11.02	-25.7%	+0.30	n/a	-0.61
Qwen3.5-4B	Frozen, 10K	12.87	9.76	-24.2%	-0.02	-0.33	+1.79
Qwen3.5-2B	Full FT, 10K	16.31	11.44	-29.9%	+0.17	-0.71	-0.85

All downstream deltas are within noise (±2 pp). No consistent directional trend.

The 2B full fine-tuning experiment is the key result: the model co-adapts with Engram, achieves the largest perplexity reduction of any experiment, and still shows zero downstream improvement. This rules out "frozen model ignores the signal" as an explanation.

Why Doesn't Perplexity Transfer?

We evaluate four hypotheses in the paper. The most compelling explanation: Engram captures local token co-occurrence statistics that are highly effective for next-token prediction but orthogonal to the multi-step reasoning and world knowledge that benchmarks require. A single Engram layer at position 1 captures 86% of the total perplexity gain, consistent with a surface-level statistical correction rather than deep representational learning.

Repo Structure

engram_qwen35_integration.py   # Engram module (injects into Qwen3.5 models)
train_qwen35.py                # Train with frozen base model
train_qwen35_fullft.py         # Train with full joint fine-tuning
eval_qwen35.py                 # Perplexity evaluation (WikiText-103 test split)
bench_qwen35.py                # Downstream benchmarks (HellaSwag, PIQA, ARC-C)

results/
  consolidated_results.json    # All experiment results (machine-readable)

paper.tex                      # Paper source (LaTeX)
paper.md                       # Paper in Markdown
paper.pdf                      # Paper (PDF)
figure1_training_loss.png      # Training loss curve (Figure 1)

Reproducing the Results

Requirements:

Python 3.10+
PyTorch (with CUDA)
transformers, datasets, lm-eval
bitsandbytes
Single GPU with 24 GB VRAM (tested on RTX 4090)

Quick start:

pip install torch transformers datasets lm-eval bitsandbytes

Run an experiment (frozen base + Engram):

python train_qwen35.py --model Qwen/Qwen3.5-4B-Base --layers 1,8,16,23 --steps 10000

Evaluate perplexity:

python eval_qwen35.py --model Qwen/Qwen3.5-4B-Base --checkpoint experiments/qwen35_4b_10k

Run downstream benchmarks:

python bench_qwen35.py --model Qwen/Qwen3.5-4B-Base --checkpoint experiments/qwen35_4b_10k

Full fine-tuning:

python train_qwen35_fullft.py --model Qwen/Qwen3.5-2B-Base --layers 1,6,12,17 --steps 10000

Note: You'll need to adjust HF model paths and data directories to match your setup. See script headers for full argument lists.

Key Findings

Perplexity is not a proxy for downstream performance. A 30% perplexity reduction corresponded to literally zero downstream improvement.
Frozen vs. full fine-tuning doesn't matter. The largest PPL improvement (full FT, -29.9%) still produced zero benchmark gains.
Scale matters. Engram may genuinely help at 671B scale (DeepSeek-V3) but the effect does not emerge at 0.5B-4B.
Local statistics ≠ reasoning. Engram captures token co-occurrence patterns that help fill in blanks but don't enhance multi-step reasoning.

Hardware

All experiments ran on a single NVIDIA RTX 4090 (24 GB VRAM):

4B frozen: fits easily (~10 GB VRAM)
2B full fine-tune: 14.08 GB peak VRAM
4B full fine-tune: OOM (requires >24 GB)

Engram embedding tables are stored on disk via mmap (~2.5 GB per layer, fp16) and only accessed pages are loaded into RAM.

Paper

The full paper is available as paper.pdf, paper.md, or paper.tex.

Citation

@misc{engram_consumer_hw_2026,
  title={Conditional Memory for Small Language Models: An Empirical Evaluation of N-Gram Hash Lookup Tables},
  author={Independent Replication Study},
  year={2026}
}

Acknowledgments

This work builds on DeepSeek's Engram mechanism and uses models from the Qwen family.

License

Apache 2.0 (see LICENSE).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conditional Memory for Small Language Models: An Empirical Evaluation of N-Gram Hash Lookup Tables

TL;DR

What is Engram?

Results

Why Doesn't Perplexity Transfer?

Repo Structure

Reproducing the Results

Key Findings

Hardware

Paper

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench_qwen35.py		bench_qwen35.py
engram_qwen35_integration.py		engram_qwen35_integration.py
eval_qwen35.py		eval_qwen35.py
figure1_training_loss.png		figure1_training_loss.png
paper.md		paper.md
paper.pdf		paper.pdf
paper.tex		paper.tex
train_qwen35.py		train_qwen35.py
train_qwen35_fullft.py		train_qwen35_fullft.py

Folders and files

Latest commit

History

Repository files navigation

Conditional Memory for Small Language Models: An Empirical Evaluation of N-Gram Hash Lookup Tables

TL;DR

What is Engram?

Results

Why Doesn't Perplexity Transfer?

Repo Structure

Reproducing the Results

Key Findings

Hardware

Paper

Citation

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages