Efficient long-context AI inference on consumer hardware — quantised hybrid SSM + sparse attention + episodic-memory architecture, reproducible end-to-end in ~25 minutes on a MacBook.
📄 Paper: arXiv preprint — link coming soon 🤗 Checkpoints: https://huggingface.co/Vineetha00/synapnet-edge 🧪 Companion repo (base architecture): SynapNet_Exp
SynapNet-Edge is an original research framework. The base SynapNet architecture is developed in the companion repository SynapNet_Exp.
Different quantization strategies per architectural component:
| Component | Method | Bits | Rationale |
|---|---|---|---|
| SSM layers | ParetoQ-style QAT | 2-bit | Near-zero-mean depthwise weights; learned per-channel step size preserves accuracy |
| Sparse attention | SmoothQuant + AWQ | INT4 | Per-channel smoothing migrates activation outliers to weights; group-wise INT4 |
| Episodic memory | Per-entry symmetric | INT8 | Per-slot scale absorbs entry diversity |
| Interface layer | FP16 ScaleBridge | FP16 | Absorbs scale mismatches between mixed-precision pathways |
Compression (measured at the 8.7M-param reference model): 4.4× compression on targeted SSM+attention parameters (0.60 MB vs 2.66 MB FP16-equivalent); 1.13× whole-model storage reduction (FF, embeddings, and memory-projection layers remain FP16 in this configuration — extending CAJQ to FFs is straightforward future work).
Accuracy (NIAH-single, mean ± std over 3 seeds, evaluated at 1024–4096 tokens): CAJQ-QAT reaches 0.674 ± 0.012 at ctx 1024, 0.590 ± 0.043 at ctx 2048, 0.521 ± 0.055 at ctx 4096, matching or exceeding the FP16 reference at every evaluated context length and reducing seed variance ~2.6×.
A lightweight retention-score classifier (~3.3K parameters) that progressively compresses memory entries under RAM constraints:
FP16 (hot) → INT8 (warm) → summary token (cold) → eviction
- RetentionClassifier: 3-layer MLP; validation ROC-AUC = 0.907 ± 0.005 (3 seeds), robust across a 10× learning-rate range and graceful under label noise.
- Budget enforcement: triggers compression stages when total memory exceeds a configurable threshold (default: 256 MB).
- vs FIFO / LRU: BAEE retains semantically important entries regardless of recency. Under 90% forced eviction with the target needle in the early portion of an 8K-token stream, BAEE retains the target 71% ± 8% of the time vs 0% for FIFO/LRU.
- vs learned KV-cache eviction: head-to-head 432-cell grid against H2O, Scissorhands, SnapKV, PyramidKV, and a Locret-style proxy positions BAEE competitively across budget × position regimes (see
paper/figures/v3_fig_kv_policy_grid.pdf).
Three hardware tiers profiled:
- Apple Silicon (MPS) — measured directly
- Multi-thread CPU — measured directly
- Single-thread CPU (Raspberry Pi 5 proxy) — measured directly (real Pi 5 throughput would be ~1.5–2× lower per-core)
Metrics reported: parameter memory, on-disk storage, activation memory, episodic-store growth, runtime RSS, sustained throughput under 45-second thermal stress, and an energy/token estimate from powermetrics (with rated-TDP fallback).
Benchmarks: RULER-style (NIAH, variable tracking, frequency aggregation) + LongBench-style (6 task categories) + NeedleBench-style (SNIA / MKN / RoN / CN / ADN), all with self-contained synthetic data — no external downloads required.
Input tokens
│
▼
Token + Position Embedding
│
├─ [× depth] SynapBlockWithEpisodic
│ │
│ ├─ SimpleSSM (depthwise conv, 2-bit QAT)
│ │
│ ├─ SparseEventAttention (INT4 AWQ + SmoothQuant)
│ │ └─ salience mask → BAEE scoring
│ │
│ ├─ WriteableMemory (INT8 per-entry)
│ │ ├─ write: top-K salient tokens → slots
│ │ └─ read: cross-attention to slots
│ │
│ └─ ScaleBridge (FP16 — normalises 3 pathway outputs)
│
▼
LayerNorm → Head (classification or LM)
git clone https://github.com/vineetha00/SynapNet-Edge
cd SynapNet-Edge
pip install -e .
# Optional extras:
pip install -e ".[dev]" # + psutil, pyyaml, scipy, tqdm
pip install -e ".[mlx]" # + Apple MLX for iPhone/Mac deployment
pip install -e ".[full]" # + HuggingFace datasets for real LongBenchpython scripts/pretrain_scaled.py \
--dim 192 --depth 6 --heads 6 \
--curriculum 512 1024 \
--device mpspython scripts/exp_cajq_qat_multiseed.py \
--context-lengths 1024 2048 4096 \
--seeds 42 43 44 \
--device mpspython scripts/exp_baee_grid_multiseed.py \
--policies baee_salience fifo lru random h2o scissorhands snapkv pyramidkv locret_proxy \
--budgets 0.10 0.20 0.30 0.50 \
--positions early late \
--seeds 42 43 44 \
--device mpspython scripts/exp_deployment_metrics.py \
--tiers apple_silicon_mps cpu_multi cpu_single \
--variants fp16 int4_uniform cajqpython scripts/generate_paper_figures_v3.pyfrom synapnet_edge import SynapNetEdge, SynapNetEdgeConfig, apply_cajq, BAEEMemoryManager
from synapnet_edge.quantization.cajq import CAJQConfig
from synapnet_edge.training.calibration import build_calib_loader
# Build the reference 8.7M model
cfg = SynapNetEdgeConfig(
dim=192, depth=6, vocab_size=4096, max_len=8192,
num_classes=64, heads=6, episodic_slots=32,
)
model = SynapNetEdge(cfg)
# Apply CAJQ quantization (PTQ; QAT entry point in scripts/exp_cajq_qat_multiseed.py)
calib_loader = build_calib_loader(n_samples=128, seq_len=1024)
model = apply_cajq(model, CAJQConfig(device="mps"),
calib_loader=calib_loader, mode="ptq")
# Streaming inference with BAEE
manager = BAEEMemoryManager(dim=192, n_layers=6, budget_mb=256.0)
logits, debug = model.forward_streaming(
input_ids, chunk_size=512, baee_manager=manager
)SynapNet-Edge/
├── synapnet_edge/
│ ├── models/ # SSM / SparseAttn / Episodic / SynapBlock / full model
│ ├── quantization/ # CAJQ + 2-bit / INT4 / INT8 / ScaleBridge
│ ├── memory/
│ │ ├── baee.py # BAEEMemoryManager + RetentionClassifier (~3.3K params)
│ │ └── kv_cache_policies.py # H2O / Scissorhands / SnapKV / PyramidKV / Locret
│ ├── benchmarks/ # RULER / LongBench / hardware / Pareto
│ ├── baselines/ # Mamba-2 / Llama-AWQ / Falcon-H1 / EM-LLM proxies
│ ├── training/ # QAT trainer + calibration
│ ├── data/long_context_tasks.py # NIAH / MultiKey / VarTrack / FA / NeedleBench / MemPressure
│ └── utils/ # profiling + visualisation
├── scripts/ # 18 reproducible experiment scripts
├── configs/ # YAML configs for CAJQ / BAEE / benchmarks
├── results/ # JSON outputs for every experiment
├── paper/figures/ # 20+ publication-quality PDFs + v3 paper summary
├── CITATION.cff # citation metadata
├── LICENSE # MIT
└── README.md
| Model | Architecture | Bits | Role |
|---|---|---|---|
| Mamba-2 (proxy) | GRU-SSM | INT4 uniform | Approximates Mamba-2 selective scan |
| Llama-3.2 (proxy) | Full dense attention | INT4 AWQ | O(T²) cost reference |
| Falcon-H1 / Hymba (proxy) | Hybrid SSM + attention | FP16 | Upper-bound hybrid reference |
| EM-LLM | Transformer + external memory | FP16 | Memory-mechanism ablation |
| H2O / Scissorhands / SnapKV / PyramidKV / Locret-proxy | KV-cache eviction policies | — | Eviction-policy baselines (BAEE comparison) |
- Quantization strategy: FP16 / Uniform INT8 / Uniform INT4 / CAJQ-PTQ / CAJQ-QAT (ours) — see
scripts/exp_cajq_qat_multiseed.py - Eviction policy: BAEE (ours) / FIFO / LRU / Random / H2O / Scissorhands / SnapKV / PyramidKV / Locret — see
scripts/exp_baee_grid_multiseed.py - ScaleBridge: post-hoc removal collapses accuracy (load-bearing in trained model); from-scratch comparison is inconclusive at our compute budget — see
scripts/exp_scale_bridge_ablation.py - Classifier stability: seed × LR × label-noise grid — see
scripts/exp_classifier_stability.py
@article{synapnet_edge_2026,
title={SynapNet-Edge: Component-Aware Quantization and Budget-Aware Eviction for Hybrid Long-Context Models on Consumer Hardware},
author={Vallish Kumar, Vineetha},
year={2026},
}Released under the MIT License.
Quantization methods inspired by ParetoQ, SmoothQuant, and AWQ. Memory eviction inspired by EM-LLM, MemGPT, and the KV-cache compression line of work (H2O, Scissorhands, SnapKV, PyramidKV).