Local, long-context document Q&A with Gemma 4 E4B — load PDFs, ask anything, get page-cited answers. The model runs on your machine. Your documents never leave it.
DeepRead is the build submission for the dev.to Gemma 4 Challenge. It demonstrates that a high-quality document-intelligence experience can run entirely on consumer hardware — no cloud, no per-query cost, no telemetry — by leaning on Gemma 4 E4B's 128K context window instead of a RAG pipeline.
- One chat, two modes. Document Q&A is the default. Type
/bench showor/bench run --ctx 5000 20000 --needles 3in the same chat to render the context-window stress-test charts inline — no profile switching, no history loss. - Native multimodality. Pages go in as rendered images via Gemma 4's vision path — no OCR pipeline.
- Live context budget. A right-side sidebar shows the working set against the 128K ceiling, color-coded by latency tier (interactive / research / batch).
- Page picker in the sidebar. A custom React element renders the bundled classics directly in the right sidebar; clicking a row fires the same Python action callback that an in-chat button would.
- Page-anchored citations. The model is constrained to cite only the page ids it was given, so footnote references always resolve to real pages.
- Five classic CS papers bundled. Attention, GFS, MapReduce, Raft, Bitcoin — one click to load, zero network calls.
E4B is the only model in the Gemma 4 family — and arguably the only open model at this size today — that combines four properties at once:
- 128K context window wide enough to hold a complete research paper plus supplementary material in a single call.
- Native vision that handles PDF pages rendered at 150 DPI without an OCR step.
- Native audio input (kept reserved for future use; the current UI is text-first).
- ~9.6 GB on-disk footprint that runs comfortably on an 8 GB laptop GPU.
The 26B and 31B variants would push reasoning quality up, but they would kill the laptop story — and the whole point of DeepRead is that nothing leaves the machine. E2B was tempting for portability but loses fidelity on multi-step reasoning across long context. E4B is the precise sweet spot.
- Python 3.11+ managed by
uv - Ollama 0.24+ with
gemma4:e4bpulled (~9.6 GB) - 8 GB GPU (RTX 4060/5050, Apple Silicon) or 16 GB+ system RAM for CPU mode
- Linux, macOS, or Windows + WSL2
git clone https://github.com/yashksaini-coder/DeepRead.git
cd DeepRead
ollama pull gemma4:e4b # one-time, ~9.6 GB
make install # uv sync
make run # launch on http://127.0.0.1:8000Then open http://127.0.0.1:8000, pick Bitcoin · 2008 for the fastest demo, and ask What problem does proof-of-work solve in this paper?. The answer streams back with [^1] footnote markers that resolve to specific pages.
make install Sync dependencies from pyproject.toml + uv.lock
make papers (Re)fetch the bundled classic papers into papers_pdf_download/
make run Launch Chainlit on $(PORT) (default 8000)
make dev Launch Chainlit with hot-reload on file changes
make test Run the full test suite
make smoke Smoke-test text, vision, and PDF pipelines against the local Ollama
make smoketest One-shot end-to-end pipeline check (ingest + budget + Ollama call)
make bench Run the context-window sweep (writes benchmarks/results.json)
make bench-quick Fast sweep — 2 zones, 3 needles (sanity check only)
make plot Render benchmarks/plot.png from results.json
make clean Remove pytest/python caches and Chainlit runtime artifacts
make help Show this help
┌──────────────────────────────────────────────────────────────┐
│ Browser (Chainlit chat UI — one session, two modes) │
│ • Default — paper Q&A with citations + sidebar budget │
│ • /bench show — render the latest sweep results │
│ • /bench run … — kick off a fresh context-window sweep │
└─────────────┬────────────────────────────────────────────────┘
│ WebSocket (Chainlit transport)
▼
┌──────────────────────────────────────────────────────────────┐
│ app.py — Chainlit handlers │
│ on_chat_start · on_message · action_callback · AskFileMsg │
│ cl.user_session: shards + excluded set (per browser tab) │
└─────────────┬────────────────────────────────────────────────┘
│ pure-Python calls
▼
┌──────────────────────────────────────────────────────────────┐
│ deepread/ (UI-framework-independent) │
│ ingest.py PDF → page-images via PyMuPDF (Shard objects) │
│ budget.py ~900 tok/page estimate · tier classification │
│ citations.py [[name#p3]] markers · known-id allow-list │
│ papers.py 5 bundled PDFs in papers_pdf_download/ │
│ llm.py stream_chat() — wraps Ollama with options │
└─────────────┬────────────────────────────────────────────────┘
│ HTTP /api/chat (streaming)
▼
┌──────────────────────────────────────────────────────────────┐
│ Ollama daemon → gemma4:e4b (Q4_K_M, 9.6 GB on disk) │
└──────────────────────────────────────────────────────────────┘
The deepread/ package is intentionally UI-agnostic — the Chainlit frontend is a thin layer over it, and could be swapped without touching the domain code.
DeepRead/
├── app.py Chainlit entry point (ChatProfile, sidebar, actions)
├── Makefile make install / run / test / bench / clean
├── pyproject.toml uv-managed deps (chainlit, plotly, pymupdf, ollama, ...)
├── chainlit.md Chainlit welcome screen
├── public/
│ └── style.css Indigo theme + sidebar typography + Plotly chrome hide
├── .chainlit/config.toml Chainlit UI/audio settings
├── deepread/ Domain package (UI-independent)
│ ├── ingest.py PDF → Shards (PNG + extracted text + cite_id)
│ ├── budget.py Token estimate + tier classification
│ ├── citations.py [[cite_id]] grammar + allow-list parsing
│ ├── llm.py Ollama chat() wrapper (streaming)
│ ├── papers.py 5 bundled classics catalog
│ └── smoketest.py End-to-end pipeline check (one Ollama call)
├── benchmarks/
│ ├── run_context_sweep.py Needle-in-haystack sweep at 5K/20K/60K/100K
│ ├── plot.py Matplotlib 3-panel render → plot.png
│ └── results.json Append-only JSONL of sweep runs
├── scripts/
│ ├── refresh_papers.py Download bundled classics
│ ├── smoke_pdf.py PDF ingest smoke test
│ ├── smoke_text.py Text-only Ollama call
│ └── smoke_vision.py Single-image vision call
├── papers_pdf_download/ 5 bundled classic CS papers (~3.3 MB total)
├── tests/ 48 pytest tests (deepread/* + app.py helpers)
└── .design/chainlit/ QA screenshots (latest = qa-v5-*)
A needle-in-a-haystack sweep at 5K / 20K / 60K / 100K tokens, with 5 unique 4-character codes seeded at fixed positions (5/25/50/75/95%) and asked back in isolation. Results from an RTX 5050 Laptop:
| Context | Pass rate | Tokens/sec | Time to first token |
|---|---|---|---|
| 20K | 5/5 | 8.6 | 15 s |
| 60K | 5/5 | 7.6 | 38 s |
| 100K | 5/5 | 6.8 | 72 s |
Recall held at 100% across the whole sweep. What broke was latency — TTFT grew nearly linearly with context size. Generation throughput stayed flat around 7–9 tok/s; the consumer-GPU tax shows up entirely in the prefill phase.
Reproduce:
make bench # full sweep
make plot # render benchmarks/plot.pngThe Benchmark chat profile renders the same data interactively from benchmarks/results.json.
make test # 48/48 pytest, ~4 s
make smoketest # one-shot end-to-end check (requires Ollama running)MIT — see LICENSE.
