Conversation
- Rewrite README opening for non-experts: what it is, why it matters, who it's for, what it is NOT. Adds a Table of Contents and an FAQ. - Document the new FastGRNN-gated near-linear path with a measured scaling table and runnable example pointer. - Add SEO-friendly keyword block at the bottom (rust llm inference, sparse attention rust, near-linear attention, edge ai rust, raspberry pi llm, gguf rust, mistral / llama / smollm2 / phi-2). - New docs/TUTORIAL.md walks through the full pipeline end-to-end (Cargo.toml → forward → KvCache decode → FP16 KV → FastGRNN gate → cross-compile to Pi). Published as https://gist.github.com/ruvnet/790214c832928d6f2ec7ebe593bb3def Co-Authored-By: claude-flow <ruv@ruv.net>
- repository, documentation, homepage URLs - keywords (llm, attention, transformer, inference, edge) - categories (algorithms, science, mathematics) - expanded description mentioning subquadratic + FastGRNN near-linear - rust-version = 1.77 (matches workspace MSRV) Published v0.1.0 to crates.io: https://crates.io/crates/ruvllm_sparse_attention Co-Authored-By: claude-flow <ruv@ruv.net>
…near scale
Adds a recurrent O(N · D_h²) FastGRNN pass that produces a per-token
salience score, then prunes the sparse-attention candidate set against
that score. Combined cost is O(N · (D_h² + W + G + K_keep + dim)),
linear in seq when the gate budget K_keep is constant.
New module `fastgrnn_gate`:
- FastGrnnGate cell (matches cognitum-agent's sparse_fastgrnn math
so weights round-trip via from_weights / score_sequence)
- score_sequence / score_kv: per-position salience over a sequence
- keep_mask_quantile / keep_mask_top_k: turn salience into a binary
keep-mask the attention candidate selector consumes
- step_with_hidden: streaming variant for online inference
New methods on SubquadraticSparseAttention:
- forward_gated(q, k, v, keep_mask) — drops below-threshold tokens
from the long-range candidate set; window + globals + current
are always retained (causality preservation)
- forward_gated_with_fastgrnn(q, k, v, gate, top_k) — convenience
wrapper that does FastGRNN scoring + top-K masking + gated forward
Tests (5 new + 8 gate tests, all passing alongside 25 baseline):
- all-true mask is bit-identical to plain forward
- all-false mask preserves window + globals + current, output finite
- wrong mask length returns InvalidConfig
- smaller top_k provably reduces total candidate count
- end-to-end FastGRNN-driven path produces finite output
Scaling demo (examples/fastgrnn_gated_scaling.rs):
seq | ungated/N | gated/N | growth ratio
----|-----------|---------|-------------
128 | 0.0021 | 0.0029 |
2048| 0.0029 | 0.0036 |
ungated grows ~1.38× over 16× seq (log-linear);
gated grows ~1.24× over 16× seq (sub-logarithmic, near-linear).
Zero new runtime dependencies (ADR-183 invariant preserved).
Co-Authored-By: claude-flow <ruv@ruv.net>
…ified
ADR-192 implementation. Crate is now no_std + alloc behind a default-on
`std` feature (purely additive — std consumers see zero behavioural change).
Changes:
- lib.rs: #![cfg_attr(not(feature = "std"), no_std)] + extern crate alloc
- F32Ext trait restores .exp/.sqrt/.tanh/.powi method syntax via libm
in no_std mode; std mode uses inherent f32 methods unchanged
- attention.rs / fastgrnn_gate.rs / tensor.rs: replace std:: with
core:: and alloc:: imports; HashSet → BTreeSet (no hashing in no_std)
- Error trait impl gated on std (core::error::Error needs MSRV bump)
- Cargo.toml: std default-on, parallel = ["std", "rayon"], libm always-on
Verified:
- cargo test --lib 38/38 pass
- cargo build --no-default-features clean
- cargo build --no-default-features --features fp16 clean
- cargo +esp build --target xtensa-esp32s3-none-elf 1.02s release,
376 KB rlib
- examples/esp32s3_smoke runs natively all checks passed
Tested against attached hardware: ESP32-S3 v0.2, MAC ac:a7:04:e2:66:24,
16 MB flash, on /dev/ttyACM0 (USB-Serial-JTAG).
Bump version 0.1.0 → 0.1.1 (patch — additive). Adds "no-std" to crates.io
categories. Adds libm 0.2 as always-on dep (~60 KB, pure Rust).
Co-Authored-By: claude-flow <ruv@ruv.net>
…attention Proposes four additive changes to the sparse-attention crate based on production data from the cognitum-agent deployment on cognitum-v0 (Pi Zero 2W, SmolLM2-135M Q4_0, cognitum-one/seed PR #133): 1. decode_step_with_deadline / decode_step_f16_with_deadline / decode_batch_with_deadline — sub-step wall-clock deadline so integrators can bound latency at finer granularity than per-token. Returns AttentionError::DeadlineExceeded { elapsed_ms, checkpoint }. 2. SparseAttentionConfig::pi_zero_2w() — codify the empirically validated window=64, tile=16, FP16 KV preset that cognitum-agent currently records as a Cargo.toml comment. 3. SubquadraticSparseAttention::warm_up() — synthetic 1-token decode to prime caches and shrink the measured 99 s → 56 s cold→warm gap before the first user inference. 4. Stochastic Q4 dequant pass-through for KV cache reload (feature-gated, off by default). Reuses the splitmix64 seeding pattern from cognitum-agent commit 1675c20 — naive `seed | 1` xorshift collapses adjacent seeds 42 and 43 to the same state, an outright bug. Status: proposed. Test plan covers correctness (deadline does not perturb output), unbiasedness (mean within 0.06 of deterministic over 256 trials), and a cluster bench comparing pre/post cold first-decode latency on cognitum-v0. Co-Authored-By: claude-flow <ruv@ruv.net>
Co-Authored-By: claude-flow <ruv@ruv.net>
ruvnet
added a commit
that referenced
this pull request
May 7, 2026
Iteration 1 of /loop "until SOTA and optimized" on PR #428 review feedback. Blocking fixes: 1. **Padding-edges correctness** (build.rs:80-87, graph.rs, search.rs) build.rs previously filled BATCH_SIZE-aligned padding slots with REAL random vertex IDs and their actual codes. During search those padding "neighbours" could score low on Hamming distance and displace real neighbours from the candidate beam (search.rs:206-212), violating the SymphonyQG paper's "no spurious-edge contribution" invariant. Fix: introduce graph::PADDING_SENTINEL = u32::MAX. Initialise the neighbours array to the sentinel and zero-fill nb_codes; the existing `nb >= g.n` check at search.rs:201 already rejects the sentinel (u32::MAX as usize > any practical g.n). Padded code bytes have constant Hamming distance from any query, so the SIMD popcount over them produces a uniform score that the sentinel skip discards before any heap insert. Empirical impact: small-corpus recall@10 measurement at dim=128, n=500, ef=300 went from a 60% floor to 71.4% measured (and the test floor is now 70%). Big-corpus PR-body claim (97.6% at n=5000) needs to be re-measured in a follow-up iteration. 2. **ADR-191 collision → ADR-193** (docs/adr/) Renamed ADR-191-symphonyqg-inline-fastscan-graph.md to ADR-193 to resolve the conflict with ADR-191-sparse-attention-pi-zero-2w- production-hardening.md (merged to main yesterday in PR #429). Updated frontmatter, title, related: chain, and authors typo (ruvenet → ruvnet). 3. **clippy::manual_div_ceil** (graph.rs:91) ((m + BATCH_SIZE - 1) / BATCH_SIZE) * BATCH_SIZE → m.div_ceil(BATCH_SIZE) * BATCH_SIZE 4. **cargo fmt -p ruvector-symphonyqg** — all the small whitespace diffs the workflow's Rustfmt check was failing on. Test floor: symphony_recall_at_10 renamed from above_60pct to above_70pct, with comments documenting the measurement gap between this small-corpus regression test and the PR body's headline number. 7/7 tests pass. Clippy clean (-D warnings). Deferred to next iteration: - Repack neighbours+codes into a single per-vertex packed buffer so the ADR's "inline" claim actually holds at the cache-line level (currently six independent Vecs share zero locality). - Re-run `src/main.rs` end-to-end and update the PR body / ADR with honest post-fix recall + speedup numbers at n=5000. - Investigate the `Tests (core-and-rest)` 3-hour timeout in workflow. - Add edge-case tests for n<BATCH_SIZE and dim non-multiple of 64. Co-Authored-By: claude-flow <ruv@ruv.net>
ruvnet
added a commit
that referenced
this pull request
May 7, 2026
Iteration 1 of /loop "until SOTA and optimized" on PR #428 review feedback. Blocking fixes: 1. **Padding-edges correctness** (build.rs:80-87, graph.rs, search.rs) build.rs previously filled BATCH_SIZE-aligned padding slots with REAL random vertex IDs and their actual codes. During search those padding "neighbours" could score low on Hamming distance and displace real neighbours from the candidate beam (search.rs:206-212), violating the SymphonyQG paper's "no spurious-edge contribution" invariant. Fix: introduce graph::PADDING_SENTINEL = u32::MAX. Initialise the neighbours array to the sentinel and zero-fill nb_codes; the existing `nb >= g.n` check at search.rs:201 already rejects the sentinel (u32::MAX as usize > any practical g.n). Padded code bytes have constant Hamming distance from any query, so the SIMD popcount over them produces a uniform score that the sentinel skip discards before any heap insert. Empirical impact: small-corpus recall@10 measurement at dim=128, n=500, ef=300 went from a 60% floor to 71.4% measured (and the test floor is now 70%). Big-corpus PR-body claim (97.6% at n=5000) needs to be re-measured in a follow-up iteration. 2. **ADR-191 collision → ADR-193** (docs/adr/) Renamed ADR-191-symphonyqg-inline-fastscan-graph.md to ADR-193 to resolve the conflict with ADR-191-sparse-attention-pi-zero-2w- production-hardening.md (merged to main yesterday in PR #429). Updated frontmatter, title, related: chain, and authors typo (ruvenet → ruvnet). 3. **clippy::manual_div_ceil** (graph.rs:91) ((m + BATCH_SIZE - 1) / BATCH_SIZE) * BATCH_SIZE → m.div_ceil(BATCH_SIZE) * BATCH_SIZE 4. **cargo fmt -p ruvector-symphonyqg** — all the small whitespace diffs the workflow's Rustfmt check was failing on. Test floor: symphony_recall_at_10 renamed from above_60pct to above_70pct, with comments documenting the measurement gap between this small-corpus regression test and the PR body's headline number. 7/7 tests pass. Clippy clean (-D warnings). Deferred to next iteration: - Repack neighbours+codes into a single per-vertex packed buffer so the ADR's "inline" claim actually holds at the cache-line level (currently six independent Vecs share zero locality). - Re-run `src/main.rs` end-to-end and update the PR body / ADR with honest post-fix recall + speedup numbers at n=5000. - Investigate the `Tests (core-and-rest)` 3-hour timeout in workflow. - Add edge-case tests for n<BATCH_SIZE and dim non-multiple of 64. Co-Authored-By: claude-flow <ruv@ruv.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three accumulated changes for
ruvllm_sparse_attention, ready to ship as v0.1.1.1. FastGRNN salience gate → near-linear attention (new public API)
FastGrnnGatecell withscore_sequence/score_kv/keep_mask_top_k/keep_mask_quantileSubquadraticSparseAttention::forward_gated(q, k, v, &keep_mask)— drops below-threshold tokens from the long-range candidate set; window + globals + current always retained (causality preserved)SubquadraticSparseAttention::forward_gated_with_fastgrnn(q, k, v, &gate, top_k)— convenience wrapper2. no_std + alloc + ESP32-S3 (ADR-192)
`cargo +esp build --release --no-default-features --features fp16 --target xtensa-esp32s3-none-elf -Z build-std=core,alloc` → 1.02 s, 376 KB rlib
3. Documentation + ADRs
Test plan
Versioning
Bumps `0.1.0` → `0.1.1` (patch, purely additive). v0.1.1 is already published to crates.io: https://crates.io/crates/ruvllm_sparse_attention/0.1.1
A GitHub release `ruvllm_sparse_attention-v0.1.1` will be cut against the merge commit once this PR lands.
Migration
Zero action required for existing std consumers — `std` is the default feature. New no_std consumers (ESP32 / Cortex-M / RISC-V MCU) opt out:
```toml
ruvllm_sparse_attention = { version = "0.1.1", default-features = false, features = ["fp16"] }
```
Bring your own allocator + panic handler + `#[entry]` (`embedded-alloc`, `linked_list_allocator`).