Skip to content

SONA learn→inference loop unwired at the JS/WASM boundary: learn_from_feedback is a no-op; MicroLoRA only adapts on multi-step varying-reward trajectories #519

@pacphi

Description

@pacphi

A downstream consumer (ruflo, ruvnet/ruflo#2222) found that SONA's trained adapter never changes a routing/recall decision — empirically Δ=0 after ~200 adapts — and attributed it to "@ruvector/ruvllm MicroLoRA apply() being inert." I cloned ruvnet/ruvector (c2089c4, 2026-05-28) and verified against the actual source. The real picture is more specific, and worth correcting: inference seams exist and work (applyLora, MicroLoRA::forward, LoraAdapter.forward) — this is not "no forward path." The gap is that the learn→adapt loop is not wired through the JS/WASM entry points a consumer actually calls, and even the Rust path only adapts under conditions a typical consumer won't hit. Filing so the seam isn't lost.

Findings (source + reproduction)

1. WasmSonaEngine::learn_from_feedback is a no-opcrates/sona/src/wasm.rs:183-192. It computes a reward and console.logs it, then returns; it never builds a trajectory, accumulates a gradient, or touches a weight:

pub fn learn_from_feedback(&self, success: bool, latency_ms: f32, quality: f32) {
    let reward = if success { quality } else { -quality };
    web_sys::console::log_1(&format!("Feedback: ... reward={}", reward).into());
}

This is the natural JS/WASM "I have an outcome, learn from it" entry point — calling it any number of times trains nothing.

2. The pure-JS SonaCoordinator.processInstantLearning is an empty stubnpm/packages/ruvllm/src/sona.js:449-452: // In full implementation, this updates LoRA weights with no body. So the JS-package SONA coordinator never adapts LoRA either. (Its LoraAdapter.forward/backward in lora.ts do work — but SonaCoordinator never instantiates or calls them.)

3. The Rust trajectory path adapts only on multi-step, varying-reward trajectories. LearningSignal::estimate_gradient (crates/sona/src/types.rs:69-101) is REINFORCE with a mean-reward baseline. A single-step trajectory — the common "one outcome per task" shape — has advantage = reward − mean(reward) = 0, giving a zero gradient. Reproduced: 200 single-step adapts → Δ = 0 exactly.

4. MicroLoRA::accumulate_gradient only writes grad_upcrates/sona/src/lora.rs:192-229. down_proj/grad_down is allocated, zeroed, reset, never updated, so adaptation is up-projection-only (rank-deficient). Reproduced: down_proj Δ=0, up_proj |w|=0.566.

5. (Bonus correctness smell) estimate_gradient L2-normalizes before returning, so a uniform-reward trajectory (true advantage ≈ 0) has its f32 baseline residue amplified into a unit-norm gradient — a full-magnitude update from a no-information signal. Reproduced: ||gradient_estimate|| = 1.0, spurious Δ = 0.0125.

Reproduction

cargo test -p ruvector-sona --test repro_delta_zero -- --nocapture (drives the real public API — MicroLoRA + LearningSignal::from_trajectory + TrajectoryBuilder, no mocks):

[single-step feedback]   Δ after 200 adapts = 0e0          (inert)
[uniform-reward 3-step]  ||grad||=1.0, spurious Δ = 1.26e-2 (FP residue amplified)
[varying-reward 2-step]  Δ after 200 adapts = 1.71e-2       (works — control)
[freeze check]           down_proj Δ = 0,  up_proj |w| = 0.566
Full reproduction test (drop in crates/sona/tests/repro_delta_zero.rs)
use ruvector_sona::{LearningSignal, MicroLoRA, TrajectoryBuilder};

const DIM: usize = 16;
const RANK: usize = 2;
const LR: f32 = 0.01;

fn forward_probe(lora: &MicroLoRA, input: &[f32]) -> Vec<f32> {
    let mut out = vec![0.0f32; input.len()];
    lora.forward(input, &mut out);
    out
}
fn l2(a: &[f32], b: &[f32]) -> f32 {
    a.iter().zip(b).map(|(x, y)| (x - y) * (x - y)).sum::<f32>().sqrt()
}

/// The natural "task finished, here is its quality" pattern: one step, one final score.
#[test]
fn single_step_feedback_is_inert() {
    let mut lora = MicroLoRA::new(DIM, RANK);
    let probe: Vec<f32> = (0..DIM).map(|i| (i as f32 * 0.1).sin()).collect();
    let baseline = forward_probe(&lora, &probe);
    for n in 0..200u64 {
        let mut b = TrajectoryBuilder::new(n, probe.clone());
        b.add_step(probe.clone(), vec![1.0; DIM], 1.0);
        let signal = LearningSignal::from_trajectory(&b.build(0.95));
        lora.accumulate_gradient(&signal);
        lora.apply_accumulated(LR);
    }
    let delta = l2(&baseline, &forward_probe(&lora, &probe));
    println!("[single-step feedback] Δ after 200 adapts = {delta:e}");
    assert_eq!(delta, 0.0);
}

/// Uniform per-step rewards: true advantage 0, but L2-normalized FP residue → unit-norm gradient.
#[test]
fn uniform_reward_amplifies_fp_residue() {
    let mut lora = MicroLoRA::new(DIM, RANK);
    let probe: Vec<f32> = (0..DIM).map(|i| (i as f32 * 0.2).cos()).collect();
    let baseline = forward_probe(&lora, &probe);
    let mut b0 = TrajectoryBuilder::new(0, probe.clone());
    b0.add_step(probe.clone(), vec![1.0; DIM], 0.9);
    b0.add_step(probe.clone(), vec![1.0; DIM], 0.9);
    b0.add_step(probe.clone(), vec![1.0; DIM], 0.9);
    let sig0 = LearningSignal::from_trajectory(&b0.build(0.9));
    let gnorm: f32 = sig0.gradient_estimate.iter().map(|x| x * x).sum::<f32>().sqrt();
    println!("[uniform-reward] true advantage=0, yet ||gradient_estimate|| = {gnorm:e}");
    for n in 0..200u64 {
        let mut b = TrajectoryBuilder::new(n, probe.clone());
        b.add_step(probe.clone(), vec![1.0; DIM], 0.9);
        b.add_step(probe.clone(), vec![1.0; DIM], 0.9);
        b.add_step(probe.clone(), vec![1.0; DIM], 0.9);
        let signal = LearningSignal::from_trajectory(&b.build(0.9));
        lora.accumulate_gradient(&signal);
        lora.apply_accumulated(LR);
    }
    let delta = l2(&baseline, &forward_probe(&lora, &probe));
    println!("[uniform-reward 3-step] spurious Δ after 200 adapts = {delta:e}");
    assert!(delta > 0.0);
}

/// Control: varying per-step rewards DO adapt — the mechanism works when fed a real gradient.
#[test]
fn varying_reward_multistep_adapts() {
    let mut lora = MicroLoRA::new(DIM, RANK);
    let probe: Vec<f32> = (0..DIM).map(|i| (i as f32 * 0.3).sin() + 0.2).collect();
    let baseline = forward_probe(&lora, &probe);
    for n in 0..200u64 {
        let mut b = TrajectoryBuilder::new(n, probe.clone());
        b.add_step(probe.clone(), vec![1.0; DIM], 0.1);
        b.add_step(probe.clone(), vec![1.0; DIM], 0.9);
        let signal = LearningSignal::from_trajectory(&b.build(0.8));
        lora.accumulate_gradient(&signal);
        lora.apply_accumulated(LR);
    }
    let delta = l2(&baseline, &forward_probe(&lora, &probe));
    println!("[varying-reward 2-step] Δ after 200 adapts = {delta:e}");
    assert!(delta > 0.0);
}

/// down_proj is never adapted regardless of signal — only up_proj moves.
#[test]
fn down_proj_is_frozen() {
    let mut lora = MicroLoRA::new(DIM, RANK);
    let (down_before, _) = { let (d, u) = lora.get_weights(); (d.clone(), u.clone()) };
    for n in 0..50u64 {
        let mut b = TrajectoryBuilder::new(n, vec![0.5; DIM]);
        b.add_step(vec![0.5; DIM], vec![1.0; DIM], 0.1);
        b.add_step(vec![0.5; DIM], vec![1.0; DIM], 0.9);
        lora.accumulate_gradient(&LearningSignal::from_trajectory(&b.build(0.8)));
        lora.apply_accumulated(LR);
    }
    let (down_after, up_after) = lora.get_weights();
    let down_delta = l2(&down_before, down_after);
    let up_delta: f32 = up_after.iter().map(|x| x * x).sum::<f32>().sqrt();
    println!("[freeze check] down_proj Δ = {down_delta:e}  up_proj |w| = {up_delta:e}");
    assert_eq!(down_delta, 0.0);
    assert!(up_delta > 0.0);
}

Net effect for consumers

A JS/WASM consumer driving SONA via the documented learnFromFeedback API — or via single-outcome trajectories — observes zero adaptation; applyLora/applyMicroLora then return the untrained (up_proj = 0) transform. The forward/inference seam exists; the learn→inference loop is just not connected through the bindings consumers reach for.

Suggested directions

  • Wire learn_from_feedback to build a single-step-safe LearningSignal (e.g. fall back to a query_embedding-based gradient when steps < 2, or skip baseline subtraction for a single step) and accumulate + flush.
  • Implement processInstantLearning in @ruvector/ruvllm, or mark it explicitly unimplemented so consumers don't assume it adapts.
  • Adapt down_proj too, or document up-only adaptation as intentional.
  • Guard estimate_gradient against amplifying a near-zero gradient (don't normalize when the pre-norm magnitude is ≈ 0).

Related but distinct: #516 (@ruvector/sona published without build output → MODULE_NOT_FOUND) is a packaging issue; this is about the no-op learn path when the module is loaded and running.

Environment: ruvnet/ruvector@c2089c4, Rust 1.95, crate ruvector-sona 0.2.0. Downstream context: ruflo 3.10.10 / Node 26.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions