feat(memory): record real LLM cost in sync audit (#3110)#3150
Conversation
…sai#3110) Add ChatProvider::chat_for_text_with_usage returning (String, Option<UsageInfo>) with a default impl reporting None so test doubles and external impls keep compiling. InferenceChatProvider routes through Provider::chat (which parses usage via compatible::extract_usage) instead of chat_with_history, which discarded it.
…nsai#3110) Add input_tokens, output_tokens, charged_amount_usd to SummaryOutput and populate them from the provider response in summarise(). Zero charge is treated as absent (None) so callers fall back to the estimate. fallback_summary reports no usage.
…3110) New Option<f64> field with #[serde(default)] so pre-tinyhumansai#3110 audit lines still deserialize. Adds effective_cost_usd()/cost_is_actual() helpers; keeps estimated_cost_usd + estimate_cost_usd() as the fallback.
…3110) github sync and rebuild now accumulate real provider token counts + charged_amount_usd across summarise() batches, recording them in the audit entry when any batch reported usage and otherwise falling back to the len/4 estimate. RebuildOutcome carries actual_charged_usd through to callers.
…3110) Set actual_charged_usd: None on the zero-cost audit entries written when no LLM call happens, and log the real charge (falling back to the estimate) on rebuild completion.
📝 WalkthroughWalkthroughThis PR threads actual LLM token counts and billing amounts from the inference provider through the memory chat layer into sync audit logs. It introduces a new ChangesProvider usage threading and audit instrumentation
Sequence Diagram(s)sequenceDiagram
participant ChatProvider
participant InferenceChatProvider
participant InferenceProvider
participant summarise
participant SyncAudit
ChatProvider->>InferenceChatProvider: chat_for_text_with_usage()
InferenceChatProvider->>InferenceProvider: chat(ChatRequest)
InferenceProvider-->>InferenceChatProvider: ChatResponse{text, usage}
InferenceChatProvider-->>ChatProvider: (String, Option<UsageInfo>)
ChatProvider->>summarise: with usage data
summarise-->>SyncAudit: SummaryOutput{tokens, charge}
SyncAudit->>SyncAudit: effective_cost_usd()
SyncAudit->>SyncAudit: cost_is_actual()
🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory_sync/sources/github.rs`:
- Around line 254-265: The current aggregation adds per-batch real token/charge
values into real_input_tokens/real_output_tokens/real_charged_usd whenever any
batch reports non-zero, causing mixed batches to produce partial "real" rollups;
change the logic to only use the real_* totals when every batch in the run
provided real usage/charge. Concretely, inside the loop that inspects
output.input_tokens, output.output_tokens and output.charged_amount_usd, set
boolean flags like saw_any_real_token, saw_all_real_token (and analogously for
charge) by initializing all-real flags true and clearing them if you encounter a
fallback/no-usage batch (input_tokens==0 && output_tokens==0 or
charged_amount_usd.is_none()), sum real_* as you do but at the end only
replace/interpret estimated totals with
real_input_tokens/real_output_tokens/real_charged_usd if the corresponding
saw_all_real_* flags remain true; apply the same change for the similar block
referenced around the 305-325 region so rollups are only labeled "real" when all
batches reported real values (use identifiers output.input_tokens,
output.output_tokens, output.charged_amount_usd, real_input_tokens,
real_output_tokens, real_charged_usd, saw_real_charge as anchors).
In `@src/openhuman/memory_sync/sources/rebuild.rs`:
- Around line 219-227: The current logic promotes per-batch provider
usage/charge into run-level "real" totals even when only some batches report
provider values, causing underreporting; change it to only accept
provider-reported usage/charge when coverage is complete across all batches:
introduce counters (e.g., total_batches and batches_with_usage and
batches_with_charge) and update them when
output.input_tokens/output.output_tokens and output.charged_amount_usd are
present, accumulate into real_input_tokens/real_output_tokens/real_charged_usd
as now, but after iterating all batches replace the run-level totals only if
batches_with_usage == total_batches (for tokens) and batches_with_charge ==
total_batches (for charges); apply the same guarded logic for the other
identical block later (the section referencing real_* at lines ~264-283) so
partial coverage never promotes partial provider values to "actual."
In `@src/openhuman/memory/chat.rs`:
- Around line 111-112: The code currently masks missing provider text by using
response.text.unwrap_or_default(); instead, fail fast when response.text is
None: replace the unwrap_or_default usage with an explicit check on
response.text (e.g., match or if let Some) and return/propagate an error (or
trigger the existing fallback) when it's None so callers don’t receive an empty
string; update the handling around the variable names response, text (and keep
usage = response.usage) to ensure the function returns a Result/Error path when
text is missing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 42eb39b6-e740-4869-85ef-62e7ac170426
📒 Files selected for processing (6)
src/openhuman/memory/chat.rssrc/openhuman/memory_sources/sync.rssrc/openhuman/memory_sync/sources/audit.rssrc/openhuman/memory_sync/sources/github.rssrc/openhuman/memory_sync/sources/rebuild.rssrc/openhuman/memory_tree/summarise.rs
graycyrus
left a comment
There was a problem hiding this comment.
@oxoxDev heads up — CI is failing on this PR (Rust Core Coverage, Rust E2E mock backend, and PR Submission Checklist), so I'll hold off on a full approval until those are sorted out. I did spot a couple of things while going through the diff:
-
Missing tests for the multi-batch accumulation logic — the loops in
github.rsandrebuild.rsthat foldSummaryOutputinto thereal_*accumulators have no unit tests. That's the most complex new code in this PR and the exact site where the partial-rollup issue was flagged. Even with the partial-rollup fix applied, those code paths need regression coverage — a test with two batches where one returns usage and one doesn't would catch the undercount. -
Duplicated accumulation logic — the
est_*/real_*/saw_real_chargeaccumulation block is copy-pasted verbatim betweengithub.rsandrebuild.rs. Any fix (including the partial-rollup one) has to land in both files. Worth pulling into a shared struct or helper so they can't drift apart.
Fix the CI and address these two, and I'll come back for a proper pass.
…sai#3110) Add RealCostAccumulator: provider-reported tokens/charges are promoted to the run-level audit figure only when every batch carried that signal. A mixed run (some batches report usage, some fall back) kept a partial real total that undercounts the run versus the estimate. Centralises the accounting so github.rs and rebuild.rs can't drift apart.
…ator (tinyhumansai#3110) Replace the duplicated per-batch accumulation in both source pipelines with the shared accumulator. Fixes the partial-rollup undercount and removes the copy-pasted est_*/real_*/saw_real_charge block so a future fix can't land in one file and miss the other.
…ai#3110) Replace response.text.unwrap_or_default() with an explicit None check that returns an error. An empty summary would otherwise be ingested and counted against the run's real charge as if it were valid output; the caller's fallback_summary path is the correct recovery and only runs on Err.
|
@graycyrus thanks — both points addressed in 139278a:
This also fixes the partial-rollup correctness bug both you and CodeRabbit flagged: provider tokens/charge are only promoted to the run-level figure when every batch reported them. On CI — the failing |
graycyrus
left a comment
There was a problem hiding this comment.
@oxoxDev all prior changes addressed. the two things i flagged — missing multi-batch tests and the duplicated accumulation block — are both resolved cleanly.
the RealCostAccumulator is the right abstraction: all-or-nothing promotion logic is in one place, both source pipelines route through it, and the four unit tests directly exercise the mixed-batch and partial-charge edge cases i was worried about. the fail-fast on empty provider text is also in.
code is clean. holding off on approval until CI is fully green (Rust Core Coverage, Rust E2E, Playwright, and Frontend Coverage are still pending). no new issues found in the follow-up commits.
memory_sources_validation_and_sync_classification_edges asserted
toolkit_from_slug(" MICROSOFT_TEAMS_SEND ") == Some("microsoft"), but the
slug map deliberately yields "microsoft_teams" (tool_scope.rs:98) and the
function's own unit test (tool_scope.rs) already asserts "microsoft_teams".
Align the e2e assertion with documented behavior; pre-existing failure on
main, unrelated to this PR's cost-audit changes.
graycyrus
left a comment
There was a problem hiding this comment.
@oxoxDev the toolkit_from_slug assertion fix in the latest commit looks correct — "microsoft_teams" is the right expected value per the slug map. That should unblock the Rust Core Coverage gate.
At the time of this check, CI is still showing failures on Rust Core Coverage and Playwright lane 1/4 — they may still be in-flight against the new HEAD. Everything else is green. Once the full suite is green I'll approve.
Summary
ChatProvider::chat_for_text_with_usagesurfaces providerUsageInfo(additive; default impl returns(text, None)— no ripple for other providers).SummaryOutputnow carriesinput_tokens/output_tokens/charged_amount_usd; github sync + tree rebuild thread the real charge into the audit entry.SyncAuditEntry.actual_charged_usd: Option<f64>(#[serde(default)]) witheffective_cost_usd()/cost_is_actual()helpers; the existing estimate stays as a fallback.actual_charged_usdkey) still deserialize and render their estimate.Problem
sync_audit.jsonlonly storedestimated_cost_usd, computed from atoken_count / 4heuristic. There was no record of what the provider actually billed for a memory sync, so real spend could not be reconciled against the estimate — and the estimate drifts from reality whenever provider pricing or tokenization differs from the heuristic.Solution
src/openhuman/memory/chat.rs— addChatProvider::chat_for_text_with_usage(default returns(text, None));InferenceChatProviderroutes throughProvider::chatand parses the returnedUsageInfo.src/openhuman/memory_tree/summarise.rs— thread real usage intoSummaryOutput(input_tokens,output_tokens,charged_amount_usd); treat a zero charge as absent.src/openhuman/memory_sync/sources/audit.rs— addactual_charged_usd(#[serde(default)]) +effective_cost_usd()(prefers actual, falls back to estimate) andcost_is_actual().src/openhuman/memory_sync/sources/github.rs+rebuild.rs(+memory_sources/sync.rs) — record the real charge on the github-sync and tree-rebuild paths; logcost_is_actual.Design note:
ingest.rsis intentionally left untouched — cost is read offSummaryOutputbeforeingest_summary, so threading it throughSummaryIngestInput.token_countwould be dead plumbing.Submission Checklist
audit/summarise/chatincl. failure/edge cases (provider reports no usage →Nonefallback, zero-charge treated as absent, pre-feat(memory): surface actual LLM usage/cost from inference API in sync audit #3110 entry deserializes).RealCostAccumulator, the two source-pipeline call sites, and thechat.rsfail-fast path) are exercised by the newaccumulator_*unit tests plus existingaudit/chatsuites;cargo-llvm-covnot run locally (heavy), the CI Coverage Gate enforces the threshold.## Related— N/A: no feature behaviour added or moved.Closes #NNNin the## Relatedsection.Impact
#[serde(default)]makes the new field optional — oldsync_audit.jsonllines load unchanged.Nonebranch is the prior behaviour).None/estimate-fallback branch + back-compat were confirmed live on staging (real audit entry written, deserializes clean, full sync→seal pipeline ran without regression). TheSome/real-charge branch is unit-covered; observing it live requires a billing-enabled summarizer backend (staging returns no charge).Related
06cd47e3): fixed a stale assertion intests/memory_raw_coverage_e2e.rs(toolkit_from_slug(" MICROSOFT_TEAMS_SEND ")expected"microsoft"but the slug map and the function's own unit test yield"microsoft_teams"). This was failing the Rust Core Coverage gate onmainfor every PR, unrelated to this PR's cost-audit changes; included here to unblock CI.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
feat/3110-real-llm-cost-audit447767743014bc1249a1186b0617f627f2b6c654Validation Run
pnpm --filter openhuman-app format:check— N/A: no frontend files changed;cargo fmt --all --checkclean.pnpm typecheck— N/A: Rust-only change, no TS touched.cargo test --lib openhuman::memory_sync::sources::audit | memory_tree::summarise | memory::chat→ 20 passed.cargo check --libclean,cargo clippy --lib --no-deps0 new warnings on changed files.app/src-taurinot touched.Validation Blocked
command:N/Aerror:N/Aimpact:N/ABehavior Changes
actual_charged_usdnow present insync_audit.jsonl/ sync-audit panel for billed syncs; estimate-only behaviour unchanged otherwise.Summary by CodeRabbit