Add KVZCH inference read-time hit rate metrics via fb303 ODS counters by hy-NJU · Pull Request #5745 · pytorch/FBGEMM

hy-NJU · 2026-05-07T17:36:19Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2675

Re-land of D101879296 (reverted by D104116880 — see S659921), C++-only this time.

Summary

KVZCH embedding cache on the serving/inference side lacked structured metrics for
read-time (forward pass) hit rate. The C++ backend already tracked per-shard miss
counts internally but never emitted them as ODS counters, and there was no total
read count, making hit rate calculation impossible.

This diff adds:

Atomic read_hit_count_ / read_miss_count_ counters in DramKVInferenceEmbedding
Per-batch fb303 ODS counter emission (kvzch.inference.read_hit_count,
kvzch.inference.read_miss_count, kvzch.inference.read_total_count) in the
get_kv_db_async_impl aggregation callback
get_read_hit_rate_stats() virtual method on KVInferenceEmbeddingInterface,
implemented in DramKVInferenceEmbedding, no-op stub on SSDKVInferenceEmbedding
DramKVEmbeddingInferenceWrapper::get_read_hit_rate_stats C++ method + TorchScript
registration so the predictor binary recognizes any model that exports the method
kvzch.inference.* regex added to PredictorXUtils.cpp fb303Collector allowlist
so ODS scrapes the new counters
//fb303:service_data BUCK dep on dram_kv_embedding_inference
One-time addStatExportType(SUM) registration via folly::call_once so counters
surface in ODS as *.sum.60 time-series

Performance impact is minimal: one stack increment per lookup in the hit branch,
two atomic fetch-adds and three addStatValue calls per batch.

Why this is C++-only (vs original D101879296)

Original D101879296 also added a Python torch.jit.export def get_read_hit_rate_stats(...)
on KVEmbeddingInference. Models trained after the diff baked the method into their
TorchScript graph, but stale predictor binaries on vg_worker_byoc_t16_gti_mrs_trunk_health_prod
(and sigrid_predictor_gpu.persistent:prod from Apr 22) lacked the C++ TorchScript
registration. Models loaded against those predictors crashed with:

torch::jit::ErrorReport: torch.torch.classes.fbgemm.DramKVEmbeddingInferenceWrapper object has no attribute or method get_read_hit_rate_stats

Root cause: D101879296 shipped the Python TorchScript-export side and the C++
TorchScript-registration side in a single diff with no coordinated predictor-binary
rollout. Newly-trained models required the new C++ class registration, but no
predictor binary anywhere had it yet — neither :prod (v5529, Apr 22) nor :LATEST
(v5704, Apr 29) of sigrid_predictor_gpu.persistent was built at a revision >= the
landing commit. This caused S659921 (mvai/video_udd_lsr serving_eval blocked on
R5449-R5450).

This re-land drops the Python torch.jit.export side entirely. ODS counter
emission lives in C++ unconditionally — it does NOT depend on any Python caller
invoking get_read_hit_rate_stats(). The fb303 addStatValue calls fire on every
batch automatically. So:

ODS gets the metrics — same end goal as the original diff
No model TorchScript graph changes — no stale-predictor crash possible
No predictor-binary wait gate needed — even predictor binaries built before
this diff will load every existing and future model normally, because no model
exports a new method that requires the new C++ registration

If a Python get_read_hit_rate_stats() programmatic API is ever needed, it can
ship in a follow-up diff AFTER predictor binaries on every KVZCH serving tier
are rolled out at a revision >= this diff's landing commit.

Note on counter consistency (per AI reviewer feedback on D101879296)

The two exchange(0) calls on read_hit_count_ and read_miss_count_ in
get_read_hit_rate_stats() are not a combined snapshot — between the two
exchanges, a concurrent batch callback may add to read_miss_count_ before it is
exchanged, so the returned hit and miss values may correspond to slightly different
time windows. This is acceptable for ODS hit-rate aggregation (where individual
sub-second slices don't matter) and is now documented in code. If a strict snapshot
is needed in the future, both exchanges can be wrapped in a small mutex.

Build verification:

buck build fbcode//deeplearning/fbgemm/fbgemm_gpu:dram_kv_embedding_inference
buck build fbcode//caffe2/caffe2/fb/predictor/embedding_db/kv_embedding_table:SSDKVInferenceEmbedding
buck build fbcode//fblearner/predictor/model_publishing_service/deployment:predictor_x_utils

S659921 regression check:

After landing, run a Vanguard serving_eval against
vg_worker_byoc_t16_gti_mrs_trunk_health_prod on a freshly-trained model and
confirm no TorchScript crash on the previously-failing test case:
https://www.internalfb.com/vanguard/serving_test_cases/1302024888565107

Original (reverted): D101879296
Revert: D104116880
SEV: S659921 — [mvai/video_udd_lsr] Vanguard serving_eval predictor crash —
DramKVEmbeddingInferenceWrapper missing get_read_hit_rate_stats on trunk health
predictor
Pull Request resolved: Add KVZCH inference read-time hit rate metrics via fb303 ODS counters (#5730) #5730
Pull Request resolved: https://github.com/facebookresearch/FBGEMM/pull/2659

Reviewed By: EddyLXJ, emlin

Differential Revision: D104246537

Summary: X-link: facebookresearch/FBGEMM#2675 Re-land of D101879296 (reverted by D104116880 — see S659921), C++-only this time. ## Summary KVZCH embedding cache on the serving/inference side lacked structured metrics for read-time (forward pass) hit rate. The C++ backend already tracked per-shard miss counts internally but never emitted them as ODS counters, and there was no total read count, making hit rate calculation impossible. This diff adds: - Atomic `read_hit_count_` / `read_miss_count_` counters in `DramKVInferenceEmbedding` - Per-batch fb303 ODS counter emission (`kvzch.inference.read_hit_count`, `kvzch.inference.read_miss_count`, `kvzch.inference.read_total_count`) in the `get_kv_db_async_impl` aggregation callback - `get_read_hit_rate_stats()` virtual method on `KVInferenceEmbeddingInterface`, implemented in `DramKVInferenceEmbedding`, no-op stub on `SSDKVInferenceEmbedding` - `DramKVEmbeddingInferenceWrapper::get_read_hit_rate_stats` C++ method + TorchScript registration so the predictor binary recognizes any model that exports the method - `kvzch.inference.*` regex added to `PredictorXUtils.cpp` fb303Collector allowlist so ODS scrapes the new counters - `//fb303:service_data` BUCK dep on `dram_kv_embedding_inference` - One-time `addStatExportType(SUM)` registration via `folly::call_once` so counters surface in ODS as `*.sum.60` time-series Performance impact is minimal: one stack increment per lookup in the hit branch, two atomic fetch-adds and three `addStatValue` calls per batch. ## Why this is C++-only (vs original D101879296) Original D101879296 also added a Python `torch.jit.export def get_read_hit_rate_stats(...)` on `KVEmbeddingInference`. Models trained after the diff baked the method into their TorchScript graph, but stale predictor binaries on `vg_worker_byoc_t16_gti_mrs_trunk_health_prod` (and `sigrid_predictor_gpu.persistent:prod` from Apr 22) lacked the C++ TorchScript registration. Models loaded against those predictors crashed with: `torch::jit::ErrorReport: torch.torch.classes.fbgemm.DramKVEmbeddingInferenceWrapper object has no attribute or method get_read_hit_rate_stats` Root cause: `D101879296` shipped the Python TorchScript-export side and the C++ TorchScript-registration side in a single diff with no coordinated predictor-binary rollout. Newly-trained models required the new C++ class registration, but no predictor binary anywhere had it yet — neither `:prod` (v5529, Apr 22) nor `:LATEST` (v5704, Apr 29) of `sigrid_predictor_gpu.persistent` was built at a revision >= the landing commit. This caused S659921 (mvai/video_udd_lsr serving_eval blocked on R5449-R5450). This re-land drops the Python `torch.jit.export` side entirely. ODS counter emission lives in C++ unconditionally — it does NOT depend on any Python caller invoking `get_read_hit_rate_stats()`. The fb303 `addStatValue` calls fire on every batch automatically. So: - ODS gets the metrics — same end goal as the original diff - No model TorchScript graph changes — no stale-predictor crash possible - No predictor-binary wait gate needed — even predictor binaries built before this diff will load every existing and future model normally, because no model exports a new method that requires the new C++ registration If a Python `get_read_hit_rate_stats()` programmatic API is ever needed, it can ship in a follow-up diff AFTER predictor binaries on every KVZCH serving tier are rolled out at a revision >= this diff's landing commit. ## Note on counter consistency (per AI reviewer feedback on D101879296) The two `exchange(0)` calls on `read_hit_count_` and `read_miss_count_` in `get_read_hit_rate_stats()` are not a combined snapshot — between the two exchanges, a concurrent batch callback may add to `read_miss_count_` before it is exchanged, so the returned hit and miss values may correspond to slightly different time windows. This is acceptable for ODS hit-rate aggregation (where individual sub-second slices don't matter) and is now documented in code. If a strict snapshot is needed in the future, both exchanges can be wrapped in a small mutex. Build verification: - buck build fbcode//deeplearning/fbgemm/fbgemm_gpu:dram_kv_embedding_inference - buck build fbcode//caffe2/caffe2/fb/predictor/embedding_db/kv_embedding_table:SSDKVInferenceEmbedding - buck build fbcode//fblearner/predictor/model_publishing_service/deployment:predictor_x_utils S659921 regression check: - After landing, run a Vanguard serving_eval against vg_worker_byoc_t16_gti_mrs_trunk_health_prod on a freshly-trained model and confirm no TorchScript crash on the previously-failing test case: https://www.internalfb.com/vanguard/serving_test_cases/1302024888565107 ## Related - Original (reverted): D101879296 - Revert: D104116880 - SEV: S659921 — [mvai/video_udd_lsr] Vanguard serving_eval predictor crash — DramKVEmbeddingInferenceWrapper missing get_read_hit_rate_stats on trunk health predictor - Pull Request resolved: pytorch#5730 - Pull Request resolved: facebookresearch/FBGEMM#2659 Reviewed By: EddyLXJ, emlin Differential Revision: D104246537

meta-codesync · 2026-05-07T17:36:27Z

@hy-NJU has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104246537.

meta-codesync · 2026-05-08T02:49:57Z

This pull request has been merged in e21ad44.

meta-cla Bot added the cla signed label May 7, 2026

meta-codesync Bot added fb-exported meta-exported labels May 7, 2026

meta-codesync Bot closed this in e21ad44 May 8, 2026

facebook-github-tools Bot added the Merged label May 8, 2026

gchalump added category:improvement contributor:Meta feature:tbessd labels May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KVZCH inference read-time hit rate metrics via fb303 ODS counters#5745

Add KVZCH inference read-time hit rate metrics via fb303 ODS counters#5745
hy-NJU wants to merge 1 commit into
pytorch:mainfrom
hy-NJU:export-D104246537

hy-NJU commented May 7, 2026

Uh oh!

meta-codesync Bot commented May 7, 2026

Uh oh!

meta-codesync Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hy-NJU commented May 7, 2026

Summary

Why this is C++-only (vs original D101879296)

Note on counter consistency (per AI reviewer feedback on D101879296)

Related

Uh oh!

meta-codesync Bot commented May 7, 2026

Uh oh!

meta-codesync Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants