Add DRAM KV cache and L1 hit rate metrics for training (#5633) by EddyLXJ · Pull Request #5633 · pytorch/FBGEMM

EddyLXJ · 2026-04-15T00:58:57Z

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2584

CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had l2_cache.hit_rate_pct, but the DRAM backend only tracked read_missing_load_avg_ (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance.

WHAT:

C++ (dram_kv_embedding_cache.h): Add atomic read_hit_count_ and read_miss_count_ counters in the read path (get_kv_db_async_impl). Extend the get_dram_kv_perf() vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard).
Python (training.py): Parse new perf vector indices, compute hit_rate_pct = 100 * hits / (hits + misses), and report both aggregate and per-TBE metrics via stats_reporter. Per-TBE metrics use tbe_id{N} in the name and are dynamically registered via register_stats(). Also add L1 hit rate computed from existing num_unique_indices and num_unique_misses.
Stats reporter (tbe_stats_reporters.py): Register aggregate dram_kv.hit_rate_pct and ssd_tbe.prefetch.l1_hit_rate_pct in the allowlist.

New TensorBoard/ODS metrics:

dram_kv.tbe_id{N}.hit_rate_pct — per-TBE DRAM cache hit rate (0-100%)
dram_kv.perf.get.tbe_id{N}.dram_read_hit_count — per-TBE raw hit count
dram_kv.perf.get.tbe_id{N}.dram_read_miss_count — per-TBE raw miss count
ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct — per-TBE L1 GPU HBM cache hit rate
dram_kv.hit_rate_pct — aggregate DRAM hit rate
ssd_tbe.prefetch.l1_hit_rate_pct — aggregate L1 hit rate

Differential Revision: D100903024

meta-codesync · 2026-04-15T00:59:07Z

@EddyLXJ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100903024.

Summary: X-link: facebookresearch/FBGEMM#2584 CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance. WHAT: - C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard). - Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`. - Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist. New TensorBoard/ODS metrics: - `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%) - `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count - `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count - `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate - `dram_kv.hit_rate_pct` — aggregate DRAM hit rate - `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate Differential Revision: D100903024

Summary: Pull Request resolved: pytorch#5633 X-link: https://github.com/facebookresearch/FBGEMM/pull/2584 CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance. WHAT: - C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard). - Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`. - Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist. New TensorBoard/ODS metrics: - `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%) - `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count - `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count - `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate - `dram_kv.hit_rate_pct` — aggregate DRAM hit rate - `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate Differential Revision: D100903024

meta-codesync · 2026-04-16T06:38:17Z

This pull request has been merged in 434db14.

meta-cla Bot added the cla signed label Apr 15, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 15, 2026

meta-codesync Bot changed the title ~~Add DRAM KV cache and L1 hit rate metrics for training~~ Add DRAM KV cache and L1 hit rate metrics for training (#5633) Apr 15, 2026

EddyLXJ force-pushed the export-D100903024 branch from 41c21ec to a5bbc85 Compare April 15, 2026 20:04

EddyLXJ force-pushed the export-D100903024 branch from a5bbc85 to 74940a9 Compare April 15, 2026 20:05

EddyLXJ force-pushed the export-D100903024 branch from 74940a9 to 70027e5 Compare April 15, 2026 20:08

EddyLXJ force-pushed the export-D100903024 branch from 70027e5 to a8d4b56 Compare April 15, 2026 20:12

meta-codesync Bot closed this in 434db14 Apr 16, 2026

facebook-github-tools Bot added the Merged label Apr 16, 2026

gchalump added category:new contributor:Meta feature:tbessd labels May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DRAM KV cache and L1 hit rate metrics for training (#5633)#5633

Add DRAM KV cache and L1 hit rate metrics for training (#5633)#5633
EddyLXJ wants to merge 1 commit into
pytorch:mainfrom
EddyLXJ:export-D100903024

EddyLXJ commented Apr 15, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

meta-codesync Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EddyLXJ commented Apr 15, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

meta-codesync Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EddyLXJ commented Apr 15, 2026 •

edited by meta-codesync Bot

Loading