Skip to content

Add DRAM KV cache and L1 hit rate metrics for training (#5633)#5633

Closed
EddyLXJ wants to merge 1 commit into
pytorch:mainfrom
EddyLXJ:export-D100903024
Closed

Add DRAM KV cache and L1 hit rate metrics for training (#5633)#5633
EddyLXJ wants to merge 1 commit into
pytorch:mainfrom
EddyLXJ:export-D100903024

Conversation

@EddyLXJ
Copy link
Copy Markdown
Contributor

@EddyLXJ EddyLXJ commented Apr 15, 2026

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2584

CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had l2_cache.hit_rate_pct, but the DRAM backend only tracked read_missing_load_avg_ (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance.

WHAT:

  • C++ (dram_kv_embedding_cache.h): Add atomic read_hit_count_ and read_miss_count_ counters in the read path (get_kv_db_async_impl). Extend the get_dram_kv_perf() vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard).
  • Python (training.py): Parse new perf vector indices, compute hit_rate_pct = 100 * hits / (hits + misses), and report both aggregate and per-TBE metrics via stats_reporter. Per-TBE metrics use tbe_id{N} in the name and are dynamically registered via register_stats(). Also add L1 hit rate computed from existing num_unique_indices and num_unique_misses.
  • Stats reporter (tbe_stats_reporters.py): Register aggregate dram_kv.hit_rate_pct and ssd_tbe.prefetch.l1_hit_rate_pct in the allowlist.

New TensorBoard/ODS metrics:

  • dram_kv.tbe_id{N}.hit_rate_pct — per-TBE DRAM cache hit rate (0-100%)
  • dram_kv.perf.get.tbe_id{N}.dram_read_hit_count — per-TBE raw hit count
  • dram_kv.perf.get.tbe_id{N}.dram_read_miss_count — per-TBE raw miss count
  • ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct — per-TBE L1 GPU HBM cache hit rate
  • dram_kv.hit_rate_pct — aggregate DRAM hit rate
  • ssd_tbe.prefetch.l1_hit_rate_pct — aggregate L1 hit rate

Differential Revision: D100903024

@meta-cla meta-cla Bot added the cla signed label Apr 15, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 15, 2026

@EddyLXJ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100903024.

@meta-codesync meta-codesync Bot changed the title Add DRAM KV cache and L1 hit rate metrics for training Add DRAM KV cache and L1 hit rate metrics for training (#5633) Apr 15, 2026
EddyLXJ added a commit to EddyLXJ/FBGEMM-1 that referenced this pull request Apr 15, 2026
Summary:

X-link: facebookresearch/FBGEMM#2584

CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance.

WHAT:
- C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard).
- Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`.
- Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist.

New TensorBoard/ODS metrics:
- `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%)
- `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count
- `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count
- `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate
- `dram_kv.hit_rate_pct` — aggregate DRAM hit rate
- `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate

Differential Revision: D100903024
@EddyLXJ EddyLXJ force-pushed the export-D100903024 branch from 41c21ec to a5bbc85 Compare April 15, 2026 20:04
EddyLXJ added a commit to EddyLXJ/FBGEMM-1 that referenced this pull request Apr 15, 2026
Summary:

X-link: facebookresearch/FBGEMM#2584

CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance.

WHAT:
- C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard).
- Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`.
- Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist.

New TensorBoard/ODS metrics:
- `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%)
- `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count
- `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count
- `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate
- `dram_kv.hit_rate_pct` — aggregate DRAM hit rate
- `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate

Differential Revision: D100903024
@EddyLXJ EddyLXJ force-pushed the export-D100903024 branch from a5bbc85 to 74940a9 Compare April 15, 2026 20:05
EddyLXJ added a commit to EddyLXJ/FBGEMM-1 that referenced this pull request Apr 15, 2026
Summary:
Pull Request resolved: pytorch#5633

X-link: https://github.com/facebookresearch/FBGEMM/pull/2584

CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance.

WHAT:
- C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard).
- Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`.
- Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist.

New TensorBoard/ODS metrics:
- `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%)
- `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count
- `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count
- `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate
- `dram_kv.hit_rate_pct` — aggregate DRAM hit rate
- `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate

Differential Revision: D100903024
@EddyLXJ EddyLXJ force-pushed the export-D100903024 branch from 74940a9 to 70027e5 Compare April 15, 2026 20:08
Summary:
Pull Request resolved: pytorch#5633

X-link: https://github.com/facebookresearch/FBGEMM/pull/2584

CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance.

WHAT:
- C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard).
- Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`.
- Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist.

New TensorBoard/ODS metrics:
- `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%)
- `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count
- `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count
- `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate
- `dram_kv.hit_rate_pct` — aggregate DRAM hit rate
- `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate

Differential Revision: D100903024
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 16, 2026

This pull request has been merged in 434db14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants