Add DRAM KV cache and L1 hit rate metrics for training (#5633)#5633
Closed
EddyLXJ wants to merge 1 commit into
Closed
Add DRAM KV cache and L1 hit rate metrics for training (#5633)#5633EddyLXJ wants to merge 1 commit into
EddyLXJ wants to merge 1 commit into
Conversation
Contributor
|
@EddyLXJ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100903024. |
EddyLXJ
added a commit
to EddyLXJ/FBGEMM-1
that referenced
this pull request
Apr 15, 2026
Summary: X-link: facebookresearch/FBGEMM#2584 CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance. WHAT: - C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard). - Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`. - Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist. New TensorBoard/ODS metrics: - `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%) - `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count - `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count - `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate - `dram_kv.hit_rate_pct` — aggregate DRAM hit rate - `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate Differential Revision: D100903024
41c21ec to
a5bbc85
Compare
EddyLXJ
added a commit
to EddyLXJ/FBGEMM-1
that referenced
this pull request
Apr 15, 2026
Summary: X-link: facebookresearch/FBGEMM#2584 CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance. WHAT: - C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard). - Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`. - Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist. New TensorBoard/ODS metrics: - `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%) - `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count - `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count - `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate - `dram_kv.hit_rate_pct` — aggregate DRAM hit rate - `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate Differential Revision: D100903024
a5bbc85 to
74940a9
Compare
EddyLXJ
added a commit
to EddyLXJ/FBGEMM-1
that referenced
this pull request
Apr 15, 2026
Summary: Pull Request resolved: pytorch#5633 X-link: https://github.com/facebookresearch/FBGEMM/pull/2584 CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance. WHAT: - C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard). - Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`. - Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist. New TensorBoard/ODS metrics: - `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%) - `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count - `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count - `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate - `dram_kv.hit_rate_pct` — aggregate DRAM hit rate - `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate Differential Revision: D100903024
74940a9 to
70027e5
Compare
Summary: Pull Request resolved: pytorch#5633 X-link: https://github.com/facebookresearch/FBGEMM/pull/2584 CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had `l2_cache.hit_rate_pct`, but the DRAM backend only tracked `read_missing_load_avg_` (averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance. WHAT: - C++ (`dram_kv_embedding_cache.h`): Add atomic `read_hit_count_` and `read_miss_count_` counters in the read path (`get_kv_db_async_impl`). Extend the `get_dram_kv_perf()` vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard). - Python (`training.py`): Parse new perf vector indices, compute `hit_rate_pct = 100 * hits / (hits + misses)`, and report both aggregate and per-TBE metrics via `stats_reporter`. Per-TBE metrics use `tbe_id{N}` in the name and are dynamically registered via `register_stats()`. Also add L1 hit rate computed from existing `num_unique_indices` and `num_unique_misses`. - Stats reporter (`tbe_stats_reporters.py`): Register aggregate `dram_kv.hit_rate_pct` and `ssd_tbe.prefetch.l1_hit_rate_pct` in the allowlist. New TensorBoard/ODS metrics: - `dram_kv.tbe_id{N}.hit_rate_pct` — per-TBE DRAM cache hit rate (0-100%) - `dram_kv.perf.get.tbe_id{N}.dram_read_hit_count` — per-TBE raw hit count - `dram_kv.perf.get.tbe_id{N}.dram_read_miss_count` — per-TBE raw miss count - `ssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct` — per-TBE L1 GPU HBM cache hit rate - `dram_kv.hit_rate_pct` — aggregate DRAM hit rate - `ssd_tbe.prefetch.l1_hit_rate_pct` — aggregate L1 hit rate Differential Revision: D100903024
70027e5 to
a8d4b56
Compare
Contributor
|
This pull request has been merged in 434db14. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2584
CONTEXT: DRAM KV embedding cache lacked hit/miss rate metrics during training, making it hard to monitor cache effectiveness. The SSD backend had
l2_cache.hit_rate_pct, but the DRAM backend only trackedread_missing_load_avg_(averaged per-shard miss count), which was not accurate for hit rate computation. L1 cache stats had raw counts but never computed a hit rate percentage. Additionally, when multiple TBEs exist on a single trainer, their metrics were mixed together in TensorBoard, making it impossible to tell which table has good/bad cache performance.WHAT:
dram_kv_embedding_cache.h): Add atomicread_hit_count_andread_miss_count_counters in the read path (get_kv_db_async_impl). Extend theget_dram_kv_perf()vector from 36 to 38 elements to expose raw hit/miss counts (not averaged per shard).training.py): Parse new perf vector indices, computehit_rate_pct = 100 * hits / (hits + misses), and report both aggregate and per-TBE metrics viastats_reporter. Per-TBE metrics usetbe_id{N}in the name and are dynamically registered viaregister_stats(). Also add L1 hit rate computed from existingnum_unique_indicesandnum_unique_misses.tbe_stats_reporters.py): Register aggregatedram_kv.hit_rate_pctandssd_tbe.prefetch.l1_hit_rate_pctin the allowlist.New TensorBoard/ODS metrics:
dram_kv.tbe_id{N}.hit_rate_pct— per-TBE DRAM cache hit rate (0-100%)dram_kv.perf.get.tbe_id{N}.dram_read_hit_count— per-TBE raw hit countdram_kv.perf.get.tbe_id{N}.dram_read_miss_count— per-TBE raw miss countssd_tbe.prefetch.tbe_id{N}.l1_hit_rate_pct— per-TBE L1 GPU HBM cache hit ratedram_kv.hit_rate_pct— aggregate DRAM hit ratessd_tbe.prefetch.l1_hit_rate_pct— aggregate L1 hit rateDifferential Revision: D100903024