TBE UVM cache prefetch pipeline #1893

yuguo68 · 2023-07-26T15:04:20Z

Summary:
This diff is to enable cache prefetch pipeline of TBE, so that prefetch of batch_{i+1} can overlap with forward/backward of batch_i. As the cache can be evicted by prefetch and the weights can be updated by the backward, we need to carefully protect a few scenarios that result in cache invalidation.

1. prevent immature cache eviction: cache gets evicted while it is being used by forward pass

Since prefetch can overlap with forward/backward pass, it is possible that prefetch tries to evict cache but the cache is being used by forward/backward pass. The fix is to use the lxu_cache_locking_counter in D46172802/#1883 to check whether a cache slot is in use or not when an eviction is attempted.

2. prevent dirty cache: weight is being updated while it is loading to cache

If the prefetch overlaps with TBE backward pass, the backward may write to uvm (idx not in cache) and at the same time prefetch (idx is inserted to cache) loads the weight from uvm to cache. We sync the streams to avoid TBE backward pass overlapping with prefetch. The backward of the rest of the module can still overlap with prefetch of TBE.

The stream sync looks like:

# backward(batch_i) waits for prefetch(batch_{i+1})
backward pre_hook: cur_stream.wait_stream(prefetch_stream)
# backward(batch_i)
TBE.backward()
# prefetch(batch_{i+2}) waits for backward(batch_i)
backward hook: prefetch_stream.wait_stream(cur_stream)

3. prevent cache inconsistency: weight get updated after it is loaded to cache

With pipeline, in the case that the same index is not inserted into cache in batch_i, but it is inserted in batch_{i+1}, the cache can be invalid in the sense that the cached weight for this index does not have the backward update of batch_i.
Example of the issue is as follows:
idx is in batch_i, batch_{i+1}
prefetch(batch_i)
- failed to insert idx into cache, cache_locations_batch_i of idx is -1 (cache miss)
forward(batch_i)
prefetch(batch_{i+1})
- insert idx into cache, cache is loaded from host memory
backward(batch_i)
- cache_locations_batch_i of idx is -1, the host memory is updated
forward(batch_{i+1})
- OUTPUT IS WRONG. the weight for idx is fetched from cache, but the cache is outdated.

The fix to this cache invalidation is to update the cache_locations_batch_i before backward of batch_i,so that the cache gets updated correctly by the backward pass of TBE.

Reviewed By: sryap

Differential Revision: D47418650

netlify · 2023-07-26T15:04:32Z

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
🔨 Latest commit	`056f1f0`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/64c1b5b3d62964000888b473

facebook-github-bot · 2023-07-26T15:05:40Z

This pull request was exported from Phabricator. Differential Revision: D47418650

Summary: Pull Request resolved: pytorch#1893 This diff is to enable cache prefetch pipeline of TBE, so that prefetch of batch_{i+1} can overlap with forward/backward of batch_i. As the cache can be evicted by prefetch and the weights can be updated by the backward, we need to carefully protect a few scenarios that result in cache invalidation. ## 1. prevent immature cache eviction: cache gets evicted while it is being used by forward pass Since prefetch can overlap with forward/backward pass, it is possible that prefetch tries to evict cache but the cache is being used by forward/backward pass. The fix is to use the `lxu_cache_locking_counter` in D46172802/pytorch#1883 to check whether a cache slot is in use or not when an eviction is attempted. ## 2. prevent dirty cache: weight is being updated while it is loading to cache If the prefetch overlaps with TBE backward pass, the backward may write to uvm (idx not in cache) and at the same time prefetch (idx is inserted to cache) loads the weight from uvm to cache. We sync the streams to avoid TBE backward pass overlapping with prefetch. The backward of the rest of the module can still overlap with prefetch of TBE. The stream sync looks like: ``` # backward(batch_i) waits for prefetch(batch_{i+1}) backward pre_hook: cur_stream.wait_stream(prefetch_stream) # backward(batch_i) TBE.backward() # prefetch(batch_{i+2}) waits for backward(batch_i) backward hook: prefetch_stream.wait_stream(cur_stream) ``` ## 3. prevent cache inconsistency: weight get updated after it is loaded to cache With pipeline, in the case that the same index is not inserted into cache in batch_i, but it is inserted in batch_{i+1}, the cache can be invalid in the sense that the cached weight for this index does not have the backward update of batch_i. Example of the issue is as follows: idx is in batch_i, batch_{i+1} prefetch(batch_i) - failed to insert idx into cache, cache_locations_batch_i of idx is -1 (cache miss) forward(batch_i) prefetch(batch_{i+1}) - insert idx into cache, cache is loaded from host memory backward(batch_i) - cache_locations_batch_i of idx is -1, the host memory is updated forward(batch_{i+1}) - OUTPUT IS WRONG. the weight for idx is fetched from cache, but the cache is outdated. The fix to this cache invalidation is to update the cache_locations_batch_i before backward of batch_i,so that the cache gets updated correctly by the backward pass of TBE. Reviewed By: sryap Differential Revision: D47418650 fbshipit-source-id: 6b01aa4df8bc7c121a7c66893fcf27b73cd8be73

facebook-github-bot · 2023-07-26T17:35:15Z

This pull request was exported from Phabricator. Differential Revision: D47418650

Summary: Pull Request resolved: pytorch#1893 This diff is to enable cache prefetch pipeline of TBE, so that prefetch of batch_{i+1} can overlap with forward/backward of batch_i. As the cache can be evicted by prefetch and the weights can be updated by the backward, we need to carefully protect a few scenarios that result in cache invalidation. ## 1. prevent immature cache eviction: cache gets evicted while it is being used by forward pass Since prefetch can overlap with forward/backward pass, it is possible that prefetch tries to evict cache but the cache is being used by forward/backward pass. The fix is to use the `lxu_cache_locking_counter` in D46172802/pytorch#1883 to check whether a cache slot is in use or not when an eviction is attempted. ## 2. prevent dirty cache: weight is being updated while it is loading to cache If the prefetch overlaps with TBE backward pass, the backward may write to uvm (idx not in cache) and at the same time prefetch (idx is inserted to cache) loads the weight from uvm to cache. We sync the streams to avoid TBE backward pass overlapping with prefetch. The backward of the rest of the module can still overlap with prefetch of TBE. The stream sync looks like: ``` # backward(batch_i) waits for prefetch(batch_{i+1}) backward pre_hook: cur_stream.wait_stream(prefetch_stream) # backward(batch_i) TBE.backward() # prefetch(batch_{i+2}) waits for backward(batch_i) backward hook: prefetch_stream.wait_stream(cur_stream) ``` ## 3. prevent cache inconsistency: weight get updated after it is loaded to cache With pipeline, in the case that the same index is not inserted into cache in batch_i, but it is inserted in batch_{i+1}, the cache can be invalid in the sense that the cached weight for this index does not have the backward update of batch_i. Example of the issue is as follows: idx is in batch_i, batch_{i+1} prefetch(batch_i) - failed to insert idx into cache, cache_locations_batch_i of idx is -1 (cache miss) forward(batch_i) prefetch(batch_{i+1}) - insert idx into cache, cache is loaded from host memory backward(batch_i) - cache_locations_batch_i of idx is -1, the host memory is updated forward(batch_{i+1}) - OUTPUT IS WRONG. the weight for idx is fetched from cache, but the cache is outdated. The fix to this cache invalidation is to update the cache_locations_batch_i before backward of batch_i,so that the cache gets updated correctly by the backward pass of TBE. Reviewed By: sryap Differential Revision: D47418650 fbshipit-source-id: 84811c423ef30fec82282702be181c00310c4e84

facebook-github-bot · 2023-07-26T17:44:42Z

This pull request was exported from Phabricator. Differential Revision: D47418650

Summary: Pull Request resolved: pytorch#1893 This diff is to enable cache prefetch pipeline of TBE, so that prefetch of batch_{i+1} can overlap with forward/backward of batch_i. As the cache can be evicted by prefetch and the weights can be updated by the backward, we need to carefully protect a few scenarios that result in cache invalidation. ## 1. prevent immature cache eviction: cache gets evicted while it is being used by forward pass Since prefetch can overlap with forward/backward pass, it is possible that prefetch tries to evict cache but the cache is being used by forward/backward pass. The fix is to use the `lxu_cache_locking_counter` in D46172802/pytorch#1883 to check whether a cache slot is in use or not when an eviction is attempted. ## 2. prevent dirty cache: weight is being updated while it is loading to cache If the prefetch overlaps with TBE backward pass, the backward may write to uvm (idx not in cache) and at the same time prefetch (idx is inserted to cache) loads the weight from uvm to cache. We sync the streams to avoid TBE backward pass overlapping with prefetch. The backward of the rest of the module can still overlap with prefetch of TBE. The stream sync looks like: ``` # backward(batch_i) waits for prefetch(batch_{i+1}) backward pre_hook: cur_stream.wait_stream(prefetch_stream) # backward(batch_i) TBE.backward() # prefetch(batch_{i+2}) waits for backward(batch_i) backward hook: prefetch_stream.wait_stream(cur_stream) ``` ## 3. prevent cache inconsistency: weight get updated after it is loaded to cache With pipeline, in the case that the same index is not inserted into cache in batch_i, but it is inserted in batch_{i+1}, the cache can be invalid in the sense that the cached weight for this index does not have the backward update of batch_i. Example of the issue is as follows: idx is in batch_i, batch_{i+1} prefetch(batch_i) - failed to insert idx into cache, cache_locations_batch_i of idx is -1 (cache miss) forward(batch_i) prefetch(batch_{i+1}) - insert idx into cache, cache is loaded from host memory backward(batch_i) - cache_locations_batch_i of idx is -1, the host memory is updated forward(batch_{i+1}) - OUTPUT IS WRONG. the weight for idx is fetched from cache, but the cache is outdated. The fix to this cache invalidation is to update the cache_locations_batch_i before backward of batch_i,so that the cache gets updated correctly by the backward pass of TBE. Reviewed By: sryap Differential Revision: D47418650 fbshipit-source-id: 9500c0b902761f24aa5086198574f5b85b141cbd

facebook-github-bot · 2023-07-26T17:50:56Z

This pull request was exported from Phabricator. Differential Revision: D47418650

facebook-github-bot · 2023-07-27T01:10:34Z

This pull request has been merged in 78c60ce.

facebook-github-bot added cla signed fb-exported labels Jul 26, 2023

yuguo68 force-pushed the export-D47418650 branch from 406ef0f to 0c17608 Compare July 26, 2023 17:35

yuguo68 force-pushed the export-D47418650 branch from 0c17608 to d9f861b Compare July 26, 2023 17:44

yuguo68 force-pushed the export-D47418650 branch from d9f861b to 056f1f0 Compare July 26, 2023 17:50

facebook-github-bot closed this in 78c60ce Jul 27, 2023

facebook-github-bot added the Merged label Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TBE UVM cache prefetch pipeline #1893

TBE UVM cache prefetch pipeline #1893

yuguo68 commented Jul 26, 2023

netlify bot commented Jul 26, 2023 •

edited

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 27, 2023

TBE UVM cache prefetch pipeline #1893

TBE UVM cache prefetch pipeline #1893

Conversation

yuguo68 commented Jul 26, 2023

1. prevent immature cache eviction: cache gets evicted while it is being used by forward pass

2. prevent dirty cache: weight is being updated while it is loading to cache

3. prevent cache inconsistency: weight get updated after it is loaded to cache

netlify bot commented Jul 26, 2023 • edited

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 26, 2023

facebook-github-bot commented Jul 27, 2023

netlify bot commented Jul 26, 2023 •

edited