Improve YOCO static attention: reusable helper, correct tensor op, runtime guard (#18545)#18545
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18545
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 2 Unrelated FailuresAs of commit c81fb28 with merge base 6fccd5a ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@viveknayakatmeta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97637849. |
This PR needs a
|
…ntime guard (pytorch#18545) Summary: Pull Request resolved: pytorch#18545 - replace the inline first_kv_shared index computation in _from_config with a reusable _is_kv_shared_layer() helper that matches llama_transformer.py's pattern and adds a missing first_shared <= 0 edge-case guard, - fix torch.cat → torch.stack in _process_normal_kv for SHA kv_to_share construction, since per-head K/V tensors are rank-3 and torch.cat(dim=1) concatenates seq dimensions incorrectly while torch.stack(dim=1) correctly inserts a new heads dimension, - change the forward() K/V skip guard from structural (if self.is_kv_shared_layer) to runtime (if shared_kv is not None) with an added assertion that self.is_kv_shared_layer holds. Differential Revision: D97637849
a041cde to
ae10e9e
Compare
…ntime guard (pytorch#18545) Summary: - replace the inline first_kv_shared index computation in _from_config with a reusable _is_kv_shared_layer() helper that matches llama_transformer.py's pattern and adds a missing first_shared <= 0 edge-case guard, - fix torch.cat → torch.stack in _process_normal_kv for SHA kv_to_share construction, since per-head K/V tensors are rank-3 and torch.cat(dim=1) concatenates seq dimensions incorrectly while torch.stack(dim=1) correctly inserts a new heads dimension, - change the forward() K/V skip guard from structural (if self.is_kv_shared_layer) to runtime (if shared_kv is not None) with an added assertion that self.is_kv_shared_layer holds. Reviewed By: billmguo Differential Revision: D97637849
ae10e9e to
02e129f
Compare
…ntime guard (pytorch#18545) Summary: Pull Request resolved: pytorch#18545 - replace the inline first_kv_shared index computation in _from_config with a reusable _is_kv_shared_layer() helper that matches llama_transformer.py's pattern and adds a missing first_shared <= 0 edge-case guard, - fix torch.cat → torch.stack in _process_normal_kv for SHA kv_to_share construction, since per-head K/V tensors are rank-3 and torch.cat(dim=1) concatenates seq dimensions incorrectly while torch.stack(dim=1) correctly inserts a new heads dimension, - change the forward() K/V skip guard from structural (if self.is_kv_shared_layer) to runtime (if shared_kv is not None) with an added assertion that self.is_kv_shared_layer holds. Reviewed By: billmguo Differential Revision: D97637849
02e129f to
c81fb28
Compare
…ntime guard (pytorch#18545) Differential Revision: D97637849 Pull Request resolved: pytorch#18545
Summary:
Reviewed By: billmguo
Differential Revision: D97637849