-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable bf16 emb #94163
enable bf16 emb #94163
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94163
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 178a574: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Sorry, that was a mistake. I didn't need to make any submodule update, but they are often hard to catch. I'll prioritize working on lint check that prevents accidental updates |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m 'Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk. For example, https://hud.pytorch.org/pytorch/pytorch/commit/05397b12505f4fd1bc98af562e103f4162993c1a' -c weird |
@pytorchbot successfully started a revert job. Check the current status here. |
@zhuhaozhe your PR has been successfully reverted. |
This reverts commit f3bf46e. Reverted #94163 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk. For example, https://hud.pytorch.org/pytorch/pytorch/commit/05397b12505f4fd1bc98af562e103f4162993c1a
@malfet pre-commit has a check for that if you need a reference: https://github.com/pre-commit/pre-commit-hooks |
We have lintrunner, but I guess a bit of a challenge is to connect it to PR message (rather than to a commit message title) |
Summary: There is random failure https://hud.pytorch.org/pytorch/pytorch/commit/05397b12505f4fd1bc98af562e103f4162993c1a and pytorch/pytorch#94163 is reverted. The random failure is caused by re-use `lengths`(which is the `offset` in Pytorch Embedding Bag) addr. For FP32->BF16 convert, we need to add a VEC with 2^15 and right shift 16 to do round-nearest https://github.com/pytorch/FBGEMM/blob/main/src/FbgemmBfloat16ConvertAvx2.cc#L18-L21. The first version I used ``` a->mov(scratchReg2_, 1 << 15); a->vpbroadcastd(ones_vreg, scratchReg2_); ``` This will cause random fail but cannot work on AVX2 since `asmjit` do not support it broadcast from `GP`(scratchReg2_) to `VEC`(ones_vreg). https://github.com/asmjit/asmjit/blob/996deae3273073bf75fbd6ddeac038dff5fdb6eb/src/asmjit/x86/x86emitter.h#L2794-L2796 As it showed on `asmjit` headers, We can broadcast from `mem` to `VEC`. So I re-use `lengths` (it is the ptr for `offset` from Pytorch EmbeddingBag). I first `mov` the content or `lenghts` to `scratchReg2_`, and `mov` `2^15` to `lenghts` ptr and broadcast it to VEC. After this, I recover `lenghts` content with `scratchReg2_`. ``` // Cannot find a broadcast instruction for int from GP to VEC with // AVX2. We use lengths address to perform the broadcast and // write it back auto temp_addr = x86::dword_ptr(lengths, 0); a->mov(scratchReg2_, temp_addr); a->mov(temp_addr, 1 << 15); a->vpbroadcastd(ones_vreg, temp_addr); a->mov(temp_addr, scratchReg2_); ``` This temporary usage of `lengths` ptr caused the random failure (related to memory overlap with multithreaded order). For example, we have threads `t1` and `t2`. After `t1 write to this addr a->mov(temp_addr, 1 << 15)` and before `t1 recovery this addr (a->mov(temp_addr, scratchReg2_))`. This addr is `read by t2`. That will cause a failure. This may be because `a->mov(temp_addr, 1 << 15);` may not only write 32 (or even 64-bit or `lengths` ptr) since it may randomly fail while both `indices` and `offset` are int64_t. I found another path to generate `VEC` with `2^15`. This way we will not do unsafe read/write with the given memory address anymore so we can solve the random failure. ``` a->mov(scratchReg2_, 1 << 15); a->vpinsrd(ones_vreg.xmm(), ones_vreg.xmm(), scratchReg2_, 0); a->vpbroadcastd(ones_vreg, ones_vreg.xmm()); ``` Pull Request resolved: #1583 Reviewed By: brad-mengchi, jiecaoyu, jiawenliu64 Differential Revision: D43112022 Pulled By: jianyuh fbshipit-source-id: 54616eac9fb0277674de98143fde0491d0e78deb
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed (Rule Dig deeper by viewing the failures on hud Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
Successfully rebased |
b4d86cd
to
07b76ec
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 4 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 2, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 2, macos-m1-12), trunk / macos-12-py3-x86-64 / test (default, 1, 3, macos-12), trunk / macos-12-py3-x86-64 / test (default, 2, 3, macos-12) Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
ghstack-source-id: b515f4d291ce7ebec194aad7813a2239582c34a3 Pull Request resolved: pytorch#89199
ghstack-source-id: d2e631fb2c04b13ffdf9f432504eef04436b3008 Pull Request resolved: pytorch#91949
Successfully rebased |
07b76ec
to
178a574
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge #89199 and #91949 into one PR.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10