enable bf16 emb #94163

zhuhaozhe · 2023-02-06T01:03:56Z

Merge #89199 and #91949 into one PR.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2023-02-06T01:03:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94163

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 178a574:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zhuhaozhe · 2023-02-06T01:22:22Z

Hi, @jianyuh. I do not know why FBGEMM is changed in #93895, I just updated it again in this PR.

jianyuh · 2023-02-06T02:27:27Z

Re "why FBGEMM is changed in #93895":

I guess @malfet is fixing some MacOS OpenMP issue. It should be safe to keep the latest version of FBGEMM.

malfet · 2023-02-06T03:35:47Z

Re "why FBGEMM is changed in #93895":

I guess @malfet is fixing some MacOS OpenMP issue. It should be safe to keep the latest version of FBGEMM.

Sorry, that was a mistake. I didn't need to make any submodule update, but they are often hard to catch. I'll prioritize working on lint check that prevents accidental updates

zhuhaozhe · 2023-02-06T04:07:56Z

@pytorchbot merge

pytorchmergebot · 2023-02-06T04:09:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2023-02-07T00:29:46Z

@pytorchbot revert -m 'Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk. For example, https://hud.pytorch.org/pytorch/pytorch/commit/05397b12505f4fd1bc98af562e103f4162993c1a' -c weird

pytorchmergebot · 2023-02-07T00:32:17Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-02-07T00:32:27Z

@zhuhaozhe your PR has been successfully reverted.

This reverts commit f3bf46e. Reverted #94163 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk. For example, https://hud.pytorch.org/pytorch/pytorch/commit/05397b12505f4fd1bc98af562e103f4162993c1a

Skylion007 · 2023-02-07T15:19:27Z

Re "why FBGEMM is changed in #93895":
I guess @malfet is fixing some MacOS OpenMP issue. It should be safe to keep the latest version of FBGEMM.

Sorry, that was a mistake. I didn't need to make any submodule update, but they are often hard to catch. I'll prioritize working on lint check that prevents accidental updates

@malfet pre-commit has a check for that if you need a reference: https://github.com/pre-commit/pre-commit-hooks

malfet · 2023-02-07T15:27:15Z

@malfet pre-commit has a check for that if you need a reference: https://github.com/pre-commit/pre-commit-hooks

We have lintrunner, but I guess a bit of a challenge is to connect it to PR message (rather than to a commit message title)

Summary: There is random failure https://hud.pytorch.org/pytorch/pytorch/commit/05397b12505f4fd1bc98af562e103f4162993c1a and pytorch/pytorch#94163 is reverted. The random failure is caused by re-use `lengths`(which is the `offset` in Pytorch Embedding Bag) addr. For FP32->BF16 convert, we need to add a VEC with 2^15 and right shift 16 to do round-nearest https://github.com/pytorch/FBGEMM/blob/main/src/FbgemmBfloat16ConvertAvx2.cc#L18-L21. The first version I used ``` a->mov(scratchReg2_, 1 << 15); a->vpbroadcastd(ones_vreg, scratchReg2_); ``` This will cause random fail but cannot work on AVX2 since `asmjit` do not support it broadcast from `GP`(scratchReg2_) to `VEC`(ones_vreg). https://github.com/asmjit/asmjit/blob/996deae3273073bf75fbd6ddeac038dff5fdb6eb/src/asmjit/x86/x86emitter.h#L2794-L2796 As it showed on `asmjit` headers, We can broadcast from `mem` to `VEC`. So I re-use `lengths` (it is the ptr for `offset` from Pytorch EmbeddingBag). I first `mov` the content or `lenghts` to `scratchReg2_`, and `mov` `2^15` to `lenghts` ptr and broadcast it to VEC. After this, I recover `lenghts` content with `scratchReg2_`. ``` // Cannot find a broadcast instruction for int from GP to VEC with // AVX2. We use lengths address to perform the broadcast and // write it back auto temp_addr = x86::dword_ptr(lengths, 0); a->mov(scratchReg2_, temp_addr); a->mov(temp_addr, 1 << 15); a->vpbroadcastd(ones_vreg, temp_addr); a->mov(temp_addr, scratchReg2_); ``` This temporary usage of `lengths` ptr caused the random failure (related to memory overlap with multithreaded order). For example, we have threads `t1` and `t2`. After `t1 write to this addr a->mov(temp_addr, 1 << 15)` and before `t1 recovery this addr (a->mov(temp_addr, scratchReg2_))`. This addr is `read by t2`. That will cause a failure. This may be because `a->mov(temp_addr, 1 << 15);` may not only write 32 (or even 64-bit or `lengths` ptr) since it may randomly fail while both `indices` and `offset` are int64_t. I found another path to generate `VEC` with `2^15`. This way we will not do unsafe read/write with the given memory address anymore so we can solve the random failure. ``` a->mov(scratchReg2_, 1 << 15); a->vpinsrd(ones_vreg.xmm(), ones_vreg.xmm(), scratchReg2_, 0); a->vpbroadcastd(ones_vreg, ones_vreg.xmm()); ``` Pull Request resolved: #1583 Reviewed By: brad-mengchi, jiecaoyu, jiawenliu64 Differential Revision: D43112022 Pulled By: jianyuh fbshipit-source-id: 54616eac9fb0277674de98143fde0491d0e78deb

zhuhaozhe · 2023-02-10T23:55:57Z

@pytorchbot merge

pytorchmergebot · 2023-02-10T23:57:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-02-10T23:57:42Z

Merge failed

Reason: 1 mandatory check(s) failed (Rule superuser). The first few are:

Lint / quick-checks

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

zhuhaozhe · 2023-02-11T03:33:19Z

@pytorchbot rebase

pytorchmergebot · 2023-02-11T03:35:33Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2023-02-11T03:35:39Z

Successfully rebased bf16_emb onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bf16_emb && git pull --rebase)

zhuhaozhe · 2023-02-11T14:41:11Z

@pytorchbot merge

pytorchmergebot · 2023-02-11T14:43:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-02-11T14:43:10Z

Merge failed

Reason: 4 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 2, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 2, macos-m1-12), trunk / macos-12-py3-x86-64 / test (default, 1, 3, macos-12), trunk / macos-12-py3-x86-64 / test (default, 2, 3, macos-12)

Details for Dev Infra team

Raised by workflow job

zhuhaozhe · 2023-02-11T14:48:49Z

@pytorchbot rebase

pytorchmergebot · 2023-02-11T14:51:04Z

@pytorchbot successfully started a rebase job. Check the current status here

ghstack-source-id: b515f4d291ce7ebec194aad7813a2239582c34a3 Pull Request resolved: pytorch#89199

ghstack-source-id: d2e631fb2c04b13ffdf9f432504eef04436b3008 Pull Request resolved: pytorch#91949

pytorchmergebot · 2023-02-11T14:51:11Z

Successfully rebased bf16_emb onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bf16_emb && git pull --rebase)

zhuhaozhe · 2023-02-12T00:03:23Z

@pytorchbot merge

pytorchmergebot · 2023-02-12T00:05:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

zhuhaozhe requested review from jianyuh and jgong5 February 6, 2023 01:03

zhuhaozhe requested review from mruberry and ngimel as code owners February 6, 2023 01:03

zhuhaozhe mentioned this pull request Feb 6, 2023

enable bf16 embeddingbag in aten #89199

Closed

pytorchbot added the open source label Feb 6, 2023

jianyuh approved these changes Feb 6, 2023

View reviewed changes

zhuhaozhe added this to the 2.0.0 milestone Feb 6, 2023

zhuhaozhe added intel This tag is for PR from Intel module: cpu CPU specific problem (e.g., perf, algorithm) labels Feb 6, 2023

malfet approved these changes Feb 6, 2023

View reviewed changes

jgong5 approved these changes Feb 6, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 6, 2023

zhuhaozhe mentioned this pull request Feb 6, 2023

enable bf16 embeddingbag with fbgemm fast path in aten #91949

Closed

pytorchmergebot added the Merged label Feb 6, 2023

pytorchmergebot closed this in f3bf46e Feb 6, 2023

huydhn mentioned this pull request Feb 7, 2023

DISABLED test_embedding_bag_bfloat16_cpu_int64_int32 (__main__.TestEmbeddingNNDeviceTypeCPU) #94209

Closed

huydhn reopened this Feb 7, 2023

pytorchmergebot added the Reverted label Feb 7, 2023

zhuhaozhe mentioned this pull request Feb 7, 2023

fix EmbeddingSpMDM bf16 in/out pytorch/FBGEMM#1583

Closed

pytorchmergebot force-pushed the bf16_emb branch from b4d86cd to 07b76ec Compare February 11, 2023 03:35

zhuhaozhe added 4 commits February 11, 2023 14:51

enable bf16 embeddingbag in aten

1e7f3e1

ghstack-source-id: b515f4d291ce7ebec194aad7813a2239582c34a3 Pull Request resolved: pytorch#89199

enable bf16 embeddingbag with fbgemm fast path in aten

e976f51

ghstack-source-id: d2e631fb2c04b13ffdf9f432504eef04436b3008 Pull Request resolved: pytorch#91949

update fbgemm

40a3490

update fbgemm

178a574

pytorchmergebot force-pushed the bf16_emb branch from 07b76ec to 178a574 Compare February 11, 2023 14:51

pytorchmergebot closed this in ed54a5d Feb 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable bf16 emb #94163

enable bf16 emb #94163

zhuhaozhe commented Feb 6, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Feb 6, 2023 •

edited

zhuhaozhe commented Feb 6, 2023

jianyuh commented Feb 6, 2023

malfet commented Feb 6, 2023

zhuhaozhe commented Feb 6, 2023

pytorchmergebot commented Feb 6, 2023

huydhn commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

Skylion007 commented Feb 7, 2023

malfet commented Feb 7, 2023

zhuhaozhe commented Feb 10, 2023

pytorchmergebot commented Feb 10, 2023

pytorchmergebot commented Feb 10, 2023

zhuhaozhe commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

zhuhaozhe commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

zhuhaozhe commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

zhuhaozhe commented Feb 12, 2023

pytorchmergebot commented Feb 12, 2023

enable bf16 emb #94163

enable bf16 emb #94163

Conversation

zhuhaozhe commented Feb 6, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented Feb 6, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94163

✅ No Failures

zhuhaozhe commented Feb 6, 2023

jianyuh commented Feb 6, 2023

malfet commented Feb 6, 2023

zhuhaozhe commented Feb 6, 2023

pytorchmergebot commented Feb 6, 2023

Merge started

huydhn commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

pytorchmergebot commented Feb 7, 2023

Skylion007 commented Feb 7, 2023

malfet commented Feb 7, 2023

zhuhaozhe commented Feb 10, 2023

pytorchmergebot commented Feb 10, 2023

Merge started

pytorchmergebot commented Feb 10, 2023

Merge failed

zhuhaozhe commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

zhuhaozhe commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

Merge started

pytorchmergebot commented Feb 11, 2023

Merge failed

zhuhaozhe commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

pytorchmergebot commented Feb 11, 2023

zhuhaozhe commented Feb 12, 2023

pytorchmergebot commented Feb 12, 2023

Merge started

zhuhaozhe commented Feb 6, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Feb 6, 2023 •

edited