Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" #31127

jianyuh · 2019-12-11T18:42:03Z

Stack from ghstack:

Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" #31127 Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256"

Original commit changeset: d22448b90843

On Skylake T6:

Single Core:
(Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.)

Before the PR:

native_layer_norm        0.81%            5.884ms          0.81%            5.884ms          122.580us        NaN              0.000us          0.000us          48               [[47, 1, 1024], [1024], [1024]]

After the PR:

native_layer_norm        0.68%            5.053ms          0.68%            5.053ms          105.272us        NaN              0.000us          0.000us          48               [[56, 1, 1024], [1024], [1024]]

20 Cores:

Before the PR:

native_layer_norm        1.65%            41.682ms         1.65%            41.682ms         868.365us        NaN              0.000us          0.000us          48               [[61, 64, 1024], [1024], [1024]]

After the PR:

native_layer_norm        1.34%            33.829ms         1.34%            33.829ms         704.771us        NaN              0.000us          0.000us          48               [[61, 64, 1024], [1024], [1024]]

Differential Revision: D18936428

…on using Vec256" Original commit changeset: d22448b90843 On Skylake T6: Single Core: (Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.) - Before the PR: ``` native_layer_norm 0.81% 5.884ms 0.81% 5.884ms 122.580us NaN 0.000us 0.000us 48 [[47, 1, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 0.68% 5.053ms 0.68% 5.053ms 105.272us NaN 0.000us 0.000us 48 [[56, 1, 1024], [1024], [1024]] ``` 20 Cores: - Before the PR: ``` native_layer_norm 1.65% 41.682ms 1.65% 41.682ms 868.365us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 1.34% 33.829ms 1.34% 33.829ms 704.771us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` Differential Revision: [D18936428](https://our.internmc.facebook.com/intern/diff/D18936428/) [ghstack-poisoned]

…vectorization using Vec256"" Original commit changeset: d22448b90843 On Skylake T6: Single Core: (Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.) - Before the PR: ``` native_layer_norm 0.81% 5.884ms 0.81% 5.884ms 122.580us NaN 0.000us 0.000us 48 [[47, 1, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 0.68% 5.053ms 0.68% 5.053ms 105.272us NaN 0.000us 0.000us 48 [[56, 1, 1024], [1024], [1024]] ``` 20 Cores: - Before the PR: ``` native_layer_norm 1.65% 41.682ms 1.65% 41.682ms 868.365us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 1.34% 33.829ms 1.34% 33.829ms 704.771us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` Differential Revision: [D18936428](https://our.internmc.facebook.com/intern/diff/D18936428/) [ghstack-poisoned]

kostmo · 2019-12-12T02:20:52Z

CircleCI build failures summary

As of commit a3ad513:

2/2 broken upstream at merge base 679b20b (see grid view)
- You may want to rebase on the viable/strict branch (see its recency history):
  - If your commit is newer than viable/strict, you can try basing on an older, stable commit:
```
git fetch viable/strict
git rebase --onto viable/strict $(git merge-base origin/master HEAD)
```
  - If your commit is older than viable/strict:
```
git fetch viable/strict
git rebase viable/strict
```
0/2 failures introduced in this PR
1/2 recognized as flaky
- Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

2 upstream failures recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:

This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 3 times.

…vectorization using Vec256"" Original commit changeset: d22448b90843 On Skylake T6: Single Core: (Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.) - Before the PR: ``` native_layer_norm 0.81% 5.884ms 0.81% 5.884ms 122.580us NaN 0.000us 0.000us 48 [[47, 1, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 0.68% 5.053ms 0.68% 5.053ms 105.272us NaN 0.000us 0.000us 48 [[56, 1, 1024], [1024], [1024]] ``` 20 Cores: - Before the PR: ``` native_layer_norm 1.65% 41.682ms 1.65% 41.682ms 868.365us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 1.34% 33.829ms 1.34% 33.829ms 704.771us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` Differential Revision: [D18936428](https://our.internmc.facebook.com/intern/diff/D18936428/) [ghstack-poisoned]

…on using Vec256" Pull Request resolved: #31127 Original commit changeset: d22448b90843 On Skylake T6: Single Core: (Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.) - Before the PR: ``` native_layer_norm 0.81% 5.884ms 0.81% 5.884ms 122.580us NaN 0.000us 0.000us 48 [[47, 1, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 0.68% 5.053ms 0.68% 5.053ms 105.272us NaN 0.000us 0.000us 48 [[56, 1, 1024], [1024], [1024]] ``` 20 Cores: - Before the PR: ``` native_layer_norm 1.65% 41.682ms 1.65% 41.682ms 868.365us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 1.34% 33.829ms 1.34% 33.829ms 704.771us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` ghstack-source-id: 95420889 Differential Revision: [D18936428](https://our.internmc.facebook.com/intern/diff/D18936428/)

jianyuh · 2019-12-12T06:46:21Z

@jamesr66a : could you re-stamp this PR? It was reverted due to some hypothesis issues (false positive). The original PR is #29104. Thanks!

facebook-github-bot · 2019-12-13T01:50:57Z

This pull request has been merged in 066e3ed.

…on using Vec256" (pytorch#31127) Summary: Pull Request resolved: pytorch#31127 Original commit changeset: d22448b90843 On Skylake T6: Single Core: (Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.) - Before the PR: ``` native_layer_norm 0.81% 5.884ms 0.81% 5.884ms 122.580us NaN 0.000us 0.000us 48 [[47, 1, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 0.68% 5.053ms 0.68% 5.053ms 105.272us NaN 0.000us 0.000us 48 [[56, 1, 1024], [1024], [1024]] ``` 20 Cores: - Before the PR: ``` native_layer_norm 1.65% 41.682ms 1.65% 41.682ms 868.365us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` - After the PR: ``` native_layer_norm 1.34% 33.829ms 1.34% 33.829ms 704.771us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]] ``` ghstack-source-id: 95420889 Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm" buck test mode/dev-nosan //caffe2/test:nn -- "test_LayerNorm_1d_no_elementwise_affine_eval" python run_test.py -i nn -- TestNN.test_LayerNorm_1d_no_elementwise_affine_eval Differential Revision: D18936428 fbshipit-source-id: 8cae33d35fb338b5ac49b1597c2709152612d6e5

jianyuh requested a review from jamesr66a December 12, 2019 06:44

jamesr66a approved these changes Dec 12, 2019

View reviewed changes

facebook-github-bot closed this in 066e3ed Dec 12, 2019

facebook-github-bot added the merged label Dec 13, 2019

facebook-github-bot deleted the gh/jianyuh/52/head branch December 16, 2019 15:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" #31127

Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" #31127

Uh oh!

jianyuh commented Dec 11, 2019 •

edited

Loading

Uh oh!

kostmo commented Dec 12, 2019 •

edited

Loading

Uh oh!

jianyuh commented Dec 12, 2019

Uh oh!

facebook-github-bot commented Dec 13, 2019

Uh oh!

Uh oh!

Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" #31127

Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" #31127

Uh oh!

Conversation

jianyuh commented Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kostmo commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CircleCI build failures summary

Detailed failure analysis

2 upstream failures recognized by patterns:

Uh oh!

jianyuh commented Dec 12, 2019

Uh oh!

facebook-github-bot commented Dec 13, 2019

Uh oh!

Uh oh!

jianyuh commented Dec 11, 2019 •

edited

Loading

kostmo commented Dec 12, 2019 •

edited

Loading