Migrate `_th_std_var` to ATen #59258

peterbell10 · 2021-06-01T19:26:47Z

This migrates std/var's special case all-reduction from TH to ATen. Using the benchmark from gh-43858 that was used to justify keeping the TH version; I find this PR has similar (slightly better) performance in single threaded. And unlike the TH version, this is multi-threaded and so much faster for large tensors.

TH Results:

[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       3.6   |       3.8    |     8.2  |      1.2   
      80       |       3.7   |       3.8    |     8.4  |      1.2   
      800      |       4.2   |       4.3    |     8.7  |      1.2   
      8000     |       9.0   |       9.1    |    11.2  |      1.5   
      80000    |      58.3   |      59.0    |    30.6  |      4.2   
      800000   |     546.9   |     546.9    |   183.4  |     31.3   
      8000000  |    5729.7   |    5701.0    |  6165.4  |    484.1

ATen results:

[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       4.0   |       4.0    |     8.7  |      1.2   
      80       |       3.6   |       3.8    |     9.0  |      1.2   
      800      |       4.1   |       4.3    |     8.9  |      1.2   
      8000     |       8.9   |       9.2    |    10.6  |      1.5   
      80000    |      57.0   |      57.4    |    28.8  |      4.3   
      800000   |     526.9   |     526.9    |   178.3  |     30.2   
      8000000  |    5568.1   |    5560.6    |  6042.1  |    453.2

[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
8 threads: ---------------------------------------------------------
      8        |      3.9    |      3.8     |     9.1  |      1.2   
      80       |      3.8    |      3.9     |     8.8  |      1.2   
      800      |      4.2    |      4.3     |     8.9  |      1.3   
      8000     |      9.0    |      9.2     |    10.4  |      1.5   
      80000    |     26.0    |     26.8     |    26.4  |      4.4   
      800000   |     92.9    |     87.3     |    72.1  |     22.4   
      8000000  |    793.5    |    791.8     |  5334.8  |    115.1

facebook-github-bot · 2021-06-01T19:26:53Z

💊 CI failures summary and remediations

As of commit da58ffa (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ngimel

Awesome, @peterbell10

facebook-github-bot · 2021-06-02T03:41:36Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-06-02T18:33:54Z

I'm going to assume that ROCm error is unrelated as this PR doesn't touch anything cuda or ROCm.

05:21:42   test_masked_scatter_large_tensor_cuda (__main__.TestTorchDeviceTypeCUDA) ... Memory exception on virtual address 0x7f2f55dd4000, node id 2 : Page not present
05:22:04 Address does not belong to a known buffer

peterbell10 · 2021-06-02T20:24:04Z

Quick look at the test and it's entirely unrelated. I've fixed merge conflicts with the nonzero porting PR, so CI will get a chance to rerun anyway.

facebook-github-bot · 2021-06-02T20:40:23Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-03T02:03:08Z

@ngimel merged this pull request in 1fb5cf5.

ngimel · 2021-06-03T04:15:39Z

tbb error is real, reverting https://app.circleci.com/pipelines/github/pytorch/pytorch/330242/workflows/f2923f3b-c702-4861-ae9a-29f69b18bfa8/jobs/13860298

facebook-github-bot · 2021-06-03T04:19:58Z

This pull request has been reverted by 8b4784a.

ngimel · 2021-06-03T05:29:10Z

Weird, 2 subsequent tests for tbb build were fine, I'll try to reland. @peterbell10 if you have any idea why it could have failed, please comment. tbb tests were not flaky, and broke on std_cpu on this PR only

facebook-github-bot · 2021-06-03T05:30:49Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

peterbell10 · 2021-06-03T11:22:34Z

I think the issue might be that get_thread_num doesn't work outside of a tbb::parallel_for call. Not sure why it would be flaky though.

facebook-github-bot · 2021-06-03T15:03:45Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-06-03T18:18:23Z

Hm, tbb failed again in the same way.

facebook-github-bot · 2021-06-04T06:06:34Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Ref pytorch#49421 This migrates `std`/`var`'s special case all-reduction from TH to ATen. Using the benchmark from pytorchgh-43858 that was used to justify keeping the TH version; I find this PR has similar (slightly better) performance in single threaded. And unlike the TH version, this is multi-threaded and so much faster for large tensors. TH Results: ``` [----------------------------- Index ------------------------------] | torch_var | torch_var0 | stdfn | torch_sum0 1 threads: --------------------------------------------------------- 8 | 3.6 | 3.8 | 8.2 | 1.2 80 | 3.7 | 3.8 | 8.4 | 1.2 800 | 4.2 | 4.3 | 8.7 | 1.2 8000 | 9.0 | 9.1 | 11.2 | 1.5 80000 | 58.3 | 59.0 | 30.6 | 4.2 800000 | 546.9 | 546.9 | 183.4 | 31.3 8000000 | 5729.7 | 5701.0 | 6165.4 | 484.1 ``` ATen results: ``` [----------------------------- Index ------------------------------] | torch_var | torch_var0 | stdfn | torch_sum0 1 threads: --------------------------------------------------------- 8 | 4.0 | 4.0 | 8.7 | 1.2 80 | 3.6 | 3.8 | 9.0 | 1.2 800 | 4.1 | 4.3 | 8.9 | 1.2 8000 | 8.9 | 9.2 | 10.6 | 1.5 80000 | 57.0 | 57.4 | 28.8 | 4.3 800000 | 526.9 | 526.9 | 178.3 | 30.2 8000000 | 5568.1 | 5560.6 | 6042.1 | 453.2 [----------------------------- Index ------------------------------] | torch_var | torch_var0 | stdfn | torch_sum0 8 threads: --------------------------------------------------------- 8 | 3.9 | 3.8 | 9.1 | 1.2 80 | 3.8 | 3.9 | 8.8 | 1.2 800 | 4.2 | 4.3 | 8.9 | 1.3 8000 | 9.0 | 9.2 | 10.4 | 1.5 80000 | 26.0 | 26.8 | 26.4 | 4.4 800000 | 92.9 | 87.3 | 72.1 | 22.4 8000000 | 793.5 | 791.8 | 5334.8 | 115.1 ``` Pull Request resolved: pytorch#59258 Reviewed By: mruberry Differential Revision: D28821216 Pulled By: ngimel fbshipit-source-id: f35992c21f08a0a8878053680dc0ca7a8facd155

Summary: Ref pytorch#49421 This migrates `std`/`var`'s special case all-reduction from TH to ATen. Using the benchmark from pytorchgh-43858 that was used to justify keeping the TH version; I find this PR has similar (slightly better) performance in single threaded. And unlike the TH version, this is multi-threaded and so much faster for large tensors. TH Results: ``` [----------------------------- Index ------------------------------] | torch_var | torch_var0 | stdfn | torch_sum0 1 threads: --------------------------------------------------------- 8 | 3.6 | 3.8 | 8.2 | 1.2 80 | 3.7 | 3.8 | 8.4 | 1.2 800 | 4.2 | 4.3 | 8.7 | 1.2 8000 | 9.0 | 9.1 | 11.2 | 1.5 80000 | 58.3 | 59.0 | 30.6 | 4.2 800000 | 546.9 | 546.9 | 183.4 | 31.3 8000000 | 5729.7 | 5701.0 | 6165.4 | 484.1 ``` ATen results: ``` [----------------------------- Index ------------------------------] | torch_var | torch_var0 | stdfn | torch_sum0 1 threads: --------------------------------------------------------- 8 | 4.0 | 4.0 | 8.7 | 1.2 80 | 3.6 | 3.8 | 9.0 | 1.2 800 | 4.1 | 4.3 | 8.9 | 1.2 8000 | 8.9 | 9.2 | 10.6 | 1.5 80000 | 57.0 | 57.4 | 28.8 | 4.3 800000 | 526.9 | 526.9 | 178.3 | 30.2 8000000 | 5568.1 | 5560.6 | 6042.1 | 453.2 [----------------------------- Index ------------------------------] | torch_var | torch_var0 | stdfn | torch_sum0 8 threads: --------------------------------------------------------- 8 | 3.9 | 3.8 | 9.1 | 1.2 80 | 3.8 | 3.9 | 8.8 | 1.2 800 | 4.2 | 4.3 | 8.9 | 1.3 8000 | 9.0 | 9.2 | 10.4 | 1.5 80000 | 26.0 | 26.8 | 26.4 | 4.4 800000 | 92.9 | 87.3 | 72.1 | 22.4 8000000 | 793.5 | 791.8 | 5334.8 | 115.1 ``` Pull Request resolved: pytorch#59258 Reviewed By: jbschlosser Differential Revision: D28860353 Pulled By: ngimel fbshipit-source-id: 80c9fe1db84dbc864eeb1a319076c7aaff0a04e5

peterbell10 added the open source label Jun 1, 2021

peterbell10 requested a review from ngimel June 1, 2021 19:26

facebook-github-bot added the cla signed label Jun 1, 2021

peterbell10 force-pushed the std-var-all-aten branch from 0ada3db to 82d3dcb Compare June 1, 2021 19:55

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 1, 2021

ngimel approved these changes Jun 2, 2021

View reviewed changes

peterbell10 force-pushed the std-var-all-aten branch from afbb2ae to c65c01e Compare June 2, 2021 20:22

facebook-github-bot closed this in 1fb5cf5 Jun 3, 2021

facebook-github-bot added the Merged label Jun 3, 2021

facebook-github-bot added the Reverted label Jun 3, 2021

ngimel reopened this Jun 3, 2021

ngimel added the ci/master label Jun 3, 2021

peterbell10 added 4 commits June 3, 2021 20:47

Migrate _th_std_var to ATen

7550353

Avoid errors from Scalar using checked_cast

7749349

TBB: Ensure get_thread_num is valid outside the thread pool

d10900c

Use parallel_reduce instead of parallel_for

da58ffa

peterbell10 force-pushed the std-var-all-aten branch from 0bffa00 to da58ffa Compare June 3, 2021 19:47

facebook-github-bot closed this in b3ee645 Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate `_th_std_var` to ATen #59258

Migrate `_th_std_var` to ATen #59258

peterbell10 commented Jun 1, 2021

facebook-github-bot commented Jun 1, 2021 •

edited

ngimel left a comment

facebook-github-bot commented Jun 2, 2021

ngimel commented Jun 2, 2021

peterbell10 commented Jun 2, 2021

facebook-github-bot commented Jun 2, 2021

facebook-github-bot commented Jun 3, 2021

ngimel commented Jun 3, 2021

facebook-github-bot commented Jun 3, 2021

ngimel commented Jun 3, 2021 •

edited

facebook-github-bot commented Jun 3, 2021

peterbell10 commented Jun 3, 2021

facebook-github-bot commented Jun 3, 2021

ngimel commented Jun 3, 2021

facebook-github-bot commented Jun 4, 2021

Migrate _th_std_var to ATen #59258

Migrate _th_std_var to ATen #59258

Conversation

peterbell10 commented Jun 1, 2021

facebook-github-bot commented Jun 1, 2021 • edited

💊 CI failures summary and remediations

ngimel left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 2, 2021

ngimel commented Jun 2, 2021

peterbell10 commented Jun 2, 2021

facebook-github-bot commented Jun 2, 2021

facebook-github-bot commented Jun 3, 2021

ngimel commented Jun 3, 2021

facebook-github-bot commented Jun 3, 2021

ngimel commented Jun 3, 2021 • edited

facebook-github-bot commented Jun 3, 2021

peterbell10 commented Jun 3, 2021

facebook-github-bot commented Jun 3, 2021

ngimel commented Jun 3, 2021

facebook-github-bot commented Jun 4, 2021

Migrate `_th_std_var` to ATen #59258

Migrate `_th_std_var` to ATen #59258

facebook-github-bot commented Jun 1, 2021 •

edited

ngimel commented Jun 3, 2021 •

edited