Skip to content

Conversation

mingfeima
Copy link
Collaborator

This PR moves uniform_ CPU implementation from TH to ATen, e.g. uniform_cpu_.
When PyTorch is built with MKL support and CPU is deteced to be intel CPU, uniform_cpu_ will finally call MKL VSL random generator which assures best performance on intel CPUs.
Otherwise (MKL is not built or AMD CPU), uniform_cpu_ will end up in sequential random generator from ATen, this path is identical to old TH implementation.

Single socket run speedup: 2.4x~142x
Single core run speedup: 3.2x~13x

@mingfeima
Copy link
Collaborator Author

performance

Below shows the performance compare between original and this pr, the benchmark will compare from:

  1. a range of input size: [1000], [10000], [100000], [1000000]
  2. single socket and single core

benchmark code available at this op_bench-py, to reproduce, ./run.sh uniform.py. Test machine is Xeon Skylake 6148 with 2*20 cores @2.40GHz

benchmark uses the metric of numbers processed per second, aka the throughput, the higher the better.

single socket

input size original this pr speedup
[1000] 0.157 0.381 2.43
[10000] 0.188 2.093 11.13
[100000] 0.191 11.545 60.45
[1000000] 0.191 27.207 142.45

single core

input size original this pr speedup
[1000] 0.160 0.508 3.18
[10000] 0.188 1.802 9.59
[100000] 0.191 2.436 12.75
[1000000] 0.191 2.517 13.18

Validation

To prove the MKL VSL random generator as functional as ATen implementation, test cases are also provided in uniform.py.

The test case uses T-test to verify the mean and variance of MKL VSL output aligns with theoretical mean and variance for uniform distribution:

for X~U(a, b)
E(X) = (a+b)/2
VAR(X) = E(X^2) - E(X)^2 = (b-a)^2 / 12

I do cross check the test cases with ATen implementation and MKL VSL implementation, both passed.

@XiaobingSuper
Copy link
Collaborator

#24781 Migrate uniform_ from the TH to Aten (CPU)

@kostmo
Copy link
Member

kostmo commented Dec 23, 2019

💊 CircleCI build failures summary and remediations

As of commit 8097855 (more details on the Dr. CI page):


  • 4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_test (1/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Apr 01 04:50:42 caused by: Connection refused (os error 111)
Apr 01 04:50:42 +++ eval 'extract_trap_cmd ' 
Apr 01 04:50:42 ++++ extract_trap_cmd 
Apr 01 04:50:42 ++++ printf '%s\n' '' 
Apr 01 04:50:42 +++ printf '%s\n' cleanup 
Apr 01 04:50:42 ++ trap -- ' 
Apr 01 04:50:42 cleanup' EXIT 
Apr 01 04:50:42 ++ which sccache 
Apr 01 04:50:42 ++ sccache --stop-server 
Apr 01 04:50:42 Stopping sccache server... 
Apr 01 04:50:42 error: couldn't connect to server 
Apr 01 04:50:42 caused by: Connection refused (os error 111) 
Apr 01 04:50:42 ++ true 
Apr 01 04:50:42 ++ rm /var/lib/jenkins/sccache_error.log 
Apr 01 04:50:42 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Apr 01 04:50:42 ++ SCCACHE_IDLE_TIMEOUT=1200 
Apr 01 04:50:42 ++ RUST_LOG=sccache::server=error 
Apr 01 04:50:42 ++ sccache --start-server 
Apr 01 04:50:42 Starting sccache server... 
Apr 01 04:50:42 ++ sccache --zero-stats 
Apr 01 04:50:42 Compile requests                 0 
Apr 01 04:50:42 Compile requests executed        0 

See CircleCI build pytorch_macos_10_13_py3_test (2/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 31 21:53:17 FAIL [300.105s]: test_barrier_timeout_full_group (__main__.TestDistBackend)
Mar 31 21:53:14   test_scatter_checks (__main__.TestDistBackend) ... ok (0.316s) 
Mar 31 21:53:15   test_scatter_full_group (__main__.TestDistBackend) ... ok (0.310s) 
Mar 31 21:53:15   test_scatter_group (__main__.TestDistBackend) ... ok (0.637s) 
Mar 31 21:53:15   test_send_recv (__main__.TestDistBackend) ... ok (0.274s) 
Mar 31 21:53:16   test_send_recv_any_source (__main__.TestDistBackend) ... ok (0.667s) 
Mar 31 21:53:17   test_send_recv_with_tag (__main__.TestDistBackend) ... ok (0.408s) 
Mar 31 21:53:17   test_sparse_all_reduce_sum (__main__.TestDistBackend) ... ok (0.238s) 
Mar 31 21:53:17   test_sparse_all_reduce_sum_cuda (__main__.TestDistBackend) ... skip (0.212s) 
Mar 31 21:53:17  
Mar 31 21:53:17 ====================================================================== 
Mar 31 21:53:17 FAIL [300.105s]: test_barrier_timeout_full_group (__main__.TestDistBackend) 
Mar 31 21:53:17 ---------------------------------------------------------------------- 
Mar 31 21:53:17 Traceback (most recent call last): 
Mar 31 21:53:17   File "distributed/test_distributed.py", line 2051, in wrapper 
Mar 31 21:53:17     self._join_and_reduce(fn) 
Mar 31 21:53:17   File "distributed/test_distributed.py", line 2144, in _join_and_reduce 
Mar 31 21:53:17     "Timeout waiting for rank %d to terminate" % rank) 
Mar 31 21:53:17 AssertionError: True is not false : Timeout waiting for rank 0 to terminate 
Mar 31 21:53:17  
Mar 31 21:53:17 ---------------------------------------------------------------------- 
Mar 31 21:53:17 Ran 94 tests in 338.178s 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (3/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Apr 01 05:03:28 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save() instead.
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 05:03:28  
Apr 01 05:03:28 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Apr 01 05:03:28  
Apr 01 05:03:28 At: 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1773): __getstate__ 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 05:03:28  
Apr 01 05:03:28 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Apr 01 05:03:28  
Apr 01 05:03:28 At: 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1773): __getstate__ 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 05:03:28  
Apr 01 05:03:28 ok (1.121s) 
Apr 01 05:03:29   test_unexepected_kwarg_is_specified (__main__.JitRpcTestWithSpawn) ... ok (1.120s) 
Apr 01 05:03:30   test_user_rrefs_confirmed (__main__.JitRpcTestWithSpawn) ... ok (1.120s) 
Apr 01 05:03:31   test_user_rrefs_confirmed_remote (__main__.JitRpcTestWithSpawn) ... ok (1.120s) 

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (4/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Apr 01 07:55:43 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save() instead.
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 07:55:43  
Apr 01 07:55:43 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Apr 01 07:55:43  
Apr 01 07:55:43 At: 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1773): __getstate__ 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 07:55:43  
Apr 01 07:55:43 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Apr 01 07:55:43  
Apr 01 07:55:43 At: 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1773): __getstate__ 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 07:55:43  
Apr 01 07:55:43 ok (0.912s) 
Apr 01 07:55:44   test_unexepected_kwarg_is_specified (__main__.JitRpcTestWithSpawn) ... ok (0.712s) 
Apr 01 07:55:44   test_user_rrefs_confirmed (__main__.JitRpcTestWithSpawn) ... ok (0.812s) 
Apr 01 07:55:45   test_user_rrefs_confirmed_remote (__main__.JitRpcTestWithSpawn) ... ok (0.812s) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 41 times.

@ezyang ezyang requested a review from VitalyFedyunin February 3, 2020 16:03
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 3, 2020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need two copies of this code compiled (AVX on/off)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I just aligned uniform_mkl_kernel with bernoulli_mkl_kernel in the same manner.

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use TensorIterator, please rebase.

VitalyFedyunin
VitalyFedyunin previously approved these changes Feb 6, 2020
@VitalyFedyunin VitalyFedyunin dismissed their stale review February 6, 2020 18:50

Code looks good, please rebase and get rid of errors.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

cc @pbelevich as it is related to random generators

@VitalyFedyunin VitalyFedyunin added the module: random Related to random number generation in PyTorch (rng generator) label Feb 6, 2020
@VitalyFedyunin
Copy link
Contributor

I had no time to check in details, but errors looks like related to the introduced change and I cannot merge it as is.

@mingfeima
Copy link
Collaborator Author

I had no time to check in details, but errors looks like related to the introduced change and I cannot merge it as is.

yes, i was aware of the errors are real... Somehow I have some trouble in understanding the test cases. Still WIP.

@mingfeima
Copy link
Collaborator Author

Remove vsl and use plain CPU implementation for now. Will put vsl part in following work.

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase

@mingfeima
Copy link
Collaborator Author

Rebased

pbelevich added a commit that referenced this pull request Mar 28, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Mar 28, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Mar 28, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Mar 29, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Mar 31, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Mar 31, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Mar 31, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Mar 31, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
@mingfeima
Copy link
Collaborator Author

rebased

pbelevich added a commit that referenced this pull request Apr 2, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954




[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954


Differential Revision: [D20820221](https://our.internmc.facebook.com/intern/diff/D20820221)

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954


Differential Revision: [D20820221](https://our.internmc.facebook.com/intern/diff/D20820221)

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…igrate uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954


Differential Revision: [D20820221](https://our.internmc.facebook.com/intern/diff/D20820221)

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 2, 2020
…from TH to ATen)"

`uniform_kernel_cpu` is based on #30954


Differential Revision: [D20820221](https://our.internmc.facebook.com/intern/diff/D20820221)

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 3, 2020
…igrate CPU uniform_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954


Differential Revision: [D20820221](https://our.internmc.facebook.com/intern/diff/D20820221)

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Apr 3, 2020
…rm_ from TH to ATen)"

`uniform_kernel_cpu` is based on #30954


Differential Revision: [D20820221](https://our.internmc.facebook.com/intern/diff/D20820221)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Apr 3, 2020
…ATen) (#35580)

Summary:
Pull Request resolved: #35580

`uniform_kernel_cpu` is based on #30954

Test Plan: Imported from OSS

Differential Revision: D20820221

Pulled By: pbelevich

fbshipit-source-id: 13f9fc8fc75b0e9fb48021f2ac08dcb38212a53f
ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Apr 13, 2020
…ATen) (pytorch#35580)

Summary:
Pull Request resolved: pytorch#35580

`uniform_kernel_cpu` is based on pytorch#30954

Test Plan: Imported from OSS

Differential Revision: D20820221

Pulled By: pbelevich

fbshipit-source-id: 13f9fc8fc75b0e9fb48021f2ac08dcb38212a53f
@VitalyFedyunin
Copy link
Contributor

I think we can close this one, as it is already ported.

@mingfeima mingfeima closed this Apr 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: random Related to random number generation in PyTorch (rng generator) open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants