move uniform_ from TH to ATen and optimize with MKL VSL #30954

mingfeima · 2019-12-09T05:43:10Z

This PR moves uniform_ CPU implementation from TH to ATen, e.g. uniform_cpu_.
When PyTorch is built with MKL support and CPU is deteced to be intel CPU, uniform_cpu_ will finally call MKL VSL random generator which assures best performance on intel CPUs.
Otherwise (MKL is not built or AMD CPU), uniform_cpu_ will end up in sequential random generator from ATen, this path is identical to old TH implementation.

Single socket run speedup: 2.4x~142x
Single core run speedup: 3.2x~13x

mingfeima · 2019-12-09T06:00:49Z

performance

Below shows the performance compare between original and this pr, the benchmark will compare from:

a range of input size: [1000], [10000], [100000], [1000000]
single socket and single core

benchmark code available at this op_bench-py, to reproduce, ./run.sh uniform.py. Test machine is Xeon Skylake 6148 with 2*20 cores @2.40GHz

benchmark uses the metric of numbers processed per second, aka the throughput, the higher the better.

single socket

input size	original	this pr	speedup
[1000]	0.157	0.381	2.43
[10000]	0.188	2.093	11.13
[100000]	0.191	11.545	60.45
[1000000]	0.191	27.207	142.45

single core

input size	original	this pr	speedup
[1000]	0.160	0.508	3.18
[10000]	0.188	1.802	9.59
[100000]	0.191	2.436	12.75
[1000000]	0.191	2.517	13.18

Validation

To prove the MKL VSL random generator as functional as ATen implementation, test cases are also provided in uniform.py.

The test case uses T-test to verify the mean and variance of MKL VSL output aligns with theoretical mean and variance for uniform distribution:

for X~U(a, b)
E(X) = (a+b)/2
VAR(X) = E(X^2) - E(X)^2 = (b-a)^2 / 12

I do cross check the test cases with ATen implementation and MKL VSL implementation, both passed.

XiaobingSuper · 2019-12-11T05:27:24Z

#24781 Migrate uniform_ from the TH to Aten (CPU)

kostmo · 2019-12-23T08:46:39Z

💊 CircleCI build failures summary and remediations

As of commit 8097855 (more details on the Dr. CI page):

4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

pytorch_xla_linux_xenial_py3_6_clang7_test (1/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Apr 01 04:50:42 caused by: Connection refused (os error 111)

Apr 01 04:50:42 +++ eval 'extract_trap_cmd ' 
Apr 01 04:50:42 ++++ extract_trap_cmd 
Apr 01 04:50:42 ++++ printf '%s\n' '' 
Apr 01 04:50:42 +++ printf '%s\n' cleanup 
Apr 01 04:50:42 ++ trap -- ' 
Apr 01 04:50:42 cleanup' EXIT 
Apr 01 04:50:42 ++ which sccache 
Apr 01 04:50:42 ++ sccache --stop-server 
Apr 01 04:50:42 Stopping sccache server... 
Apr 01 04:50:42 error: couldn't connect to server 
Apr 01 04:50:42 caused by: Connection refused (os error 111) 
Apr 01 04:50:42 ++ true 
Apr 01 04:50:42 ++ rm /var/lib/jenkins/sccache_error.log 
Apr 01 04:50:42 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Apr 01 04:50:42 ++ SCCACHE_IDLE_TIMEOUT=1200 
Apr 01 04:50:42 ++ RUST_LOG=sccache::server=error 
Apr 01 04:50:42 ++ sccache --start-server 
Apr 01 04:50:42 Starting sccache server... 
Apr 01 04:50:42 ++ sccache --zero-stats 
Apr 01 04:50:42 Compile requests                 0 
Apr 01 04:50:42 Compile requests executed        0

pytorch_macos_10_13_py3_test (2/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 31 21:53:17 FAIL [300.105s]: test_barrier_timeout_full_group (__main__.TestDistBackend)

Mar 31 21:53:14   test_scatter_checks (__main__.TestDistBackend) ... ok (0.316s) 
Mar 31 21:53:15   test_scatter_full_group (__main__.TestDistBackend) ... ok (0.310s) 
Mar 31 21:53:15   test_scatter_group (__main__.TestDistBackend) ... ok (0.637s) 
Mar 31 21:53:15   test_send_recv (__main__.TestDistBackend) ... ok (0.274s) 
Mar 31 21:53:16   test_send_recv_any_source (__main__.TestDistBackend) ... ok (0.667s) 
Mar 31 21:53:17   test_send_recv_with_tag (__main__.TestDistBackend) ... ok (0.408s) 
Mar 31 21:53:17   test_sparse_all_reduce_sum (__main__.TestDistBackend) ... ok (0.238s) 
Mar 31 21:53:17   test_sparse_all_reduce_sum_cuda (__main__.TestDistBackend) ... skip (0.212s) 
Mar 31 21:53:17  
Mar 31 21:53:17 ====================================================================== 
Mar 31 21:53:17 FAIL [300.105s]: test_barrier_timeout_full_group (__main__.TestDistBackend) 
Mar 31 21:53:17 ---------------------------------------------------------------------- 
Mar 31 21:53:17 Traceback (most recent call last): 
Mar 31 21:53:17   File "distributed/test_distributed.py", line 2051, in wrapper 
Mar 31 21:53:17     self._join_and_reduce(fn) 
Mar 31 21:53:17   File "distributed/test_distributed.py", line 2144, in _join_and_reduce 
Mar 31 21:53:17     "Timeout waiting for rank %d to terminate" % rank) 
Mar 31 21:53:17 AssertionError: True is not false : Timeout waiting for rank 0 to terminate 
Mar 31 21:53:17  
Mar 31 21:53:17 ---------------------------------------------------------------------- 
Mar 31 21:53:17 Ran 94 tests in 338.178s

pytorch_linux_xenial_py3_6_gcc5_4_test (3/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Apr 01 05:03:28 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save() instead.

Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 05:03:28  
Apr 01 05:03:28 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Apr 01 05:03:28  
Apr 01 05:03:28 At: 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1773): __getstate__ 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 05:03:28  
Apr 01 05:03:28 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Apr 01 05:03:28  
Apr 01 05:03:28 At: 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1773): __getstate__ 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 05:03:28   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 05:03:28  
Apr 01 05:03:28 ok (1.121s) 
Apr 01 05:03:29   test_unexepected_kwarg_is_specified (__main__.JitRpcTestWithSpawn) ... ok (1.120s) 
Apr 01 05:03:30   test_user_rrefs_confirmed (__main__.JitRpcTestWithSpawn) ... ok (1.120s) 
Apr 01 05:03:31   test_user_rrefs_confirmed_remote (__main__.JitRpcTestWithSpawn) ... ok (1.120s)

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (4/4)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Apr 01 07:55:43 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save() instead.

Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 07:55:43  
Apr 01 07:55:43 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Apr 01 07:55:43  
Apr 01 07:55:43 At: 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1773): __getstate__ 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 07:55:43  
Apr 01 07:55:43 [E request_callback_impl.cpp:94] Received error while processing request type 2: PickleError: ScriptModules cannot be deepcopied using copy.deepcopy or saved using torch.save. Mixed serialization of script and non-script modules is not supported. For purely script modules use my_script_module.save(<filename>) instead. 
Apr 01 07:55:43  
Apr 01 07:55:43 At: 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1773): __getstate__ 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(86): serialize 
Apr 01 07:55:43   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(135): serialize 
Apr 01 07:55:43  
Apr 01 07:55:43 ok (0.912s) 
Apr 01 07:55:44   test_unexepected_kwarg_is_specified (__main__.JitRpcTestWithSpawn) ... ok (0.712s) 
Apr 01 07:55:44   test_user_rrefs_confirmed (__main__.JitRpcTestWithSpawn) ... ok (0.812s) 
Apr 01 07:55:45   test_user_rrefs_confirmed_remote (__main__.JitRpcTestWithSpawn) ... ok (0.812s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 41 times.

aten/src/ATen/native/Distributions.cpp

VitalyFedyunin · 2020-02-03T16:12:22Z

aten/src/ATen/native/cpu/UnaryOpsKernel.cpp

Do you need two copies of this code compiled (AVX on/off)?

Not really, I just aligned uniform_mkl_kernel with bernoulli_mkl_kernel in the same manner.

VitalyFedyunin

Please use TensorIterator, please rebase.

Code looks good, please rebase and get rid of errors.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2020-02-06T18:51:40Z

cc @pbelevich as it is related to random generators

VitalyFedyunin · 2020-02-11T18:21:40Z

I had no time to check in details, but errors looks like related to the introduced change and I cannot merge it as is.

mingfeima · 2020-02-13T07:53:18Z

I had no time to check in details, but errors looks like related to the introduced change and I cannot merge it as is.

yes, i was aware of the errors are real... Somehow I have some trouble in understanding the test cases. Still WIP.

mingfeima · 2020-02-25T02:16:47Z

Remove vsl and use plain CPU implementation for now. Will put vsl part in following work.

VitalyFedyunin

Please rebase

mingfeima · 2020-03-03T02:26:59Z

Rebased