You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
caffe2/python/numa_test.py fails intermittently. It seems as if Caffe2 ignores whatever numa_node_id is set in the numa_device_option when running ConstantFill for output_blob_0 and output_blob_1. This test case will also print out some warnings about 1 not being a valid NUMA node when run on POWER. Just doing a numactl --hardware on my POWER machine, it looks like POWER's NUMA nodes are not numbered consecutively. Instead of nodes 0, 1, 2, 3, etc. I have nodes 0, 8, 254, etc. The test case doesn't take this into account, and assumes that node 1 exists.
To Reproduce
Steps to reproduce the behavior:
Build and install PyTorch, and make sure that CMake includes NUMA in the build
Run caffe2/python/numa_test.py
On x86:
(pytorch-env) [builder@b81f4478d132 python]$ python numa_test.py
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 7.728e-06 secs
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 40; device id: 0
.
----------------------------------------------------------------------
Ran 1 test in 1.613s
OK
(pytorch-env) [builder@b81f4478d132 python]$ python numa_test.py
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 9.282e-06 secs
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 40; device id: 0
F
======================================================================
FAIL: test_numa (__main__.NUMATest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "numa_test.py", line 49, in test_numa
self.assertEqual(workspace.GetBlobNUMANode("output_blob_0"), 0)
AssertionError: 1 != 0
----------------------------------------------------------------------
Ran 1 test in 1.451s
FAILED (failures=1)
PyTorch version: 1.0.0a0+bda268a (with some local modifications)
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Red Hat Enterprise Linux Server 7.5 (Maipo)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: Could not collect
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
Nvidia driver version: 410.79
POWER:
PyTorch version: 1.0.0a0+bda268a (with some local modifications)
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Red Hat Enterprise Linux Server 7.5 (Maipo)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: Could not collect
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 410.72
cuDNN version: Probably one of the following:
/usr/local/cuda-10.0/targets/ppc64le-linux/lib/libcudnn.so.7.3.1
/usr/local/cuda-10.0/targets/ppc64le-linux/lib/libcudnn_static.a
The text was updated successfully, but these errors were encountered:
Looking through the source code: It appears that the numa_node_id specified in numa_test.py propagates down to device_id in poolGetter in net_async_base.cc. I can confirm this by adding some log statements in net_async_base.cc:
$ python numa_test.py
...
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 1
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 0
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 0
libnuma: Warning: node 1 not allowed
...
But when it comes time to actually allocate memory for the blob, in allocator.h, we use the NUMA node of whatever CPU the current thread is running on in numa.cpp. The ID has
...
[W context_gpu.h:380] CPH: Allocating space at: 0x7ffdc4000c40
[W context_gpu.h:381] CPH: GetNUMANode(data_ptr.get()): 8
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 176; device id: 0
[W context_gpu.h:380] CPH: Allocating space at: 0x7ffd1c000c00
[W context_gpu.h:381] CPH: GetNUMANode(data_ptr.get()): 8
...
In the above, NUMA node 8 was chosen for both blobs. It appears that the thread isn't getting run on the NUMA node that is specified with numa_node_id in numa_test.py. I'm afraid I'm not familiar enough with the ThreadPoolRegistry in net_async_base.cc to know how the NUMA node the tread is run on gets chosen.
馃悰 Bug
caffe2/python/numa_test.py fails intermittently. It seems as if Caffe2 ignores whatever numa_node_id is set in the numa_device_option when running ConstantFill for output_blob_0 and output_blob_1. This test case will also print out some warnings about
1
not being a valid NUMA node when run on POWER. Just doing anumactl --hardware
on my POWER machine, it looks like POWER's NUMA nodes are not numbered consecutively. Instead of nodes 0, 1, 2, 3, etc. I have nodes 0, 8, 254, etc. The test case doesn't take this into account, and assumes that node 1 exists.To Reproduce
Steps to reproduce the behavior:
On x86:
On POWER:
Expected behavior
The test case should pass on both x86 and POWER
Environment
x86:
POWER:
The text was updated successfully, but these errors were encountered: