numa_test.py fails with AssertionError #16666

charleshofer · 2019-02-01T19:35:09Z

🐛 Bug

caffe2/python/numa_test.py fails intermittently. It seems as if Caffe2 ignores whatever numa_node_id is set in the numa_device_option when running ConstantFill for output_blob_0 and output_blob_1. This test case will also print out some warnings about 1 not being a valid NUMA node when run on POWER. Just doing a numactl --hardware on my POWER machine, it looks like POWER's NUMA nodes are not numbered consecutively. Instead of nodes 0, 1, 2, 3, etc. I have nodes 0, 8, 254, etc. The test case doesn't take this into account, and assumes that node 1 exists.

To Reproduce

Steps to reproduce the behavior:

Build and install PyTorch, and make sure that CMake includes NUMA in the build
Run caffe2/python/numa_test.py

On x86:

(pytorch-env) [builder@b81f4478d132 python]$ python numa_test.py 
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 7.728e-06 secs
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 40; device id: 0
.
----------------------------------------------------------------------
Ran 1 test in 1.613s

OK
(pytorch-env) [builder@b81f4478d132 python]$ python numa_test.py 
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 9.282e-06 secs
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 40; device id: 0
F
======================================================================
FAIL: test_numa (__main__.NUMATest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "numa_test.py", line 49, in test_numa
    self.assertEqual(workspace.GetBlobNUMANode("output_blob_0"), 0)
AssertionError: 1 != 0

----------------------------------------------------------------------
Ran 1 test in 1.451s

FAILED (failures=1)

On POWER:

(pytorch-env) [builder@34def4d0c457 python]$ python numa_test.py 
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 7.333e-06 secs
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 0
libnuma: Warning: node 1 not allowed
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
[I net_async_base.h:206] Using estimated CUDA pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 176; device id: 0
F
======================================================================
FAIL: test_numa (__main__.NUMATest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "numa_test.py", line 50, in test_numa
    self.assertEqual(workspace.GetBlobNUMANode("output_blob_1"), 1)
AssertionError: 0 != 1

----------------------------------------------------------------------
Ran 1 test in 6.899s

FAILED (failures=1)

Expected behavior

The test case should pass on both x86 and POWER

Environment

x86:

PyTorch version: 1.0.0a0+bda268a (with some local modifications)
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Red Hat Enterprise Linux Server 7.5 (Maipo)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB

Nvidia driver version: 410.79

POWER:

PyTorch version: 1.0.0a0+bda268a (with some local modifications)
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Red Hat Enterprise Linux Server 7.5 (Maipo)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 410.72
cuDNN version: Probably one of the following:
/usr/local/cuda-10.0/targets/ppc64le-linux/lib/libcudnn.so.7.3.1
/usr/local/cuda-10.0/targets/ppc64le-linux/lib/libcudnn_static.a

The text was updated successfully, but these errors were encountered:

charleshofer · 2019-02-01T20:20:47Z

Looking through the source code: It appears that the numa_node_id specified in numa_test.py propagates down to device_id in poolGetter in net_async_base.cc. I can confirm this by adding some log statements in net_async_base.cc:

$ python numa_test.py 
...
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 1
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 0
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 0
libnuma: Warning: node 1 not allowed
...

But when it comes time to actually allocate memory for the blob, in allocator.h, we use the NUMA node of whatever CPU the current thread is running on in numa.cpp. The ID has

...
[W context_gpu.h:380] CPH: Allocating space at: 0x7ffdc4000c40
[W context_gpu.h:381] CPH: GetNUMANode(data_ptr.get()): 8
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 176; device id: 0
[W context_gpu.h:380] CPH: Allocating space at: 0x7ffd1c000c00
[W context_gpu.h:381] CPH: GetNUMANode(data_ptr.get()): 8
...

In the above, NUMA node 8 was chosen for both blobs. It appears that the thread isn't getting run on the NUMA node that is specified with numa_node_id in numa_test.py. I'm afraid I'm not familiar enough with the ThreadPoolRegistry in net_async_base.cc to know how the NUMA node the tread is run on gets chosen.

zou3519 added the caffe2 label Feb 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numa_test.py fails with AssertionError #16666

numa_test.py fails with AssertionError #16666

charleshofer commented Feb 1, 2019

charleshofer commented Feb 1, 2019 •

edited

numa_test.py fails with AssertionError #16666

numa_test.py fails with AssertionError #16666

Comments

charleshofer commented Feb 1, 2019

🐛 Bug

To Reproduce

Expected behavior

Environment

charleshofer commented Feb 1, 2019 • edited

charleshofer commented Feb 1, 2019 •

edited