Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

numa_test.py fails with AssertionError #16666

Open
charleshofer opened this issue Feb 1, 2019 · 1 comment
Open

numa_test.py fails with AssertionError #16666

charleshofer opened this issue Feb 1, 2019 · 1 comment
Labels

Comments

@charleshofer
Copy link

馃悰 Bug

caffe2/python/numa_test.py fails intermittently. It seems as if Caffe2 ignores whatever numa_node_id is set in the numa_device_option when running ConstantFill for output_blob_0 and output_blob_1. This test case will also print out some warnings about 1 not being a valid NUMA node when run on POWER. Just doing a numactl --hardware on my POWER machine, it looks like POWER's NUMA nodes are not numbered consecutively. Instead of nodes 0, 1, 2, 3, etc. I have nodes 0, 8, 254, etc. The test case doesn't take this into account, and assumes that node 1 exists.

To Reproduce

Steps to reproduce the behavior:

  1. Build and install PyTorch, and make sure that CMake includes NUMA in the build
  2. Run caffe2/python/numa_test.py

On x86:

(pytorch-env) [builder@b81f4478d132 python]$ python numa_test.py 
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 7.728e-06 secs
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 40; device id: 0
.
----------------------------------------------------------------------
Ran 1 test in 1.613s

OK
(pytorch-env) [builder@b81f4478d132 python]$ python numa_test.py 
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 9.282e-06 secs
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 40; device id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 40; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 40; device id: 0
F
======================================================================
FAIL: test_numa (__main__.NUMATest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "numa_test.py", line 49, in test_numa
    self.assertEqual(workspace.GetBlobNUMANode("output_blob_0"), 0)
AssertionError: 1 != 0

----------------------------------------------------------------------
Ran 1 test in 1.451s

FAILED (failures=1)

On POWER:

(pytorch-env) [builder@34def4d0c457 python]$ python numa_test.py 
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 7.333e-06 secs
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 0
libnuma: Warning: node 1 not allowed
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
numa_sched_setaffinity_v2_int() failed: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
set_mempolicy: Invalid argument
[I net_async_base.h:206] Using estimated CUDA pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 176; device id: 0
F
======================================================================
FAIL: test_numa (__main__.NUMATest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "numa_test.py", line 50, in test_numa
    self.assertEqual(workspace.GetBlobNUMANode("output_blob_1"), 1)
AssertionError: 0 != 1

----------------------------------------------------------------------
Ran 1 test in 6.899s

FAILED (failures=1)

Expected behavior

The test case should pass on both x86 and POWER

Environment

x86:

PyTorch version: 1.0.0a0+bda268a (with some local modifications)
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Red Hat Enterprise Linux Server 7.5 (Maipo)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB

Nvidia driver version: 410.79

POWER:

PyTorch version: 1.0.0a0+bda268a (with some local modifications)
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Red Hat Enterprise Linux Server 7.5 (Maipo)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 410.72
cuDNN version: Probably one of the following:
/usr/local/cuda-10.0/targets/ppc64le-linux/lib/libcudnn.so.7.3.1
/usr/local/cuda-10.0/targets/ppc64le-linux/lib/libcudnn_static.a
@charleshofer
Copy link
Author

charleshofer commented Feb 1, 2019

Looking through the source code: It appears that the numa_node_id specified in numa_test.py propagates down to device_id in poolGetter in net_async_base.cc. I can confirm this by adding some log statements in net_async_base.cc:

$ python numa_test.py 
...
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 1
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 1
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 1
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 0
[I net_async_base.h:206] Using estimated CPU pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CPU pool, size: 176; device id: 0
libnuma: Warning: node 1 not allowed
...

But when it comes time to actually allocate memory for the blob, in allocator.h, we use the NUMA node of whatever CPU the current thread is running on in numa.cpp. The ID has

...
[W context_gpu.h:380] CPH: Allocating space at: 0x7ffdc4000c40
[W context_gpu.h:381] CPH: GetNUMANode(data_ptr.get()): 8
[W net_async_base.cc:155] CPH: Creating new pool with device_id: 0
[I net_async_base.h:206] Using estimated CUDA pool size: 176; device id: 0
[I net_async_base.h:224] Created shared CUDA pool, size: 176; device id: 0
[W context_gpu.h:380] CPH: Allocating space at: 0x7ffd1c000c00
[W context_gpu.h:381] CPH: GetNUMANode(data_ptr.get()): 8
...

In the above, NUMA node 8 was chosen for both blobs. It appears that the thread isn't getting run on the NUMA node that is specified with numa_node_id in numa_test.py. I'm afraid I'm not familiar enough with the ThreadPoolRegistry in net_async_base.cc to know how the NUMA node the tread is run on gets chosen.

@zou3519 zou3519 added the caffe2 label Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants