Skip to content

Third-party devices implement can't find symbols when include TraceUtils.h #114575

@ChuBoning

Description

@ChuBoning

🐛 Describe the bug

#114367 move some functions and classes which are only used in cuda to TraceUtils.h. And these symbols will not be found in third-party devices implement when include TraceUtils.h.

/opt/_internal/cpython-3.8.18/lib/python3.8/site-packages/torch/include/torch/csrc/distributed/c10d/TraceUtils.h:269:1: error: ‘DebugInfoWriter’ does not name a type
  269 | DebugInfoWriter::DebugInfoWriter(int rank) {
....
/opt/_internal/cpython-3.8.18/lib/python3.8/site-packages/torch/include/torch/csrc/distributed/c10d/TraceUtils.h:328:43: error: ‘CUDAEvent’ is not a member of ‘at::cuda’
  328 |   using EventList = std::vector<at::cuda::CUDAEvent>;
...
/opt/_internal/cpython-3.8.18/lib/python3.8/site-packages/torch/include/torch/csrc/distributed/c10d/TraceUtils.h:412:13: error: ‘struct c10d::NCCLTraceBuffer::Entry’ has no member named ‘start_’; did you mean ‘state_’?
  412 |       if (r.start_ != nullptr) {
      |             ^~~~~~
      |             state_

Versions

PyTorch version: 2.2.0.dev20231126
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (AltArch) (aarch64)
GCC version: (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
Clang version: Could not collect
CMake version: version 3.27.7
Libc version: glibc-2.17

Python version: 3.8.18 (default, Nov 13 2023, 04:17:39)  [GCC 10.2.1 20210130 (Red Hat 10.2.1-11)] (64-bit runtime)
Python platform: Linux-4.15.0-112-generic-aarch64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                192
On-line CPU(s) list:   0-191
Thread(s) per core:    1
Core(s) per socket:    48
Socket(s):             4
NUMA node(s):          8
Model:                 0
BogoMIPS:              200.00
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              24576K
NUMA node0 CPU(s):     0-23
NUMA node1 CPU(s):     24-47
NUMA node2 CPU(s):     48-71
NUMA node3 CPU(s):     72-95
NUMA node4 CPU(s):     96-119
NUMA node5 CPU(s):     120-143
NUMA node6 CPU(s):     144-167
NUMA node7 CPU(s):     168-191
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] torch==2.2.0.dev20231126
[pip3] torchvision==0.15.2
[conda] Could not collect

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @malfet @seemethere

Metadata

Metadata

Labels

module: buildBuild system issuesoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions