Static Linking C++, Op not available at runtime #111654

daroo-m · 2023-10-20T14:36:53Z

🐛 Describe the bug

When linking with static libtorch and torchvision libraries, I am able to build, but at runtime, I get an error about an Unknown builtin op: aten::mul.

I have found references indicating that including <torchvision/vision.h> should cause the operators to be registered so they are linked in, but that doesn't seem to do the trick.

I've also found references indicating that forcing the linker to link the "whole archive" for libtorch_cpu.a should force it to include all the operators in the linked executable. I have done this, and it does overcome the problem - however, this feels a bit like a workaround, and we aren't able to use that as a long-term solution. When I link in the whole archive, the executable jumps from 87MB to 339MB.

I've also found some references suggesting calling c10::RegisterOps, or torch::RegisterOps, neither of which seem to exist. I found both c10::RegisterOperators and torch::RegisterOperators, but calling them doesn't seem to have any effect - admittedly, I might be using them incorrectly, all I did was add a call to torch::RegisterOperators(); which didn't cause any build errors, but did not overcome the runtime "Unknown builtin op: aten::mul" error.

I tried to make a minimal example:

// According to: https://github.com/pytorch/vision/#c-api
// and https://github.com/pytorch/vision/issues/2915
//  In order to get the torchvision operators registered with
//  torch (eg. for the JIT), all you need to do is to ensure
//  that you #include <torchvision/vision.h> in your project.
#include <vision.h>
#include <ATen/core/ivalue.h>
#include <fstream>
#include <torch/script.h>
#include <torch/torch.h>
#include <vector>
using namespace std;

int main(int argc, char* argv[])
{
  torch::NoGradGuard noGradGuard;

  // Load a trained model that has been converted to torchscript
  ifstream modelFile("torchscriptModel.pt");
  torch::jit::script::Module model;
  model = torch::jit::load(modelFile);
  modelFile.close();

  // Set model to eval mode
  model.eval();

  // Generate a random inference image
  float* imgPix = new float[100*100];
  // Normally set image pixels here, left uninitialized for minimal example

  // Convert image pixels to format required by forward
  at::Tensor imgTensor = torch::from_blob(imgPix, {100, 100, 1});
  at::Tensor imgTensorPermuted = imgTensor.permute({2, 0, 1});
  imgTensorPermuted.unsqueeze_(0);

  vector< at::Tensor > imageTensorVec;
  imageTensorVec.push_back(imgTensorPermuted);

  vector< torch::jit::IValue > inputToModel;
  inputToModel.push_back(torch::cat(imageTensorVec));

  at::Tensor forwardResult = model.forward(inputToModel).toTensor();

  delete [] imgPix;

  return 0;
}

To build this, I use the following command:

g++ minimalExample.cpp \
    -D_GLIBCXX_USE_CXX11_ABI=1 \
    -I /usr/src/vision/torchvision/csrc/ \
    -I /usr/src/pytorch/build/lib.linux-x86_64-3.8/torch/include/torch/csrc/api/include/ \
    -I /usr/src/pytorch/build/lib.linux-x86_64-3.8/torch/include/ \
    -Wl,--start-group \
       /usr/src/vision/build/libtorchvision.a \
       /usr/src/pytorch/build/lib/libc10.a \
       /usr/src/pytorch/build/lib/libtorch_cpu.a \
    -Wl,--end-group \
    /usr/src/pytorch/build/lib/libprotobuf.a \
    /usr/src/pytorch/build/lib/libfbgemm.a \
    /usr/src/pytorch/build/sleef/lib/libsleef.a \
    /usr/src/pytorch/build/lib/libasmjit.a \
    /usr/src/pytorch/build/lib/libonnx.a \
    /usr/src/pytorch/build/lib/libonnx_proto.a \
    /usr/src/pytorch/build/lib/libcpuinfo.a \
    /usr/src/pytorch/build/lib/libclog.a \
    /usr/src/pytorch/build/lib/libkineto.a \
    /usr/src/pytorch/build/lib/libnnpack.a \
    /usr/src/pytorch/build/lib/libpytorch_qnnpack.a \
    /usr/src/pytorch/build/lib/libXNNPACK.a \
    /usr/src/pytorch/build/lib/libpthreadpool.a \
    -Wl,--start-group \
       /opt/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_tbb_thread.a \
       /opt/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_core.a \
       /opt/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_blacs_openmpi_lp64.a \
       /opt/intel/oneapi/mkl/2023.2.0/lib/intel64/libmkl_intel_lp64.a \
       /usr/src/onetbb_installed/lib64/libtbb.a \
    -Wl,--end-group \
    /usr/local/lib/libompitrace.a \
    /usr/local/lib/libmpi.a \
    /usr/local/lib/libopen-rte.a \
    /usr/local/lib/libopen-pal.a \
    /usr/local/lib/libz.a \
    /usr/lib64/libc_nonshared.a \
    -lrt \
    -ldl \
    -fopenmp \
    -pthread \
    -o minimalExample.exe

As I said, this will build successfully, but it does give a warning when building:

/usr/src/vision/torchvision/csrc/vision.h:10:40: warning: ‘_register_ops’ initialized and declared ‘extern’
 extern "C" VISION_INLINE_VARIABLE auto _register_ops = &cuda_version;
                                        ^~~~~~~~~~~~~

When I run the executable, though, I get the following error:

$ ./minimalExample.exe 
terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():  
Unknown builtin op: aten::mul.
Could not find any similar ops to aten::mul. This op may not exist or may not be currently supported in TorchScript.
:
  File "<string>", line 3

def mul(a : float, b : Tensor) -> Tensor:
  return b * a
         ~~~~~ <--- HERE
def add(a : float, b : Tensor) -> Tensor:
  return b + a
'mul' is being compiled since it was called from 'full_out_0_4'
  File "<string>", line 3

def full_out_0_4(size:List[int], fill_value:number, *, out:Tensor) -> Tensor:
  return torch.full(size, fill_value, out=out)
                                          ~~~ <--- HERE

Abort (core dumped)

The minimal example runs as expected, without error, if I link the libtorch_cpu.a whole archive, by changing the corresponding line in the build command to:

    -Wl,--start-group \
       /usr/src/vision/build/libtorchvision.a \
       /usr/src/pytorch/build/lib/libc10.a \
       -Wl,--whole-archive /usr/src/pytorch/build/lib/libtorch_cpu.a -Wl,--no-whole-arhive \
    -Wl,--end-group \

but as I said, the size of the executable jumps way higher, and seems like overkill.

I wasn't sure if this should be a forum post or an issue report, but given that I thought the include of <vision.h> was supposed to manage this, it felt more like an issue report to me.

Versions

I'm not sure this is especially valuable in this situation. The example is running on an old OS with CPU-only support. The conversion to torchscript was done on a more modern machine with python and pytorch installed, but the machine I am running on is a severely stripped-down machine without python at all.

If I run the minimalExample.exe on the modern machine, it performs the same way though (i.e. errors at runtime without the whole-archive stuff, but runs successfully with the whole-archive stuff). So, here's the env for that machine in case its helpful:

Collecting environment information...
PyTorch version: 1.13.0a0+git7c98e70
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, May 26 2023, 14:05:08)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-1024-fips-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA RTX 6000 Ada Generation
GPU 1: NVIDIA RTX 6000 Ada Generation
GPU 2: NVIDIA RTX 6000 Ada Generation

Nvidia driver version: 535.98
cuDNN version: Probably one of the following:
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          72
On-line CPU(s) list:             0-71
Thread(s) per core:              2
Core(s) per socket:              18
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz
Stepping:                        6
Frequency boost:                 enabled
CPU MHz:                         804.039
CPU max MHz:                     3600.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6000.00
Virtualization:                  VT-x
L1d cache:                       1.7 MiB
L1i cache:                       1.1 MiB
L2 cache:                        45 MiB
L3 cache:                        78 MiB
NUMA node0 CPU(s):               0-17,36-53
NUMA node1 CPU(s):               18-35,54-71
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.13.0a0+git7c98e70
[pip3] torchvision==0.14.0a0+5ce4506
[conda] Could not collect

cc @malfet @seemethere @jbschlosser @datumbox @vfdev-5 @pmeier

The text was updated successfully, but these errors were encountered:

malfet · 2023-10-23T14:55:42Z

Thank you for reporting this issue. This behavior, without using whole-archive is kind of expected and at the moment there are no option (short of building from source) that would allow one to achieve the desired outcome (i.e. have a lean binary)

daroo-m · 2023-10-24T12:21:46Z

Ok - thanks for the info. Is there a way in my own C++ code that I can just invoke the needed operators manually so that the linker knows to link with them? Even if its just a dummy operation - if the operator gets directly used by my code, seems the like the linker would link it in even without --whole-archive. I have attempted this a bit and have not yet been successful, but I'm not sure if my attempts are not utilizing the proper operator, or if this approach just won't work.

I'm intrigued by "short of building from source" - what do you mean by that in this context? I've built torch and vision from source already, so I have all the .o files and source for them. If there's a way to achieve what I want by linking with the .o files directly, I am willing to do that. I've tried extracting some specific .o files out of the libtorch_cpu.a archive and then putting those extracted files into their own archive and then linking that new archive with --whole-archive but I still keep getting Unknown builtin op: aten::mul. Admittedly, I don't fully understand all the operator registration stuff and aliasing that is done with the operators, so I may have just missed an .o file or two that are important. This feels like the best approach to me, but I'm not sure what the minimal set of files needed to be included in the "subarchive" should be, and all of my attempts so far have not worked out.

If you mean something different by "building from source" - I'd be interested in more information so I can try it out.

daroo-m · 2023-10-31T18:40:10Z

I was experimenting with some functionality to see if I can force it to know about the operator at link-time (I realize there will be many more ops needed, but just trying to see if I can make some progress). What I have found is that, if I add this code:

  c10::OperatorName opName = c10::OperatorName("aten::mul", "Tensor");
  std::shared_ptr<torch::jit::Operator> atenMulOpPtr = torch::jit::findOperatorFor(opName);
  //after this I'm trying to call torch::jit::RegisterOperators with a vector containing that Operator, but
  //I'm not showing that because without --whole-archive the atenMulOpPtr is a null pointer

the shared pointer named atenMulOpPtr ends up being a null pointer if I don't link with --whole-archive. I do link with --whole-archive, the same code results in a non-null pointer. This seems to imply there's an earlier step that I am missing. Any hints as to what that might be?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static Linking C++, Op not available at runtime #111654

Static Linking C++, Op not available at runtime #111654

daroo-m commented Oct 20, 2023 •

edited by pytorch-bot bot

malfet commented Oct 23, 2023

daroo-m commented Oct 24, 2023

daroo-m commented Oct 31, 2023

Static Linking C++, Op not available at runtime #111654

Static Linking C++, Op not available at runtime #111654

Comments

daroo-m commented Oct 20, 2023 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

malfet commented Oct 23, 2023

daroo-m commented Oct 24, 2023

daroo-m commented Oct 31, 2023

daroo-m commented Oct 20, 2023 •

edited by pytorch-bot bot