Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use RTLD_GLOBAL to load _C. #31162

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
beb3b40
Don't use RTLD_GLOBAL to load _C.
ezyang Dec 12, 2019
a0484e6
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 12, 2019
3231d91
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 12, 2019
a08fa45
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 12, 2019
0d0f466
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 13, 2019
1060202
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 13, 2019
22a565c
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 13, 2019
4fc627b
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 13, 2019
b796cb3
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 13, 2019
417faae
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 13, 2019
28c0ef5
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 13, 2019
25f8d0c
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 13, 2019
95ebee5
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Dec 14, 2019
53461ff
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 2, 2020
edfc4ea
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 6, 2020
a77da74
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 6, 2020
4170672
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 6, 2020
9812a35
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 7, 2020
70dd543
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 7, 2020
993c810
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 7, 2020
20872e7
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 7, 2020
2081c67
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 7, 2020
8948282
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 7, 2020
593b661
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 8, 2020
d79fe49
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 8, 2020
2771d72
Update on "Don't use RTLD_GLOBAL to load _C."
ezyang Jan 8, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
41 changes: 29 additions & 12 deletions .jenkins/pytorch/test.sh
Expand Up @@ -76,22 +76,39 @@ fi
# ASAN test is not working
if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
export ASAN_OPTIONS=detect_leaks=0:symbolize=1:strict_init_order=true
# We suppress the vptr volation, since we have separate copies of
# libprotobuf in both libtorch.so and libcaffe2.so, and it causes
# the following problem:
# test_cse (__main__.TestJit) ... torch/csrc/jit/export.cpp:622:38:
# runtime error: member call on address ... which does not point
# to an object of type 'google::protobuf::MessageLite'
# ...: note: object is of type 'onnx_torch::ModelProto'
#
# This problem should be solved when libtorch.so and libcaffe2.so are
# merged.
export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp
export UBSAN_OPTIONS=print_stacktrace=1
export PYTORCH_TEST_WITH_ASAN=1
export PYTORCH_TEST_WITH_UBSAN=1
# TODO: Figure out how to avoid hard-coding these paths
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-5.0/bin/llvm-symbolizer
export LD_PRELOAD=/usr/lib/llvm-5.0/lib/clang/5.0.0/lib/linux/libclang_rt.asan-x86_64.so
# NB: we preload libtorch.so to ensure that subsequent loads of C++
# extension modules consistently reference the type info in
# libtorch.so and its dependencies (most notably, libc++.so).
# If another copy of type info is generated, then
# UBSAN will fail claiming a vptr violation that looks like:
#
# member call on address XXXXXX which does not point to an object of
# type 'std::_Sp_counted_base<__gnu_cxx::_Lock_policy::_S_atomic>'
# XXXXXX note: object is of type
# 'std::_Sp_counted_ptr<torch::nn::LinearImpl*, (__gnu_cxx::_Lock_policy)2>'
#
# (NB: the textual types of the objects here are misleading, because
# they actually line up; it just so happens that there's two copies
# of the type info floating around in the address space, so they
# don't pointer compare equal. See also
# https://github.com/google/sanitizers/issues/1175
#
# This didn't use to be necessary, because historically we loaded
# _C.so (and transitively, libtorch.so) using RTLD_GLOBAL. We
# stopped doing that to promote better hygiene of C++ symbols,
# but that means all weak symbols are going to get duplicated--this
# especially applies to type info, which is almost always weak. This
# has implications for RTTI (which UBSAN is rightly flagging won't
# work), but in our codebase, we don't use RTTI (because it doesn't
# work in mobile). However, UBSAN relies on UBSAN to detect vptr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: UBSAN relies on UBSAN ?

# confusion, so at least in this environment, we need our ducks in
# order!
export LD_PRELOAD=/usr/lib/llvm-5.0/lib/clang/5.0.0/lib/linux/libclang_rt.asan-x86_64.so:$PWD/torch/lib/libtorch_python.so
# Increase stack size, because ASAN red zones use more stack
ulimit -s 81920

Expand Down
23 changes: 23 additions & 0 deletions torch/CMakeLists.txt
Expand Up @@ -390,3 +390,26 @@ if (NOT TORCH_PYTHON_LINK_FLAGS STREQUAL "")
endif()

install(TARGETS torch_python DESTINATION "${TORCH_INSTALL_LIB_DIR}")

# Note [Global dependencies]
# Some libraries (e.g. OpenMPI) like to dlopen plugins after they're initialized,
# and they assume that all of their symbols will be available in the global namespace.
# On the other hand we try to be good citizens and avoid polluting the symbol
# namespaces, so libtorch is loaded with all its dependencies in a local scope.
# That usually leads to missing symbol errors at run-time, so to avoid a situation like
# this we have to preload those libs in a global namespace.
add_library(torch_global_deps SHARED ${TORCH_SRC_DIR}/csrc/empty.c)
set_target_properties(torch_global_deps PROPERTIES LINKER_LANGUAGE C)
if (USE_MPI)
target_link_libraries(torch_global_deps ${MPI_CXX_LIBRARIES})
endif()
target_link_libraries(torch_global_deps ${MKL_LIBRARIES})
# The CUDA libraries are linked here for a different reason: in some
# cases we load these libraries with ctypes, and if they weren't opened
# with RTLD_GLOBAL, we'll do the "normal" search process again (and
# not find them, because they're usually in non-standard locations)
if (USE_CUDA)
target_link_libraries(torch_global_deps ${Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS})
endif()

install(TARGETS torch_global_deps DESTINATION "${TORCH_INSTALL_LIB_DIR}")
62 changes: 23 additions & 39 deletions torch/__init__.py
Expand Up @@ -13,6 +13,7 @@
import os
import sys
import platform
import ctypes
from ._utils import _import_dotted_name
from ._utils_internal import get_file_path, prepare_multiprocessing_environment
from .version import __version__
Expand All @@ -33,61 +34,44 @@
# Load the extension module
################################################################################

# Loading the extension with RTLD_GLOBAL option allows to not link extension
# modules against the _C shared object. Their missing THP symbols will be
# automatically filled by the dynamic loader.
import os as _dl_flags

# if we have numpy, it *must* be imported before the call to setdlopenflags()
# or there is risk that later c modules will segfault when importing numpy
try:
import numpy as _np
except ImportError:
pass

if platform.system() == 'Windows':
# first get nvToolsExt PATH
def get_nvToolsExt_path():
NVTOOLEXT_HOME = _dl_flags.getenv('NVTOOLSEXT_PATH', 'C:\\Program Files\\NVIDIA Corporation\\NvToolsExt')
NVTOOLSEXT_PATH = os.getenv('NVTOOLSEXT_PATH', 'C:\\Program Files\\NVIDIA Corporation\\NvToolsExt')

if _dl_flags.path.exists(NVTOOLEXT_HOME):
return _dl_flags.path.join(NVTOOLEXT_HOME, 'bin', 'x64')
else:
return ''
if os.path.exists(NVTOOLSEXT_PATH):
nvtoolsext_lib_path = os.path.join(NVTOOLSEXT_PATH, 'bin', 'x64')
else:
nvtoolsext_lib_path = ''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tip that maybe not directly related to this PR: We missed the path of CUDA here. To make it compatible with Python 3.8, we have to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to fix this, but I'm not exactly sure what the suggestion is here. What's the other CUDA path we missed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a variable that records the version of the cuda used during build? If yes, then the answer is quite simple. For example, for CUDA 9.2, just add the following path.

cuda_path = os.path.join(os.environ.get('CUDA_PATH_V9_2', r'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2'), 'bin')


py_dll_path = _dl_flags.path.join(sys.exec_prefix, 'Library', 'bin')
th_dll_path = _dl_flags.path.join(_dl_flags.path.dirname(__file__), 'lib')
py_dll_path = os.path.join(sys.exec_prefix, 'Library', 'bin')
th_dll_path = os.path.join(os.path.dirname(__file__), 'lib')

dll_paths = [th_dll_path, py_dll_path, get_nvToolsExt_path(), _dl_flags.environ['PATH']]
dll_paths = [th_dll_path, py_dll_path, nvtoolsext_lib_path, os.environ['PATH']]

# then add the path to env
_dl_flags.environ['PATH'] = ';'.join(dll_paths)
os.environ['PATH'] = ';'.join(dll_paths)

else:
# first check if the os package has the required flags
if not hasattr(_dl_flags, 'RTLD_GLOBAL') or not hasattr(_dl_flags, 'RTLD_LAZY'):
try:
# next try if DLFCN exists
import DLFCN as _dl_flags
except ImportError:
# as a last attempt, use compile-time constants
import torch._dl as _dl_flags

old_flags = sys.getdlopenflags()
sys.setdlopenflags(_dl_flags.RTLD_GLOBAL | _dl_flags.RTLD_LAZY)
# See Note [Global dependencies]
def _load_global_deps():
if platform.system() == 'Windows':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about defining a global variable IS_WINDOWS since it's used multiple times in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this in a follow up

return

del _dl_flags
lib_name = 'libtorch_global_deps' + ('.dylib' if platform.system() == 'Darwin' else '.so')
here = os.path.abspath(__file__)
lib_path = os.path.join(os.path.dirname(here), 'lib', lib_name)

ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been going through this out of curiosity and it got me wondering if this doesn't lead to an eventual dlclose? Don't we have to stash this library handle somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



# See Note [Global dependencies]
_load_global_deps()

from torch._C import *

__all__ += [name for name in dir(_C)
if name[0] != '_' and
not name.endswith('Base')]

if platform.system() != 'Windows':
sys.setdlopenflags(old_flags)
del old_flags

################################################################################
# Define basic utilities
################################################################################
Expand Down
Empty file added torch/csrc/empty.c
Empty file.
3 changes: 0 additions & 3 deletions ubsan.supp

This file was deleted.