Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF loads/initializes DevicePlugins multiple times if there are symlinked paths in sys.path. #55497

Open
mergian opened this issue Apr 5, 2022 · 0 comments
Assignees
Labels
2.6.0 comp:runtime c++ runtime, performance issues (cpu) stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug

Comments

@mergian
Copy link

mergian commented Apr 5, 2022

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux CentOS but others also are affected
  • TensorFlow installed from (source or binary): all >= 2.6
  • TensorFlow version (use command below): v2.6.1-9-gc2363d6d025 2.6.2
  • Python version: >= 3.7

Describe the current behavior
When TF starts up, the __init__.py checks sys.path for paths to site-packages, that will then be used for loading PluggableDevices. If the identical file is accessible through a symlink from a different path, i.e.:

  • .../lib/python3.8/site-packages/tensorflow-plugins/libmydevice.so
  • .../lib64/python3.8/site-packages/tensorflow-plugins/libmydevice.so

with lib64 being a symlink to lib (as it is the case for VENVs) then TensorFlow loads the library twice and dies with

stream_executor::MultiPlatformManager::RegisterPlatform( std::move(cplatform)) status: Internal: platform is already registered with name: "NAME"

In https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c/c_api_experimental.cc#L724 a map with std::string is used, which cannot distinguish the symlink and therefore will call the initialization methods within the PluggableDevice library a second time.

Describe the expected behavior
The library does not get initialized twice.

Contributing

  • Do you want to contribute a PR? (yes/no): no
  • Briefly describe your candidate solution(if contributing):

I think the easiest solution would be to change TF_LoadPluggableDeviceLibrary to use a set on the library handle instead of map, i.e.:

TF_Library* TF_LoadPluggableDeviceLibrary(const char* library_filename,
                                          TF_Status* status) {
#if defined(IS_MOBILE_PLATFORM) || defined(IS_SLIM_BUILD)
  status->status = tensorflow::errors::Unimplemented(
      "PluggableDevice plugin functionality is not supported on mobile");
  return nullptr;
#else
  TF_Library* lib_handle = new TF_Library;
  static tensorflow::mutex mu(tensorflow::LINKER_INITIALIZED);
  static std::unordered_set<void*>* loaded_libs =
      new std::unordered_set<void*>();
  tensorflow::Env* env = tensorflow::Env::Default();
  {
    tensorflow::mutex_lock lock(mu);
    status->status =
      env->LoadDynamicLibrary(library_filename, &lib_handle->lib_handle);
    if (status->status.ok()) {
      if (loaded_libs.emplace(lib_handle->lib_handle).second) {
        TF_CHECK_OK(tensorflow::RegisterPluggableDevicePlugin(lib_handle->lib_handle));
      } else {
        dlclose(lib_handle->lib_handle);
      }
    } else {
      delete lib_handle;
      return nullptr;
    }
    return lib_handle;
  }
#endif
}

LoadDynamicLibrary uses dlopen and will return ALWAYS the same handle independent of symlinks, see: https://man7.org/linux/man-pages/man3/dlopen.3.html. The call to dlclose is to decrease the reference count in case the lib got opened multiple times.

Standalone code to reproduce the issue
We encountered this error when using a VENV with rh-python38 package, because it puts lib and lib64 into sys.path. But can also be triggered by forging the PYTHONPATH env var:

pip3 install tensorflow
# install any PluggableDevice library you like
ln -s ~/.local/lib ~/.local/lib64
PYTHONPATH=~/.local/lib/python3.8/site-packages:~/.local/lib64/python3.8/site-packages python3 -c "import tensorflow"
@mergian mergian added the type:bug Bug label Apr 5, 2022
@sushreebarsa sushreebarsa added comp:core issues related to core part of tensorflow 2.6.0 comp:runtime c++ runtime, performance issues (cpu) and removed comp:core issues related to core part of tensorflow labels Apr 6, 2022
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.6.0 comp:runtime c++ runtime, performance issues (cpu) stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug
Projects
None yet
Development

No branches or pull requests

4 participants