Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify libtorch and libcaffe2 #17783

Closed
wants to merge 1 commit into from
Closed

Unify libtorch and libcaffe2 #17783

wants to merge 1 commit into from

Conversation

kostmo
Copy link
Member

@kostmo kostmo commented Mar 7, 2019

This PR is an intermediate step toward the ultimate goal of eliminating "caffe2" in favor of "torch". This PR moves all of the files that had constituted "libtorch.so" into the "libcaffe2.so" library, and wraps "libcaffe2.so" with a shell library named "libtorch.so". This means that, for now, caffe2/CMakeLists.txt becomes a lot bigger, and torch/CMakeLists.txt becomes smaller.

The torch Python bindings (torch_python.so) still remain in torch/CMakeLists.txt.

The follow-up to this PR will rename references to caffe2 to torch, and flatten the shell into one library.

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress, CI passes! :)

One question is about libtorch.so - as far as I understand there's not such file any more. We probably should keep libtorch.so name because external things can depend on it. I'd imagine there are fewer dependencies on libcaffe2.so externally and we can make a symlink maybe.

I think it's getting to a point to get more people's input. @soumith @orionr @yf225 @bddppq - please take a look. Also, we have a release soon. Should we strive to land it before that? (pro: stress testing during binary builds, cons: potentially more churn)

# formerly-libtorch
# ==========================================================

set(TORCH_SRC_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../torch")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably just lift this main folder up to avoid .. but we can do it in a separate PR

set(TORCH_SRCS
${GENERATED_CXX_TORCH}
${GENERATED_H_TORCH}
${TORCH_SRC_DIR}/csrc/autograd/anomaly_mode.cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be nice to dedup this list with build_variables.py in the future (not this PR) - I wonder whether we can just read the same file in in both build systems

target_compile_definitions(caffe2 PUBLIC _THP_CORE)


# until they can be unified, keep these lists synced with setup.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it's worth to look into it in the future (not sure what exact history of this duplication is) - from setup.py it seems we're building only stub.cpp, so it might be outdated comment

if (NOT WIN32)
target_compile_options(caffe2 PRIVATE "-fvisibility=hidden")
# FIXME kostmo
# target_compile_options(caffe2 PRIVATE "-fvisibility=hidden")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after we fixed THPPointer - can we bring it on or are there still a lot of failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still failures, e.g.:

ImportError: /scratch/kostmo/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZTIN5torch3jit4NodeE

if (NOT MSVC)
target_compile_options(caffe2 INTERFACE "$<$<COMPILE_LANGUAGE:CXX>:-std=c++11>")
endif()
#if (NOT MSVC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant with this line:

set_property(TARGET caffe2 PROPERTY CXX_STANDARD 11)

@kostmo
Copy link
Member Author

kostmo commented Apr 12, 2019

Ran into this Windows build failure after rebasing (to resolve a merge conflict):

21:55:55 dataloader.cpp.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) public: virtual __cdecl torch::data::samplers::Sampler<struct TestIndex>::~Sampler<struct TestIndex>(void)" (__imp_??1?$Sampler@UTestIndex@@@samplers@data@torch@@UEAA@XZ) referenced in function "int `class std::unique_ptr<class torch::data::StatelessDataLoader<struct TestIndexDataset,struct TestIndexSampler>,struct std::default_delete<class torch::data::StatelessDataLoader<struct TestIndexDataset,struct TestIndexSampler> > > __cdecl torch::data::make_data_loader<struct TestIndexDataset,struct TestIndexSampler>(struct TestIndexDataset,struct TestIndexSampler,struct data::dtor$7::DataLoaderOptions)'::`1'::dtor$7" (?dtor$7@?0???$make_data_loader@UTestIndexDataset@@UTestIndexSampler@@@data@torch@@YA?AV?$unique_ptr@V?$StatelessDataLoader@UTestIndexDataset@@UTestIndexSampler@@@data@torch@@U?$default_delete@V?$StatelessDataLoader@UTestIndexDataset@@UTestIndexSampler@@@data@torch@@@std@@@std@@UTestIndexDataset@@UTestIndexSampler@@UDataLoaderOptions@01@@Z@4HA)
21:55:55 dataloader.cpp.obj : error LNK2019: unresolved external symbol "__declspec(dllimport) const torch::data::samplers::Sampler<struct TestIndex>::`vftable'" (__imp_??_7?$Sampler@UTestIndex@@@samplers@data@torch@@6B@) referenced in function "class std::unique_ptr<class torch::data::StatelessDataLoader<struct TestIndexDataset,struct TestIndexSampler>,struct std::default_delete<class torch::data::StatelessDataLoader<struct TestIndexDataset,struct TestIndexSampler> > > __cdecl torch::data::make_data_loader<struct TestIndexDataset,struct TestIndexSampler>(struct TestIndexDataset,struct TestIndexSampler,struct torch::data::DataLoaderOptions)" (??$make_data_loader@UTestIndexDataset@@UTestIndexSampler@@@data@torch@@YA?AV?$unique_ptr@V?$StatelessDataLoader@UTestIndexDataset@@UTestIndexSampler@@@data@torch@@U?$default_delete@V?$StatelessDataLoader@UTestIndexDataset@@UTestIndexSampler@@@data@torch@@@std@@@std@@UTestIndexDataset@@UTestIndexSampler@@UDataLoaderOptions@01@@Z)
21:55:55 bin\test_api.exe : fatal error LNK1120: 2 unresolved externals

@dzhulgakov
Copy link
Collaborator

It's getting in a good shape - all of the following with visibility=hidden enabled:

However, before we can land this we need to decide naming problem:

  1. libtorch.so is a public name, so we should either name the entire thing libtorch (and break libcaffe2 backward compat which is more ok) or figure out some way to create a proxy library (or symlink?)
  2. gpu pieces are now linked in the "add-on" library caffe2_gpu.so. If we decide not to make a proxy libtorch which depends on both in (1), we'd need to fold the gpu bits directly in the main library.

@soumith @gchanan - do you have insights on above?

Also asking for review from more folks: @orionr @bddppq @yf225 @pjh5

list(APPEND TORCH_PYTHON_SRCS
${TORCH_SRC_DIR}/csrc/cuda/Module.cpp
${TORCH_SRC_DIR}/csrc/cuda/Storage.cpp
${TORCH_SRC_DIR}/csrc/cuda/Stream.cpp
${TORCH_SRC_DIR}/csrc/cuda/Event.cpp
${TORCH_SRC_DIR}/csrc/cuda/utils.cpp
${TORCH_SRC_DIR}/csrc/cuda/comm.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this is the reason causing the undefined symbol in rocm build? (although I don't understand why cuda build doesn't see the same issue).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I saw it has been moved to caffe2/CMakeLists.txt. But from the generated build.ninja file, comm.cpp is not included in the rocm build.

endif ()

if (USE_CUDA)
list(APPEND Caffe2_GPU_SRCS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be added to Caffe2_HIP_SRCS if USE_ROCM

@dzhulgakov
Copy link
Collaborator

So apparently windows CI (on jenkins) doesn't start if the merge doesn't go smoothly. Brace for rebase I guess.

# Only necessary on Windows?
# Contains "caffe2" and "caffe2_gpu".
file(WRITE ${CMAKE_BINARY_DIR}/empty.cpp "")
add_library(torch STATIC ${CMAKE_BINARY_DIR}/empty.cpp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this solve on windows? It looks like torch is just a wrapper around caffe2 and caffe2_gpu; what do we need this target for?

@kostmo kostmo changed the title Unify libtorch and libcaffe2 [wip] Unify libtorch and libcaffe2 May 1, 2019
file(WRITE ${DUMMY_EMPTY_FILE} "")

if (MSVC)
set(DUMMY_FILE_CONTENT "__declspec(dllexport) int ignore_this_library_placeholder(){return 0;}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs escaping before ;, that is

set(DUMMY_FILE_CONTENT "__declspec(dllexport) int ignore_this_library_placeholder(){return 0\\;}")

Copy link
Collaborator

@peterjc123 peterjc123 May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By exporting this function, torch.lib will be generated, but it will not reference any other libs because it is not calling any funcs here, so it will only contain the placeholder func in the symbol table, which is quite useless.
I don't know why the empty stub is needed here. Maybe it was used to keep the backward compatibility? But it was not true, we are already appending caffe2.lib and c10.lib when we do find_package(Torch REQUIRED) here when BUILD_SHARED_LIB is ON. Or maybe we just to pretend torch.dll still exists?
Anyway, if you still want to keep the dependency link between torch.lib and caffe2.lib / caffe_gpu.lib, then you have to call at least one func from these libs in empty.cpp and switch off the optimizing flags. Here is my PR for your reference: https://github.com/pytorch/pytorch/pull/16071/files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the empty stub is needed here. Maybe it was used to keep the backward compatibility?

Relaying an answer from Dima:

primarily to keep the shared library file name to be torch, not caffe2 to make sure that people builds and extensions don’t break
Alternatively we could just generate only libtorch instead of libcaffe2 but it’d make the PR bigger and also would still have the same issue for C2 users ( but the latter part might be fine)

@peterjc123
Copy link
Collaborator

As for the build timeout issue for caffe2 win builds, I think that you can try the following things:

  1. Remove the argument /maxcpucount in tools\build_pytorch_libs.py, to confirm whether it was or was not caused by a deadlock in msbuild
  2. Try different generators (in/ not in msbuild)
  3. Updating VS (in / not in the MSVC toolchain)

@peterjc123
Copy link
Collaborator

@pytorchbot retest this please

@kostmo kostmo changed the title [wip] Unify libtorch and libcaffe2 Unify libtorch and libcaffe2 May 9, 2019
Copy link
Contributor

@orionr orionr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kostmo is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kostmo merged this pull request in 4ba28de.

facebook-github-bot pushed a commit that referenced this pull request May 29, 2019
Summary:
#17783 has made ninja and makefile builds to print out build commands unconditionally, this has made the build log very verbose, e.g. ROCm CI build log becomes >13mb. Large build log make searching for real error hard.
#20508 has reverted the ninja change, and this one reverts the makefile change.
Pull Request resolved: #21053

Differential Revision: D15533412

Pulled By: bddppq

fbshipit-source-id: ad89b617d06acc670d75d4cf25111a4081e9c95e
@kostmo kostmo mentioned this pull request May 31, 2019
facebook-github-bot pushed a commit that referenced this pull request Jun 13, 2019
Summary:
This renames the CMake `caffe2` target to `torch`, as well as renaming `caffe2_gpu` to `torch_gpu` (and likewise for other gpu target variants).  Many intermediate variables that don't manifest as artifacts of the build remain for now with the "caffe2" name; a complete purge of `caffe2` from CMake variable names is beyond the scope of this PR.

The shell `libtorch` library that had been introduced as a stopgap in #17783 is again flattened in this PR.
Pull Request resolved: #20774

Differential Revision: D15769965

Pulled By: kostmo

fbshipit-source-id: b86e8c410099f90be0468e30176207d3ad40c821
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 13, 2019
Summary:
This renames the CMake `caffe2` target to `torch`, as well as renaming `caffe2_gpu` to `torch_gpu` (and likewise for other gpu target variants).  Many intermediate variables that don't manifest as artifacts of the build remain for now with the "caffe2" name; a complete purge of `caffe2` from CMake variable names is beyond the scope of this PR.

The shell `libtorch` library that had been introduced as a stopgap in pytorch/pytorch#17783 is again flattened in this PR.
Pull Request resolved: pytorch/pytorch#20774

Differential Revision: D15769965

Pulled By: kostmo

fbshipit-source-id: b86e8c410099f90be0468e30176207d3ad40c821
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants