Migrate THCState to ATen #66765

peterbell10 · 2021-10-17T15:44:50Z

Stack from ghstack:

This guts THCState to simply be an empty struct, as well as:

moving THCState_getPeerToPeerAccess and its cache into ATen.
cleaning up dead code in THCGeneral.cpp
moving THCudaInit and THCMagma_init into CUDAHooks::initCUDA

Differential Revision: D31721648

This guts `THCState` to simply be an empty struct, as well as: - moving `THCState_getPeerToPeerAccess` and its cache into `ATen`. - cleaning up dead code in `THCGeneral.cpp` - moving `THCudaInit` and `THCMagma_init` into `CUDAHooks::initCUDA` [ghstack-poisoned]

pytorch-probot · 2021-10-17T15:44:55Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/919050f67b85579ea8d33ec16a9d539b363fe8fa/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default,ciflow/cuda

Workflows	Labels (bold enabled)	Status
Triggered Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	✅ triggered
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	✅ triggered
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	✅ triggered
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	✅ triggered
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	✅ triggered
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	✅ triggered
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-10-17T15:44:56Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/66765
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 919050f (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

linux-xenial-cuda10.2-py3.6-gcc7 / test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-10-18T13:36:39.2683136Z AssertionError: RuntimeError not raised

2021-10-18T13:36:39.2673856Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 2898, in wrapper
2021-10-18T13:36:39.2674697Z     return func(*args, **kwargs)
2021-10-18T13:36:39.2675768Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 112, in wrapper
2021-10-18T13:36:39.2676657Z     return func(*args, **kwargs)
2021-10-18T13:36:39.2677620Z   File "/var/lib/jenkins/workspace/test/distributed/test_c10d_nccl.py", line 2397, in test_nccl_timeout
2021-10-18T13:36:39.2678834Z     process_group.allreduce(torch.rand(10).cuda(self.rank)).wait(timeout=timedelta(seconds=1))
2021-10-18T13:36:39.2679784Z   File "/opt/conda/lib/python3.6/unittest/case.py", line 203, in __exit__
2021-10-18T13:36:39.2680543Z     self._raiseFailure("{} not raised".format(exc_name))
2021-10-18T13:36:39.2681492Z   File "/opt/conda/lib/python3.6/unittest/case.py", line 135, in _raiseFailure
2021-10-18T13:36:39.2682335Z     raise self.test_case.failureException(msg)
2021-10-18T13:36:39.2683136Z AssertionError: RuntimeError not raised
2021-10-18T13:36:39.2683555Z 
2021-10-18T13:36:39.2683796Z 
2021-10-18T13:36:39.2684109Z 		
2021-10-18T13:36:39.2684669Z ✅ 181 Passed
2021-10-18T13:36:39.2685191Z 💨 11 Skipped
2021-10-18T13:36:39.2685723Z 🚨 1 Failed
2021-10-18T13:36:39.2901628Z ##[group]Run # Remove any previous test reports if they exist
2021-10-18T13:36:39.2902530Z �[36;1m# Remove any previous test reports if they exist�[0m
2021-10-18T13:36:39.2903483Z �[36;1mrm -f test-reports-*.zip�[0m
2021-10-18T13:36:39.2904162Z �[36;1mzip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'�[0m

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

peterbell10 · 2021-10-17T15:47:32Z

@pytorchbot ciflow rerun -l ciflow/cuda

ngimel · 2021-10-17T20:29:15Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-10-17T23:04:06Z

I'm getting the same error in internal builds

c10::Error: detail::magma_init_fnINTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp":75, please report a bug to PyTorch. Cannot initilaize magma, init routine not set
Exception raised from initCUDA at caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp:75 (most recent call first):
# 0  c10::get_backtrace[abi:cxx11](unsigned long, unsigned long, bool)
# 1  std::_Function_handler<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > (), c10::(anonymous namespace)::GetFetchStackTrace()::$_0>::_M_invoke(std::_Any_data const&)
# 2  c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
# 3  c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 4  c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*)
# 5  at::cuda::detail::CUDAHooks::initCUDA() const
# 6  at::Context::lazyInitCUDA()::{lambda()#1}::operator()() const
# 7  __pthread_once_slow
# 8  void std::call_once<at::Context::lazyInitCUDA()::{lambda()#1}>(std::once_flag&, at::Context::lazyInitCUDA()::{lambda()#1}&&)
# 9  at::Context::lazyInitCUDA()
# 10 at::(anonymous namespace)::(anonymous namespace)::wrapper__empty_strided(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)

any advice why that would happen?

peterbell10 · 2021-10-17T23:19:18Z

That means AT_MAGMA_ENABLED() is true but magma_init_fn hasn't been set. That must mean the initializer in cuda/BatchLinearAlgebra.cpp hasn't run yet. I'm not sure how that could happen though.

The old code just ignored it if this happened, so it's possible this was happening silently before.

ngimel · 2021-10-18T05:31:58Z

I've added prints to verify that initializer in cuda/BatchLinearAlgebra.cpp has run before the initialization is performed in CUDAHooks, and the function is correctly set in the initializer and can be called, but yet in CUDAHooks somehow it is unset again?
I didn't check if it was properly set before this PR.
Edit: checked that before this PR magma_init in THC was successfully called.

This guts `THCState` to simply be an empty struct, as well as: - moving `THCState_getPeerToPeerAccess` and its cache into `ATen`. - cleaning up dead code in `THCGeneral.cpp` - moving `THCudaInit` and `THCMagma_init` into `CUDAHooks::initCUDA` ghstack-source-id: e3a38ee Pull Request resolved: pytorch#66765

This guts `THCState` to simply be an empty struct, as well as: - moving `THCState_getPeerToPeerAccess` and its cache into `ATen`. - cleaning up dead code in `THCGeneral.cpp` - moving `THCudaInit` and `THCMagma_init` into `CUDAHooks::initCUDA` Differential Revision: [D31721648](https://our.internmc.facebook.com/intern/diff/D31721648) [ghstack-poisoned]

peterbell10 · 2021-10-18T13:20:44Z

My next best guess would be static initialization order issues. I've changed the variable from std::function to a static function pointer so it doesn't need a constructor.

ngimel · 2021-10-18T16:04:27Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-10-18T16:33:45Z

This seems to be working on a local repro, I'll wait for all the tests to run and will land.

facebook-github-bot · 2021-10-18T19:16:23Z

@ngimel merged this pull request in 8637556.

Summary: Pull Request resolved: #66765 This guts `THCState` to simply be an empty struct, as well as: - moving `THCState_getPeerToPeerAccess` and its cache into `ATen`. - cleaning up dead code in `THCGeneral.cpp` - moving `THCudaInit` and `THCMagma_init` into `CUDAHooks::initCUDA` Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D31721648 Pulled By: ngimel fbshipit-source-id: 772b24787656a95f9e3fcb287d912b1c3400f32d

Fix open-mmlab#900. Support PyTorch version >= 1.11. Referring to pytorch/pytorch#66765 and https://github.com/pytorch/pytorch/wiki/TH-to-ATen-porting-guide.

* feat: support torch>=1.11 Fix #900. Support PyTorch version >= 1.11. Referring to pytorch/pytorch#66765 and https://github.com/pytorch/pytorch/wiki/TH-to-ATen-porting-guide. * fix: Remove preproc torch version check macros

Migrate THCState to ATen

71b2acc

This guts `THCState` to simply be an empty struct, as well as: - moving `THCState_getPeerToPeerAccess` and its cache into `ATen`. - cleaning up dead code in `THCGeneral.cpp` - moving `THCudaInit` and `THCMagma_init` into `CUDAHooks::initCUDA` [ghstack-poisoned]

peterbell10 requested review from IvanYashchuk, lezcano and nikitaved as code owners October 17, 2021 15:44

pytorch-probot bot added the ciflow/default label Oct 17, 2021

facebook-github-bot added the cla signed label Oct 17, 2021

peterbell10 mentioned this pull request Oct 17, 2021

Remove THCGeneral.cpp #66766

Closed

peterbell10 added the ciflow/cuda label Oct 17, 2021

pytorch-probot bot assigned pytorchbot and unassigned pytorchbot Oct 17, 2021

pytorchbot added the open source label Oct 17, 2021

peterbell10 requested a review from ngimel October 17, 2021 19:46

ngimel approved these changes Oct 17, 2021

View reviewed changes

facebook-github-bot closed this in 8637556 Oct 18, 2021

facebook-github-bot added the Merged label Oct 18, 2021

facebook-github-bot deleted the gh/peterbell10/176/head branch October 22, 2021 14:17

yihuajack added a commit to yihuajack/OpenPCDet that referenced this pull request Jul 24, 2022

feat: support torch>=1.11

5941276

Fix open-mmlab#900. Support PyTorch version >= 1.11. Referring to pytorch/pytorch#66765 and https://github.com/pytorch/pytorch/wiki/TH-to-ATen-porting-guide.

yihuajack mentioned this pull request Jul 24, 2022

feat: support torch>=1.11 open-mmlab/OpenPCDet#1041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate THCState to ATen #66765

Migrate THCState to ATen #66765

Uh oh!

peterbell10 commented Oct 17, 2021 •

edited by ngimel

Loading

Uh oh!

pytorch-probot bot commented Oct 17, 2021 •

edited

Loading

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Oct 17, 2021 •

edited

Loading

Uh oh!

peterbell10 commented Oct 17, 2021

Uh oh!

ngimel commented Oct 17, 2021

Uh oh!

ngimel commented Oct 17, 2021

Uh oh!

peterbell10 commented Oct 17, 2021

Uh oh!

ngimel commented Oct 18, 2021 •

edited

Loading

Uh oh!

peterbell10 commented Oct 18, 2021

Uh oh!

ngimel commented Oct 18, 2021

Uh oh!

ngimel commented Oct 18, 2021

Uh oh!

facebook-github-bot commented Oct 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Migrate THCState to ATen #66765

Migrate THCState to ATen #66765

Uh oh!

Conversation

peterbell10 commented Oct 17, 2021 • edited by ngimel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-probot bot commented Oct 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Oct 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

linux-xenial-cuda10.2-py3.6-gcc7 / test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu) (1/1)

Uh oh!

peterbell10 commented Oct 17, 2021

Uh oh!

ngimel commented Oct 17, 2021

Uh oh!

ngimel commented Oct 17, 2021

Uh oh!

peterbell10 commented Oct 17, 2021

Uh oh!

ngimel commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbell10 commented Oct 18, 2021

Uh oh!

ngimel commented Oct 18, 2021

Uh oh!

ngimel commented Oct 18, 2021

Uh oh!

facebook-github-bot commented Oct 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

peterbell10 commented Oct 17, 2021 •

edited by ngimel

Loading

pytorch-probot bot commented Oct 17, 2021 •

edited

Loading

facebook-github-bot commented Oct 17, 2021 •

edited

Loading

ngimel commented Oct 18, 2021 •

edited

Loading