Skip to content

Conversation

dolpm
Copy link
Contributor

@dolpm dolpm commented Sep 7, 2025

Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425

Copy link

pytorch-bot bot commented Sep 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162353

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5d1f70b with merge base 43d9b5e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

@dolpm dolpm added the topic: not user facing topic category label Sep 7, 2025
@dolpm dolpm requested a review from yiming0416 September 7, 2025 22:27
dolpm added a commit to dolpm/pytorch that referenced this pull request Sep 7, 2025
Summary:

att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425
dolpm added a commit to dolpm/pytorch that referenced this pull request Sep 7, 2025
Summary:

att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

dolpm added a commit to dolpm/pytorch that referenced this pull request Sep 7, 2025
Summary:
Pull Request resolved: pytorch#162353

att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

dolpm added a commit to dolpm/pytorch that referenced this pull request Sep 7, 2025
Summary:
Pull Request resolved: pytorch#162353

att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

@dolpm dolpm force-pushed the export-D81731425 branch 2 times, most recently from 1cb77ea to a823045 Compare September 8, 2025 03:47
dolpm added a commit to dolpm/pytorch that referenced this pull request Sep 8, 2025
Summary:

att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425
dolpm added a commit to dolpm/pytorch that referenced this pull request Sep 8, 2025
Summary:

att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

dolpm added a commit to dolpm/pytorch that referenced this pull request Sep 8, 2025
Summary:
Pull Request resolved: pytorch#162353

att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

dolpm added a commit to dolpm/pytorch that referenced this pull request Sep 8, 2025
Summary:
Pull Request resolved: pytorch#162353

att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

Summary:
Pull Request resolved: pytorch#162353

att

Test Plan:
ci

Rollback Plan:

Reviewed By: yiming0416

Differential Revision: D81731425
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81731425

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeffdaily
Copy link
Collaborator

I've taken a quick look. There are some changes that seem necessary for ROCm but they don't avoid the seg fault.

diff --git a/test/cpp/nativert/CMakeLists.txt b/test/cpp/nativert/CMakeLists.txt
index 30c68283282..77636a2d109 100644
--- a/test/cpp/nativert/CMakeLists.txt
+++ b/test/cpp/nativert/CMakeLists.txt
@@ -48,7 +48,7 @@ set(NATIVERT_TEST_SRCS
   ${TORCH_ROOT}/torch/csrc/inductor/aoti_torch/oss_proxy_executor.cpp
 )

-if(USE_CUDA)
+if(USE_CUDA OR USE_ROCM)
   list(APPEND NATIVERT_TEST_SRCS ${TORCH_ROOT}/torch/nativert/executor/triton/CudaTritonKernelManager.cpp)
   list(APPEND NATIVERT_TEST_SRCS ${TORCH_ROOT}/torch/nativert/executor/AOTInductorModelContainerCudaShim.cpp)
 endif()
diff --git a/test/cpp/nativert/test_aoti_model_container_registration.cpp b/test/cpp/nativert/test_aoti_model_container_registration.cpp
index 94b19bf8e75..9a5d67a9c3c 100644
--- a/test/cpp/nativert/test_aoti_model_container_registration.cpp
+++ b/test/cpp/nativert/test_aoti_model_container_registration.cpp
@@ -8,7 +8,7 @@ using namespace torch::nativert;
 TEST(AOTIModelContainerRegistrationTests, TestRegister) {
   EXPECT_TRUE(AOTIModelContainerRunnerRegistry()->Has(at::kCPU));

-#ifdef USE_CUDA
+#if defined(USE_CUDA) || defined(USE_ROCM)
   EXPECT_TRUE(AOTIModelContainerRunnerRegistry()->Has(at::kCUDA));
 #else
   EXPECT_FALSE(AOTIModelContainerRunnerRegistry()->Has(at::kCUDA));

I'm doing a build with BUILD_LIBTORCH_CPU_WITH_DEBUG:BOOL=ON to get a backtrace for the seg fault.

@dolpm
Copy link
Contributor Author

dolpm commented Sep 12, 2025

I've taken a quick look. There are some changes that seem necessary for ROCm but they don't avoid the seg fault.

diff --git a/test/cpp/nativert/CMakeLists.txt b/test/cpp/nativert/CMakeLists.txt
index 30c68283282..77636a2d109 100644
--- a/test/cpp/nativert/CMakeLists.txt
+++ b/test/cpp/nativert/CMakeLists.txt
@@ -48,7 +48,7 @@ set(NATIVERT_TEST_SRCS
   ${TORCH_ROOT}/torch/csrc/inductor/aoti_torch/oss_proxy_executor.cpp
 )

-if(USE_CUDA)
+if(USE_CUDA OR USE_ROCM)
   list(APPEND NATIVERT_TEST_SRCS ${TORCH_ROOT}/torch/nativert/executor/triton/CudaTritonKernelManager.cpp)
   list(APPEND NATIVERT_TEST_SRCS ${TORCH_ROOT}/torch/nativert/executor/AOTInductorModelContainerCudaShim.cpp)
 endif()
diff --git a/test/cpp/nativert/test_aoti_model_container_registration.cpp b/test/cpp/nativert/test_aoti_model_container_registration.cpp
index 94b19bf8e75..9a5d67a9c3c 100644
--- a/test/cpp/nativert/test_aoti_model_container_registration.cpp
+++ b/test/cpp/nativert/test_aoti_model_container_registration.cpp
@@ -8,7 +8,7 @@ using namespace torch::nativert;
 TEST(AOTIModelContainerRegistrationTests, TestRegister) {
   EXPECT_TRUE(AOTIModelContainerRunnerRegistry()->Has(at::kCPU));

-#ifdef USE_CUDA
+#if defined(USE_CUDA) || defined(USE_ROCM)
   EXPECT_TRUE(AOTIModelContainerRunnerRegistry()->Has(at::kCUDA));
 #else
   EXPECT_FALSE(AOTIModelContainerRunnerRegistry()->Has(at::kCUDA));

I'm doing a build with BUILD_LIBTORCH_CPU_WITH_DEBUG:BOOL=ON to get a backtrace for the seg fault.

it will take me a few mins to repro this locally on amd machine, but i didn't expect these tests to work on rocm -- we can just disable them for now if this is the case? the reason this isn't changing anything is USE_ROCM and USE_CUDA are set when it's an amd gpu build so this isn't doing anything.

@jeffdaily
Copy link
Collaborator

Found one more diff I'm trying out. The earlier diffs were for the test binary only but the sources weren't being included for rocm builds into libtorch.

diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt
index b5d47bb4b5d..bcf8fffa4ac 100644
--- a/caffe2/CMakeLists.txt
+++ b/caffe2/CMakeLists.txt
@@ -550,8 +550,7 @@ if(USE_CUDA OR USE_ROCM)
   append_filelist("libtorch_cuda_core_sources" Caffe2_GPU_HIP_JIT_FUSERS_SRCS)
 endif()

-if(USE_CUDA)
-  # eventually do rocm
+if(USE_CUDA OR USE_ROCM)
   append_filelist("libtorch_nativert_cuda_sources" Caffe2_GPU_SRCS)
 endif()

@dolpm
Copy link
Contributor Author

dolpm commented Sep 12, 2025

Found one more diff I'm trying out. The earlier diffs were for the test binary only but the sources weren't being included for rocm builds into libtorch.

diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt
index b5d47bb4b5d..bcf8fffa4ac 100644
--- a/caffe2/CMakeLists.txt
+++ b/caffe2/CMakeLists.txt
@@ -550,8 +550,7 @@ if(USE_CUDA OR USE_ROCM)
   append_filelist("libtorch_cuda_core_sources" Caffe2_GPU_HIP_JIT_FUSERS_SRCS)
 endif()

-if(USE_CUDA)
-  # eventually do rocm
+if(USE_CUDA OR USE_ROCM)
   append_filelist("libtorch_nativert_cuda_sources" Caffe2_GPU_SRCS)
 endif()

can try 186290f as well.

@jeffdaily
Copy link
Collaborator

Besides skipping the tests, is there some technical reason why you don't think these tests should work on ROCm?

@dolpm
Copy link
Contributor Author

dolpm commented Sep 12, 2025

Besides skipping the tests, is there some technical reason why you don't think these tests should work on ROCm?

it will get the registry by the device type of the input, so we need to change the registry code as well based on USE_ROCM -- i think i know how to fix the issue you are seeing. give me 5 mins.

for some other reason i can't get my rocm build to work -- mind trying 52cf1c7 ? assuming i understand the problem, this should fix it.

@jeffdaily
Copy link
Collaborator

I figured it out finally. See my PR here. Tests were passing for me.
#162827

@dolpm
Copy link
Contributor Author

dolpm commented Sep 12, 2025

I figured it out finally. See my PR here. Tests were passing for me. #162827

left a comment

pytorchmergebot pushed a commit that referenced this pull request Sep 12, 2025
Forward fix for tests added by #162353.  Enables aoti tests on rocm.

Pull Request resolved: #162827
Approved by: https://github.com/dolpm, https://github.com/huydhn
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425

Pull Request resolved: pytorch#162353
Approved by: https://github.com/yiming0416
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Forward fix for tests added by pytorch#162353.  Enables aoti tests on rocm.

Pull Request resolved: pytorch#162827
Approved by: https://github.com/dolpm, https://github.com/huydhn
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425

Pull Request resolved: pytorch#162353
Approved by: https://github.com/yiming0416
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Forward fix for tests added by pytorch#162353.  Enables aoti tests on rocm.

Pull Request resolved: pytorch#162827
Approved by: https://github.com/dolpm, https://github.com/huydhn
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425

Pull Request resolved: pytorch#162353
Approved by: https://github.com/yiming0416
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Forward fix for tests added by pytorch#162353.  Enables aoti tests on rocm.

Pull Request resolved: pytorch#162827
Approved by: https://github.com/dolpm, https://github.com/huydhn
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D81731425

Pull Request resolved: pytorch#162353
Approved by: https://github.com/yiming0416
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Forward fix for tests added by pytorch#162353.  Enables aoti tests on rocm.

Pull Request resolved: pytorch#162827
Approved by: https://github.com/dolpm, https://github.com/huydhn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants