Integrate hipsolver batched linalg drivers #103203

alugorey · 2023-06-07T21:21:02Z

Enables the following tests for ROCm along with support for various batched linalg drivers:
test_inverse_errors_large_cuda*
test_qr_batched_cuda*
test_linalg_solve_triangular_large_cuda*
test_ormqr_cuda_complex*
test_householder_product_cuda_complex*
test_geqrf_cuda_complex*

pytorch-bot · 2023-06-07T21:21:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103203

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ 3 Unrelated Failures

As of commit 7f864fb:

BROKEN TRUNK - The following jobs failed but were present on the merge base 04da0c7:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lezcano

Left a number of comments and questions.

You should probably also activate the relevant OpInfo tests

lezcano · 2023-06-08T00:49:18Z

test/test_linalg.py

    @precisionOverride({torch.float32: 1e-3, torch.complex64: 1e-3,
                        torch.float64: 1e-8, torch.complex128: 1e-8})
    def test_lu_solve_batched(self, device, dtype):
+        torch.backends.cuda.preferred_linalg_library('cusolver')


This shouldn't be needed.

lezcano · 2023-06-08T00:49:52Z

test/test_linalg.py

+    @dtypesIfCUDA(*floating_types_and(
+                  *[torch.cfloat] if not TEST_WITH_ROCM else [],
+                  *[torch.cdouble] if not TEST_WITH_ROCM else []))


Suggested change

@dtypesIfCUDA(*floating_types_and(

*[torch.cfloat] if not TEST_WITH_ROCM else [],

*[torch.cdouble] if not TEST_WITH_ROCM else []))

@dtypesIfCUDA(*floating_types_and(

*[torch.cfloat, torch.cdouble] if not TEST_WITH_ROCM else []))

same everywhere else.

lezcano · 2023-06-08T00:51:29Z

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp

-  TORCH_CHECK(false, "torch.linalg.lstsq: Batched version is supported only with cuBLAS backend.")
-#else
-#ifdef ROCM_VERSION
+#if defined(ROCM_VERSION) && (ROCM_VERSION >= 50400)


Should we throw a better error if the version is lower?

Done. added #elif to check for lower version and if so, set it to the previous value of rocblas_operation_none

lezcano · 2023-06-08T00:58:02Z

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp

-#endif // ifdef USE_LINALG_SOLVER && !USE_ROCM
-#else  // No cublas or cusolver
+#else
+#ifdef CUDART_VERSION


This change is not sound. The path below should just be taken when there is no cublas or cusolver. Now, perhaps now it can be removed completely? wdyt @IvanYashchuk?

lezcano · 2023-06-08T00:59:07Z

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp

  // Particular case when multiplying A^{-1}B where B is square
  // In this case doing two triangular solves is almost always fastest
  if (n == k) {
-#ifdef CUDART_VERSION


Does this mean we always have acces to cublas/cusolver?

Oops. deleted this during testing to exercise path on ROCm. Instead of deletion, should be a check against USE_LINALG_SOLVER

lezcano · 2023-06-08T01:00:41Z

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp


 void geqrf_kernel(const Tensor& input, const Tensor& tau) {
-#ifdef CUDART_VERSION
+#if defined(CUDART_VERSION) || defined(USE_ROCM)


Should we just delete this if, as you did in the other changes below?

I believe this #ifdef is needed because in the event that we are not using CUDA or ROCM, then the logic just defaults to take the magma route on line 1874. At least that is my understanding of the code. Please correct me if I am wrong.

lezcano · 2023-06-08T01:02:27Z

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp

    constexpr bool looped_correct = CUSOLVER_VERSION >= 11100;
-    if (m != n || (looped_correct && (batch_size == 1 || m >= 512))) {
+#else
+	bool looped_correct = false;


Why not set this to true?

lezcano · 2023-06-08T01:04:30Z

aten/src/ATen/cuda/CUDABlas.h

+#else
+
+#ifdef CUDART_VERSION


or simply elif defined(CUDART_VERSION)

* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well

alugorey · 2023-07-21T19:33:16Z

@pytorchbot label ciflow/trunk

alugorey · 2023-07-25T14:54:12Z

Hi @nikitaved , @lezcano , This is ready for final review. It is failing the following 3 unrelated and unstable test cases.
cudnn_extension error: trunk / linux-focal-rocm5.6-py3.8 / test (default, 2, 3, linux.rocm.gpu, unstable) (push)
torch dynamo error: pull / linux-focal-py3.8-gcc7 / test (distributed, 2, 2, linux.2xlarge, unstable) (pull_request)
torch dynamo error: pull / linux-bionic-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu, unstable) (pull_request) Failing after 11m

lezcano

This is looking quite good, but I'd like to make sure we're not breaking some non-standard build.

There are many changes that are not semantics preserving. I pointed out a few.

Now, there are many cases where we do things like if ROCM ... elif cusolver ... else. Is it possible to build without ROCM or cusolver support? cc @malfet

If we always have either cusolver or hipsolver, then a fair amount of the code could be simplified. In particular, we would always have that defined(CUDART_VERSION) || defined(ROCM_VERSION) and quite a bit of the code could be simplified.

If there are cases where we don't have cusolver or hipsolver, then we should try to build in those cases with this patch, as we may be breaking them.

Once we know the answer to the question above, we should write a note for future developers somewhere.

lezcano · 2023-07-25T14:57:47Z

aten/src/ATen/cuda/CUDABlas.cpp

 }

-// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda
-#ifdef CUDART_VERSION


Is cusolver always installed when USE_ROCM is False? I think this is not true.

I believe you are correct. There could be instances where neither solver is installed and it instead uses magma or some other LAPACK library. Did I add logic that assumes cusolver is installed when USE_ROCM is False?

Let's see if @malfet's approach is possible, which would heavily reduce the amount of code needed. If not, it'd be good to put together a build without rocm or cusolver (if such a build exists, I'm not sure) and then try to build and run the tests, see if it's correct.

cc @IvanYashchuk who surely knows the answer as to whether we can build without cusolver and hipsolver.

lezcano · 2023-07-25T14:58:55Z

aten/src/ATen/cuda/CUDABlas.cpp

 template <>
 void getrfBatched<double>(
-    int n, double** dA_array, int ldda, int* ipiv_array, int* info_array, int batchsize) {
+    CUDABLAS_GETRF_BATCHED_ARGTYPES(double)) {


lezcano · 2023-07-25T15:01:26Z

aten/src/ATen/cuda/CUDABlas.cpp

+#else
+
+#ifdef CUDART_VERSION


nit

Suggested change

#else

#ifdef CUDART_VERSION

#elif defined(CUDART_VERSION)

Same in the other occurrences.

lezcano · 2023-07-25T15:02:29Z

aten/src/ATen/cuda/CUDABlas.h

+TORCH_CUDA_CU_API void getrsBatched<c10::complex<double>>(HIPBLAS_GETRS_ARGTYPES(c10::complex<double>));
+
+
+#else


ditto. Grep for all these and write elif define as it's easier to follow

woops ignore previous comment.

lezcano · 2023-07-25T15:05:52Z

aten/src/ATen/cuda/CUDABlas.h


 template<class Dtype>
-void getrfBatched(CUDABLAS_GETRF_ARGTYPES(Dtype)) {
+void getrfBatched(CUDABLAS_GETRF_BATCHED_ARGTYPES(Dtype)) {


nit. Same same for CUDABLAS_GETRS_ARGTYPES but yeah.

lezcano · 2023-07-25T15:09:21Z

aten/src/ATen/cuda/CUDABlas.h

+template <>
+TORCH_CUDA_CU_API void geqrfBatched<c10::complex<float>>(
+    HIPBLAS_GEQRF_BATCHED_ARGTYPES(c10::complex<float>));
+#else


this used to be within a ifdef CUDART_VERSION

lezcano · 2023-07-25T15:14:58Z

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp

    }
  };

-#ifdef CUDART_VERSION


This doesn't seem correct.

OOps, yea this should definitely have been changed to USE_LINALG_SOLVER instead of being removed

lezcano · 2023-07-25T15:25:34Z

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp

-// AMD ROCm backend is implemented via rewriting all CUDA calls to HIP
-// rocBLAS does not implement BLAS-like extensions of cuBLAS, they're in rocSOLVER
-// rocSOLVER is currently not used in ATen, therefore we raise an error in this case
-#ifndef CUDART_VERSION


Shouldn't all these now be ifndef USE_LINALG_SOLVER?

test/test_linalg.py

malfet

Large chunks of code seems to copy-n-pasted in this PR instead of modifying hipifier to take care about them.
Please either elaborate in PR description, why HIPifier can not be used here, or better adjust it to take care of the duplications. Also, if you are adding/removing some constraints that are not specific to ROCM, please mention in PR description why they should no longer apply

malfet · 2023-07-26T11:57:13Z

test/test_linalg.py

                                    "The QR decomposition is not differentiable when mode='complete' and nrows > ncols"):
            b.backward()

-    @skipCUDAIfNoCusolver


Why are you removing this guard?

woops, that was leftover from internal testing

malfet · 2023-07-26T11:59:11Z

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp

  auto trans = CUBLAS_OP_N;
 #endif
+
+#if defined(CUDART_VERSION) || (defined(ROCM_VERSION) && (ROCM_VERSION >= 50400))


Why introduce CUDA_VERSION check there, as this code should only be compiled for either CUDA or ROCM

Suggested change

#if defined(CUDART_VERSION) || (defined(ROCM_VERSION) && (ROCM_VERSION >= 50400))

#if !defined(ROCM_VERSION) || ROCM_VERSION >= 50400)

malfet · 2023-07-26T12:01:53Z

aten/src/ATen/cuda/Exceptions.h

+#ifdef USE_ROCM
+#define TORCH_HIPBLAS_CHECK(EXPR)                               \
+  do {                                                          \
+    hipblasStatus_t __err = EXPR;                               \
+    TORCH_CHECK(__err == HIPBLAS_STATUS_SUCCESS,                \
+                "CUDA error: ",                                 \
+                " when calling `" #EXPR "`");                   \
+  } while (0)
+#endif


Why this is needed? Shouldn't hipifier just replace TORCH_CUDABLAS_CHECK with TORCH_HIPBLAS_CHECK?

malfet · 2023-07-26T12:07:53Z

aten/src/ATen/cuda/CUDABlas.cpp

+template <>
+void trsm<float>(HIPBLAS_TRSM_ARGTYPES(float)) {
+  TORCH_HIPBLAS_CHECK(cublasStrsm(
+      handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));
+}
+
+template <>
+void trsm<double>(HIPBLAS_TRSM_ARGTYPES(double)) {
+  TORCH_HIPBLAS_CHECK(cublasDtrsm(
+      handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));
+}


This looks like a verbatim copy of

pytorch/aten/src/ATen/cuda/CUDABlas.cpp

Lines 985 to 995 in 854fe47

template <>

void trsm<float>(CUDABLAS_TRSM_ARGTYPES(float)) {

TORCH_CUDABLAS_CHECK(cublasStrsm(

handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));

}

template <>

void trsm<double>(CUDABLAS_TRSM_ARGTYPES(double)) {

TORCH_CUDABLAS_CHECK(cublasDtrsm(

handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));

}

which makes me wonder, why hipifier can not take care of that one?

alugorey · 2023-07-27T18:32:42Z

@malfet Thank you for your comments! I am in the mix of addressing them now and seeing if we can get by with hipification. I think the original motivation for not using hipification was because some hipblas and rocblas types were not easily interchangeable. A problem that I think may have been addressed in later rocm versions. I am looking into that now.

Also, any instance of this PR removing constraints that are NOT ROCM specific is an oversight on my part. I have addressed the locations you have pointed out in my current working version. Please continue to scrutinize every bit of this PR as I believe we all want this to be as high-quality as possible. Thanks again for your time in reviewing this :) new patch set will come in soon!

alugorey · 2023-07-27T20:33:12Z

@malfet @lezcano I have investigated the hipification request. Unfortunately it uncovered a known limitation. Currently, cublas types and api are hipified to rocblas. However, the api introduced here require cublas to be converted to hipblas. The issue is that there are many BLAS calls unrelated to the changes in this PR that still require rocblas. If I were to blindly hipify all instances of cublas to hipblas I would break a whole lot of code unrelated to this PR. This is a known issue and is currently being worked on, no hard ETA at the moment. The duplication of code in this PR that could seemingly have otherwise have been hipified is due to this limitation. I am aware of the pieces of work required to remove this limitation and will submit a new PR once that code gets merged. Thoughts?

lezcano · 2023-07-27T20:44:29Z

Given the size of this PR, it may be simpler to wait until those fixes land, and then rebase this one on top of that one and heavily simplify it.

alugorey · 2023-07-27T20:52:00Z

Understood. I'm going to push up the things I fixed in terms of the other comments in order to save my place.

alugorey · 2023-07-27T21:17:25Z

Here is the work in question: #105881

alugorey · 2023-08-04T16:24:34Z

The contents of this PR were brought in as part of #105881. The lingering functionality not encompasses in that PR will be brought in here: #106620
Closing this issue @malfet @lezcano

This is a follow up to #105881 and replaces #103203 The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default` The following 16 unit tests will be enabled for rocm in this change: - test_inverse_many_batches_cuda* - test_inverse_errors_large_cuda* - test_linalg_solve_triangular_large_cuda* - test_lu_solve_batched_many_batches_cuda* Pull Request resolved: #106620 Approved by: https://github.com/lezcano

…h#106620) This is a follow up to pytorch#105881 and replaces pytorch#103203 The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default` The following 16 unit tests will be enabled for rocm in this change: - test_inverse_many_batches_cuda* - test_inverse_errors_large_cuda* - test_linalg_solve_triangular_large_cuda* - test_lu_solve_batched_many_batches_cuda* Pull Request resolved: pytorch#106620 Approved by: https://github.com/lezcano

alugorey requested review from IvanYashchuk, lezcano and nikitaved as code owners June 7, 2023 21:21

pytorch-bot bot added the release notes: linalg_frontend release notes category label Jun 7, 2023

pytorchbot added the open source label Jun 7, 2023

alugorey marked this pull request as draft June 7, 2023 21:33

lezcano reviewed Jun 8, 2023

View reviewed changes

alugorey force-pushed the upstream_batched_hipsolver branch from 9e033a0 to 1cea7db Compare June 12, 2023 21:13

IvanYashchuk removed their request for review June 13, 2023 07:34

alugorey and others added 4 commits July 21, 2023 15:13

Integrate hipsolver batched linalg drivers (ROCm#1163)

b279ba7

* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well

Clean up review comments

35afc7b

Fix typo that broke test_qr_cuda_complex64

6952897

Add hipblas trsm/batched

e4eaaca

alugorey force-pushed the upstream_batched_hipsolver branch from b8ed817 to 85894aa Compare July 21, 2023 19:28

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 21, 2023

alugorey force-pushed the upstream_batched_hipsolver branch from 85894aa to ab462bf Compare July 21, 2023 19:54

Iron out associated unit tests

73b597a

alugorey force-pushed the upstream_batched_hipsolver branch from ab462bf to 73b597a Compare July 21, 2023 19:55

alugorey added 2 commits July 21, 2023 20:38

Fix lint errors

f042de0

Skipping large triangular solve. works locally, triage later

16fbfd3

alugorey marked this pull request as ready for review July 25, 2023 14:54

alugorey requested review from jeffdaily and jithunnair-amd as code owners July 25, 2023 14:54

lezcano reviewed Jul 25, 2023

View reviewed changes

malfet requested changes Jul 26, 2023

View reviewed changes

Fix macro compilation

1b0851a

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 27, 2023

Address several other macro comments

032ab79

Add mistakenly removed test decorator

7f864fb

alugorey mentioned this pull request Aug 4, 2023

[ROCm] Enable hipsolver unit tests for batched linalg drivers #106620

Closed

alugorey closed this Aug 4, 2023

		TORCH_CUDA_CU_API void getrsBatched<c10::complex<double>>(HIPBLAS_GETRS_ARGTYPES(c10::complex<double>));


		#else

	#if defined(CUDART_VERSION) \|\| (defined(ROCM_VERSION) && (ROCM_VERSION >= 50400))
	#if !defined(ROCM_VERSION) \|\| ROCM_VERSION >= 50400)

	template <>
	void trsm<float>(CUDABLAS_TRSM_ARGTYPES(float)) {
	TORCH_CUDABLAS_CHECK(cublasStrsm(
	handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));
	}

	template <>
	void trsm<double>(CUDABLAS_TRSM_ARGTYPES(double)) {
	TORCH_CUDABLAS_CHECK(cublasDtrsm(
	handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));
	}

Integrate hipsolver batched linalg drivers #103203

Integrate hipsolver batched linalg drivers #103203

Uh oh!

Conversation

alugorey commented Jun 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103203

✅ 3 Unrelated Failures

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alugorey commented Jul 21, 2023

Uh oh!

alugorey commented Jul 25, 2023

Uh oh!

lezcano left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alugorey commented Jun 7, 2023 •

edited

Loading

pytorch-bot bot commented Jun 7, 2023 •

edited

Loading

lezcano left a comment •

edited

Loading