-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Integrate hipsolver batched linalg drivers #103203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103203
Note: Links to docs will display an error until the docs builds have been completed. ✅ 3 Unrelated FailuresAs of commit 7f864fb: BROKEN TRUNK - The following jobs failed but were present on the merge base 04da0c7:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a number of comments and questions.
You should probably also activate the relevant OpInfo tests
test/test_linalg.py
Outdated
| @precisionOverride({torch.float32: 1e-3, torch.complex64: 1e-3, | ||
| torch.float64: 1e-8, torch.complex128: 1e-8}) | ||
| def test_lu_solve_batched(self, device, dtype): | ||
| torch.backends.cuda.preferred_linalg_library('cusolver') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed.
test/test_linalg.py
Outdated
| @dtypesIfCUDA(*floating_types_and( | ||
| *[torch.cfloat] if not TEST_WITH_ROCM else [], | ||
| *[torch.cdouble] if not TEST_WITH_ROCM else [])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @dtypesIfCUDA(*floating_types_and( | |
| *[torch.cfloat] if not TEST_WITH_ROCM else [], | |
| *[torch.cdouble] if not TEST_WITH_ROCM else [])) | |
| @dtypesIfCUDA(*floating_types_and( | |
| *[torch.cfloat, torch.cdouble] if not TEST_WITH_ROCM else [])) |
same everywhere else.
| TORCH_CHECK(false, "torch.linalg.lstsq: Batched version is supported only with cuBLAS backend.") | ||
| #else | ||
| #ifdef ROCM_VERSION | ||
| #if defined(ROCM_VERSION) && (ROCM_VERSION >= 50400) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw a better error if the version is lower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. added #elif to check for lower version and if so, set it to the previous value of rocblas_operation_none
| #endif // ifdef USE_LINALG_SOLVER && !USE_ROCM | ||
| #else // No cublas or cusolver | ||
| #else | ||
| #ifdef CUDART_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not sound. The path below should just be taken when there is no cublas or cusolver. Now, perhaps now it can be removed completely? wdyt @IvanYashchuk?
| // Particular case when multiplying A^{-1}B where B is square | ||
| // In this case doing two triangular solves is almost always fastest | ||
| if (n == k) { | ||
| #ifdef CUDART_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we always have acces to cublas/cusolver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. deleted this during testing to exercise path on ROCm. Instead of deletion, should be a check against USE_LINALG_SOLVER
|
|
||
| void geqrf_kernel(const Tensor& input, const Tensor& tau) { | ||
| #ifdef CUDART_VERSION | ||
| #if defined(CUDART_VERSION) || defined(USE_ROCM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just delete this if, as you did in the other changes below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this #ifdef is needed because in the event that we are not using CUDA or ROCM, then the logic just defaults to take the magma route on line 1874. At least that is my understanding of the code. Please correct me if I am wrong.
| constexpr bool looped_correct = CUSOLVER_VERSION >= 11100; | ||
| if (m != n || (looped_correct && (batch_size == 1 || m >= 512))) { | ||
| #else | ||
| bool looped_correct = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not set this to true?
aten/src/ATen/cuda/CUDABlas.h
Outdated
| #else | ||
|
|
||
| #ifdef CUDART_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or simply elif defined(CUDART_VERSION)
9e033a0 to
1cea7db
Compare
* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype * Skip complex types, hipsolver does not support * Skip complex types in other batched tests as well
b8ed817 to
85894aa
Compare
|
@pytorchbot label ciflow/trunk |
85894aa to
ab462bf
Compare
ab462bf to
73b597a
Compare
|
Hi @nikitaved , @lezcano , This is ready for final review. It is failing the following 3 unrelated and unstable test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking quite good, but I'd like to make sure we're not breaking some non-standard build.
There are many changes that are not semantics preserving. I pointed out a few.
Now, there are many cases where we do things like if ROCM ... elif cusolver ... else. Is it possible to build without ROCM or cusolver support? cc @malfet
If we always have either cusolver or hipsolver, then a fair amount of the code could be simplified. In particular, we would always have that defined(CUDART_VERSION) || defined(ROCM_VERSION) and quite a bit of the code could be simplified.
If there are cases where we don't have cusolver or hipsolver, then we should try to build in those cases with this patch, as we may be breaking them.
Once we know the answer to the question above, we should write a note for future developers somewhere.
| } | ||
|
|
||
| // This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda | ||
| #ifdef CUDART_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is cusolver always installed when USE_ROCM is False? I think this is not true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you are correct. There could be instances where neither solver is installed and it instead uses magma or some other LAPACK library. Did I add logic that assumes cusolver is installed when USE_ROCM is False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if @malfet's approach is possible, which would heavily reduce the amount of code needed. If not, it'd be good to put together a build without rocm or cusolver (if such a build exists, I'm not sure) and then try to build and run the tests, see if it's correct.
cc @IvanYashchuk who surely knows the answer as to whether we can build without cusolver and hipsolver.
| template <> | ||
| void getrfBatched<double>( | ||
| int n, double** dA_array, int ldda, int* ipiv_array, int* info_array, int batchsize) { | ||
| CUDABLAS_GETRF_BATCHED_ARGTYPES(double)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
aten/src/ATen/cuda/CUDABlas.cpp
Outdated
| #else | ||
|
|
||
| #ifdef CUDART_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| #else | |
| #ifdef CUDART_VERSION | |
| #elif defined(CUDART_VERSION) |
Same in the other occurrences.
aten/src/ATen/cuda/CUDABlas.h
Outdated
| TORCH_CUDA_CU_API void getrsBatched<c10::complex<double>>(HIPBLAS_GETRS_ARGTYPES(c10::complex<double>)); | ||
|
|
||
|
|
||
| #else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. Grep for all these and write elif define as it's easier to follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops ignore previous comment.
|
|
||
| template<class Dtype> | ||
| void getrfBatched(CUDABLAS_GETRF_ARGTYPES(Dtype)) { | ||
| void getrfBatched(CUDABLAS_GETRF_BATCHED_ARGTYPES(Dtype)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Same same for CUDABLAS_GETRS_ARGTYPES but yeah.
aten/src/ATen/cuda/CUDABlas.h
Outdated
| template <> | ||
| TORCH_CUDA_CU_API void geqrfBatched<c10::complex<float>>( | ||
| HIPBLAS_GEQRF_BATCHED_ARGTYPES(c10::complex<float>)); | ||
| #else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this used to be within a ifdef CUDART_VERSION
| } | ||
| }; | ||
|
|
||
| #ifdef CUDART_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOps, yea this should definitely have been changed to USE_LINALG_SOLVER instead of being removed
| // AMD ROCm backend is implemented via rewriting all CUDA calls to HIP | ||
| // rocBLAS does not implement BLAS-like extensions of cuBLAS, they're in rocSOLVER | ||
| // rocSOLVER is currently not used in ATen, therefore we raise an error in this case | ||
| #ifndef CUDART_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't all these now be ifndef USE_LINALG_SOLVER?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large chunks of code seems to copy-n-pasted in this PR instead of modifying hipifier to take care about them.
Please either elaborate in PR description, why HIPifier can not be used here, or better adjust it to take care of the duplications. Also, if you are adding/removing some constraints that are not specific to ROCM, please mention in PR description why they should no longer apply
| "The QR decomposition is not differentiable when mode='complete' and nrows > ncols"): | ||
| b.backward() | ||
|
|
||
| @skipCUDAIfNoCusolver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you removing this guard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, that was leftover from internal testing
| auto trans = CUBLAS_OP_N; | ||
| #endif | ||
|
|
||
| #if defined(CUDART_VERSION) || (defined(ROCM_VERSION) && (ROCM_VERSION >= 50400)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduce CUDA_VERSION check there, as this code should only be compiled for either CUDA or ROCM
| #if defined(CUDART_VERSION) || (defined(ROCM_VERSION) && (ROCM_VERSION >= 50400)) | |
| #if !defined(ROCM_VERSION) || ROCM_VERSION >= 50400) |
| #ifdef USE_ROCM | ||
| #define TORCH_HIPBLAS_CHECK(EXPR) \ | ||
| do { \ | ||
| hipblasStatus_t __err = EXPR; \ | ||
| TORCH_CHECK(__err == HIPBLAS_STATUS_SUCCESS, \ | ||
| "CUDA error: ", \ | ||
| " when calling `" #EXPR "`"); \ | ||
| } while (0) | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is needed? Shouldn't hipifier just replace TORCH_CUDABLAS_CHECK with TORCH_HIPBLAS_CHECK?
| template <> | ||
| void trsm<float>(HIPBLAS_TRSM_ARGTYPES(float)) { | ||
| TORCH_HIPBLAS_CHECK(cublasStrsm( | ||
| handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb)); | ||
| } | ||
|
|
||
| template <> | ||
| void trsm<double>(HIPBLAS_TRSM_ARGTYPES(double)) { | ||
| TORCH_HIPBLAS_CHECK(cublasDtrsm( | ||
| handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a verbatim copy of
pytorch/aten/src/ATen/cuda/CUDABlas.cpp
Lines 985 to 995 in 854fe47
| template <> | |
| void trsm<float>(CUDABLAS_TRSM_ARGTYPES(float)) { | |
| TORCH_CUDABLAS_CHECK(cublasStrsm( | |
| handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb)); | |
| } | |
| template <> | |
| void trsm<double>(CUDABLAS_TRSM_ARGTYPES(double)) { | |
| TORCH_CUDABLAS_CHECK(cublasDtrsm( | |
| handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb)); | |
| } |
which makes me wonder, why hipifier can not take care of that one?
|
@malfet Thank you for your comments! I am in the mix of addressing them now and seeing if we can get by with hipification. I think the original motivation for not using hipification was because some hipblas and rocblas types were not easily interchangeable. A problem that I think may have been addressed in later rocm versions. I am looking into that now. Also, any instance of this PR removing constraints that are NOT ROCM specific is an oversight on my part. I have addressed the locations you have pointed out in my current working version. Please continue to scrutinize every bit of this PR as I believe we all want this to be as high-quality as possible. Thanks again for your time in reviewing this :) new patch set will come in soon! |
|
@malfet @lezcano I have investigated the hipification request. Unfortunately it uncovered a known limitation. Currently, |
|
Given the size of this PR, it may be simpler to wait until those fixes land, and then rebase this one on top of that one and heavily simplify it. |
|
Understood. I'm going to push up the things I fixed in terms of the other comments in order to save my place. |
|
Here is the work in question: #105881 |
This is a follow up to #105881 and replaces #103203 The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default` The following 16 unit tests will be enabled for rocm in this change: - test_inverse_many_batches_cuda* - test_inverse_errors_large_cuda* - test_linalg_solve_triangular_large_cuda* - test_lu_solve_batched_many_batches_cuda* Pull Request resolved: #106620 Approved by: https://github.com/lezcano
…h#106620) This is a follow up to pytorch#105881 and replaces pytorch#103203 The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default` The following 16 unit tests will be enabled for rocm in this change: - test_inverse_many_batches_cuda* - test_inverse_errors_large_cuda* - test_linalg_solve_triangular_large_cuda* - test_lu_solve_batched_many_batches_cuda* Pull Request resolved: pytorch#106620 Approved by: https://github.com/lezcano
Enables the following tests for ROCm along with support for various batched linalg drivers:
test_inverse_errors_large_cuda*
test_qr_batched_cuda*
test_linalg_solve_triangular_large_cuda*
test_ormqr_cuda_complex*
test_householder_product_cuda_complex*
test_geqrf_cuda_complex*