Skip to content

Conversation

@alugorey
Copy link
Contributor

@alugorey alugorey commented Jun 7, 2023

Enables the following tests for ROCm along with support for various batched linalg drivers:
test_inverse_errors_large_cuda*
test_qr_batched_cuda*
test_linalg_solve_triangular_large_cuda*
test_ormqr_cuda_complex*
test_householder_product_cuda_complex*
test_geqrf_cuda_complex*

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 7, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103203

Note: Links to docs will display an error until the docs builds have been completed.

✅ 3 Unrelated Failures

As of commit 7f864fb:

BROKEN TRUNK - The following jobs failed but were present on the merge base 04da0c7:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: linalg_frontend release notes category label Jun 7, 2023
@alugorey alugorey marked this pull request as draft June 7, 2023 21:33
Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a number of comments and questions.

You should probably also activate the relevant OpInfo tests

@precisionOverride({torch.float32: 1e-3, torch.complex64: 1e-3,
torch.float64: 1e-8, torch.complex128: 1e-8})
def test_lu_solve_batched(self, device, dtype):
torch.backends.cuda.preferred_linalg_library('cusolver')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed.

Comment on lines 4753 to 4755
@dtypesIfCUDA(*floating_types_and(
*[torch.cfloat] if not TEST_WITH_ROCM else [],
*[torch.cdouble] if not TEST_WITH_ROCM else []))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@dtypesIfCUDA(*floating_types_and(
*[torch.cfloat] if not TEST_WITH_ROCM else [],
*[torch.cdouble] if not TEST_WITH_ROCM else []))
@dtypesIfCUDA(*floating_types_and(
*[torch.cfloat, torch.cdouble] if not TEST_WITH_ROCM else []))

same everywhere else.

TORCH_CHECK(false, "torch.linalg.lstsq: Batched version is supported only with cuBLAS backend.")
#else
#ifdef ROCM_VERSION
#if defined(ROCM_VERSION) && (ROCM_VERSION >= 50400)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw a better error if the version is lower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. added #elif to check for lower version and if so, set it to the previous value of rocblas_operation_none

#endif // ifdef USE_LINALG_SOLVER && !USE_ROCM
#else // No cublas or cusolver
#else
#ifdef CUDART_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not sound. The path below should just be taken when there is no cublas or cusolver. Now, perhaps now it can be removed completely? wdyt @IvanYashchuk?

// Particular case when multiplying A^{-1}B where B is square
// In this case doing two triangular solves is almost always fastest
if (n == k) {
#ifdef CUDART_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we always have acces to cublas/cusolver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. deleted this during testing to exercise path on ROCm. Instead of deletion, should be a check against USE_LINALG_SOLVER


void geqrf_kernel(const Tensor& input, const Tensor& tau) {
#ifdef CUDART_VERSION
#if defined(CUDART_VERSION) || defined(USE_ROCM)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just delete this if, as you did in the other changes below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this #ifdef is needed because in the event that we are not using CUDA or ROCM, then the logic just defaults to take the magma route on line 1874. At least that is my understanding of the code. Please correct me if I am wrong.

constexpr bool looped_correct = CUSOLVER_VERSION >= 11100;
if (m != n || (looped_correct && (batch_size == 1 || m >= 512))) {
#else
bool looped_correct = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not set this to true?

Comment on lines 431 to 433
#else

#ifdef CUDART_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or simply elif defined(CUDART_VERSION)

@alugorey alugorey force-pushed the upstream_batched_hipsolver branch from 9e033a0 to 1cea7db Compare June 12, 2023 21:13
@IvanYashchuk IvanYashchuk removed their request for review June 13, 2023 07:34
alugorey and others added 4 commits July 21, 2023 15:13
* Skip test_qr_batched; ROCM doesn't support QR decomp for complex dtype

* Skip complex types, hipsolver does not support

* Skip complex types in other batched tests as well
@alugorey alugorey force-pushed the upstream_batched_hipsolver branch from b8ed817 to 85894aa Compare July 21, 2023 19:28
@alugorey
Copy link
Contributor Author

@pytorchbot label ciflow/trunk

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 21, 2023
@alugorey alugorey force-pushed the upstream_batched_hipsolver branch from 85894aa to ab462bf Compare July 21, 2023 19:54
@alugorey alugorey force-pushed the upstream_batched_hipsolver branch from ab462bf to 73b597a Compare July 21, 2023 19:55
@alugorey
Copy link
Contributor Author

Hi @nikitaved , @lezcano , This is ready for final review. It is failing the following 3 unrelated and unstable test cases.
cudnn_extension error: trunk / linux-focal-rocm5.6-py3.8 / test (default, 2, 3, linux.rocm.gpu, unstable) (push)
torch dynamo error: pull / linux-focal-py3.8-gcc7 / test (distributed, 2, 2, linux.2xlarge, unstable) (pull_request)
torch dynamo error: pull / linux-bionic-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu, unstable) (pull_request) Failing after 11m

@alugorey alugorey marked this pull request as ready for review July 25, 2023 14:54
Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking quite good, but I'd like to make sure we're not breaking some non-standard build.

There are many changes that are not semantics preserving. I pointed out a few.

Now, there are many cases where we do things like if ROCM ... elif cusolver ... else. Is it possible to build without ROCM or cusolver support? cc @malfet

If we always have either cusolver or hipsolver, then a fair amount of the code could be simplified. In particular, we would always have that defined(CUDART_VERSION) || defined(ROCM_VERSION) and quite a bit of the code could be simplified.

If there are cases where we don't have cusolver or hipsolver, then we should try to build in those cases with this patch, as we may be breaking them.

Once we know the answer to the question above, we should write a note for future developers somewhere.

}

// This guards blocks use of getrsBatched, geqrfBatched, getrfBatched on platforms other than cuda
#ifdef CUDART_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cusolver always installed when USE_ROCM is False? I think this is not true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you are correct. There could be instances where neither solver is installed and it instead uses magma or some other LAPACK library. Did I add logic that assumes cusolver is installed when USE_ROCM is False?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if @malfet's approach is possible, which would heavily reduce the amount of code needed. If not, it'd be good to put together a build without rocm or cusolver (if such a build exists, I'm not sure) and then try to build and run the tests, see if it's correct.

cc @IvanYashchuk who surely knows the answer as to whether we can build without cusolver and hipsolver.

template <>
void getrfBatched<double>(
int n, double** dA_array, int ldda, int* ipiv_array, int* info_array, int batchsize) {
CUDABLAS_GETRF_BATCHED_ARGTYPES(double)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Comment on lines 1905 to 1907
#else

#ifdef CUDART_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
#else
#ifdef CUDART_VERSION
#elif defined(CUDART_VERSION)

Same in the other occurrences.

TORCH_CUDA_CU_API void getrsBatched<c10::complex<double>>(HIPBLAS_GETRS_ARGTYPES(c10::complex<double>));


#else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. Grep for all these and write elif define as it's easier to follow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops ignore previous comment.


template<class Dtype>
void getrfBatched(CUDABLAS_GETRF_ARGTYPES(Dtype)) {
void getrfBatched(CUDABLAS_GETRF_BATCHED_ARGTYPES(Dtype)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Same same for CUDABLAS_GETRS_ARGTYPES but yeah.

template <>
TORCH_CUDA_CU_API void geqrfBatched<c10::complex<float>>(
HIPBLAS_GEQRF_BATCHED_ARGTYPES(c10::complex<float>));
#else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this used to be within a ifdef CUDART_VERSION

}
};

#ifdef CUDART_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOps, yea this should definitely have been changed to USE_LINALG_SOLVER instead of being removed

// AMD ROCm backend is implemented via rewriting all CUDA calls to HIP
// rocBLAS does not implement BLAS-like extensions of cuBLAS, they're in rocSOLVER
// rocSOLVER is currently not used in ATen, therefore we raise an error in this case
#ifndef CUDART_VERSION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't all these now be ifndef USE_LINALG_SOLVER?

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large chunks of code seems to copy-n-pasted in this PR instead of modifying hipifier to take care about them.
Please either elaborate in PR description, why HIPifier can not be used here, or better adjust it to take care of the duplications. Also, if you are adding/removing some constraints that are not specific to ROCM, please mention in PR description why they should no longer apply

"The QR decomposition is not differentiable when mode='complete' and nrows > ncols"):
b.backward()

@skipCUDAIfNoCusolver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing this guard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, that was leftover from internal testing

auto trans = CUBLAS_OP_N;
#endif

#if defined(CUDART_VERSION) || (defined(ROCM_VERSION) && (ROCM_VERSION >= 50400))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce CUDA_VERSION check there, as this code should only be compiled for either CUDA or ROCM

Suggested change
#if defined(CUDART_VERSION) || (defined(ROCM_VERSION) && (ROCM_VERSION >= 50400))
#if !defined(ROCM_VERSION) || ROCM_VERSION >= 50400)

Comment on lines +59 to +67
#ifdef USE_ROCM
#define TORCH_HIPBLAS_CHECK(EXPR) \
do { \
hipblasStatus_t __err = EXPR; \
TORCH_CHECK(__err == HIPBLAS_STATUS_SUCCESS, \
"CUDA error: ", \
" when calling `" #EXPR "`"); \
} while (0)
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is needed? Shouldn't hipifier just replace TORCH_CUDABLAS_CHECK with TORCH_HIPBLAS_CHECK?

Comment on lines +989 to +999
template <>
void trsm<float>(HIPBLAS_TRSM_ARGTYPES(float)) {
TORCH_HIPBLAS_CHECK(cublasStrsm(
handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));
}

template <>
void trsm<double>(HIPBLAS_TRSM_ARGTYPES(double)) {
TORCH_HIPBLAS_CHECK(cublasDtrsm(
handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a verbatim copy of

template <>
void trsm<float>(CUDABLAS_TRSM_ARGTYPES(float)) {
TORCH_CUDABLAS_CHECK(cublasStrsm(
handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));
}
template <>
void trsm<double>(CUDABLAS_TRSM_ARGTYPES(double)) {
TORCH_CUDABLAS_CHECK(cublasDtrsm(
handle, side, uplo, trans, diag, m, n, alpha, A, lda, B, ldb));
}

which makes me wonder, why hipifier can not take care of that one?

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 27, 2023
@alugorey
Copy link
Contributor Author

@malfet Thank you for your comments! I am in the mix of addressing them now and seeing if we can get by with hipification. I think the original motivation for not using hipification was because some hipblas and rocblas types were not easily interchangeable. A problem that I think may have been addressed in later rocm versions. I am looking into that now.

Also, any instance of this PR removing constraints that are NOT ROCM specific is an oversight on my part. I have addressed the locations you have pointed out in my current working version. Please continue to scrutinize every bit of this PR as I believe we all want this to be as high-quality as possible. Thanks again for your time in reviewing this :) new patch set will come in soon!

@alugorey
Copy link
Contributor Author

@malfet @lezcano I have investigated the hipification request. Unfortunately it uncovered a known limitation. Currently, cublas types and api are hipified to rocblas. However, the api introduced here require cublas to be converted to hipblas. The issue is that there are many BLAS calls unrelated to the changes in this PR that still require rocblas. If I were to blindly hipify all instances of cublas to hipblas I would break a whole lot of code unrelated to this PR. This is a known issue and is currently being worked on, no hard ETA at the moment. The duplication of code in this PR that could seemingly have otherwise have been hipified is due to this limitation. I am aware of the pieces of work required to remove this limitation and will submit a new PR once that code gets merged. Thoughts?

@lezcano
Copy link
Collaborator

lezcano commented Jul 27, 2023

Given the size of this PR, it may be simpler to wait until those fixes land, and then rebase this one on top of that one and heavily simplify it.

@alugorey
Copy link
Contributor Author

Understood. I'm going to push up the things I fixed in terms of the other comments in order to save my place.

@alugorey
Copy link
Contributor Author

Here is the work in question: #105881

@alugorey
Copy link
Contributor Author

alugorey commented Aug 4, 2023

The contents of this PR were brought in as part of #105881. The lingering functionality not encompasses in that PR will be brought in here: #106620
Closing this issue @malfet @lezcano

@alugorey alugorey closed this Aug 4, 2023
pytorchmergebot pushed a commit that referenced this pull request Aug 15, 2023
This is a follow up to #105881 and replaces #103203

The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default`
The following 16 unit tests will be enabled for rocm in this change:
- test_inverse_many_batches_cuda*
- test_inverse_errors_large_cuda*
- test_linalg_solve_triangular_large_cuda*
- test_lu_solve_batched_many_batches_cuda*

Pull Request resolved: #106620
Approved by: https://github.com/lezcano
summerdo pushed a commit to summerdo/pytorch that referenced this pull request Aug 17, 2023
…h#106620)

This is a follow up to pytorch#105881 and replaces pytorch#103203

The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default`
The following 16 unit tests will be enabled for rocm in this change:
- test_inverse_many_batches_cuda*
- test_inverse_errors_large_cuda*
- test_linalg_solve_triangular_large_cuda*
- test_lu_solve_batched_many_batches_cuda*

Pull Request resolved: pytorch#106620
Approved by: https://github.com/lezcano
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request open source release notes: linalg_frontend release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants