Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix matrixVectorOp to verify promoted pointer type is still aligned to vectorized load boundary #325

Merged
merged 3 commits into from
Sep 24, 2021

Conversation

viclafargue
Copy link
Contributor

Fix for rapidsai/cuml#3965
The function did not check for the memory alignment of the pointer provided.

@viclafargue viclafargue requested review from a team as code owners September 9, 2021 13:48
@github-actions github-actions bot added the cpp label Sep 9, 2021
@viclafargue viclafargue added 3 - Ready for Review bug Something isn't working non-breaking Non-breaking change and removed cpp labels Sep 9, 2021
@github-actions github-actions bot added the cpp label Sep 9, 2021
@viclafargue viclafargue changed the title Fix matrixVectorOp Fix matrixVectorOp Sep 9, 2021
@cjnolet
Copy link
Member

cjnolet commented Sep 9, 2021

@viclafargue, RMM should be aligning allocations by default, which is why this has not been an issue for cuml algorithms. Before RMM, we did manually align allocations, but since RMM does this for us, we shouldn't have to explicitly check this in the algorithms (since the number of places this might need to be done could be exhaustive).

Hopefully the solution proposed here solves the problem for the user.

@viclafargue
Copy link
Contributor Author

Thanks for finding a solution. That would solve the problem on the Python side. Maybe these should even be done undercover by default when cuML is imported? However, do we have the guarantee that pointers provided to the native API are always memory aligned?

@cjnolet
Copy link
Member

cjnolet commented Sep 15, 2021

However, do we have the guarantee that pointers provided to the native API are always memory aligned?

The assumption here is that users will be using RAPIDS and so we should be able to expect that the RAPIDS memory manager is being used to allocate any pointers that are handed to these prims.

@viclafargue
Copy link
Contributor Author

This PR is not necessary anymore.

@cjnolet cjnolet changed the title Fix matrixVectorOp Fix matrixVectorOp to verify pointer is aligned to L1 load boundary Sep 23, 2021
@cjnolet cjnolet reopened this Sep 23, 2021
@cjnolet
Copy link
Member

cjnolet commented Sep 23, 2021

I've reopened this pull request after chatting w/ @viclafargue and now that I have a better understanding of how this could be causing the issue in rapidsai/cuml#4199.

Specifically, the matrixVectorOp attempts to make the application of vector elements to the matrix more efficient by loading multiple vector elements into a single CUDA thread. This alone won't cause an issue, but the kernel is further optimizing the reads by loading the vector elements into the read-only L1 cache and, insodoing, might promote the datatype to minimize the number of reads (e.g. when 2x 8byte elements need to be read, it will use 1x 16 byte datatype so that it can perform a single read). It's this promotion of the datatype that assumes the pointer is aligned to the type being promoted. Thus, the pointer alignment needs to be explicitly checked when this promotion occurs. This is an internal assumption, specific to the matrixVectorOp.

@@ -93,17 +93,23 @@ void matrixVectorOp(Type *out, const Type *matrix, const Type *vec, IdxType D,
IdxType N, bool rowMajor, bool bcastAlongRows, Lambda op,
cudaStream_t stream) {
IdxType stride = rowMajor ? D : N;
size_t bytes = stride * sizeof(Type);
if (16 / sizeof(Type) && bytes % 16 == 0) {
size_t stride_bytes = stride * sizeof(Type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks good to merge for 21.10 but we should add a nice comment here which summarizes these low-level details and justifies the need for the additional check on pointer alignment. Before knowing myself that the primitive was promoting the type to optimize the reads, I was under the impression the alignment issue was happening from the inputs. It would also be helpful if we added some additional documentation to the TxN_t struct as well. It currently states this: Obviously, it's caller's responsibility to take care of pointer alignment! but it would be helpful to add a sentence or two justifying why (the description of the vectorized op itself isn't bad but it still doesn't make particularly clear why the alignment might sometimes be needed).

@cjnolet cjnolet changed the title Fix matrixVectorOp to verify pointer is aligned to L1 load boundary Fix matrixVectorOp to verify promoted pointer type is still aligned to vectorized load boundary Sep 23, 2021
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cjnolet
Copy link
Member

cjnolet commented Sep 24, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 6d7f897 into rapidsai:branch-21.10 Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review bug Something isn't working cpp non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants