-
Notifications
You must be signed in to change notification settings - Fork 196
Workspace reuse #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workspace reuse #339
Conversation
…over a dense workspace
… workspace_reuse
…ts loop to zero every element in a temporary when it is hoisted before the producer is called. Changes the codegens to keep pointer names constant
|
The code that gets generated for SpGEMM (
Additionally, it might make sense now to enable some of the tests that were added (but kept disabled) in #325. |
|
I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the for (int32_t i = 0; i < B1_dimension; i++) {
for (int32_t pw = 0; pw < 42; pw++) {
w[pw] = 0.0;
}
for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
int32_t k = B2_crd[kB];
for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
int32_t j = C2_crd[jC];
if (!w_already_set[j]) {
w[j] = B_vals[kB] * C_vals[jC];
w_index_list[w_index_list_size] = j;
w_already_set[j] = 1;
w_index_list_size++;
}
... |
The optimization of removing the zeroInit Loop for temporary workspaces wasn't done on this branch. However, I did fix this last week in my spatial branch and can add those changes to this branch before we approve the merge/pull request |
Ah this is actually something I removed and re-enabled by accident in one of my last commits. I think you have to zero init the workspace for any loops where we use the workspace multiple times since we don't have clean up code. For now, I just always emit the init loop. However, in this case, the cleanup code exists do I just need to check if we are accelerating a dense workspace and omit the loop in that case. |
Yea, I'll fix those and check reenabling the tests you mentioned. I will fix the hard coding of sizes if its not too much implementation work. Thanks Stephen! |
I think that if there is no cleanup code, you still don't need a zeroInit loop in the case where there is no reduction operator (+=). For example, if you always say t[it] = b[ib] * c[ic] for a dense workspace then you won't need to zero-initialize because even if you are re-using the workspace multiple times, it will always be set to a new, computed value. I just tried SpGEMM on the original master branch of TACO and it just seems the zeroInit loop moved into the w = (double*)malloc(sizeof(double) * 42);
for (int32_t pw = 0; pw < 42; pw++) {
w[pw] = 0.0;
}
for (int32_t i = 0; i < B1_dimension; i++) {
for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
int32_t k = B2_crd[kB];
for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
int32_t j = C2_crd[jC];
w[j] = w[j] + B_vals[kB] * C_vals[jC];
}
} |
|
Actually, I also just realized that the generated code for SpGEMM sets the size of the workspace arrays to be |
|
I don't think I touched the code sets the temporary size. I could have accidentally broke something though. You're right that it should be C2_dimension. I'll check that as I fix the other issues. |
…nse workspace. This should make the transition to multithreading easier and fixes a bug in the original code
|
@stephenchouca When I run this command:
The generated code sets the workspace size to 120 which seems correct. What example did you try when you noticed the incorrect behavior? |
It turned out I was actually setting the dimensions incorrectly; feel free to ignore my earlier comment. Sorry for the confusion. |
…ce for the workspace based on the size of the sizes of the input tensors
|
I relaxed the requirements for the SPMM transformation and some more tests from #325 pass. The idea is to keep the transform off for algorithms/formats that don't do/allow linear combination of rows. |
|
I changed the transform again so that if the LHS is CSR then it will include a workspace as long as:
I think this is ok since the i -> k -> j iteration order means we can iterate over each operand in order. I'm not sure if there's something I'm missing though, In any case, most of the tests from #325 pass. The ones that don't seem to require transposes. Let me know your feedback on this. Tests from #325:
|
This PR is joint work by myself and @weiya711.
List of new features:
Addresses the following bugs:
All tests passed with the GPU and CPU backends. Tests were run on a machine with the following configuration:
OS: Ubuntu 20.04.1 LTS
gcc: 7.5.0
g++ 7.5.0
CUDA: 10.2.89
GPU cmake command: cmake -DCMAKE_BUILD_TYPE=Release -DCUDA=ON ..
CPU cmake command: cmake -DCMAKE_BUILD_TYPE=Release ..