Skip to content

Conversation

@rawnhenry
Copy link
Collaborator

@rawnhenry rawnhenry commented Dec 6, 2020

This PR is joint work by myself and @weiya711.

List of new features:

  1. Support for hoisting allocation of workspace memory outside of the workspace loop. (Thanks to @weiya711)
  2. Allows TACO to 'accelerate' dense workspace with sparse iteration. That is, if a workspace consumer is dense and is being stored in a sparse result TACO will automatically track a list of indices and only append the relevant indices to the result tensor. This is useful for SpGEMM. For now, this is only enabled for serial CPU code.
  3. Changes the cuda and c code generators so that unique names are not generated for pointer variables.
  4. Adds some tests for workspaces which exposed the bug below. (Thanks to @weiya711).

Addresses the following bugs:

  1. Fixes bug causing derived variables on the RHS of an assignment to be incorrectly identified as reduction variables. This caused the check for concrete index notation to fail on valid input.

All tests passed with the GPU and CPU backends. Tests were run on a machine with the following configuration:

OS: Ubuntu 20.04.1 LTS
gcc: 7.5.0
g++ 7.5.0
CUDA: 10.2.89

GPU cmake command: cmake -DCMAKE_BUILD_TYPE=Release -DCUDA=ON ..
CPU cmake command: cmake -DCMAKE_BUILD_TYPE=Release ..

@stephenchouca
Copy link
Contributor

The code that gets generated for SpGEMM (taco "A(i,j) = B(i,k) * C(k,j)" -f=A:ds -f=B:ds -f=C:ds) has a number of issues:

  • w_index_list_size needs to be reset after every iteration of the i loop but isn't in the generated code. (Alternatively, w_index_list_size could be local to the i loop.)
  • w_already_set[w_index_locator] = 0; needs to be w_already_set[j] = 0;.
  • The generated code hard-codes the size of the workspace (i.e., to 42), whereas ideally it would be set to whatever variable is used to represent the j dimension (e.g., C1_dimension). (Not sure how much work it'd be to implement this though.)

Additionally, it might make sense now to enable some of the tests that were added (but kept disabled) in #325.

@stephenchouca
Copy link
Contributor

I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the i loop, which increases the asymptotic complexity of the code to O(N^2):

  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t pw = 0; pw < 42; pw++) {
      w[pw] = 0.0;
    }
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        if (!w_already_set[j]) {
          w[j] = B_vals[kB] * C_vals[jC];
          w_index_list[w_index_list_size] = j;
          w_already_set[j] = 1;
          w_index_list_size++;
        }
        ...

@weiya711
Copy link
Contributor

I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the i loop, which increases the asymptotic complexity of the code to O(N^2):

  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t pw = 0; pw < 42; pw++) {
      w[pw] = 0.0;
    }
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        if (!w_already_set[j]) {
          w[j] = B_vals[kB] * C_vals[jC];
          w_index_list[w_index_list_size] = j;
          w_already_set[j] = 1;
          w_index_list_size++;
        }
        ...

The optimization of removing the zeroInit Loop for temporary workspaces wasn't done on this branch. However, I did fix this last week in my spatial branch and can add those changes to this branch before we approve the merge/pull request

@rawnhenry
Copy link
Collaborator Author

I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the i loop, which increases the asymptotic complexity of the code to O(N^2):

  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t pw = 0; pw < 42; pw++) {
      w[pw] = 0.0;
    }
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        if (!w_already_set[j]) {
          w[j] = B_vals[kB] * C_vals[jC];
          w_index_list[w_index_list_size] = j;
          w_already_set[j] = 1;
          w_index_list_size++;
        }
        ...

The optimization of removing the zeroInit Loop for temporary workspaces wasn't done on this branch. However, I did fix this last week in my spatial branch and can add those changes to this branch before we approve the merge/pull request

Ah this is actually something I removed and re-enabled by accident in one of my last commits. I think you have to zero init the workspace for any loops where we use the workspace multiple times since we don't have clean up code. For now, I just always emit the init loop. However, in this case, the cleanup code exists do I just need to check if we are accelerating a dense workspace and omit the loop in that case.

@rawnhenry
Copy link
Collaborator Author

rawnhenry commented Dec 16, 2020

The code that gets generated for SpGEMM (taco "A(i,j) = B(i,k) * C(k,j)" -f=A:ds -f=B:ds -f=C:ds) has a number of issues:

* `w_index_list_size` needs to be reset after every iteration of the `i` loop but isn't in the generated code. (Alternatively, `w_index_list_size` could be local to the `i` loop.)

* `w_already_set[w_index_locator] = 0;` needs to be `w_already_set[j] = 0;`.

* The generated code hard-codes the size of the workspace (i.e., to 42), whereas ideally it would be set to whatever variable is used to represent the `j` dimension (e.g., `C1_dimension`). (Not sure how much work it'd be to implement this though.)

Additionally, it might make sense now to enable some of the tests that were added (but kept disabled) in #325.

Yea, I'll fix those and check reenabling the tests you mentioned. I will fix the hard coding of sizes if its not too much implementation work. Thanks Stephen!

@weiya711
Copy link
Contributor

weiya711 commented Dec 16, 2020

I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the i loop, which increases the asymptotic complexity of the code to O(N^2):

  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t pw = 0; pw < 42; pw++) {
      w[pw] = 0.0;
    }
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        if (!w_already_set[j]) {
          w[j] = B_vals[kB] * C_vals[jC];
          w_index_list[w_index_list_size] = j;
          w_already_set[j] = 1;
          w_index_list_size++;
        }
        ...

The optimization of removing the zeroInit Loop for temporary workspaces wasn't done on this branch. However, I did fix this last week in my spatial branch and can add those changes to this branch before we approve the merge/pull request

Ah this is actually something I removed and re-enabled by accident in one of my last commits. I think you have to zero init the workspace for any loops where we use the workspace multiple times since we don't have clean up code. For now, I just always emit the init loop. However, in this case, the cleanup code exists do I just need to check if we are accelerating a dense workspace and omit the loop in that case.

I think that if there is no cleanup code, you still don't need a zeroInit loop in the case where there is no reduction operator (+=). For example, if you always say t[it] = b[ib] * c[ic] for a dense workspace then you won't need to zero-initialize because even if you are re-using the workspace multiple times, it will always be set to a new, computed value.

I just tried SpGEMM on the original master branch of TACO and it just seems the zeroInit loop moved into the for(i = ...) {} loop when it's supposed to be

  w = (double*)malloc(sizeof(double) * 42);
  for (int32_t pw = 0; pw < 42; pw++) {
    w[pw] = 0.0;
  }

  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        w[j] = w[j] + B_vals[kB] * C_vals[jC];
      }
    }

@stephenchouca
Copy link
Contributor

Actually, I also just realized that the generated code for SpGEMM sets the size of the workspace arrays to be C1_dimension, but shouldn't it actually be C2_dimension (i.e., the number of columns)?

@rawnhenry
Copy link
Collaborator Author

I don't think I touched the code sets the temporary size. I could have accidentally broke something though.

You're right that it should be C2_dimension. I'll check that as I fix the other issues.

@rawnhenry
Copy link
Collaborator Author

@stephenchouca When I run this command:

./bin/taco "A(i,j) = B(i,k) * C(k,j)" -f=A:ds -f=B:ds -f=C:ds -d=i:4 -d=k:12 -d=j:120

The generated code sets the workspace size to 120 which seems correct. What example did you try when you noticed the incorrect behavior?

@stephenchouca
Copy link
Contributor

@stephenchouca When I run this command:

./bin/taco "A(i,j) = B(i,k) * C(k,j)" -f=A:ds -f=B:ds -f=C:ds -d=i:4 -d=k:12 -d=j:120

The generated code sets the workspace size to 120 which seems correct. What example did you try when you noticed the incorrect behavior?

It turned out I was actually setting the dimensions incorrectly; feel free to ignore my earlier comment. Sorry for the confusion.

@rawnhenry
Copy link
Collaborator Author

rawnhenry commented Dec 25, 2020

I relaxed the requirements for the SPMM transformation and some more tests from #325 pass. The idea is to keep the transform off for algorithms/formats that don't do/allow linear combination of rows.

@rawnhenry
Copy link
Collaborator Author

rawnhenry commented Dec 25, 2020

I changed the transform again so that if the LHS is CSR then it will include a workspace as long as:

  1. There are no permutations in the format
  2. All the level formats of the operands are ordered

I think this is ok since the i -> k -> j iteration order means we can iterate over each operand in order. I'm not sure if there's something I'm missing though, In any case, most of the tests from #325 pass. The ones that don't seem to require transposes. Let me know your feedback on this.

Tests from #325:

Result A = B * C
pass Dense Dense Dense
pass Dense Dense CSR
pass Dense Dense CSC
pass Dense Dense COO
pass Dense CSR Dense
pass Dense CSR CSR
pass Dense CSR CSC
pass Dense CSR COO
pass Dense CSC Dense
pass Dense CSC CSR
pass Dense CSC CSC
pass Dense CSC COO
pass Dense COO Dense
pass Dense COO CSR
pass Dense COO CSC
pass Dense COO COO
pass CSR Dense Dense
pass CSR Dense CSR
pass CSR Dense CSC
pass CSR Dense COO
pass CSR CSR Dense
pass CSR CSR CSR
pass CSR CSR CSC
pass CSR CSR COO
FAIL CSR CSC Dense
FAIL CSR CSC CSR

@stephenchouca stephenchouca merged commit cb4731d into tensor-compiler:master Jan 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants