Workspace reuse #339

rawnhenry · 2020-12-06T23:19:18Z

This PR is joint work by myself and @weiya711.

List of new features:

Support for hoisting allocation of workspace memory outside of the workspace loop. (Thanks to @weiya711)
Allows TACO to 'accelerate' dense workspace with sparse iteration. That is, if a workspace consumer is dense and is being stored in a sparse result TACO will automatically track a list of indices and only append the relevant indices to the result tensor. This is useful for SpGEMM. For now, this is only enabled for serial CPU code.
Changes the cuda and c code generators so that unique names are not generated for pointer variables.
Adds some tests for workspaces which exposed the bug below. (Thanks to @weiya711).

Addresses the following bugs:

Fixes bug causing derived variables on the RHS of an assignment to be incorrectly identified as reduction variables. This caused the check for concrete index notation to fail on valid input.

All tests passed with the GPU and CPU backends. Tests were run on a machine with the following configuration:

OS: Ubuntu 20.04.1 LTS
gcc: 7.5.0
g++ 7.5.0
CUDA: 10.2.89

GPU cmake command: cmake -DCMAKE_BUILD_TYPE=Release -DCUDA=ON ..
CPU cmake command: cmake -DCMAKE_BUILD_TYPE=Release ..

…nd split

…over a dense workspace

… workspace_reuse

…ts loop to zero every element in a temporary when it is hoisted before the producer is called. Changes the codegens to keep pointer names constant

stephenchouca · 2020-12-15T23:11:23Z

The code that gets generated for SpGEMM (taco "A(i,j) = B(i,k) * C(k,j)" -f=A:ds -f=B:ds -f=C:ds) has a number of issues:

w_index_list_size needs to be reset after every iteration of the i loop but isn't in the generated code. (Alternatively, w_index_list_size could be local to the i loop.)
w_already_set[w_index_locator] = 0; needs to be w_already_set[j] = 0;.
The generated code hard-codes the size of the workspace (i.e., to 42), whereas ideally it would be set to whatever variable is used to represent the j dimension (e.g., C1_dimension). (Not sure how much work it'd be to implement this though.)

Additionally, it might make sense now to enable some of the tests that were added (but kept disabled) in #325.

stephenchouca · 2020-12-16T02:22:05Z

I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the i loop, which increases the asymptotic complexity of the code to O(N^2):

  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t pw = 0; pw < 42; pw++) {
      w[pw] = 0.0;
    }
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        if (!w_already_set[j]) {
          w[j] = B_vals[kB] * C_vals[jC];
          w_index_list[w_index_list_size] = j;
          w_already_set[j] = 1;
          w_index_list_size++;
        }
        ...

weiya711 · 2020-12-16T02:25:05Z

I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the i loop, which increases the asymptotic complexity of the code to O(N^2):

  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t pw = 0; pw < 42; pw++) {
      w[pw] = 0.0;
    }
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        if (!w_already_set[j]) {
          w[j] = B_vals[kB] * C_vals[jC];
          w_index_list[w_index_list_size] = j;
          w_already_set[j] = 1;
          w_index_list_size++;
        }
        ...

The optimization of removing the zeroInit Loop for temporary workspaces wasn't done on this branch. However, I did fix this last week in my spatial branch and can add those changes to this branch before we approve the merge/pull request

rawnhenry · 2020-12-16T04:48:06Z

I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the i loop, which increases the asymptotic complexity of the code to O(N^2):
  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t pw = 0; pw < 42; pw++) {
      w[pw] = 0.0;
    }
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        if (!w_already_set[j]) {
          w[j] = B_vals[kB] * C_vals[jC];
          w_index_list[w_index_list_size] = j;
          w_already_set[j] = 1;
          w_index_list_size++;
        }
        ...
The optimization of removing the zeroInit Loop for temporary workspaces wasn't done on this branch. However, I did fix this last week in my spatial branch and can add those changes to this branch before we approve the merge/pull request

Ah this is actually something I removed and re-enabled by accident in one of my last commits. I think you have to zero init the workspace for any loops where we use the workspace multiple times since we don't have clean up code. For now, I just always emit the init loop. However, in this case, the cleanup code exists do I just need to check if we are accelerating a dense workspace and omit the loop in that case.

rawnhenry · 2020-12-16T04:48:44Z

The code that gets generated for SpGEMM (taco "A(i,j) = B(i,k) * C(k,j)" -f=A:ds -f=B:ds -f=C:ds) has a number of issues:
* `w_index_list_size` needs to be reset after every iteration of the `i` loop but isn't in the generated code. (Alternatively, `w_index_list_size` could be local to the `i` loop.)

* `w_already_set[w_index_locator] = 0;` needs to be `w_already_set[j] = 0;`.

* The generated code hard-codes the size of the workspace (i.e., to 42), whereas ideally it would be set to whatever variable is used to represent the `j` dimension (e.g., `C1_dimension`). (Not sure how much work it'd be to implement this though.)
Additionally, it might make sense now to enable some of the tests that were added (but kept disabled) in #325.

Yea, I'll fix those and check reenabling the tests you mentioned. I will fix the hard coding of sizes if its not too much implementation work. Thanks Stephen!

weiya711 · 2020-12-16T21:07:45Z

I also noticed that for SpGEMM, the generated code still contains a loop that zeros out the entire workspace for every iteration of the i loop, which increases the asymptotic complexity of the code to O(N^2):
  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t pw = 0; pw < 42; pw++) {
      w[pw] = 0.0;
    }
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        if (!w_already_set[j]) {
          w[j] = B_vals[kB] * C_vals[jC];
          w_index_list[w_index_list_size] = j;
          w_already_set[j] = 1;
          w_index_list_size++;
        }
        ...
The optimization of removing the zeroInit Loop for temporary workspaces wasn't done on this branch. However, I did fix this last week in my spatial branch and can add those changes to this branch before we approve the merge/pull request
Ah this is actually something I removed and re-enabled by accident in one of my last commits. I think you have to zero init the workspace for any loops where we use the workspace multiple times since we don't have clean up code. For now, I just always emit the init loop. However, in this case, the cleanup code exists do I just need to check if we are accelerating a dense workspace and omit the loop in that case.

I think that if there is no cleanup code, you still don't need a zeroInit loop in the case where there is no reduction operator (+=). For example, if you always say t[it] = b[ib] * c[ic] for a dense workspace then you won't need to zero-initialize because even if you are re-using the workspace multiple times, it will always be set to a new, computed value.

I just tried SpGEMM on the original master branch of TACO and it just seems the zeroInit loop moved into the for(i = ...) {} loop when it's supposed to be

  w = (double*)malloc(sizeof(double) * 42);
  for (int32_t pw = 0; pw < 42; pw++) {
    w[pw] = 0.0;
  }

  for (int32_t i = 0; i < B1_dimension; i++) {
    for (int32_t kB = B2_pos[i]; kB < B2_pos[(i + 1)]; kB++) {
      int32_t k = B2_crd[kB];
      for (int32_t jC = C2_pos[k]; jC < C2_pos[(k + 1)]; jC++) {
        int32_t j = C2_crd[jC];
        w[j] = w[j] + B_vals[kB] * C_vals[jC];
      }
    }

stephenchouca · 2020-12-21T06:16:00Z

Actually, I also just realized that the generated code for SpGEMM sets the size of the workspace arrays to be C1_dimension, but shouldn't it actually be C2_dimension (i.e., the number of columns)?

rawnhenry · 2020-12-21T23:50:38Z

I don't think I touched the code sets the temporary size. I could have accidentally broke something though.

You're right that it should be C2_dimension. I'll check that as I fix the other issues.

…dense workspacE

…nse workspace. This should make the transition to multithreading easier and fixes a bug in the original code

rawnhenry · 2020-12-24T05:16:35Z

@stephenchouca When I run this command:

./bin/taco "A(i,j) = B(i,k) * C(k,j)" -f=A:ds -f=B:ds -f=C:ds -d=i:4 -d=k:12 -d=j:120

The generated code sets the workspace size to 120 which seems correct. What example did you try when you noticed the incorrect behavior?

stephenchouca · 2020-12-24T06:57:00Z

@stephenchouca When I run this command:

./bin/taco "A(i,j) = B(i,k) * C(k,j)" -f=A:ds -f=B:ds -f=C:ds -d=i:4 -d=k:12 -d=j:120

The generated code sets the workspace size to 120 which seems correct. What example did you try when you noticed the incorrect behavior?

It turned out I was actually setting the dimensions incorrectly; feel free to ignore my earlier comment. Sorry for the confusion.

…ce for the workspace based on the size of the sizes of the input tensors

rawnhenry · 2020-12-25T00:34:29Z

I relaxed the requirements for the SPMM transformation and some more tests from #325 pass. The idea is to keep the transform off for algorithms/formats that don't do/allow linear combination of rows.

rawnhenry · 2020-12-25T01:05:00Z

I changed the transform again so that if the LHS is CSR then it will include a workspace as long as:

There are no permutations in the format
All the level formats of the operands are ordered

I think this is ok since the i -> k -> j iteration order means we can iterate over each operand in order. I'm not sure if there's something I'm missing though, In any case, most of the tests from #325 pass. The ones that don't seem to require transposes. Let me know your feedback on this.

Tests from #325:

Result	A =	B *	C
pass	Dense	Dense	Dense
pass	Dense	Dense	CSR
pass	Dense	Dense	CSC
pass	Dense	Dense	COO
pass	Dense	CSR	Dense
pass	Dense	CSR	CSR
pass	Dense	CSR	CSC
pass	Dense	CSR	COO
pass	Dense	CSC	Dense
pass	Dense	CSC	CSR
pass	Dense	CSC	CSC
pass	Dense	CSC	COO
pass	Dense	COO	Dense
pass	Dense	COO	CSR
pass	Dense	COO	CSC
pass	Dense	COO	COO
pass	CSR	Dense	Dense
pass	CSR	Dense	CSR
pass	CSR	Dense	CSC
pass	CSR	Dense	COO
pass	CSR	CSR	Dense
pass	CSR	CSR	CSR
pass	CSR	CSR	CSC
pass	CSR	CSR	COO
FAIL	CSR	CSC	Dense
FAIL	CSR	CSC	CSR

weiya711 and others added 8 commits October 14, 2020 14:51

Add in hoisted workspace reuse and remove guard for divisible bound a…

bd36277

…nd split

Fix some workspaces tests

e649a67

Prototypes automatically generating code to to have sparse iteration …

01166af

…over a dense workspace

Merge branch 'master' of https://github.com/tensor-compiler/taco into…

3e03992

… workspace_reuse

Fixes bugs in check for accelerating workspace

12b51f0

Fixes bug in concreteNotation check. All workspace tests pass.

c8972c0

Removes print statements

4895917

Only hoists out malloc + free from where statement when possible. Emi…

1c37ebd

…ts loop to zero every element in a temporary when it is hoisted before the producer is called. Changes the codegens to keep pointer names constant

Adds negation to pytaco tensor interface

e48e4ec

rawnhenry added 4 commits December 23, 2020 14:45

Removes initialization loop from before producer when accelerating a …

9b1450c

…dense workspacE

Merge branch 'master' into workspace_reuse

e96e4b8

Places index list size above the producer loop when accelerating a de…

dd795fc

…nse workspace. This should make the transition to multithreading easier and fixes a bug in the original code

Fixes workspace reset

ff84784

rawnhenry added 2 commits December 24, 2020 00:12

If underived variables are used to index a workspace, we allocate spa…

d5721d7

…ce for the workspace based on the size of the sizes of the input tensors

Relaxes requirements for spmm transformation

2eb298e

rawnhenry added 2 commits December 24, 2020 16:39

Checks if first mode of last tensor has locate for spmm transform

46aed13

Changes SPMM tranform requirement. Unsure about this

8471869

stephenchouca merged commit cb4731d into tensor-compiler:master Jan 13, 2021

Workspace reuse #339

Workspace reuse #339

Uh oh!

Conversation

rawnhenry commented Dec 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenchouca commented Dec 15, 2020

Uh oh!

stephenchouca commented Dec 16, 2020

Uh oh!

weiya711 commented Dec 16, 2020

Uh oh!

rawnhenry commented Dec 16, 2020

Uh oh!

rawnhenry commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weiya711 commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenchouca commented Dec 21, 2020

Uh oh!

rawnhenry commented Dec 21, 2020

Uh oh!

rawnhenry commented Dec 24, 2020

Uh oh!

stephenchouca commented Dec 24, 2020

Uh oh!

rawnhenry commented Dec 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rawnhenry commented Dec 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rawnhenry commented Dec 6, 2020 •

edited

Loading

rawnhenry commented Dec 16, 2020 •

edited

Loading

weiya711 commented Dec 16, 2020 •

edited

Loading

rawnhenry commented Dec 25, 2020 •

edited

Loading

rawnhenry commented Dec 25, 2020 •

edited

Loading