[NNC] Replace ComputeAt #55933

navahgar · 2021-04-13T17:53:42Z

Given the input IR:

for (int i = 0; i < 100; ++i) {
  for (int j = 0; j < 200; ++j) {
    A[i,j] = sin(i*j)
  }
}
for (int i = 0; i < 100; ++i) {
  for (int j = 0; j < 199; ++j) {
    B[i,j] = A[i,j] + A[i, j+1]
  }
}

we need a transformation that gets to:

for (int i2 = 0; i2 < 100; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    for (int t = 0; t < 2; t++) {
      A[0, t] = sin(i2*(j2+t)) // dims: [1, 2]
    }
    B[i2,j2] = A[0,0] + A[0, 1]
  }
}

This should then be able to replace ComputeAt.

The text was updated successfully, but these errors were encountered:

navahgar · 2021-04-13T17:54:21Z

@ZolotukhinM @bertmaher

navahgar · 2021-04-13T17:55:59Z

Some discussions related to this: #55853 (comment)

navahgar · 2021-04-13T18:32:30Z

Why not have the following instead of the desired form above? @ZolotukhinM

for (int i2 = 0; i2 < 100; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = sin(i2*j2) + sin(i2*(j2+1))
  }
}

ZolotukhinM · 2021-04-13T19:09:31Z

Why not have the following instead of the desired form above?

The option you brought up is what compute_inline would do. That's on one side of the spectrum. The other side of the spectrum is to keep the computation as it was originally. On one side we have redundant computations (when we compute-inline), on the other side we need extra memory (when we keep the computation intact), so there is a trade off. What compute_at allows us to do is to land on some point in the middle of this spectrum and make the desired tradeoff.

This example just shows that something might be impossible to achieve without compute_at - to make it more realistic we might change the computation to A[i,j] + A[i, j+1] + A[i+1,j] + A[i+1, j+1] and apply compute_at at i2 loop - this way we'll be telling that we need to compute only two rows of A every time we compute a row of B.

navahgar · 2021-04-13T19:34:18Z

I get the tradeoff that compute_at offers. So, IIUC, compute_at at the innermost loop is same as compute_inline?

A more representative example would be:

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 100; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1]
  }
}

compute_at(A, i2) would result in:

for (int i2 = 0; i2 < 100; ++i2) {
  for (int t1 = 0; t1 < 1; ++t1) {
    for (int t2 = 0; t2 < 200; ++t2) {
      A[t1,t2] = sin(i2*t2);
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[0,j2] + A[0, j2+1]
  }
}

This is exactly what we have after fusion and buffer compression. Is there any other example for compute_at that is not possible with fusion and buffer compression or with inlining?

ZolotukhinM · 2021-04-13T20:24:10Z

compute_at at the innermost loop is same as compute_inline?

It is very similar in the effect, but an important difference is that compute_at would introduce a new buffer (or reuse the old one, but make it compressable), while compute_inline always eliminates the original buffer. But for the innermost loops it is mostly equivalent as you noted.

Is there any other example for compute_at that is not possible with fusion and buffer compression or with inlining?

Yes, let's look at an example when we need two rows:

Original:

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

After compute_at(A, i2):

for (int i2 = 0; i2 < 99; ++i2) {
  for (int t = 0; t < 2; t++) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[t, j1] = sin((i2+t)*j1) // dims [2, 200]
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[0,j2] + A[0, j2+1] + A[1,j2] + A[1, j2+1]
  }
}

In the original example we used two columns and one row - that's why to demonstrate the effect of compute-at I needed to apply it to the innermost loop (but in that case as you noted it was similar to compute-inline, but that was a red herring).

navahgar · 2021-04-13T22:14:01Z

Right, that is the perfect example here. The key challenge is that we need a transformation that introduces redundancy.

Here is the reason for this redundancy. Consider the element A[1,99] for example. That is computed twice, once when i == 0 and again when i == 1. Whereas in the input code there is no such redundant computation.

navahgar · 2021-04-13T22:20:13Z

The first step in transforming the code is to fuse loops i1 and i2. Just peeling the first iteration of i1 and normalizing it wouldn't help because we still can't fuse due to a dependency.

So, we need to transform the input code to the following:

for (int i1 = 0; i1 < 99; ++i1) {
  for (int t = 0; t < 2; ++t) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[i1+t,j1] = sin((i1+t)*j1)
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

This changes the iteration space of i1 from [0..100] to [0..99] x [0..2]. This introduces redundant computations of A as well.

After this we can fuse the two outer loops and perform buffer compression to get to the desired result.

navahgar · 2021-04-13T23:02:03Z

To clarify further about the transformation described above: it can be seen as split with overlapping iteration spaces and amount of overlap will determine the size of the redundancy being introduced.

ZolotukhinM · 2021-04-15T18:01:28Z

One question I have is that whether making it a separate transformation brings any value in the end. It feels that whatever we do it only will be useful in compute_at - i.e. we don't generally want to introduce this redundancy, we only want to do that for elements used in one iteration in the consumer stmt (consumer stmt: B[i2,j2]=..., elements used: A[i2,j2], A[i2, j2+1], ... A[i2+1, j2+1]).
While generally I 100% support making transformation as "atomic" as possible, in this case I'm not convinced separating compute_at into these steps is buying us that much. Maybe we should just refine semantics of compute_at instead?

navahgar · 2021-04-15T18:07:09Z

This particular transformation will likely be used only in the context of computeAt. But the other two transformations that we need to replace computeAt, namely loop fusion and buffer compression, should be useful in more general contexts.

I have not looked at computeAt code myself. I am not sure how complicated it is to fix the issues in that. By splitting it into 3 different simpler transformations, I believe it will be easier and better tested as well.

ZolotukhinM · 2021-04-15T18:19:46Z

Yeah, loop fusion and buffer compression make sense to me as separate transformation. What I suggest calling the transformation that would introduce this redundancy "compute_at" - it will not try to fuse, and it will not try to compress buffers. But it will be responsible for inserting proper loops at the proper place. E.g.

Original:

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

After compute_at(A, i2):

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int i3 = 0; i3 < 2; ++i3) {
    for (int j3 = 0; j3 < 200; ++j3) {
      T[i2+i3, j3] = sin((i2+i3)*j3). // dims: [100, 200]
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = T[i2,j2] + T[i2, j2+1] + T[i2+1,j2] + T[i2+1, j2+1]
  }
}

After eliminate_dead_stores:

for (int i2 = 0; i2 < 99; ++i2) {
  for (int i3 = 0; i3 < 2; ++i3) {
    for (int j3 = 0; j3 < 200; ++j3) {
      T[i2+i3, j3] = sin((i2+i3)*j3) // dims: [100, 200]
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = T[i2,j2] + T[i2, j2+1] + T[i2+1,j2] + T[i2+1, j2+1]
  }
}

After compress_buffers:

for (int i2 = 0; i2 < 99; ++i2) {
  for (int i3 = 0; i3 < 2; ++i3) {
    for (int j3 = 0; j3 < 200; ++j3) {
      T[i3, j3] = sin((i2+i3)*j3) // dims: [2, 200]
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = T[0,j2] + T[0, j2+1] + T[1,j2] + T[1, j2+1]
  }
}

After reorder(i3, j3) and unroll(i3):

for (int i2 = 0; i2 < 99; ++i2) {
  for (int j3 = 0; j3 < 200; ++j3) {
    T[0, j3] = sin(i2*j3)
    T[1, j3] = sin((i2+1)*j3)
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = T[0,j2] + T[0, j2+1] + T[1,j2] + T[1, j2+1]
  }
}

ZolotukhinM · 2021-04-15T18:21:50Z

Btw, that reminded me about more transformations that we might want to add: transformations to change buffers layout. E.g. in my last example it might be useful to swap i and j dimensions in T (not sure about this particular case, but it can be in some situations). We probably can't do it for input/output buffers, but should be able to for intermediate ones.

navahgar · 2021-04-15T18:24:33Z

Yeah, loop fusion and buffer compression make sense to me as separate transformation. What I suggest calling the transformation that would introduce this redundancy "compute_at" - it will not try to fuse, and it will not try to compress buffers. But it will be responsible for inserting proper loops at the proper place.

While I don't mind this approach, here is what I had in mind:

Input

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

After splitting i1 iteration space into 2D with overlap of 2: splitWithOverlap?

for (int i1 = 0; i1 < 99; ++i1) {
  for (int t = 0; t < 2; ++t) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[i1+t,j1] = sin((i1+t)*j1)
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

After fusing i1 and i2

for (int i1 = 0; i1 < 99; ++i1) {
  for (int t = 0; t < 2; ++t) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[i1+t,j1] = sin((i1+t)*j1)
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i1,j2] = A[i1,j2] + A[i1, j2+1] + A[i1+1,j2] + A[i1+1, j2+1]
  }
}

Buffer compression

for (int i1 = 0; i1 < 99; ++i1) {
  for (int t = 0; t < 2; ++t) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[t,j1] = sin((i1+t)*j1)
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i1,j2] = A[0,j2] + A[0, j2+1] + A[1,j2] + A[1, j2+1]
  }
}

navahgar · 2021-04-16T06:27:47Z

@ZolotukhinM Here are the steps to perform computeAt at the inner most loop.

Input

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

ComputeAt(A, j2) will include the following steps:

splitWithOverlap(i1, 2)

for (int i1 = 0; i1 < 99; ++i1) {
  for (int ti = 0; ti < 2; ++ti) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[i1+ti,j1] = sin((i1+ti)*j1)     // A is [100, 200]
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

splitWithOverlap(j1, 2)

for (int i1 = 0; i1 < 99; ++i1) {
  for (int ti = 0; ti < 2; ++ti) {
    for (int j1 = 0; j1 < 199; ++j1) {
      for (int tj = 0; tj < 2; ++tj) {
        A[i1+ti,j1+tj] = sin((i1+ti)*(j1+tj))     // A is [100, 200]
      }
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

reorder(ti, j1)

for (int i1 = 0; i1 < 99; ++i1) {
  for (int j1 = 0; j1 < 199; ++j1) {
    for (int ti = 0; ti < 2; ++ti) {
      for (int tj = 0; tj < 2; ++tj) {
        A[i1+ti,j1+tj] = sin((i1+ti)*(j1+tj))     // A is [100, 200]
      }
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

fuse(i1, i2), fuse(j1, j2)

for (int i1 = 0; i1 < 99; ++i1) {
  for (int j1 = 0; j1 < 199; ++j1) {
    for (int ti = 0; ti < 2; ++ti) {
      for (int tj = 0; tj < 2; ++tj) {
        A[i1+ti,j1+tj] = sin((i1+ti)*(j1+tj))     // A is [100, 200]
      }
    }
    B[i1,j1] = A[i1,j1] + A[i1, j1+1] + A[i1+1,j1] + A[i1+1, j1+1]
  }
}

compressBuffer(A)

for (int i1 = 0; i1 < 99; ++i1) {
  for (int j1 = 0; j1 < 199; ++j1) {
    for (int ti = 0; ti < 2; ++ti) {
      for (int tj = 0; tj < 2; ++tj) {
        A[ti,tj] = sin((i1+ti)*(j1+tj))     // A is [2, 2]
      }
    }
    B[i1,j1] = A[0,0] + A[0,1] + A[1,0] + A[1,1]
  }
}

ZolotukhinM · 2021-04-16T18:13:19Z

Yes, it will work for this case. However, I still don't think it's a right decomposition, and the reason for that is that the first transformation really needs to be "compute whatever part of the producer is used in the consumer statement". In this example, it is A[i:i+1, j:j+1], but it can be anything. For instance, consumer can be:

B[i,j] = A[i/2, j*2]
B[i,j] = A[i+1, 0] + A[i/2, 5] + A[i*2, j]
B[i,j] = A[i+1, j-1] + A[i-1, j+1]

splitWithOverlap covers only one example, and I believe this is an overfitting. That first transform should really be a part of compute-at and be driven by that, and I'm not sure it can be separated from it. It is more like "compute this area of indices for this buffer" and probably needs to rely on bounds inference to determine the ranges. I don't think it always can be expressed as "splitWithOverlap".

navahgar · 2021-04-16T19:17:40Z

Actually, splitWithOverlap approach also requires bounds inference. For example, in my code above, we need bounds inference to figure out which loops to split and how they have to be split. I still feel we can handle all cases using this approach.

However, I am okay with going along with your approach that you mentioned earlier (by "computing at" without fusing / buffer compression).

ZolotukhinM · 2021-04-16T20:01:17Z

It is then possible that we just call the same thing different names :)

navahgar self-assigned this Apr 13, 2021

navahgar added the NNC label Apr 13, 2021

navahgar added this to Needs triage in NNC via automation Apr 13, 2021

navahgar added this to To Do in NNC Convolutions via automation Apr 13, 2021

navahgar moved this from Needs triage to High priority in NNC Apr 13, 2021

navahgar mentioned this issue Apr 13, 2021

[NNC] API for Buffer Compression #55853

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NNC] Replace ComputeAt #55933

[NNC] Replace ComputeAt #55933

navahgar commented Apr 13, 2021

navahgar commented Apr 13, 2021

navahgar commented Apr 13, 2021

navahgar commented Apr 13, 2021

ZolotukhinM commented Apr 13, 2021

navahgar commented Apr 13, 2021 •

edited

ZolotukhinM commented Apr 13, 2021 •

edited

navahgar commented Apr 13, 2021

navahgar commented Apr 13, 2021 •

edited

navahgar commented Apr 13, 2021

ZolotukhinM commented Apr 15, 2021

navahgar commented Apr 15, 2021

ZolotukhinM commented Apr 15, 2021

ZolotukhinM commented Apr 15, 2021

navahgar commented Apr 15, 2021 •

edited

navahgar commented Apr 16, 2021

ZolotukhinM commented Apr 16, 2021 •

edited

navahgar commented Apr 16, 2021

ZolotukhinM commented Apr 16, 2021

[NNC] Replace ComputeAt #55933

[NNC] Replace ComputeAt #55933

Comments

navahgar commented Apr 13, 2021

navahgar commented Apr 13, 2021

navahgar commented Apr 13, 2021

navahgar commented Apr 13, 2021

ZolotukhinM commented Apr 13, 2021

navahgar commented Apr 13, 2021 • edited

ZolotukhinM commented Apr 13, 2021 • edited

navahgar commented Apr 13, 2021

navahgar commented Apr 13, 2021 • edited

navahgar commented Apr 13, 2021

ZolotukhinM commented Apr 15, 2021

navahgar commented Apr 15, 2021

ZolotukhinM commented Apr 15, 2021

ZolotukhinM commented Apr 15, 2021

navahgar commented Apr 15, 2021 • edited

navahgar commented Apr 16, 2021

ZolotukhinM commented Apr 16, 2021 • edited

navahgar commented Apr 16, 2021

ZolotukhinM commented Apr 16, 2021

navahgar commented Apr 13, 2021 •

edited

ZolotukhinM commented Apr 13, 2021 •

edited

navahgar commented Apr 13, 2021 •

edited

navahgar commented Apr 15, 2021 •

edited

ZolotukhinM commented Apr 16, 2021 •

edited