Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NNC] Replace ComputeAt #55933

Open
navahgar opened this issue Apr 13, 2021 · 18 comments
Open

[NNC] Replace ComputeAt #55933

navahgar opened this issue Apr 13, 2021 · 18 comments
Assignees
Labels

Comments

@navahgar
Copy link
Contributor

Given the input IR:

for (int i = 0; i < 100; ++i) {
  for (int j = 0; j < 200; ++j) {
    A[i,j] = sin(i*j)
  }
}
for (int i = 0; i < 100; ++i) {
  for (int j = 0; j < 199; ++j) {
    B[i,j] = A[i,j] + A[i, j+1]
  }
}

we need a transformation that gets to:

for (int i2 = 0; i2 < 100; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    for (int t = 0; t < 2; t++) {
      A[0, t] = sin(i2*(j2+t)) // dims: [1, 2]
    }
    B[i2,j2] = A[0,0] + A[0, 1]
  }
}

This should then be able to replace ComputeAt.

@navahgar navahgar self-assigned this Apr 13, 2021
@navahgar navahgar added the NNC label Apr 13, 2021
@navahgar navahgar added this to Needs triage in NNC via automation Apr 13, 2021
@navahgar navahgar added this to To Do in NNC Convolutions via automation Apr 13, 2021
@navahgar navahgar moved this from Needs triage to High priority in NNC Apr 13, 2021
@navahgar
Copy link
Contributor Author

@ZolotukhinM @bertmaher

@navahgar
Copy link
Contributor Author

Some discussions related to this: #55853 (comment)

@navahgar
Copy link
Contributor Author

Why not have the following instead of the desired form above? @ZolotukhinM

for (int i2 = 0; i2 < 100; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = sin(i2*j2) + sin(i2*(j2+1))
  }
}

@ZolotukhinM
Copy link

Why not have the following instead of the desired form above?

The option you brought up is what compute_inline would do. That's on one side of the spectrum. The other side of the spectrum is to keep the computation as it was originally. On one side we have redundant computations (when we compute-inline), on the other side we need extra memory (when we keep the computation intact), so there is a trade off. What compute_at allows us to do is to land on some point in the middle of this spectrum and make the desired tradeoff.

This example just shows that something might be impossible to achieve without compute_at - to make it more realistic we might change the computation to A[i,j] + A[i, j+1] + A[i+1,j] + A[i+1, j+1] and apply compute_at at i2 loop - this way we'll be telling that we need to compute only two rows of A every time we compute a row of B.

@navahgar
Copy link
Contributor Author

navahgar commented Apr 13, 2021

I get the tradeoff that compute_at offers. So, IIUC, compute_at at the innermost loop is same as compute_inline?

A more representative example would be:

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 100; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1]
  }
}

compute_at(A, i2) would result in:

for (int i2 = 0; i2 < 100; ++i2) {
  for (int t1 = 0; t1 < 1; ++t1) {
    for (int t2 = 0; t2 < 200; ++t2) {
      A[t1,t2] = sin(i2*t2);
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[0,j2] + A[0, j2+1]
  }
}

This is exactly what we have after fusion and buffer compression. Is there any other example for compute_at that is not possible with fusion and buffer compression or with inlining?

@ZolotukhinM
Copy link

ZolotukhinM commented Apr 13, 2021

compute_at at the innermost loop is same as compute_inline?

It is very similar in the effect, but an important difference is that compute_at would introduce a new buffer (or reuse the old one, but make it compressable), while compute_inline always eliminates the original buffer. But for the innermost loops it is mostly equivalent as you noted.

Is there any other example for compute_at that is not possible with fusion and buffer compression or with inlining?

Yes, let's look at an example when we need two rows:

Original:

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

After compute_at(A, i2):

for (int i2 = 0; i2 < 99; ++i2) {
  for (int t = 0; t < 2; t++) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[t, j1] = sin((i2+t)*j1) // dims [2, 200]
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[0,j2] + A[0, j2+1] + A[1,j2] + A[1, j2+1]
  }
}

In the original example we used two columns and one row - that's why to demonstrate the effect of compute-at I needed to apply it to the innermost loop (but in that case as you noted it was similar to compute-inline, but that was a red herring).

@navahgar
Copy link
Contributor Author

Right, that is the perfect example here. The key challenge is that we need a transformation that introduces redundancy.

Here is the reason for this redundancy. Consider the element A[1,99] for example. That is computed twice, once when i == 0 and again when i == 1. Whereas in the input code there is no such redundant computation.

@navahgar
Copy link
Contributor Author

navahgar commented Apr 13, 2021

The first step in transforming the code is to fuse loops i1 and i2. Just peeling the first iteration of i1 and normalizing it wouldn't help because we still can't fuse due to a dependency.

So, we need to transform the input code to the following:

for (int i1 = 0; i1 < 99; ++i1) {
  for (int t = 0; t < 2; ++t) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[i1+t,j1] = sin((i1+t)*j1)
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

This changes the iteration space of i1 from [0..100] to [0..99] x [0..2]. This introduces redundant computations of A as well.

After this we can fuse the two outer loops and perform buffer compression to get to the desired result.

@navahgar
Copy link
Contributor Author

To clarify further about the transformation described above: it can be seen as split with overlapping iteration spaces and amount of overlap will determine the size of the redundancy being introduced.

@ZolotukhinM
Copy link

One question I have is that whether making it a separate transformation brings any value in the end. It feels that whatever we do it only will be useful in compute_at - i.e. we don't generally want to introduce this redundancy, we only want to do that for elements used in one iteration in the consumer stmt (consumer stmt: B[i2,j2]=..., elements used: A[i2,j2], A[i2, j2+1], ... A[i2+1, j2+1]).
While generally I 100% support making transformation as "atomic" as possible, in this case I'm not convinced separating compute_at into these steps is buying us that much. Maybe we should just refine semantics of compute_at instead?

@navahgar
Copy link
Contributor Author

This particular transformation will likely be used only in the context of computeAt. But the other two transformations that we need to replace computeAt, namely loop fusion and buffer compression, should be useful in more general contexts.

I have not looked at computeAt code myself. I am not sure how complicated it is to fix the issues in that. By splitting it into 3 different simpler transformations, I believe it will be easier and better tested as well.

@ZolotukhinM
Copy link

Yeah, loop fusion and buffer compression make sense to me as separate transformation. What I suggest calling the transformation that would introduce this redundancy "compute_at" - it will not try to fuse, and it will not try to compress buffers. But it will be responsible for inserting proper loops at the proper place. E.g.

Original:

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

After compute_at(A, i2):

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int i3 = 0; i3 < 2; ++i3) {
    for (int j3 = 0; j3 < 200; ++j3) {
      T[i2+i3, j3] = sin((i2+i3)*j3). // dims: [100, 200]
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = T[i2,j2] + T[i2, j2+1] + T[i2+1,j2] + T[i2+1, j2+1]
  }
}

After eliminate_dead_stores:

for (int i2 = 0; i2 < 99; ++i2) {
  for (int i3 = 0; i3 < 2; ++i3) {
    for (int j3 = 0; j3 < 200; ++j3) {
      T[i2+i3, j3] = sin((i2+i3)*j3) // dims: [100, 200]
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = T[i2,j2] + T[i2, j2+1] + T[i2+1,j2] + T[i2+1, j2+1]
  }
}

After compress_buffers:

for (int i2 = 0; i2 < 99; ++i2) {
  for (int i3 = 0; i3 < 2; ++i3) {
    for (int j3 = 0; j3 < 200; ++j3) {
      T[i3, j3] = sin((i2+i3)*j3) // dims: [2, 200]
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = T[0,j2] + T[0, j2+1] + T[1,j2] + T[1, j2+1]
  }
}

After reorder(i3, j3) and unroll(i3):

for (int i2 = 0; i2 < 99; ++i2) {
  for (int j3 = 0; j3 < 200; ++j3) {
    T[0, j3] = sin(i2*j3)
    T[1, j3] = sin((i2+1)*j3)
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = T[0,j2] + T[0, j2+1] + T[1,j2] + T[1, j2+1]
  }
}

@ZolotukhinM
Copy link

Btw, that reminded me about more transformations that we might want to add: transformations to change buffers layout. E.g. in my last example it might be useful to swap i and j dimensions in T (not sure about this particular case, but it can be in some situations). We probably can't do it for input/output buffers, but should be able to for intermediate ones.

@navahgar
Copy link
Contributor Author

navahgar commented Apr 15, 2021

Yeah, loop fusion and buffer compression make sense to me as separate transformation. What I suggest calling the transformation that would introduce this redundancy "compute_at" - it will not try to fuse, and it will not try to compress buffers. But it will be responsible for inserting proper loops at the proper place.

While I don't mind this approach, here is what I had in mind:

Input

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

After splitting i1 iteration space into 2D with overlap of 2: splitWithOverlap?

for (int i1 = 0; i1 < 99; ++i1) {
  for (int t = 0; t < 2; ++t) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[i1+t,j1] = sin((i1+t)*j1)
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

After fusing i1 and i2

for (int i1 = 0; i1 < 99; ++i1) {
  for (int t = 0; t < 2; ++t) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[i1+t,j1] = sin((i1+t)*j1)
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i1,j2] = A[i1,j2] + A[i1, j2+1] + A[i1+1,j2] + A[i1+1, j2+1]
  }
}

Buffer compression

for (int i1 = 0; i1 < 99; ++i1) {
  for (int t = 0; t < 2; ++t) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[t,j1] = sin((i1+t)*j1)
    }
  }
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i1,j2] = A[0,j2] + A[0, j2+1] + A[1,j2] + A[1, j2+1]
  }
}

@navahgar
Copy link
Contributor Author

@ZolotukhinM Here are the steps to perform computeAt at the inner most loop.

Input

for (int i1 = 0; i1 < 100; ++i1) {
  for (int j1 = 0; j1 < 200; ++j1) {
    A[i1,j1] = sin(i1*j1)
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}

ComputeAt(A, j2) will include the following steps:

  1. splitWithOverlap(i1, 2)
for (int i1 = 0; i1 < 99; ++i1) {
  for (int ti = 0; ti < 2; ++ti) {
    for (int j1 = 0; j1 < 200; ++j1) {
      A[i1+ti,j1] = sin((i1+ti)*j1)     // A is [100, 200]
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}
  1. splitWithOverlap(j1, 2)
for (int i1 = 0; i1 < 99; ++i1) {
  for (int ti = 0; ti < 2; ++ti) {
    for (int j1 = 0; j1 < 199; ++j1) {
      for (int tj = 0; tj < 2; ++tj) {
        A[i1+ti,j1+tj] = sin((i1+ti)*(j1+tj))     // A is [100, 200]
      }
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}
  1. reorder(ti, j1)
for (int i1 = 0; i1 < 99; ++i1) {
  for (int j1 = 0; j1 < 199; ++j1) {
    for (int ti = 0; ti < 2; ++ti) {
      for (int tj = 0; tj < 2; ++tj) {
        A[i1+ti,j1+tj] = sin((i1+ti)*(j1+tj))     // A is [100, 200]
      }
    }
  }
}
for (int i2 = 0; i2 < 99; ++i2) {
  for (int j2 = 0; j2 < 199; ++j2) {
    B[i2,j2] = A[i2,j2] + A[i2, j2+1] + A[i2+1,j2] + A[i2+1, j2+1]
  }
}
  1. fuse(i1, i2), fuse(j1, j2)
for (int i1 = 0; i1 < 99; ++i1) {
  for (int j1 = 0; j1 < 199; ++j1) {
    for (int ti = 0; ti < 2; ++ti) {
      for (int tj = 0; tj < 2; ++tj) {
        A[i1+ti,j1+tj] = sin((i1+ti)*(j1+tj))     // A is [100, 200]
      }
    }
    B[i1,j1] = A[i1,j1] + A[i1, j1+1] + A[i1+1,j1] + A[i1+1, j1+1]
  }
}
  1. compressBuffer(A)
for (int i1 = 0; i1 < 99; ++i1) {
  for (int j1 = 0; j1 < 199; ++j1) {
    for (int ti = 0; ti < 2; ++ti) {
      for (int tj = 0; tj < 2; ++tj) {
        A[ti,tj] = sin((i1+ti)*(j1+tj))     // A is [2, 2]
      }
    }
    B[i1,j1] = A[0,0] + A[0,1] + A[1,0] + A[1,1]
  }
}

@ZolotukhinM
Copy link

ZolotukhinM commented Apr 16, 2021

Yes, it will work for this case. However, I still don't think it's a right decomposition, and the reason for that is that the first transformation really needs to be "compute whatever part of the producer is used in the consumer statement". In this example, it is A[i:i+1, j:j+1], but it can be anything. For instance, consumer can be:

  • B[i,j] = A[i/2, j*2]
  • B[i,j] = A[i+1, 0] + A[i/2, 5] + A[i*2, j]
  • B[i,j] = A[i+1, j-1] + A[i-1, j+1]

splitWithOverlap covers only one example, and I believe this is an overfitting. That first transform should really be a part of compute-at and be driven by that, and I'm not sure it can be separated from it. It is more like "compute this area of indices for this buffer" and probably needs to rely on bounds inference to determine the ranges. I don't think it always can be expressed as "splitWithOverlap".

@navahgar
Copy link
Contributor Author

Actually, splitWithOverlap approach also requires bounds inference. For example, in my code above, we need bounds inference to figure out which loops to split and how they have to be split. I still feel we can handle all cases using this approach.

However, I am okay with going along with your approach that you mentioned earlier (by "computing at" without fusing / buffer compression).

@ZolotukhinM
Copy link

It is then possible that we just call the same thing different names :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
NNC
High priority
Development

No branches or pull requests

2 participants