You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the implementation will split the input sequence into n blocks, e.g. 4 gpu will split into:
b0 | b1 | b2 | b3
however, this will result in uneven calculation, where the gpu that has b3 will do around 4 times more calculation than the gpu that has b0, due to causal attention mask.
If we split the input sequence into 2n blocks, e.g. 4 gpu will split into:
b0,b7 | b1,b6 | b2,b5 | b3,b4
then all gpu will have the same amount of calculation, and theoratically the latency should be decrease by half.
The text was updated successfully, but these errors were encountered:
Do you mean Striped Attention when you say zigzag blocking? Or is it something more simple which still gives you a much better utilization (lower latency) of the gpus?
@andreaskoepf Oh... I haven't read the striped attention before... (thought that was some sparse attention mask version of ring attention like window attention from the name, my bad...)
but from a quick look, it seems that stripe attention is doing some thing like:
0,4,8,12 | 1,5,9,13 | 2,6,10,14 | 3,7,11,15
I was thinking about doing something like:
0,1,14,15| 2,3,12,13 | 4,5,10,11 | 6,7,8,9
which may be able to fold the causal mask from:
x
xx
xxx
xxxx
xxxxx
xxxxxx
xxxxxxx
xxxxxxxx
into
x xxxxxxxx
xx xxxxxxx
xxx xxxxxx
xxxx xxxxx
I'm not sure which could give better performace...
Currently the implementation will split the input sequence into n blocks, e.g. 4 gpu will split into:
however, this will result in uneven calculation, where the gpu that has
b3
will do around 4 times more calculation than the gpu that hasb0
, due to causal attention mask.If we split the input sequence into 2n blocks, e.g. 4 gpu will split into:
then all gpu will have the same amount of calculation, and theoratically the latency should be decrease by half.
The text was updated successfully, but these errors were encountered: