How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF #31

charliechou1001 · 2022-11-04T01:21:03Z

Hi,

I'm using a GEMM with tile size 512 for xtot and ytot. But if I use C=AB mode(MM_TRANSPOSE_A=OFF), since A is column based and need to be transposed on chip, there is constraint that "the number of inner tiles must be greater than or equal to the outer tile size in N", this limited the number of PEs and the parallelism of B input, which also limited the GEMM performance. I also thought of using C=A.T*B instead but this doesn't work for me with changing the computation mathematically.

Is there any way to break the constraint, e.g. increase the A bandwidth, etc.? And how is the constraint formula deduced, what's its meaning? Thank you!

definelicht · 2022-11-04T09:17:43Z

The ideal solution is increasing your overall tile size, but you could also try to shift some of your tile sizes from N to M. This is not ideal for communication volume, but it can keep the same number of inner tiles, while decreasing the number of columns that must be read from A per tile. For example, you could try -DMM_MEMORY_TILE_SIZE_N=256 -DMM_MEMORY_TILE_SIZE_M=1024?

charliechou1001 · 2022-11-04T10:25:18Z

Thanks. The problem for me is there are three GEMM computation tasks and the convention number of GEMM accelerator size is 512, and these three GEMM computation has different transpose format.

My understanding of the transposeA implementation is to use parallel aSplit buffers to read columns of A and then in ConvertWidthA it will sequentially feed the element of outer tile A. The data feed time of A outer tile is DMM_MEMORY_TILE_SIZE_N, and the computation time of PE arrays is inner tile multiplication time(kInnerTileN*kInnerTileM). So the data feed time should be no more than computation time, is that understanding right?

If so, will it work if I modify the transpose function to increase the parallelism for data feeding in tile A? If it cut half of the time, then I can double the PE number or B parallelism factor?

definelicht · 2022-11-04T13:32:09Z

What sizes does your matrix have? Even if you chain multiple multiplications of different sizes, it should still be possible for them to have different tile sizes. The problem is of course if your matrix is only 512x512, then you cannot make the tile bigger without losing computational efficiency.

The data feed time of A outer tile is DMM_MEMORY_TILE_SIZE_N, and the computation time of PE arrays is inner tile multiplication time(kInnerTileN*kInnerTileM). So the data feed time should be no more than computation time, is that understanding right?

Yes, correct

If so, will it work if I modify the transpose function to increase the parallelism for data feeding in tile A? If it cut half of the time, then I can double the PE number or B parallelism factor?

Yes, this could work. It's not super simple, since you would need to vectorize the buffering of A between and inside all the PEs. But it's definitely doable (actually the initial version had a version of this feature, but I removed it for simplicity since I never used it).

You could give it a try :-)

charliechou1001 · 2022-11-05T01:37:56Z

...since you would need to vectorize the buffering of A between and inside all the PEs....

Would you mind explaining a bit more about why we also need to modify PEs for vectorizing buffering of A? I'm a bit lost.

Or is there any simpler way, for example, I only modify the data feeder of A to double the outer tile of A feeding speed? In that case, there is no extra modification for PEs, and constraint can be kInnerTileN*kInnerTileM >= MEMORY_TILE_SIZE_N/2.

definelicht · 2022-11-06T22:38:28Z

Would you mind explaining a bit more about why we also need to modify PEs for vectorizing buffering of A? I'm a bit lost.

If you want to speed up buffering A, you need to send multiple elements of A per cycle through the systolic array of processing elements. Currently these FIFOs have the width of a single element, but would need to be vectors instead.

Or is there any simpler way, for example, I only modify the data feeder of A to double the outer tile of A feeding speed?

I'm not sure how you could do this without making the width of the data path transporting elements of A wider?

charliechou1001 · 2022-11-07T00:52:31Z

If you want to speed up buffering A, you need to send multiple elements of A per cycle through the systolic array of processing elements. Currently these FIFOs have the width of a single element, but would need to be vectors instead.

I got it. That sounds to set the Granularity of PE to more than 1 and performance is doubled. And do you mind sharing that code?

I'm not sure how you could do this without making the width of the data path transporting elements of A wider?

Since I want to keep that constraint working, if I need to double the PE number or double the parallelism of B, then kInnerTileN*kInnerTileM will be half. To satisfy the constraint, I also need to reduce the latency for loading and transposing A outer tile to half. The method I thought increased the performance by adding more PEs and parallelism of B, but it use more resources than the method you proposed.

definelicht · 2022-11-07T06:26:44Z

The code does not exist anymore, it was the idea that this should be supported when I wrote the transpose code, but in the end I decided that I didn't need it :-)

charliechou1001 · 2022-11-07T11:06:54Z

Thanks. And what do you think of the other method? does it work?

Since I want to keep that constraint working, if I need to double the PE number or double the parallelism of B, then kInnerTileN*kInnerTileM will be half. To satisfy the constraint, I also need to reduce the latency for loading and transposing A outer tile to half. The method I thought increased the performance by adding more PEs and parallelism of B, but it use more resources than the method you proposed.

definelicht · 2022-11-09T13:26:46Z

It doesn't matter if your parallelism is in A or B, it only matters how many columns of A must be loaded for every outer tile. You either need to increase the tile size, or allow multiple elements of A to be buffered per cycle.

definelicht closed this as completed Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF #31

How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF #31

charliechou1001 commented Nov 4, 2022

definelicht commented Nov 4, 2022

charliechou1001 commented Nov 4, 2022

definelicht commented Nov 4, 2022

charliechou1001 commented Nov 5, 2022 •

edited

Loading

definelicht commented Nov 6, 2022

charliechou1001 commented Nov 7, 2022

definelicht commented Nov 7, 2022

charliechou1001 commented Nov 7, 2022

definelicht commented Nov 9, 2022

How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF #31

How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF #31

Comments

charliechou1001 commented Nov 4, 2022

definelicht commented Nov 4, 2022

charliechou1001 commented Nov 4, 2022

definelicht commented Nov 4, 2022

charliechou1001 commented Nov 5, 2022 • edited Loading

definelicht commented Nov 6, 2022

charliechou1001 commented Nov 7, 2022

definelicht commented Nov 7, 2022

charliechou1001 commented Nov 7, 2022

definelicht commented Nov 9, 2022

charliechou1001 commented Nov 5, 2022 •

edited

Loading