Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to increase the performance when there is constraint on MM_TRANSPOSE_A=OFF #31

Closed
charliechou1001 opened this issue Nov 4, 2022 · 9 comments

Comments

@charliechou1001
Copy link

Hi,

I'm using a GEMM with tile size 512 for xtot and ytot. But if I use C=AB mode(MM_TRANSPOSE_A=OFF), since A is column based and need to be transposed on chip, there is constraint that "the number of inner tiles must be greater than or equal to the outer tile size in N", this limited the number of PEs and the parallelism of B input, which also limited the GEMM performance. I also thought of using C=A.T*B instead but this doesn't work for me with changing the computation mathematically.

Is there any way to break the constraint, e.g. increase the A bandwidth, etc.? And how is the constraint formula deduced, what's its meaning? Thank you!

@definelicht
Copy link
Contributor

The ideal solution is increasing your overall tile size, but you could also try to shift some of your tile sizes from N to M. This is not ideal for communication volume, but it can keep the same number of inner tiles, while decreasing the number of columns that must be read from A per tile. For example, you could try -DMM_MEMORY_TILE_SIZE_N=256 -DMM_MEMORY_TILE_SIZE_M=1024?

@charliechou1001
Copy link
Author

Thanks. The problem for me is there are three GEMM computation tasks and the convention number of GEMM accelerator size is 512, and these three GEMM computation has different transpose format.

My understanding of the transposeA implementation is to use parallel aSplit buffers to read columns of A and then in ConvertWidthA it will sequentially feed the element of outer tile A. The data feed time of A outer tile is DMM_MEMORY_TILE_SIZE_N, and the computation time of PE arrays is inner tile multiplication time(kInnerTileN*kInnerTileM). So the data feed time should be no more than computation time, is that understanding right?

If so, will it work if I modify the transpose function to increase the parallelism for data feeding in tile A? If it cut half of the time, then I can double the PE number or B parallelism factor?

@definelicht
Copy link
Contributor

What sizes does your matrix have? Even if you chain multiple multiplications of different sizes, it should still be possible for them to have different tile sizes. The problem is of course if your matrix is only 512x512, then you cannot make the tile bigger without losing computational efficiency.

The data feed time of A outer tile is DMM_MEMORY_TILE_SIZE_N, and the computation time of PE arrays is inner tile multiplication time(kInnerTileN*kInnerTileM). So the data feed time should be no more than computation time, is that understanding right?

Yes, correct

If so, will it work if I modify the transpose function to increase the parallelism for data feeding in tile A? If it cut half of the time, then I can double the PE number or B parallelism factor?

Yes, this could work. It's not super simple, since you would need to vectorize the buffering of A between and inside all the PEs. But it's definitely doable (actually the initial version had a version of this feature, but I removed it for simplicity since I never used it).

You could give it a try :-)

@charliechou1001
Copy link
Author

charliechou1001 commented Nov 5, 2022

...since you would need to vectorize the buffering of A between and inside all the PEs....

Would you mind explaining a bit more about why we also need to modify PEs for vectorizing buffering of A? I'm a bit lost.

Or is there any simpler way, for example, I only modify the data feeder of A to double the outer tile of A feeding speed? In that case, there is no extra modification for PEs, and constraint can be kInnerTileN*kInnerTileM >= MEMORY_TILE_SIZE_N/2.

@definelicht
Copy link
Contributor

Would you mind explaining a bit more about why we also need to modify PEs for vectorizing buffering of A? I'm a bit lost.

If you want to speed up buffering A, you need to send multiple elements of A per cycle through the systolic array of processing elements. Currently these FIFOs have the width of a single element, but would need to be vectors instead.

Or is there any simpler way, for example, I only modify the data feeder of A to double the outer tile of A feeding speed?

I'm not sure how you could do this without making the width of the data path transporting elements of A wider?

@charliechou1001
Copy link
Author

If you want to speed up buffering A, you need to send multiple elements of A per cycle through the systolic array of processing elements. Currently these FIFOs have the width of a single element, but would need to be vectors instead.

I got it. That sounds to set the Granularity of PE to more than 1 and performance is doubled. And do you mind sharing that code?

I'm not sure how you could do this without making the width of the data path transporting elements of A wider?

Since I want to keep that constraint working, if I need to double the PE number or double the parallelism of B, then kInnerTileN*kInnerTileM will be half. To satisfy the constraint, I also need to reduce the latency for loading and transposing A outer tile to half. The method I thought increased the performance by adding more PEs and parallelism of B, but it use more resources than the method you proposed.

@definelicht
Copy link
Contributor

The code does not exist anymore, it was the idea that this should be supported when I wrote the transpose code, but in the end I decided that I didn't need it :-)

@charliechou1001
Copy link
Author

Thanks. And what do you think of the other method? does it work?

Since I want to keep that constraint working, if I need to double the PE number or double the parallelism of B, then kInnerTileN*kInnerTileM will be half. To satisfy the constraint, I also need to reduce the latency for loading and transposing A outer tile to half. The method I thought increased the performance by adding more PEs and parallelism of B, but it use more resources than the method you proposed.

@definelicht
Copy link
Contributor

It doesn't matter if your parallelism is in A or B, it only matters how many columns of A must be loaded for every outer tile. You either need to increase the tile size, or allow multiple elements of A to be buffered per cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants