Free Running Kernel #8

definelicht · 2021-12-22T14:12:04Z

In this PR:

Made the core computation of matrix multiplication a free running kernel, by moving the control flow on the FIFOs out into separate feeder and drainer function.
Introduced more cycles of latency to the adds in Karatsuba to help place and route
Fixed an issue with initializing floats in Random.cpp
Add some more diagnostics to the kernel test function

… free running

…size of Xilinx DSP multiplications

…e implementation

device/MatrixMultiplication.cpp

ChrisPattison · 2021-12-22T21:08:21Z

device/MatrixMultiplication.cpp

+                        const PackedFloat b = (n1 == 0) ? b_read : b_buffer[m1];
+                        const PackedFloat c = (k == 0) ? c_read : c_buffer[n1 * kTileSizeM + m1];
+                        a_buffer = a;
+                        b_buffer[m1] = b;
                        // Ignore contributions from out-of-bound indices
                        const bool in_bounds = (n0 * kTileSizeN + n1 < size_n) && (m0 * kTileSizeM + m1 < size_m);
                        // Meat of the computation
                        const auto res = in_bounds ? MultiplyAccumulate(a, b, c) : c;


Somewhat of a micro-optimization but it would probably be better to pass 0 into the MAC instead of bypassing the MAC completely. So like
MultiplyAccumulate(in_bounds ? a : 0, in_bounds ? b : 0, c) : c;
That way we don't need a bunch of registers that have to bypass the multiplier

ChrisPattison

I left a few comments. Let me know what you think

ChrisPattison · 2021-12-22T21:09:42Z

device/MatrixMultiplication.cpp

-                        }
+                        const PackedFloat a_read = a_in.Pop();
+                        const PackedFloat b_read = b_in.Pop();
+                        const PackedFloat c_read = c_in.Pop();


Does this work if you're on an out of bounds index? It seems like it might desynchronize where you are in the matrix multiply vs where you're pulling data from

definelicht added 8 commits December 13, 2021 16:12

Remove control flow on stream accesses from compute kernel to make it…

d9d9a1d

… free running

Add latency to adders, add loop labels

de6a4ce

Don't make noise for unused labels

e8dff8e

Fix performance estimate

a8bf0d7

Print specifications when running the kernel

f008c10

Allow enabling profiling. Bottom out at 18 bits, which is the native …

72c96cf

…size of Xilinx DSP multiplications

Fix critical fuckup in passing matrix sizes

28f7cf0

Fix comparison to MPFR stemming from wrong precision used in referenc…

987136d

…e implementation

definelicht requested a review from ChrisPattison December 22, 2021 14:14

ChrisPattison reviewed Dec 22, 2021

View reviewed changes

device/MatrixMultiplication.cpp Show resolved Hide resolved

ChrisPattison reviewed Dec 22, 2021

View reviewed changes

ChrisPattison requested changes Dec 22, 2021

View reviewed changes

ChrisPattison merged commit 58a3e80 into main Dec 29, 2021

ChrisPattison mentioned this pull request Dec 30, 2021

TestHardware failing #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free Running Kernel #8

Free Running Kernel #8

definelicht commented Dec 22, 2021 •

edited

Loading

ChrisPattison Dec 22, 2021 •

edited

Loading

ChrisPattison left a comment

ChrisPattison Dec 22, 2021

Free Running Kernel #8

Free Running Kernel #8

Conversation

definelicht commented Dec 22, 2021 • edited Loading

ChrisPattison Dec 22, 2021 • edited Loading

Choose a reason for hiding this comment

ChrisPattison left a comment

Choose a reason for hiding this comment

ChrisPattison Dec 22, 2021

Choose a reason for hiding this comment

definelicht commented Dec 22, 2021 •

edited

Loading

ChrisPattison Dec 22, 2021 •

edited

Loading