-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Free Running Kernel #8
Conversation
…size of Xilinx DSP multiplications
const PackedFloat b = (n1 == 0) ? b_read : b_buffer[m1]; | ||
const PackedFloat c = (k == 0) ? c_read : c_buffer[n1 * kTileSizeM + m1]; | ||
a_buffer = a; | ||
b_buffer[m1] = b; | ||
// Ignore contributions from out-of-bound indices | ||
const bool in_bounds = (n0 * kTileSizeN + n1 < size_n) && (m0 * kTileSizeM + m1 < size_m); | ||
// Meat of the computation | ||
const auto res = in_bounds ? MultiplyAccumulate(a, b, c) : c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat of a micro-optimization but it would probably be better to pass 0 into the MAC instead of bypassing the MAC completely. So like
MultiplyAccumulate(in_bounds ? a : 0, in_bounds ? b : 0, c) : c;
That way we don't need a bunch of registers that have to bypass the multiplier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. Let me know what you think
} | ||
const PackedFloat a_read = a_in.Pop(); | ||
const PackedFloat b_read = b_in.Pop(); | ||
const PackedFloat c_read = c_in.Pop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if you're on an out of bounds index? It seems like it might desynchronize where you are in the matrix multiply vs where you're pulling data from
In this PR:
Random.cpp