Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FD2 underperforms w/ multiple kernel binaries #8857

Closed
pgkeller opened this issue May 27, 2024 · 2 comments
Closed

FD2 underperforms w/ multiple kernel binaries #8857

pgkeller opened this issue May 27, 2024 · 2 comments
Assignees

Comments

@pgkeller
Copy link
Contributor

FD2 doesn't pack kernel binaries.

Plan is for FD2.2 to use a ring buffer, w/ a the ring buffer the binaries can be packed in DRAM and written w/ one linear write

However, w/ multiple kernel groups, we'd still pay the dram latency for each kernel group which will dominate for typical kernel sizes.

To address both now and post ring buffer, we could:

  1. Create a packed read command which would read from multiple dram locations
  2. Either modify the existing packed write command to handle larger amounts of data (currently limited to 1 page per write) or create a packed_write_large command for larger transfers
  3. Today, do a packed read followed by a packed_write for all binaries. Post ring buffer, we could pack the binaries for a kernel group in dram and then for single kernel groups do a read and linear write while multiple kernel groups would do a packed read and packed write (w/ fewer total transfers)
@pgkeller
Copy link
Contributor Author

note that adding "force_inline" to a few prefetcher routines nets ~5% w/ 5 binaries.

@pgkeller
Copy link
Contributor Author

pgkeller commented Jul 3, 2024

Done, perf gains of >5x

@pgkeller pgkeller closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant