New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing parallel generates past elaboration #2550
Comments
Yes, icache is the main performance bottleneck at present. Generates are currently expanded because dotted references need to be able to scope into them, and because all internal optimizations assume a flattened design. For example there may be clock buffers inside a generate, and it's critical to flatten these out for performance. Perhaps we keep the arraying only when the current module inliner doesn't inline. There's an attempt once the duplicate code is made to find the remaining parallelism again (V3Combine), however this is very naive at present, and doesn't take advantage of arraying (as it doesn't exist). Also if done well it could recognize other similar code even when not arrayed (e.g. instantiating cpu0 and cpu1 with nearly duplicate interiors.) Anyhow, it would be wonderful to get improvements in this area. If you have a good design, I'd suggest manually editing the output of verilator to see what performance you might get (hacking so that even if results are slightly wrong you can still measure performance). Then once have a good strategy we can discuss how to achieve that output. |
Hey Wilson. Sorry for the delay on this, here are some thoughts: OverviewAfter looking into this, I think the best return for performance per engineer hour would come from taking another stab at V3Combine. A decent slice of poor icache perf seems to come from unique but near-equivalent functions being created by verilator for each copy of a module that is generated. BenchmarkI made a micro-benchmark at https://github.com/1024bees/verilator_example. This stamps down a bunch of tiny verilog modules in a nested for loop. the work at the core of this benchmark looks something like:
By default, verilator creates a unique copy of this Default performance of this benchmark on my machine (Ryzen5 2600 based): Performance after converting all of the generated unique functions into the TL;DR we get about a ~2x increase in performance in this micro-benchmark from by mapping all variants of the SolutionsFrom what you've described V3Combine looks it could be extended (re-written?) to try to collapse functions created from modules stamped down via generates in a similar way that the That being said, the path of extending V3Combine seems to be treating a symptom (repeated C code) rather than the cause (redundant work verilating common blocks within a design). IMO, the ideal fix would look something like
but this idea comes from having little insight into how Verilator scheduling works :) What do you think? |
I have little doubt that there is a significant speed up from this, 2x seems quite plausible, #2631 has additional examples of this. I would not change scheduling to optimize this, rather would rewrite to improve and merge V3Combine and V3Reloop. I believe that will give most of the gain and is much easier to build and get correct. Also multithreaded has counter-goals that might want to keep code expanded so should be performed before this. As an idea consider this
One idea to discuss is that the new version would convert all statements into variable and constant "argumented" statements, e.g.
This step would also make a graph indicating which statements need the same position relative to others like V3Split does. Then create a hash of the generic functions, put all of the generic functions with same signature together (as allowed by the reordering legal graph), and order the constants low-to-high.
Now look for same signatures and make a loop (as allowed by reordering legal graph)
Any arguments that the loop/subroutines don't change are left in place
This algorithm can also be used to make small functions that implement the common code, but this might slow down the runtime if over-used, so we'd probably only want to do that in cases where the function to generate is of substantial number of statements (say 100 instructions or something). The same algorithm could then be repeated at a larger level than statements, e.g. whole functions. You wouldn't need the graph to say what's legal at that level. We would then do the normal V3Combine looking for common functions. (Perhaps V3Combine as-is remains as a later stage). Note what V3Combine does is a more specific function-level version of this, and what V3Reloop does is a very specific signature version of looping, which is why I suggest this replaces those ideas. |
This sounds like a good plan; after which pass do you think the new V3Combine + V3Reloop (I'll call it V3Collapse) should occur? |
I think I'd do it where V3Reloop is now. I don't know why V3Combine is so early, but it might be good to leave that where it is at least until the new process is understood. |
Hey folks,
About a year ago I noticed i-cache being a bottleneck for my simulator performance (#2042).
Now that I have some time, one optimization that I think could help with icache perf is preserving parallel generates past elaboration. Right now I'm imagining something like
verilating into something like
Does this seem remotely feasible? FWIW, I'd be willing to spend a bit of time on this
The text was updated successfully, but these errors were encountered: