Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Output more friendly to incremental compilation #2140
I have been using icecream + ccache to speed up builds of the C++ output by Verilator. This works to some extent when running with -Oi to disable module inlining in order to give ccache a chance. The caching however becomes useless after moderate changes or if any module has the properties (name, size) of any named port/state/signal changed, which causes a full re-build. Reading the output of Verilator, I see at least 2 reasons why the compiler cache has a hard time:
I wonder if it would be reasonable to attempt to remedy this (under a new Verilator switch), by:
Theory is then that the only C++ that would need to be re-compiled are the modules that changed, their dependencies (which are limited to everything below in the hierarchy unless there are interface changes or hierarchical references), and the top level _eval and friends, but not all the other modules. I am sure I am missing a lot of detail, can you think of anything that would make this whole idea a non-starter? I would also be interested to know if you think this would not be as useful as I think it would for any reason.
I'm curious what sort of speed up you see with inlining off, I wouldn't have expected much improvement, since as you noted any change to the symbol table (inlining or not) recompiles.
A caution there's likely bugs hiding as -Oi isn't tested much.
It's an interesting debate as to if there should be any uses of "this". The reason that vlSymsp wasn't used for everything (e.g. remove this and have only vlSymsp) is an attempt of V3Combine to look for code that is used in several instantiations and change that to be relative to the instance. I think there's great opportunities for better combining to help the Icache, which also helps compiling as less to compile.
But anyhow, an alternative to "this" would certainly be to build static functions that take two arguments (Syms vlSymsp, module_class __Vthis). If you are willing I'd suggest this as the first experiment, as it will likely help compile time and is likely runtime performance neutral so wouldn't need a switch.
Once you do that I think a second-order change we may need would be to improve the csplit and function numbering. As it is now, if an 'early' function gets larger or inserted this can cause not just the file with its implementation, but all of the 'later' functions to be numbered differently and/or land in different files, basically recompiling everything. I think a good algorithm for this would be to hash all function contents, name them based on that. Then determine total size, use csplit to determine number of files, then put each function into those N files based on their hashed name. Now if a function internal changes you have only one or 2 recompilations, the "old" file it was in, and the "new" file it goes into.
Verilator once had something like you suggest using pimpl (pointer-to-implementation pointers) in __Syms.h. The problem was every scope reference required a pointer indirection, which was bad for
It always seems broken that C/C++ has such a horrible need to have variables in the header, with only pimpl as a workaround. A hack we could experiment with is whereby Verilator determines the size of each symbol table, and use a union to make only certain sections visible when compiling. E.g. 100 variables would be a union of those variables and e.g. 100 bytes. We assert the full variables fit in the 100 bytes. If a C file needs to see inside this structure/module, it uses the full structure variable names. If a C file doesn't need to look inside, it uses ifdefs to see only the 100 byte part of the union. If Verilator always rounds the space needed by the variables up so e.g. 128 bytes are used, a submodule that does not refer to the inner variables will not need to recompile when variables change as long as they don't blow past 128 bytes. The great thing about this is it will be as fast as the existing code as there's still a single vSymsp, but incremental compiles are much better, in theory better than pimpl as we can build the exposure as fine as every unique output file, not just module boundaries.
The ugly part is Verilator needs to be very good at calculating the hidden size, and this needs to include predicting padding. The good thing is this is fairly easy to make optional; we use "hidden" and turn on all EXPOSED ifdefs, none of the consumer code changes at all.
Also note issue #1572, which if we got working well would allow you to manually partition large sections of the design for completely independent compilation. BTW for some very large designs this has been being done manually e.g. only a single CPU is Verilated then multiple CPUs manually inserted into an ASIC under SystemC or a manual C++ wrapper.
I once did small experiment to reduce verilating time and C++ compile time for larger design which includes multiple instances of big modules.
When I disabled per-instance(per-scope) optimization inf V3Gate, all instances of a module share the C++ function.
I am wondering if the following approach works.
Verilating time and memory consumption is expected to be largely reduced for a repeated design such as multi-core processors. C++ compile time will be also reduced.
Even for a design without repetition, ccache is expected to work well because untouched module is expected to be the same C++ code.
(BTW I saw a phenomenon that verilated C++ code differs time to time from the same Verilog though the simulation result is identical.
Disabling inlining is my workaround for what you described as csplit moving things around and hence making ccache useless. ccache then helps during development iterations and brings the build time from 40 minutes on a bad day to 3 minutes if I only change logic.
Agreed, that is what I was trying to get at. I might not get to this in the next few weeks due to other commitments but I will make an attempt when I have a bit of bandwidth.
The idea of using conditional unions to expose only required bits of the symbol table is very interesting. Regarding the problem of Verilator having to predict padding, an alternative would be to invoke the compiler on a stub file from the verilated makefile as a first compilation step, which can yield the exact structure sizes used by that particular compiler, and then feed this back into the actual compilation phase. It is a bit of a gross hack, but would mean that Verilator need not bee aware of the padding habits of various compilers.
Regarding manual partitioning and protect-lib, that is certainly an option for homogeneous architectures or for chip level 'loosely coupled' blocks, but sadly what I am fighting with at the moment is a highly heterogeneous accelerator where there isn't much re-use of functionality, and the whole thing is tightly interconnected so while partitioning it manually would be possible, it would certainly not be a trivial exercise.
I have noticed that too, eyeballing it it was mostly the clock sensitivity lists being ordered differently, but it seems to only happen when a rebuild would be needed anyway due to symbol table changes.