Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output more friendly to incremental compilation #2140

Open
gezalore opened this issue Jan 26, 2020 · 4 comments
Open

Output more friendly to incremental compilation #2140

gezalore opened this issue Jan 26, 2020 · 4 comments

Comments

@gezalore
Copy link
Contributor

@gezalore gezalore commented Jan 26, 2020

I have been using icecream + ccache to speed up builds of the C++ output by Verilator. This works to some extent when running with -Oi to disable module inlining in order to give ccache a chance. The caching however becomes useless after moderate changes or if any module has the properties (name, size) of any named port/state/signal changed, which causes a full re-build. Reading the output of Verilator, I see at least 2 reasons why the compiler cache has a hard time:

  1. There are internal implementation functions declared in the .h files of modules, which change as the internals change.
  2. The fully enumerated design hierarchy in the *__Syms symbol table holds all instances as members, and its definition is itself included in the sources of every module.

I wonder if it would be reasonable to attempt to remedy this (under a new Verilator switch), by:

  1. Emitting all implementation functions as static (passing explicit instances instead of this as needed), in companion *__Internals.{h,cpp} files.
  2. Holding the instances in *__Syms.h via forward declared pointers.
  3. In implementation files, include only the header files of the required moduels, rather than everything (which we do today via *__Syms.h)

Theory is then that the only C++ that would need to be re-compiled are the modules that changed, their dependencies (which are limited to everything below in the hierarchy unless there are interface changes or hierarchical references), and the top level _eval and friends, but not all the other modules. I am sure I am missing a lot of detail, can you think of anything that would make this whole idea a non-starter? I would also be interested to know if you think this would not be as useful as I think it would for any reason.

@gezalore gezalore added the new label Jan 26, 2020
@gezalore gezalore changed the title Output mode friendly to incremental compilation Output more friendly to incremental compilation Jan 26, 2020
@wsnyder

This comment has been minimized.

Copy link
Member

@wsnyder wsnyder commented Jan 26, 2020

I have been using icecream + ccache to speed up builds of the C++ output
by Verilator. This works to some extent when running with -Oi to disable
module inlining in order to give ccache a chance.

I'm curious what sort of speed up you see with inlining off, I wouldn't have expected much improvement, since as you noted any change to the symbol table (inlining or not) recompiles.

A caution there's likely bugs hiding as -Oi isn't tested much.

  1. There are internal implementation functions declared in the .h files of modules,
    which change as the internals change.
  2. The fully enumerated design hierarchy in the *__Syms symbol table holds all
    instances as members, and its definition is itself included in the sources of every
    module.

I wonder if it would be reasonable to attempt to remedy this (under a new Verilator
switch), by:

  1. Emitting all implementation functions as static (passing explicit instances
    instead of this as needed), in companion *__Internals.{h,cpp} files.
  2. Holding the instances in *__Syms.h via forward declared pointers.
  3. In implementation files, include only the header files of the required moduels,
    rather than everything (which we do today via *__Syms.h)

It's an interesting debate as to if there should be any uses of "this". The reason that vlSymsp wasn't used for everything (e.g. remove this and have only vlSymsp) is an attempt of V3Combine to look for code that is used in several instantiations and change that to be relative to the instance. I think there's great opportunities for better combining to help the Icache, which also helps compiling as less to compile.

But anyhow, an alternative to "this" would certainly be to build static functions that take two arguments (Syms vlSymsp, module_class __Vthis). If you are willing I'd suggest this as the first experiment, as it will likely help compile time and is likely runtime performance neutral so wouldn't need a switch.

Once you do that I think a second-order change we may need would be to improve the csplit and function numbering. As it is now, if an 'early' function gets larger or inserted this can cause not just the file with its implementation, but all of the 'later' functions to be numbered differently and/or land in different files, basically recompiling everything. I think a good algorithm for this would be to hash all function contents, name them based on that. Then determine total size, use csplit to determine number of files, then put each function into those N files based on their hashed name. Now if a function internal changes you have only one or 2 recompilations, the "old" file it was in, and the "new" file it goes into.

Verilator once had something like you suggest using pimpl (pointer-to-implementation pointers) in __Syms.h. The problem was every scope reference required a pointer indirection, which was bad for
performance. The compilers could de-alias many of them, but the indirection still requires a lot of extra instructions and all of the additional registers tended to cause more register spills, and once spilled the compiler reloaded the base pointers. As it is now there's "this" and vlSymsp and those two sit in registers and the code looks good.

It always seems broken that C/C++ has such a horrible need to have variables in the header, with only pimpl as a workaround. A hack we could experiment with is whereby Verilator determines the size of each symbol table, and use a union to make only certain sections visible when compiling. E.g. 100 variables would be a union of those variables and e.g. 100 bytes. We assert the full variables fit in the 100 bytes. If a C file needs to see inside this structure/module, it uses the full structure variable names. If a C file doesn't need to look inside, it uses ifdefs to see only the 100 byte part of the union. If Verilator always rounds the space needed by the variables up so e.g. 128 bytes are used, a submodule that does not refer to the inner variables will not need to recompile when variables change as long as they don't blow past 128 bytes. The great thing about this is it will be as fast as the existing code as there's still a single vSymsp, but incremental compiles are much better, in theory better than pimpl as we can build the exposure as fine as every unique output file, not just module boundaries.

    VM_MODULE(foo) {
       union {
           uint64_t hidden[2];
#ifdef foo_EXPOSED
           struct {
              int var_a;
              int var_b;
           } exposed;
#endif
        } vars;
    }

    in constructor:  assert(sizeof(vars.exposed) <= sizeof(vars.hidden))
       (also assert it's within maybe 20% of optimal so we don't have bugs
       where hidden is way too big)

The ugly part is Verilator needs to be very good at calculating the hidden size, and this needs to include predicting padding. The good thing is this is fairly easy to make optional; we use "hidden[0]" and turn on all EXPOSED ifdefs, none of the consumer code changes at all.

Also note issue #1572, which if we got working well would allow you to manually partition large sections of the design for completely independent compilation. BTW for some very large designs this has been being done manually e.g. only a single CPU is Verilated then multiple CPUs manually inserted into an ASIC under SystemC or a manual C++ wrapper.

@wsnyder

This comment has been minimized.

Copy link
Member

@wsnyder wsnyder commented Jan 26, 2020

@toddstrader who might be interested.

@yTakatsukasa

This comment has been minimized.

Copy link
Contributor

@yTakatsukasa yTakatsukasa commented Jan 27, 2020

I once did small experiment to reduce verilating time and C++ compile time for larger design which includes multiple instances of big modules.

When I disabled per-instance(per-scope) optimization inf V3Gate, all instances of a module share the C++ function.
Because the experiment was preliminary, there may be some potential issues.

  • Instance specific data (e.g. $display("%m"))
  • ordering issue may happen if simply skipping optimization

I am wondering if the following approach works.

  • V3Scope creates just single scope for a module user specified. Create black-box scope for the rest of the instances of a module. The black-box scope has just ports and is almost empty.
    • for small modules, per-scope optimization and inlining should be applied as it is.
  • V3Gate optimizes just the scope, but not optimizing the scope boundary so that the single scope will be applicable to other instances.

Verilating time and memory consumption is expected to be largely reduced for a repeated design such as multi-core processors. C++ compile time will be also reduced.

Even for a design without repetition, ccache is expected to work well because untouched module is expected to be the same C++ code.

(BTW I saw a phenomenon that verilated C++ code differs time to time from the same Verilog though the simulation result is identical.
I suspect the difference is caused by some map/set which uses pointer as its key.)

@gezalore

This comment has been minimized.

Copy link
Contributor Author

@gezalore gezalore commented Jan 27, 2020

I'm curious what sort of speed up you see with inlining off, I wouldn't have expected much improvement, since as you noted any change to the symbol table (inlining or not) recompiles.

Disabling inlining is my workaround for what you described as csplit moving things around and hence making ccache useless. ccache then helps during development iterations and brings the build time from 40 minutes on a bad day to 3 minutes if I only change logic. ccache -s tells me my long term hit rate is about 50% so it's not too shabby. The model is only about 25% faster with inlining, but the dev tests take far shorter to run than the compilation time so I'm happy with that. As of -Oi being untested, fair enough but I haven't noticed any issues.

But anyhow, an alternative to "this" would certainly be to build static functions that take two arguments (Syms vlSymsp, module_class __Vthis). If you are willing I'd suggest this as the first experiment, as it will likely help compile time and is likely runtime performance neutral so wouldn't need a switch.

Agreed, that is what I was trying to get at. I might not get to this in the next few weeks due to other commitments but I will make an attempt when I have a bit of bandwidth.

The idea of using conditional unions to expose only required bits of the symbol table is very interesting. Regarding the problem of Verilator having to predict padding, an alternative would be to invoke the compiler on a stub file from the verilated makefile as a first compilation step, which can yield the exact structure sizes used by that particular compiler, and then feed this back into the actual compilation phase. It is a bit of a gross hack, but would mean that Verilator need not bee aware of the padding habits of various compilers.

Regarding manual partitioning and protect-lib, that is certainly an option for homogeneous architectures or for chip level 'loosely coupled' blocks, but sadly what I am fighting with at the moment is a highly heterogeneous accelerator where there isn't much re-use of functionality, and the whole thing is tightly interconnected so while partitioning it manually would be possible, it would certainly not be a trivial exercise.

(BTW I saw a phenomenon that verilated C++ code differs time to time from the same Verilog though the simulation result is identical.
I suspect the difference is caused by some map/set which uses pointer as its key.)

I have noticed that too, eyeballing it it was mostly the clock sensitivity lists being ordered differently, but it seems to only happen when a rebuild would be needed anyway due to symbol table changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.