Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Verilating speed for large designs with repetition #2182

Open
yTakatsukasa opened this issue Mar 1, 2020 · 5 comments
Open

Improve Verilating speed for large designs with repetition #2182

yTakatsukasa opened this issue Mar 1, 2020 · 5 comments

Comments

@yTakatsukasa
Copy link
Contributor

@yTakatsukasa yTakatsukasa commented Mar 1, 2020

This is a related to #2140 (comment).
I'd like to collect information before touching code. Any advice is appreciated.

Goal

I'd like to improve verilating speed and C++ compilation time for large designs, especially with repetition such as many core SoC.

Approach

Skip per-scope optimization for large module which is repeated.

  • modules can be marked by a pragma
  • the marked modules will not be optimized per scope (per instance)
    • the modules will be scoped just once and optimized inside it. (A)
    • no constant propagation across the module boundary in V3Gate so that the module is kept compatible among scopes
    • other instantiations of the module have dummy ASTScope
    • after descope, the optimized result of (A) will be shared in all scopes

Background

I found that passes after per-scope optimization consumes a lot of CPU time and memory.
It's quite natural that per-instance(per-scope) optimization needs such resources.

On the other hand, per-instance optimization for larger block such as processor is expensive.
It must be meaningful to optimize inside the processor module, but uniquifying the processor module for each instance seems less beneficial.

Disabling per-instance opt. for such large module will

  • Decrease verilating speed and memory consumption ( more capability )
  • Smaller C++ code since all instances of such module share the C++ implementation
  • Less compilation time due to smaller C++ code
  • Improve simulation speed by smaller code (improved instruction cache hit ratio)

Concerns

  • Ordering issues
  • how should instance specific op be treated ? such as $display("%m")
    • maybe unsupport such modules
  • pitfalls I ovelrooked

Statistics

Here is a statistics for my test design ( 128 instances of processor core https://github.com/lowRISC/ibex)

Performance Statistics:

  Stage, Elapsed time (sec), 001_cells              0.000000
  Stage, Elapsed time (sec), 002_linkparse          0.004940
  Stage, Elapsed time (sec), 003_linkdot            0.009097
  Stage, Elapsed time (sec), 004_linkresolve        0.001068
  Stage, Elapsed time (sec), 005_linklvalue         0.000379
  Stage, Elapsed time (sec), 006_link               0.000421
  Stage, Elapsed time (sec), 007_param              0.071113
  Stage, Elapsed time (sec), 008_paramlink          0.196825
  Stage, Elapsed time (sec), 009_deadModules        0.031841
  Stage, Elapsed time (sec), 010_width              0.051896
  Stage, Elapsed time (sec), 011_widthcommit        0.012004
  Stage, Elapsed time (sec), 012_const              0.009935
  Stage, Elapsed time (sec), 013_assertpre          0.015303
  Stage, Elapsed time (sec), 014_assert             0.005705
  Stage, Elapsed time (sec), 015_wraptop            0.000046
  Stage, Elapsed time (sec), 016_const              0.007756
  Stage, Elapsed time (sec), 017_split_var          0.004342
  Stage, Elapsed time (sec), 018_split_var          0.000204
  Stage, Elapsed time (sec), 019_dearray            0.004603
  Stage, Elapsed time (sec), 020_linkdot            0.315843
  Stage, Elapsed time (sec), 021_begin              0.063780
  Stage, Elapsed time (sec), 022_tristate           0.029940
  Stage, Elapsed time (sec), 023_unknown            0.018268
  Stage, Elapsed time (sec), 024_inline             0.134317
  Stage, Elapsed time (sec), 025_linkdot            0.290331
  Stage, Elapsed time (sec), 026_const              0.012163
  Stage, Elapsed time (sec), 027_deadDtypes         0.009144
  Stage, Elapsed time (sec), 028_inst               0.010735
  Stage, Elapsed time (sec), 029_const              0.006795
  Stage, Elapsed time (sec), 030_scope              1.033281  <== most passes before scope takes less than 100ms
  Stage, Elapsed time (sec), 031_linkdot            0.469761
  Stage, Elapsed time (sec), 032_const              0.084470
  Stage, Elapsed time (sec), 033_deadDtypesScoped   0.201601
  Stage, Elapsed time (sec), 034_case               8.617434
  Stage, Elapsed time (sec), 035_task               0.859307
  Stage, Elapsed time (sec), 036_name               0.321294
  Stage, Elapsed time (sec), 037_unroll             0.642761
  Stage, Elapsed time (sec), 038_slice              0.258600
  Stage, Elapsed time (sec), 039_const              0.560202
  Stage, Elapsed time (sec), 040_life               0.371504
  Stage, Elapsed time (sec), 041_table              0.522432
  Stage, Elapsed time (sec), 042_const              0.251064
  Stage, Elapsed time (sec), 043_deadDtypesScoped   0.464217
  Stage, Elapsed time (sec), 044_active             0.162216
  Stage, Elapsed time (sec), 045_split              3.231433
  Stage, Elapsed time (sec), 046_splitas            0.233790
  Stage, Elapsed time (sec), 047_gate               2.716224
  Stage, Elapsed time (sec), 048_const              0.301392
  Stage, Elapsed time (sec), 049_deadAllScoped      0.583631
  Stage, Elapsed time (sec), 050_reorder            0.930066
  Stage, Elapsed time (sec), 051_delayed            0.262375
  Stage, Elapsed time (sec), 052_activetop          1.464380
  Stage, Elapsed time (sec), 053_order              2.999547                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 054_genclk             0.361603                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 055_clock              0.380473                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 056_const              0.555295                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 057_life               0.803892                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 058_life_post          1.450731                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 059_const              0.562564                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 060_deadAllScoped      0.801386                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 061_changed            0.213276                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 062_descope            0.966051                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 063_localize           0.712902                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 064_combine            1.124508                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 065_const              0.543583                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 066_deadAll            0.480752                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 067_clean              1.065068                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 068_premit             0.593945                                                                                                                                                                                                                                                                           
  Stage, Elapsed time (sec), 069_expand             1.096621
  Stage, Elapsed time (sec), 070_const_cpp          1.388702
  Stage, Elapsed time (sec), 071_subst              0.967372
  Stage, Elapsed time (sec), 072_const_cpp          0.645099
  Stage, Elapsed time (sec), 073_deadAll            0.551125
  Stage, Elapsed time (sec), 074_reloop             0.178762
  Stage, Elapsed time (sec), 075_depth              0.396143
  Stage, Elapsed time (sec), 076_cast               1.020172
  Stage, Elapsed time (sec), 077_cuse               0.015112
    
  Stage, Memory (MB), 001_cells                    18.867188
  Stage, Memory (MB), 002_linkparse                18.867188
  Stage, Memory (MB), 003_linkdot                  20.769531
  Stage, Memory (MB), 004_linkresolve              20.769531
  Stage, Memory (MB), 005_linklvalue               20.769531
  Stage, Memory (MB), 006_link                     20.769531
  Stage, Memory (MB), 007_param                    73.175781
  Stage, Memory (MB), 008_paramlink                109.164062
  Stage, Memory (MB), 009_deadModules              109.164062
  Stage, Memory (MB), 010_width                    109.164062
  Stage, Memory (MB), 011_widthcommit              109.164062
  Stage, Memory (MB), 012_const                    109.164062
  Stage, Memory (MB), 013_assertpre                109.164062
  Stage, Memory (MB), 014_assert                   109.164062
  Stage, Memory (MB), 015_wraptop                  109.164062
  Stage, Memory (MB), 016_const                    109.164062
  Stage, Memory (MB), 017_split_var                109.164062
  Stage, Memory (MB), 018_split_var                109.164062
  Stage, Memory (MB), 019_dearray                  109.164062
  Stage, Memory (MB), 020_linkdot                  136.750000
  Stage, Memory (MB), 021_begin                    136.750000
  Stage, Memory (MB), 022_tristate                 136.750000
  Stage, Memory (MB), 023_unknown                  136.750000
  Stage, Memory (MB), 024_inline                   150.285156
  Stage, Memory (MB), 025_linkdot                  150.285156
  Stage, Memory (MB), 026_const                    150.285156
  Stage, Memory (MB), 027_deadDtypes               150.285156
  Stage, Memory (MB), 028_inst                     150.285156
  Stage, Memory (MB), 029_const                    150.285156
  Stage, Memory (MB), 030_scope                    760.695312   <== memory consumption increased
  Stage, Memory (MB), 031_linkdot                  763.949219
  Stage, Memory (MB), 032_const                    763.949219
  Stage, Memory (MB), 033_deadDtypesScoped         763.949219
  Stage, Memory (MB), 034_case                     932.300781
  Stage, Memory (MB), 035_task                     1814.582031
  Stage, Memory (MB), 036_name                     1814.847656
  Stage, Memory (MB), 037_unroll                   2046.562500
  Stage, Memory (MB), 038_slice                    2046.562500
  Stage, Memory (MB), 039_const                    2046.562500
  Stage, Memory (MB), 040_life                     2047.046875
  Stage, Memory (MB), 041_table                    2047.046875
  Stage, Memory (MB), 042_const                    2047.046875
  Stage, Memory (MB), 043_deadDtypesScoped         2047.046875
  Stage, Memory (MB), 044_active                   2047.046875
  Stage, Memory (MB), 045_split                    2794.187500
  Stage, Memory (MB), 046_splitas                  2794.187500
  Stage, Memory (MB), 047_gate                     2798.250000
  Stage, Memory (MB), 048_const                    2798.250000
  Stage, Memory (MB), 049_deadAllScoped            2798.441406
  Stage, Memory (MB), 050_reorder                  2798.441406
  Stage, Memory (MB), 051_delayed                  2798.441406
  Stage, Memory (MB), 052_activetop                2983.550781
  Stage, Memory (MB), 053_order                    3224.660156
  Stage, Memory (MB), 054_genclk                   3224.660156
  Stage, Memory (MB), 055_clock                    3224.660156
  Stage, Memory (MB), 056_const                    3224.660156
  Stage, Memory (MB), 057_life                     3224.660156
  Stage, Memory (MB), 058_life_post                3224.660156
  Stage, Memory (MB), 059_const                    3224.660156
  Stage, Memory (MB), 060_deadAllScoped            3224.660156
  Stage, Memory (MB), 061_changed                  3224.660156
  Stage, Memory (MB), 062_descope                  3224.660156
  Stage, Memory (MB), 063_localize                 3224.660156
  Stage, Memory (MB), 064_combine                  3224.660156
  Stage, Memory (MB), 065_const                    3224.660156
  Stage, Memory (MB), 066_deadAll                  3224.660156
  Stage, Memory (MB), 067_clean                    3358.722656
  Stage, Memory (MB), 068_premit                   3363.492188
  Stage, Memory (MB), 069_expand                   3364.136719
  Stage, Memory (MB), 070_const_cpp                3364.136719
  Stage, Memory (MB), 071_subst                    3364.136719
  Stage, Memory (MB), 072_const_cpp                3364.136719
  Stage, Memory (MB), 073_deadAll                  3364.636719
  Stage, Memory (MB), 074_reloop                   3364.636719
  Stage, Memory (MB), 075_depth                    3364.636719
  Stage, Memory (MB), 076_cast                     3364.636719
  Stage, Memory (MB), 077_cuse                     3364.636719
@wsnyder

This comment has been minimized.

Copy link
Member

@wsnyder wsnyder commented Mar 2, 2020

Improving large design handling would be very valuable. A lot of the potential gain may be making the gcc time less (versus Verilation time). BTW 22 seconds is pretty fast in the grand context, but I presume that was just a test case to show the point.

I'm not sure of the tradeoffs of what you suggest versus say making "protect-like" modules. This idea should be much faster in runtime, but likely significant effort. Maybe the ideas can be combined, that is you mark pragmas that are the repeated module boundaries, and Verilator understands how to make the repeated sub-modules at those boundaries with separate sub-runs?

A problem I see are items like JTAG and similar chains that wind through the modules. The order of evaluation will need to be correct across the little pieces, and called in a different order per repetition, or it will run too slow. I think you'll need to order the "meta-module" considering all submodule repetitions at once.

%m ties in with VPI/DPI, which must know the full design scope. I would think you can maintain a AstScope as currently for the whole design hierarchy. A given scope would then know it's a repetition and point at another module. Currently every class that composes a module already can have multiple instances with different names.

I'd suggest maybe handcrafting what the output C++ code would look like for a trivial design and we can discuss. Then propose the steps and algorithms.

@toddstrader

@toddstrader

This comment has been minimized.

Copy link
Member

@toddstrader toddstrader commented Mar 2, 2020

I'm not sure of the tradeoffs of what you suggest versus say making "protect-like" modules. This idea should be much faster in runtime, but likely significant effort. Maybe the ideas can be combined, that is you mark pragmas that are the repeated module boundaries, and Verilator understands how to make the repeated sub-modules at those boundaries with separate sub-runs?

Related to this, is your biggest pain point the initial time it takes you to compile the entire design? Or is it when you are iterating on a file in the edit-compile-test loop? I'd also be glad to see improvements on either front, but the latter would impact my daily workflow more and I would expect it to have a lower floor on execution time since there's less (new) work to be done.

@yTakatsukasa

This comment has been minimized.

Copy link
Contributor Author

@yTakatsukasa yTakatsukasa commented Mar 3, 2020

Thanks for the comments.

BTW 22 seconds is pretty fast in the grand context, but I presume that was just a test case to show the point.

Right. The example is just an example that can be shown in public.
I saw much bigger design that could not be verilated due to huge memory consumption, but I don't remember the detailed numbers.
I am using protected-lib to handle the design. (Thank you! Todd)

JTAG example is good point. It seems I thought things too easy.
I realized I need to learn more about the ordering code.
I'm quite new to the backend of verilator.
After I study, I will write a C++ code that would be generated.

Related to this, is your biggest pain point the initial time it takes you to compile the entire design? Or is it when you are iterating on a file in the edit-compile-test loop?

Well, I didn't separate them in my mind, but it is important to ask myself and check my usecase.
At this moment, I always verilate & g++ from scratch, I don't use ccache.
I am also interested in #1520.

@yTakatsukasa

This comment has been minimized.

Copy link
Contributor Author

@yTakatsukasa yTakatsukasa commented Mar 9, 2020

After I explored the ordering code, I think protect-lib based approach is better because the feature is already working great and enhancement for the feature may be beneficial for more users.

I think the following items will help users with large design. (including me)

Usability for the usecase of veirlating speed

  • hierarchical verilation based of protect-lib can be done just adding some pragma to modules
    • simpler command line

C++ compile time

  • deterministic C++ code generation (more ccache friendly)
  • skip emitting code if AST is same (check not only by timestamp)

Simulation Performance

@yTakatsukasa

This comment has been minimized.

Copy link
Contributor Author

@yTakatsukasa yTakatsukasa commented Mar 13, 2020

I'd like to start from the usability, simple hierarchical verilation based on protect-lib feature.

Here is what I am thinking.

Assume the design looks like this. Note that sub_s, sub_b, and sub_c are marked hier_block.

module sub_a() /*verilator hier_block*/;
endmodule

module sub_b() /*verilator hier_block*/;
   sub_a i_sub_a();
endmodule

module sub_c() /*verilator hier_block*/;
endmodule

module sub_d();
endmodule

module top();
   sub_b i_sub_b();
   sub_c i_sub_c();
   sub_d i_sub_d();
endmodule

Running verilator and make as below would generate an executable. There will be nothing special.

verilator --cc --exe --top-module top sub_a.sv sub_b.sv sub_c.sv sub_d.sv top.sv
make -f Vtop.mk -C obj_dir

Internally verilator would create protect lib for sub_a, sub_b, and sub_c to shorten verilation time.
There will be files as below.

obj_dir/Vtop.mk
obj_dir/Vtop_*.h
obj_dir/Vtop_*.cpp
obj_dir/Vsub_a/Vsub_a.mk
obj_dir/Vsub_a/Vsub_a_*.h
obj_dir/Vsub_a/Vsub_a_*.cpp
obj_dir/Vsub_b/Vsub_b.mk
obj_dir/Vsub_b/Vsub_b_*.h
obj_dir/Vsub_b/Vsub_b_*.cpp
obj_dir/Vsub_c/Vsub_c.mk
obj_dir/Vsub_c/Vsub_c_*.h
obj_dir/Vsub_c/Vsub_c_*.cpp

The same thing can be done with the current verilator, but user have to call verilator for each protect-libs and need to pass correct file list and library flags.

Any comment is appreciated including meta comment name :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.