Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing parallel generates past elaboration #2550

Open
1024bees opened this issue Sep 15, 2020 · 5 comments
Open

Optimizing parallel generates past elaboration #2550

1024bees opened this issue Sep 15, 2020 · 5 comments
Labels
area: performance Issue involves performance issues effort: weeks Expect this issue to require weeks or more of invested effort to resolve

Comments

@1024bees
Copy link

Hey folks,

About a year ago I noticed i-cache being a bottleneck for my simulator performance (#2042).

Now that I have some time, one optimization that I think could help with icache perf is preserving parallel generates past elaboration. Right now I'm imagining something like

genvar rr, cc;
generate
  for (rr=0; rr<16; rr=rr+1) begin : gen1
    for (cc=0; cc<4; cc=cc+1) begin : gen2
      example_module example(
        .i_clk      (i_clk),
        .i_reset_n  (i_reset_n),
        .a          (row_val[rr]),
        .b          (col_val[cc]),
        .o_sum        (sum_out[rr * 16 + cc]));
      end
    end
endgenerate

verilating into something like

//Vtop__Syms.h
Vtop__example_module top__DOT__gen1__BRA__rr__KET__DOT__gen2__BRA__cc__KET__DOT__example[16][4];

Does this seem remotely feasible? FWIW, I'd be willing to spend a bit of time on this

@1024bees 1024bees added the new New issue not seen by maintainers label Sep 15, 2020
@wsnyder wsnyder changed the title Preserving parallel generates past elaboration Optimizing parallel generates past elaboration Sep 15, 2020
@wsnyder wsnyder added area: performance Issue involves performance issues effort: weeks Expect this issue to require weeks or more of invested effort to resolve and removed new New issue not seen by maintainers labels Sep 15, 2020
@wsnyder
Copy link
Member

wsnyder commented Sep 15, 2020

Yes, icache is the main performance bottleneck at present.

Generates are currently expanded because dotted references need to be able to scope into them, and because all internal optimizations assume a flattened design. For example there may be clock buffers inside a generate, and it's critical to flatten these out for performance. Perhaps we keep the arraying only when the current module inliner doesn't inline.

There's an attempt once the duplicate code is made to find the remaining parallelism again (V3Combine), however this is very naive at present, and doesn't take advantage of arraying (as it doesn't exist). Also if done well it could recognize other similar code even when not arrayed (e.g. instantiating cpu0 and cpu1 with nearly duplicate interiors.)

Anyhow, it would be wonderful to get improvements in this area. If you have a good design, I'd suggest manually editing the output of verilator to see what performance you might get (hacking so that even if results are slightly wrong you can still measure performance). Then once have a good strategy we can discuss how to achieve that output.

@1024bees
Copy link
Author

1024bees commented Nov 10, 2020

Hey Wilson. Sorry for the delay on this, here are some thoughts:

Overview

After looking into this, I think the best return for performance per engineer hour would come from taking another stab at V3Combine. A decent slice of poor icache perf seems to come from unique but near-equivalent functions being created by verilator for each copy of a module that is generated.

Benchmark

I made a micro-benchmark at https://github.com/1024bees/verilator_example. This stamps down a bunch of tiny verilog modules in a nested for loop. the work at the core of this benchmark looks something like:

      Vtop* const __restrict vlTOPp VL_ATTR_UNUSED = vlSymsp->TOPp;
      // Body
      this->__PVT__flops_next[0U] = (vlTOPp->i_row_val[i] 
                                     + vlTOPp->i_col_val[j]);
      this->__PVT__flops_next[1U] = (vlTOPp->i_row_val[i] 
                                     * vlTOPp->i_col_val[j]);
      this->__PVT__flops_next[2U] = (this->__PVT__flops[0U] 
                                     ^ this->__PVT__flops[1U]);
      this->__PVT__flops_next[3U] = (this->__PVT__flops[3U] 
                                     + this->__PVT__flops[2U]);
  }

By default, verilator creates a unique copy of this generic_work function for each generated module.

Default performance of this benchmark on my machine (Ryzen5 2600 based):

original

Performance after converting all of the generated unique functions into the generic_work function mentioned above:

modified

TL;DR we get about a ~2x increase in performance in this micro-benchmark from by mapping all variants of the generic_work function created by verilator into one common generic_work function. I suspect this micro-benchmark could see further improvements from "rolling" function calls in eval() into a for-loop.

Solutions

From what you've described V3Combine looks it could be extended (re-written?) to try to collapse functions created from modules stamped down via generates in a similar way that the generic_work function was created, but I'd have to checkout AstCFunc to get a good feel for that.

That being said, the path of extending V3Combine seems to be treating a symptom (repeated C code) rather than the cause (redundant work verilating common blocks within a design). IMO, the ideal fix would look something like

  1. Record any modules with equivalent interiors (after V3Const / V3Width?)
  2. for a set of N modules with equivalent interiors, schedule/optimized one copy of that module with symbolic inputs and output and substitute the schedule and optimized module back into the flattened

but this idea comes from having little insight into how Verilator scheduling works :)

What do you think?

@wsnyder
Copy link
Member

wsnyder commented Nov 10, 2020

I have little doubt that there is a significant speed up from this, 2x seems quite plausible, #2631 has additional examples of this.

I would not change scheduling to optimize this, rather would rewrite to improve and merge V3Combine and V3Reloop. I believe that will give most of the gain and is much easier to build and get correct. Also multithreaded has counter-goals that might want to keep code expanded so should be performed before this.

As an idea consider this

ASSIGN(foo[1], EQ(bar, 1))
other_statement
ASSIGN(foo[0], EQ(bar, 0))

One idea to discuss is that the new version would convert all statements into variable and constant "argumented" statements, e.g.

var1=foo, const2=1, var3=bar, const4=1     ASSIGN(var1[const2], EQ(const3, const4))
... other_statement
var1=foo, const2=0, var3=bar, const4=0     ASSIGN(var1[const2], EQ(const3, const4))

This step would also make a graph indicating which statements need the same position relative to others like V3Split does.

Then create a hash of the generic functions, put all of the generic functions with same signature together (as allowed by the reordering legal graph), and order the constants low-to-high.

var1=foo, const2=0, var3=bar, const4=0     ASSIGN(var1[const2], EQ(const3, const4))
var1=foo, const2=1, var3=bar, const4=1     ASSIGN(var1[const2], EQ(const3, const4))
... other_statement

Now look for same signatures and make a loop (as allowed by reordering legal graph)

for i 0..1
    var1=foo, const2=i, var3=bar, const4=i     ASSIGN(var1[const2], EQ(const3, const4))
... other_statement

Any arguments that the loop/subroutines don't change are left in place

for i 0..1
    const2=i, const4=i     ASSIGN(foo[const2], EQ(bar, const4))
... other_statement

This algorithm can also be used to make small functions that implement the common code, but this might slow down the runtime if over-used, so we'd probably only want to do that in cases where the function to generate is of substantial number of statements (say 100 instructions or something).

The same algorithm could then be repeated at a larger level than statements, e.g. whole functions. You wouldn't need the graph to say what's legal at that level.

We would then do the normal V3Combine looking for common functions. (Perhaps V3Combine as-is remains as a later stage).

Note what V3Combine does is a more specific function-level version of this, and what V3Reloop does is a very specific signature version of looping, which is why I suggest this replaces those ideas.

@1024bees
Copy link
Author

This sounds like a good plan; after which pass do you think the new V3Combine + V3Reloop (I'll call it V3Collapse) should occur?

@wsnyder
Copy link
Member

wsnyder commented Nov 10, 2020

I think I'd do it where V3Reloop is now. I don't know why V3Combine is so early, but it might be good to leave that where it is at least until the new process is understood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance Issue involves performance issues effort: weeks Expect this issue to require weeks or more of invested effort to resolve
Projects
None yet
Development

No branches or pull requests

2 participants