Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve V3Localize #3027

Merged
merged 3 commits into from Jun 18, 2021
Merged

Improve V3Localize #3027

merged 3 commits into from Jun 18, 2021

Conversation

gezalore
Copy link
Member

@gezalore gezalore commented Jun 17, 2021

Got a bit side tracked form the ccache efficiency work when I found this optimization opportunity. It makes the OpenTitan model 8.5% faster when compiled without --trace (measured on a single threaded model), as V3Localize now finds an additional ~21000 variables that it can convert into locals which no longer need to live in (and be stored to) the model state. (Sadly when compiled with tracing these opportunities go away as variables are read in the tracing routines.)

There are 3 patches and they should be kept separate as they are largely orthogonal. The first moves V3Descpoe after V3Localize, and teaches V3Localize how to localize variables of another module. Patch 2 is just infrastructure to be used in patch 3 initially but is generic. Patch 3 teaches V3Localize how to localize any variable that is always assigned before it's read within a function (previously it only did this if the variable was used only in a single function, but a lot of stuff was duplicaed into the settle loop, so there was a lot of lost opportuity).

Some numbers for OpenTitan (single threaded, without --trace):

Old Localize: (0a28fc8c)

      68,728.68 msec task-clock                #    1.000 CPUs utilized            ( +-  0.05% )
             75      context-switches          #    0.001 K/sec                    ( +- 31.50% )
              0      cpu-migrations            #    0.000 K/sec
          1,549      page-faults               #    0.023 K/sec                    ( +-  0.02% )
240,412,856,507      cycles                    #    3.498 GHz                      ( +-  0.05% )
232,031,502,038      instructions              #    0.97  insn per cycle           ( +-  0.00% )
 11,966,905,530      branches                  #  174.118 M/sec                    ( +-  0.00% )
    261,356,240      branch-misses             #    2.18% of all branches          ( +-  0.47% )

        68.7321 +- 0.0323 seconds time elapsed  ( +-  0.05% )


New Localize: (21917abf)

      63,407.49 msec task-clock                #    1.000 CPUs utilized            ( +-  0.04% )
             39      context-switches          #    0.001 K/sec                    ( +- 56.85% )
              0      cpu-migrations            #    0.000 K/sec
          1,434      page-faults               #    0.023 K/sec                    ( +-  0.01% )
221,799,502,259      cycles                    #    3.498 GHz                      ( +-  0.04% )
211,109,106,971      instructions              #    0.95  insn per cycle           ( +-  0.00% )
 11,898,040,540      branches                  #  187.644 M/sec                    ( +-  0.00% )
    264,866,360      branch-misses             #    2.23% of all branches          ( +-  0.78% )

        63.4091 +- 0.0265 seconds time elapsed  ( +-  0.04% )

@gezalore gezalore requested a review from wsnyder June 17, 2021 20:19
@gezalore gezalore force-pushed the improve-localize branch 2 times, most recently from a71aa7b to 21917ab Compare June 17, 2021 20:32
@gezalore
Copy link
Member Author

Actually, turns out even wiht tracing there are 19000 more locals in OpenTitan with this patch!

@gezalore
Copy link
Member Author

gezalore commented Jun 17, 2021

To clarify, on OpenTitan:

  • Without this patch, V3Localize localizes about 1500 variables.
  • With this patch, with --trace, it's about 20500.
  • With this patch, without --trace, it's about 22000.

Let's hope that stays that way after I fix the sanitizer failures..

@wsnyder
Copy link
Member

wsnyder commented Jun 17, 2021

Sounds like great results, wish had spotted that earlier! Please also check SweRV performance.

  1. Merging first change seems ok.
  2. Like the concept, but perhaps we should apply it more globally to existing code, and as a separate pull?
  3. Also seems reasonable.

@gezalore
Copy link
Member Author

The sanitizer error seems a pre-existing genuine read past the end of a VlWide (trying to do a bit extract I think). It shows up now because the VlWide has been localized, while previously it just read past the end of it in the module struct.. Will need to figure out how to fix this.

  1. Like the concept, but perhaps we should apply it more globally to existing code, and as a separate pull?

Agree it should be applied more globally where possible, but the last patch depends on this so I will keep it here for now, and revisit where the allocators can be used as a later patch/PR.

@gezalore
Copy link
Member Author

gezalore commented Jun 17, 2021

Please also check SweRV performance.

More modest gain on EH1:

Old Localize:0a28fc8c

      18,299.03 msec task-clock                #    0.998 CPUs utilized            ( +-  0.03% )
             15      context-switches          #    0.001 K/sec                    ( +- 13.40% )
              0      cpu-migrations            #    0.000 K/sec
          1,398      page-faults               #    0.076 K/sec
 64,006,751,514      cycles                    #    3.498 GHz                      ( +-  0.03% )
 46,452,448,956      instructions              #    0.73  insn per cycle           ( +-  0.01% )
  4,547,912,648      branches                  #  248.533 M/sec                    ( +-  0.03% )
     58,756,429      branch-misses             #    1.29% of all branches          ( +-  0.39% )

       18.33155 +- 0.00519 seconds time elapsed  ( +-  0.03% )


New Localize: 21917abf

      18,138.79 msec task-clock                #    0.998 CPUs utilized            ( +-  0.04% )
             18      context-switches          #    0.001 K/sec                    ( +-  6.57% )
              0      cpu-migrations            #    0.000 K/sec
          1,398      page-faults               #    0.077 K/sec
 63,446,485,184      cycles                    #    3.498 GHz                      ( +-  0.04% )
 45,207,556,286      instructions              #    0.71  insn per cycle           ( +-  0.01% )
  4,427,565,512      branches                  #  244.094 M/sec                    ( +-  0.01% )
     58,920,165      branch-misses             #    1.33% of all branches          ( +-  0.36% )

       18.17047 +- 0.00724 seconds time elapsed  ( +-  0.04% )

On the other hand, OpenTitan builds about 20% faster on GCC with this patch (with no ccache, 2191 CPU seconds vs 2651), something that @rswarbrick might be happy to hear.

@gezalore
Copy link
Member Author

  1. Like the concept, but perhaps we should apply it more globally to existing code, and as a separate pull?

BTW I intend to rebase this and keep the 3 commits separate instead of squasing them. Sadly GitHub doesn't support dependent PRs..

V3Localize can now localize variable references that reference variables
located in scopes different from the referencing function. This also
means V3Descope has now moved after V3Localize.
These utility classes can be used to hang advanced data structures off
AstNode user*u() pointers, and they take care of memory management for
the client. Use via the call operator().
Teach V3Localize how to localize variables that are used in multiple
functions, if in all functions where they are used, they are always
written in whole before being consumed. This allows a lot more variables
to be localized (+20k variables on OpenTitan - when building without
--trace), and can cause significant performance improvement (OpenTitan
simulates 8.5% - build single threaded and withuot --trace).
@gezalore gezalore merged commit e5e5bc0 into verilator:master Jun 18, 2021
@gezalore gezalore deleted the improve-localize branch June 18, 2021 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants