Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance delta between 4.020 and 4.022 (V3Case case handling) #1644

Closed
veripoolbot opened this issue Dec 17, 2019 · 13 comments
Closed

Performance delta between 4.020 and 4.022 (V3Case case handling) #1644

veripoolbot opened this issue Dec 17, 2019 · 13 comments

Comments

@veripoolbot
Copy link

@veripoolbot veripoolbot commented Dec 17, 2019


Author Name: Julien Margetts
Original Redmine Issue: 1644 from https://www.veripool.org


For my design I am finding a very large runtime delta between Verilator 4.020 and either 4.022 or 4.024

I am seeing a 30x increase in bytes of C++ emitted, a 10x increase in compiled object size, and a 5x increase in verilator run time

Not ruling out user error, and I am trying to narrow down the area of the design which is blowing up, but in the meantime, are there any recent changes you can think of which may cause such an issue?

Thanks

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 17, 2019


Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-17T18:52:35Z


Possibly something to do with casez expansion?

function [3:0] LowestBitSet(input [15:0] v);
begin // Return the index of the lowest set bit in a 16-bit vector
     casez (v)
     16'b???????????????1 : LowestBitSet = 4'd0;
     16'b??????????????10 : LowestBitSet = 4'd1;
     16'b?????????????100 : LowestBitSet = 4'd2;
     16'b????????????1000 : LowestBitSet = 4'd3;
     16'b???????????10000 : LowestBitSet = 4'd4;
     16'b??????????100000 : LowestBitSet = 4'd5;
     16'b?????????1000000 : LowestBitSet = 4'd6;
     16'b????????10000000 : LowestBitSet = 4'd7;
     16'b???????100000000 : LowestBitSet = 4'd8;
     16'b??????1000000000 : LowestBitSet = 4'd9;
     16'b?????10000000000 : LowestBitSet = 4'd10;
     16'b????100000000000 : LowestBitSet = 4'd11;
     16'b???1000000000000 : LowestBitSet = 4'd12;
     16'b??10000000000000 : LowestBitSet = 4'd13;
     16'b?100000000000000 : LowestBitSet = 4'd14;
     default              : LowestBitSet = 4'd15;
     endcase
end
endfunction

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 17, 2019


Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-17T23:22:52Z


Yes, this is it. Re-coding the above as a for loop works-round the issue.
Is this a side-effect of the workaround for #�, which went in at 4.022?

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 17, 2019


Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-17T23:28:38Z


The above function emits an impressive 11MB of C++ on 4.022 compared to 14KB with 4.020 :)

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 18, 2019


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-18T00:06:35Z


Wow, that's an unanticipated consequence! Can you try reverting the define in #�? If that is it, I suspect it would still have done stupid code in earlier versions if you have a 10-bit encoder instead, you just happened to cross the 16-bit line.

Assuming that's it, it should know the instruction count is being increased and abort.

A longer term ideal fix would be to recognize all the common "diagonal" 0/1 cases and convert to just a few instructions.

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 18, 2019


Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-18T13:05:00Z


To your first point: Yes, simply reverting CASE_OVERLAP_WIDTH to 12 and we're back down to 14K
To your second point: Yes, a 12-bit search also emits lots of code regardless of version.

If you are interested, the attached patch reverts CASE_OVERLAP_WIDTH to 12, and performs a new overlap check earlier (up to 64-bit input width regardless of CASE_OVERLAP_WIDTH). This probably renders some of the existing code which follows redundant, but I have left it all in place.

I also think isCaseTreeFast would benefit from a test similar to the following to prevent the code explosion:

#define CASE_SPARSENESS 2
if ((m_caseItems == m_caseWidth) || m_caseItems < ((1UL<<m_caseWidth)/CASE_SPARSENESS)
     return false;

(Bit search cases will typically have m_caseItems == m_caseWidth)

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 18, 2019


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-18T17:30:10Z


Thanks for the patch, I'll check & merge it tonight.

Can you also try out the CASE_SPARSENESS test to tune your performance? I think the comparison should be caseItems == width, or items == width+1 (as there might be a default). But maybe only enable that part of the expression if the width is say 5 or larger as small statements/widths will do better optimizing.

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 18, 2019


Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-18T19:15:16Z


Sounds good. Will do.

If I get time I might look at a patch to recognise CLZ, CTZ type casez constructs, e.g.

  1. width == items
  2. each items const value has only a single bit set
  3. each items const is prev items const >> or << 1 bit (i.e. spot the diagonal)
  4. each items 'action' is an ASSIGN ( CONST, VARREF ) and the VARREF is the same for each item
  5. the constant assigned to VARREF is equal to CTZ/CLZ(item_const)

Then I guess you could convert the whole CASE into something like ASSIGN ( CLZ( VARREF ), VARREF ) where CLZ is a new Math Op that emits a call to __builtin_clz or similar?

Do you think this would be worth it? Might be a fun self contained exercise either way!

Could then possibly recognise a for loop based implementation too, and expand to include popcount … :)

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 18, 2019


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-18T22:57:30Z


Short/Medium term the patch you mention seems good.

The really ideal long term improvement IMO would be to use a BDD or otherwise to make a function signature for N inputs, then map common signatures to fast implementations, this is one of many. E.g. it could even recognize low-level gates forming an adder. V3Table is an example of part of this concept, finding a series of inputs and making a truth table, it just doesn't make a signature and reduce it yet, which would replace making the table for known signatures.

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 18, 2019


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-18T23:01:18Z


As to the patch, it's O(n^2). I've seen real tables with thousands of entries, so this will effectively hang. It at least needs a size bound.

@veripoolbot

This comment has been minimized.

Copy link
Author

@veripoolbot veripoolbot commented Dec 19, 2019


Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-19T10:41:49Z


Yes, I did wonder about that, but thought it would be OK since the existing overlap check (although bounded) is also O(N^2) (actually O(N*M) where M=1<<Width, which must be >= N)

Anyhow, it's easy to make it O(N) so I will do that (as it stands the patch takes 6s to check a 65,536 entry case)

@margej

This comment has been minimized.

Copy link

@margej margej commented Dec 23, 2019

Been wrestling with this for a while, and after writing an O(N) overlap check which is 10,000x faster than my original lame effort, I came to the conclusion its not really worth including, as the work all has to be done again to populate m_value anyway, so its just adding effort to increase the width of the overlap check, which doesn't seem much of a value add.

I've come to the conclusion that my immediate issue is best resolved by this very simple change (as discussed above) but doing this check at the end at least allows CASE_OVERLAP_WIDTH to remain at 16

Changing this:
if (m_caseItems <= 3) return false; // Not worth simplifying
To this:
if (m_caseItems <= 3 || (m_caseWidth >= 8 && (m_caseItems <= (m_caseWidth+1)))) return false; // Not worth simplifying

@margej

This comment has been minimized.

Copy link

@margej margej commented Dec 23, 2019

For reference, here is the patch for the wider overlap check.
Can't drag and drop ".patch" or ".diff" files BTW

V3Case_patch.txt

@wsnyder

This comment has been minimized.

Copy link
Member

@wsnyder wsnyder commented Dec 23, 2019

Thanks for your work, smaller patch pushed.

@wsnyder wsnyder changed the title Performance delta between 4.020 and 4.022 Performance delta between 4.020 and 4.022 (V3Case case handling) Dec 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.