-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validating the HOTCOUNT table #56
Comments
Incidentally on Snabb we have occasionally seen ridiculous situations where somebody renames a function and performance suffers measurably. Conceivable that this is due to shifting the addresses of bytecodes? If so then that would be another reason to introduce some entropy, so that performance tests could measure and account for the non-determinism instead of falsely attributing it to one software version or another. |
Once more with EntropyI reran the tests with some more entropy in the bytecode addresses. This is intended to make the hash collisions between runs unpredictable. I did this with a super hacky patch (lukego/raptorjit@c4fd4cd) to randomize the low bits of the On the one hand this should make the hash value of each bytecode unpredictable (simulate real world where modules can be loaded in different order, etc) but on the other hand the relative hash values within a prototype will be the same (failing to simulate the real world where functions are updated and the offset between a loop and a function head can change and affect whether they share a hash bucket.) I also note that the smaller hash sizes (8 and 16) may be more representative of real-world performance than the default hash size (64) because real applications contain much more code than these small benchmarks and so they will tend to have more collisions. So, let us take a look at the results, bearing in mind that at least one more experiment is needed. (Have to think about a suitable way to add the missing entropy.) Note: I changed the ordering to be from smallest to largest hash size and this has remapped the colours compared with the previous graph. and the ECDF: I need to sit down and look closely at this data to draw any real conclusions. Meanwhile here is a braindump of a few ideas that pop out right away:
I suppose that one useful way to answer these questions would be to have the CI save the Just to recap, the point of this whole exercise is to make RaptorJIT performance consistently fast (#11) and this means rooting out and eliminating the sources of non-determinism (and making sure that the heuristics, once made predictable, are usually leading to the best results i.e. that it really is best to trace the innermost loop first etc.) The difficulty in obtaining consistent and predictable performance is IMHO the one major problem with LuaJIT and is a practical obstacle for Snabb hackers to be confident about deploying their own inventions in live networks. Gonna be awesome to have it fixed! |
One idea, (just to add some more noise): The only case I imagine about a smaller table giving faster results than a bigger one is when a hash collision makes an important part of the code get hot sooner. If this is a real concern, then significantly reducing the initial hotcount values (by changing the -Ohotloop value) should help. |
@javierguerragiraldez Cool idea! @fsfod also mentioned on Slack that he has written some |
I am running a new test with this quick change intended to randomize hash collisions of bytecodes within the same prototype: lukego@972f724. |
Interesting. I have a large Snabb benchmark that does exhibit unwanted variation in performance and I ran that with a 2048 HOTCOUNT. I was expecting to see an impact in the results, since there must be a lot of hash collisions going on. However, I don't see any effect at all. So it is possible that the existing mechanism is actually fine, but we would really need to understand why that is the case sufficiently that our troubleshooting is not complicated by doubts about hash collisions. Couple of possibilities just passing through my mind:
Question is... how do we find out the truth? |
Here is an experiment to test an idea from @fsfod: that the outliers we see in performance reports could be due to hash collisions in the HOTCOUNT table.
The JIT only starts tracing a bytecode when it becomes "hot." Hotness is not counted separately for each bytecode: the bytecode address is hashed into a bucket that is shared by all bytecodes with the same hash value. The standard hashtable size is only 64 entries.
It is possible that we see non-deterministic performance because bytecode addresses have entropy and on different runs we see different hash collisions. This could cause the JIT to make different decisions, for example whether to start a trace at the start of a function / on an outer loop / on an inner loop. (Ordinarily a trace would start on the innermost loop but hash collisions could disturb this.)
I ran the simplest experiment that I could think of to test this idea: I made five branches each with a different hashtable size (8, 16, 64, 256, 1024) and I ran a large benchmark suite (300 tests per benchmark per branch.)
Here are the results in a jitter plot:
How to make sense of this? I am not sure. There are definitely patterns there, for example
fannkuch
runs much faster 16-element table, andnbody
runs much slower with 8- and 16- element tables. It is less clear whether the table size influences the distribution of results i.e. makes them more predictable or not. (Could it be that hashtable collisions do impact performance but that it is fairly deterministic on these simple benchmarks due to lack of entropy in the addresses being hashed?)Have to think about this one some more. Meanwhile here is one more graph that I spun to try and see if the pattern of outliers has changed. It's an ECDF with the Y-axis on a log scale to emphasise the slow outliers.
Possible next step would be to make sure there is entropy in the bytecode addresses from one benchmark run to the next. Just so that the same branch is not always hashing the same instruction to the same bucket (in a way that would be unrealistic for real applications that load code in more complex ways and are also being incrementally modified during development.)
See also #11.
The text was updated successfully, but these errors were encountered: