Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validating the HOTCOUNT table #56

Open
lukego opened this issue Mar 22, 2017 · 6 comments
Open

Validating the HOTCOUNT table #56

lukego opened this issue Mar 22, 2017 · 6 comments
Labels

Comments

@lukego
Copy link
Contributor

lukego commented Mar 22, 2017

Here is an experiment to test an idea from @fsfod: that the outliers we see in performance reports could be due to hash collisions in the HOTCOUNT table.

The JIT only starts tracing a bytecode when it becomes "hot." Hotness is not counted separately for each bytecode: the bytecode address is hashed into a bucket that is shared by all bytecodes with the same hash value. The standard hashtable size is only 64 entries.

It is possible that we see non-deterministic performance because bytecode addresses have entropy and on different runs we see different hash collisions. This could cause the JIT to make different decisions, for example whether to start a trace at the start of a function / on an outer loop / on an inner loop. (Ordinarily a trace would start on the innermost loop but hash collisions could disturb this.)

I ran the simplest experiment that I could think of to test this idea: I made five branches each with a different hashtable size (8, 16, 64, 256, 1024) and I ran a large benchmark suite (300 tests per benchmark per branch.)

Here are the results in a jitter plot:

hotcount-jitter

How to make sense of this? I am not sure. There are definitely patterns there, for example fannkuch runs much faster 16-element table, and nbody runs much slower with 8- and 16- element tables. It is less clear whether the table size influences the distribution of results i.e. makes them more predictable or not. (Could it be that hashtable collisions do impact performance but that it is fairly deterministic on these simple benchmarks due to lack of entropy in the addresses being hashed?)

Have to think about this one some more. Meanwhile here is one more graph that I spun to try and see if the pattern of outliers has changed. It's an ECDF with the Y-axis on a log scale to emphasise the slow outliers.

hotcount-ecdf

Possible next step would be to make sure there is entropy in the bytecode addresses from one benchmark run to the next. Just so that the same branch is not always hashing the same instruction to the same bucket (in a way that would be unrealistic for real applications that load code in more complex ways and are also being incrementally modified during development.)

See also #11.

@lukego lukego added the report label Mar 22, 2017
@lukego
Copy link
Contributor Author

lukego commented Mar 22, 2017

Incidentally on Snabb we have occasionally seen ridiculous situations where somebody renames a function and performance suffers measurably. Conceivable that this is due to shifting the addresses of bytecodes? If so then that would be another reason to introduce some entropy, so that performance tests could measure and account for the non-determinism instead of falsely attributing it to one software version or another.

@lukego
Copy link
Contributor Author

lukego commented Mar 24, 2017

Once more with Entropy

I reran the tests with some more entropy in the bytecode addresses. This is intended to make the hash collisions between runs unpredictable. I did this with a super hacky patch (lukego/raptorjit@c4fd4cd) to randomize the low bits of the GCproto object address.

On the one hand this should make the hash value of each bytecode unpredictable (simulate real world where modules can be loaded in different order, etc) but on the other hand the relative hash values within a prototype will be the same (failing to simulate the real world where functions are updated and the offset between a loop and a function head can change and affect whether they share a hash bucket.)

I also note that the smaller hash sizes (8 and 16) may be more representative of real-world performance than the default hash size (64) because real applications contain much more code than these small benchmarks and so they will tend to have more collisions.

So, let us take a look at the results, bearing in mind that at least one more experiment is needed. (Have to think about a suitable way to add the missing entropy.)

Note: I changed the ordering to be from smallest to largest hash size and this has remapped the colours compared with the previous graph.

hotcount-entropy-jitter

and the ECDF:

hotcount-entropy-ecdf

I need to sit down and look closely at this data to draw any real conclusions. Meanwhile here is a braindump of a few ideas that pop out right away:

  • fannkuch is consistent from run to run but inconsistent from branch to branch. I suspect that it is not sensitive to hash collisions between different functions (which are now randomized between runs) but that it is sensitive to hash collisions within the same function (which are not randomized but will be different for each branch due to larger/smaller hashtable size.) I predict that once we add entropy to the relative hash values of instructions within a prototype then the results will become much fuzzier.
  • nbody is becoming more consistently fast with larger hash sizes. Could be that the default tracing behaviour of starting with the innermost loops and working outwards is the best for this benchmark and that hash collisions between prototypes can derail this.
  • roulette does is similar except for table size 256 which is more definitely bimodal. Can also be due to a within-prototype hash collision at certain table sizes e.g. between loop and function head?

I suppose that one useful way to answer these questions would be to have the CI save the jit.v for each run and then come and inspect that afterwards to see e.g. if the difference is related to which bytecodes are traced first. Could even hash the jit.v output and then the visualization could split the results into groups that traced in the same way. However, for the moment I am happy to continue with the more blunt-instrument statistical approach. (well, and I have removed jit.v from RaptorJIT without yet replacing it.)

Just to recap, the point of this whole exercise is to make RaptorJIT performance consistently fast (#11) and this means rooting out and eliminating the sources of non-determinism (and making sure that the heuristics, once made predictable, are usually leading to the best results i.e. that it really is best to trace the innermost loop first etc.) The difficulty in obtaining consistent and predictable performance is IMHO the one major problem with LuaJIT and is a practical obstacle for Snabb hackers to be confident about deploying their own inventions in live networks. Gonna be awesome to have it fixed!

@javierguerragiraldez
Copy link

One idea, (just to add some more noise): The only case I imagine about a smaller table giving faster results than a bigger one is when a hash collision makes an important part of the code get hot sooner. If this is a real concern, then significantly reducing the initial hotcount values (by changing the -Ohotloop value) should help.

@lukego
Copy link
Contributor Author

lukego commented Mar 24, 2017

@javierguerragiraldez Cool idea! @fsfod also mentioned on Slack that he has written some jit.util extensions for explicitly marking different exits, etc, hot. Sounds useful for writing tests! https://github.com/fsfod/LuaJIT/commits/setfunchot

@lukego
Copy link
Contributor Author

lukego commented Mar 24, 2017

I am running a new test with this quick change intended to randomize hash collisions of bytecodes within the same prototype: lukego@972f724.

@lukego
Copy link
Contributor Author

lukego commented Mar 28, 2017

Interesting. I have a large Snabb benchmark that does exhibit unwanted variation in performance and I ran that with a 2048 HOTCOUNT. I was expecting to see an impact in the results, since there must be a lot of hash collisions going on. However, I don't see any effect at all.

So it is possible that the existing mechanism is actually fine, but we would really need to understand why that is the case sufficiently that our troubleshooting is not complicated by doubts about hash collisions.

Couple of possibilities just passing through my mind:

  • Could be that inner loops almost always become hot first, regardless of the value of the hot counter, simply because they execute so much more than function heads and outer loops.
  • Could be that it doesn't matter especially which bytecodes become hot first in practice, and the JIT tends to end up with much the same traces.
  • Could be that some other larger problem dwarfs the issues caused by hash collisions, e.g. that the main practical issue is root-trace vs side-trace selection and this is not affected by the HOTCOUNT table.

Question is... how do we find out the truth?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants