New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFV test matrix preview #976
Comments
|
Looks awesome overall. The main thing that worries me is the lack of any indication of expected error/noise - when I see two overlapping graphs that diverge a bit, I have no idea how likely it is to be significant or random. |
|
@kbara Yes I see what you mean about the difficulty of knowing whether a variation is significant or random noise. It will be interesting to see how the graphs look when we compare code that actually does have different performance. I have added Could also be that we can improve the plots to convey more information about significance. There are some examples in Plotting distributions (ggplot2). If anybody should feel inspired to muck around with the graphs here is a quick start:
|
|
Couple more observations about the report... l2fwd by QEMU and DPDKWow! Looking more closely at the sliced l2fwd results there are a couple of really significant things: QEMU 2.4.1 works better than the others and DPDK 1.8 doesn't work at all. Here is how the data looks when we plot the success rate of the l2fwd benchmark separately for each QEMU and DPDK version combination: success = summarize(group_by(l2fwd, qemu, dpdk), success = mean(!is.na(score)))
ggplot(success, aes(qemu, dpdk, fill=success)) +
geom_tile(aes(fill=success)) + scale_fill_gradient(low="red", high="white") +
geom_text(aes(label = scales::percent(success)))and this suggests a lot of interesting ideas:
This is awesome! Go Hydra! iperf duplicate rowsThe iperf success and failure counts are all multiples of 5. The CSV seems to contain duplicate rows like this: @domenkozar Can be something with the nix expression? For example could it be that the matrix is including a row for each different DPDK version (there are 5) and then reusing the result since the test does not use that software? Stats and modellingJust a thought: R has some really fancy features for fitting data to models and e.g. telling you which factor explains the most variation in results. So if we would write some nice expressions then R should be automatically tell us "QEMU version and DPDK version each have a huge impact on l2fwd results." |
|
Hypothesis for what is going on with QEMU here: The original QEMU vhost-user feature that we contributed to QEMU 2.1 did not support resetting the Virtio-net device. This is simply something that we missed and lacked a test for. The consequence is that if the guest shuts down its Virtio-net device and reuses the descriptor memory for other things then it will lead to making garbage DMA requests e.g. causing an error in the vswitch or overwriting memory in the guest. Later we found this problem when we tested rebooting VMs. Reboots failed, presumably due to memory corruption during boot. We fixed this by extending QEMU to send a notification when the guest device shuts down so that the vswitch knows to stop processing DMA requests. This fix went upstream but was later reverted for reasons that I do not fully understand. So my hypothesis is that QEMU 2.4.1 includes the fix, but none of the others do, and the error is hand-over inside the VM from the kernel driver to the DPDK driver. The reason we did not see this with SnabbBot, even with other QEMU versions, is that we setup those VMs to dedicate the Virtio-net device to DPDK i.e. prevented the kernel from initializing it at boot. The new VMs bootstrapped with nix are first letting the kernel I'll get in touch with QEMU upstream and see what they reckon. I would certainly like to include device reset in our test suite and have that working reliably so I am glad that the nix images are exercising this scenario and showing the problem. |
|
Hard to put this stuff down :-). Next question: How come we still have 1% failures with QEMU == 2.4.1 and DPDK != 1.8.0? First we could take a peek at some basic statistics about these rows: Now we have really sliced-and-diced the data: only 17 rows left out of the 37,800 results in the CSV file. Quick observations:
Question is, how confident can we be about this? I think that the R aov (Analysis of Variance) feature can answer this for us: I believe this means that we can be 99.9% confident ( So now we know of one bug somewhere in the test setup - failures when virtio-net options are suppressed - and we know what further testing to do for context - something like 10,000 more tests with QEMU 2.4.1 and DPDK ~= 1.8 so that we can be sure whether software versions are significant (would be handy to know for debugging purposes). We could also browse Hydra to find the logs for these 17 failing tests and review them. First step could be dumping the 17 CSV rows in order to identify the relevant test cases. |
|
Looks to me like the reason for the 1% failures is mostly cases that run slowly and bump a 160-second timeout that we have in the Nix expression for the benchmark. Could be an idea to increase that timeout somewhat so that we can better differentiate between slow and failed cases. However, I hold off on that for the moment because redefining the benchmark definition would force Hydra to do a lot of work (rerunning tests for every branch based on the new definitions). I dumped the list of failed cases like this: Then I manually transcribed these into job names and searched Hydra for their build links: and what I see in most (but not all) of the logs I checked is low speeds (~1Mpps or less). |
|
@kbara I pushed a new version of the report that includes overlay histograms now. The Y-axis also now gives the exact number of results per bin. Better? https://hydra.snabb.co/build/203152/download/2/report.html (I also added the red-and-white "success rate" tables. I have a fix in the pipeline that will make the l2fwd picture look much better soon.) |
|
Possibly better, but I honestly don't understand what is going on in those graphs without digging deeper. What is 'matrix'? How does 'matrix-next' differ from plain old next, since they're both being measured? What are score and count? What are their units, if applicable? IE, if I look under 'by benchmark', l2fwd goes up to nearly 12000, but iperf up to about 1000. I can make (uncertain) guesses as to what that means, but it seems unnecessarily opaque. The success/failure data looks very readable. |
|
@kbara Score is Gbps (benchmark=iperf) or Mpps (benchmark=l2fwd). i.e. it is a scalar value saying how fast a test went (higher is better). we kind-of get away with mixing the units because they vaguely similar in magnitude. Count is the number of test results that fell within a range of scores. For example in the first graph, on the left, we can see that every branch had around 6,000 failed tests (histogram bucket at 0). That help? |
|
Cooooool! Here is a fresh new NFV test matrix report that is much more exciting. This compares @kbara I would be really interested to hear your take on what the report says! (Have only glanced at it so far myself but itching to take a closer look :-)) |
|
I still really wish there were units on the axes - I have no idea on initially looking at the first graph if '15' is Gbps (on two cards? bidirectional?) or MPPS (your comment above about iperf vs l2fwd, plus the graph below it, makes me think Gbps, but it would be so, so much easier to just read that. Some of the graphs include words like iperf - they could just say 'iperf - Gbps', though a label on the axis would match my preferences from school and reading research papers). How does it choose how many times to run a test? IE, on the overall summary, it looks like there are probably a lot more runs of master than nfv-test (certainly a lot more zeros, then the other results seem to look similar) - are there? If so, why? Also, doesn't that make the two smooth graphs impossible to compare without scaling? [Edit: actually, it looks like a density graph, so disregard the last question.] Why do the nfv-test scores seem to be so much better than the master scores? I continue to really like how clear this format makes failure data. I think it needs a bit of tweaking on conveying the rest of the data, but it's improving! |
|
@kbara Thanks for taking the time to read all of this and give feedback. I am working this out as I go along so it is very helpful to be able to discuss it. Zooming out for one sec, I see two distinct problems to solve with these benchmarks:
On the one hand for deployment planning we need to really care about what we are measuring and how to interpret the results in real-world terms. On the other hand for software optimization it may be reasonable to treat the benchmark results more abstractly, e.g. simply as "scores" for which higher values are linearly better. (Like e.g. SPECint: I can't remember what workload that represents but I know the GCC people would be thrilled to get a patch that increases the score by 2%.) Just now I am mostly wearing the "software optimization" hat and that is why I am being loose with units. The benchmark scores are numbers and I want to make them higher (move ink to the right) and more consistent (move ink upwards). Coming back to your comments: I think we are seeing the limitations of the cookie-cutter approach of generating canned graphs for what is, for now at least, more of an exploration than a quantification. Indeed we can see that Have to see how the situation evolves with experience i.e. how much can we effectively summarize with predefined graphs and statistics (as few as possible) and when do we need to roll up our sleeves for some data analysis. Relatedly, here is my R bookshelf at the moment:
New case studyJust at braindump quality but maybe interesting anyway: rough draft analysis of some tests that I ran over the weekend to evaluate the impact of a QEMU patch that never went upstream. |
|
I want to de-lurk briefly... I am doing some work on automated benchmarking of systems to track performance changes/regressions and I came across a paper a couple of weeks ago. Whilst I won't claim to understand it all yet it might be of some use. A snippet from the abstract: 'we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. ... We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.' "Rigorous Benchmarking in Reasonable Time" Right, now back to lurking until I have the time to make a proper contribution :-) |
|
@tobyriddell Thanks for the link! Please feel welcome to braindump about performance testing work here :). I have skimmed the paper. Looks fancy :) I don't immediately grasp the details but I think that I understand the problem they are working on i.e. optimizing use of test resources (e.g. hardware) by running exactly the number of tests needed to reach an appropriate confidence interval. I am cautious about this fancy approach for two reasons: First, are they optimizing for answering predefined questions at the expense of post-hoc data exploration? I want to use the datasets both for hypothesis testing and also hypothesis generation i.e. poking around interactively in the dataset to look for interesting patterns ("I wonder if CPU microarchitecture explains some of the variation in IPsec scores?" etc). So there is at least some value for me in generating a large and regular dataset. Second, I really like the idea of having a clean division oflabor between different skill sets:
If these activities are well separated then you only need one skill set to contribute to the project. For example, if R and statistics are your thing then it is nice to be able to simply work on the CSV files without having to operate the CI infrastructure. End braindump :). |
|
@kbara Braindump... putting on the "characterizing system performance in real world terms" hat for a moment. Quoth R for data science:
I submit that describing the performance of an application is a statistical modeling problem. The goal is to define a model that accurately predicts real-world performance. The model is simply a function whose inputs are relevant information about the environment (hardware, configuration, workload, etc) and whose output is a performance estimate. The goodness of the model depends on how easy it is to understand and how accurately it estimates real-world performance. So how would this look in practice? Let's take the simplest possible model and then make some refinements. Model A: Magic numberThis simple model predicts that the application always performs at "line rate" 10G. function A ()
return 10.0 -- Gbps
endThis model is very simple but there is no evidence in the formulation to suggest that it is accurate. Model B: Symbolic constantThis model predicts that the application performance is always the function B (k)
return k.Gbps -- 'Gbps' value in table of constants k
endThe constant Model C: Linear with processor speedThis model takes hardware differences into account by estimating that function C (k, e)
return e.GHz * k.bitsPerCycle
endHere the end-user supplies the value Here the performance curve is modeled as a straight line that increases with GHz. The constant Model D: Many factorsThis model is more elaborate and taking a step towards practicality. function D (k, e)
local per_core = e.GHz * bitsPerCycle(k, e)
return per_core * math.pow(e.cores, k.multicoreScaling)
end
-- Return the number of bits processed per cycle.
--
-- Performance depends on whether IPsec is enabled and if so then also
-- on the CPU microarchitecture (because Haswell AES-NI is twice as
-- fast as Sandy Bridge).
function bitsPerCycle (k, e)
if e.ipsec and e.haswell then return k.fastIpsecBitsPerCycle
elseif e.ipsec then return k.slowIpsecBitsPerCycle
else return k.baseBitsPerCycle end
endHere the user supplies this information:
And the test suite calculates these constants:
This would be cool, huh? You could really understand a lot about the performance of a given release just by looking at those constants e.g. how close is the scalability to linear, how expensive are the features, how important is the choice of CPU, etc. I am not really sure if the R side of this would be easy, hard, or impossible. I suspect it would be straightforward by porting that function to R and feeding the benchmark CSV file into some suitable off-the-shelf "nonlinear regression" algorithm, but maybe I am being hopelessly naive. If we would really be working with R at this level of sophistication then we could start to refine the model iteratively. We would compare actual results with predicted results and check for patterns in the "residuals" that indicate important details that our model has missed. This may lead us to refine the model to make better predictions e.g. allowing for WrapupJust now I like the dream of using statistical modeling to understand application performance. The model would then represent our actual understanding of how an application behaves in different situations. This could then be communicated to end-users in whatever is the most appropriate way e.g. a table summarizing the most important workloads, an Excel macro that calculates the performance estimate for a given software release, or ... etc. End brain dump! |
|
Interesting, I've been seeing it in an almost entirely different set of ways. :-) (The following are musings on performance; I see correctness and being able to eliminate test failures as even more important, but that's another post entirely - and the current tooling is a big step in that direction already.) To me, 'line rate' is a first class node in my mental model: "on reasonable hardware X and card Y, does app network Z achieve line rate with packet size/distribution alpha"? This leads me to think of comparing two branches or configurations in a handful of ways (I was hashing a bit of this out with someone with a deep stats background a couple weeks ago): a) Raw averages: which one is better? The graphs so far are good at showing this. Questions I'd concretely like to answer include:
A bit more pie-in-the-sky-ily, I could imagine using tooling like this to optimize things like the number of packets per breath to particular hardware, automatically, if that seemed like the kind of parameter that performance was actually sensitive to across different hardware. I see this kind of tooling as being able to give us really rich information about what matters, and whether it always matters in the same way. It's easy to overlook how much tuning constants can matter - and we can take a lot of the guesswork out of it, and empirically see how it works on different hardware. |
Update configuration.md to discuss RSS.

Check out this new NFV test matrix report!
This is a fully automated workflow where Hydra detects changes on branches under test, executes many end-to-end benchmarks (>10,000) with VMs in different setups, and produces a report that shows benchmark results both in broad overview and slice-and-diced by different factors. The tests run on a 10-machine cluster and take around one day to complete. Awesome work, @domenkozar!!!
We should be able to hook this up to upstream branches including
masterandnextonce we land #969 that allows Snabb NFV to run without a physical hardware NIC. Just for now it is connected to a couple of branches that do include this PR.Observations
This is a preview in the sense that we have not been able to thoroughly sanity-check the results yet and some of them are different than we have seen in the past. So take all of this with a grain of salt.
On the Overall graph we can see that the two Snabb branches being tested,
matrixandmatrix-next, look practically identical.On the iperf graph we can see results clustered around three places: 0 Gbps (failure), 6 Gbps, and 10-15 Gbps. On the iperf configuration graph we can see the explanation: the filter tests are failing (likely our packet filter is blocking iperf); the ipsec tests are delivering ~6 Gbps; and the baseline and L2TPv3 tests are delivering ~10-15 Gbps.
On the l2fwd graph we can see that most results are failing. Question is, why? I am not sure. The l2fwd success and failure stats may point to QEMU version and packet size being important factors? (These tests are much more reliable with SnabbBot so what is the difference here? Can be related to the guest kernel.)
Onward
So! Now we can automatically see whether changes are good for Snabb NFV. We want to merge changes that make the humps higher (more consistency) and that move the area under the curves to the right (faster average). Then we also want harsher test cases that reveal problems by spreading the curves out and moving them to the left :).
If this works well in practice then we could gradually add coverage for all the other Snabb applications too.
Feedback?
Questions, suggestions, comments?
The text was updated successfully, but these errors were encountered: