Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation about sorting param #1419

Closed
polarathene opened this issue Jun 12, 2019 · 5 comments
Closed

Add documentation about sorting param #1419

polarathene opened this issue Jun 12, 2019 · 5 comments

Comments

@polarathene
Copy link

Latency Top 10:

Language (Runtime) Framework (Middleware) Average 50th percentile 90th percentile 99th percentile 99.9th percentile Standard deviation
rust (1.35) nickel (0.11) 0.07 ms 0.06 ms 0.10 ms 0.12 ms 0.99 ms 24.33
ruby (2.6) roda (3.2) 3.50 ms 0.17 ms 13.67 ms 32.96 ms 82.73 ms 7372.33
ruby (2.6) rack-routing (0.0) 4.66 ms 0.22 ms 18.59 ms 41.10 ms 107.99 ms 9556.00
rust (1.35) iron (0.6) 0.36 ms 0.36 ms 0.60 ms 0.87 ms 11.31 ms 205.00
php (7.3) symfony (4.3) 93.23 ms 0.37 ms 216.72 ms 1956.34 ms 6787.76 ms 368518.33
php (7.3) laravel (5.8) 154.07 ms 0.39 ms 330.01 ms 3101.21 ms 6937.61 ms 549852.00
php (7.3) slim (3.12) 127.39 ms 0.39 ms 266.58 ms 2402.54 ms 6879.72 ms 458913.00
php (7.3) zend-expressive (3.2) 170.55 ms 0.39 ms 304.31 ms 3452.25 ms 6962.37 ms 626142.00
ruby (2.6) flame (4.18) 7.22 ms 0.39 ms 25.05 ms 52.38 ms 138.96 ms 12378.00
php (7.3) lumen (5.8) 141.59 ms 0.39 ms 279.21 ms 3045.51 ms 6945.63 ms 525800.67

The only column that seems to define the rank is the 50th Percentile? Could the README mention that's how the ranking is being done?

iron definitely seems ahead of rack-routing. flame looks like it should be ahead of the 3 PHP results above it, they're all tying on 0.39ms 50th percentile, but flame's standard deviation score should give it the upper hand here. There are other cases like this further down the chart, like agoo-c vs rocket:

Language (Runtime) Framework (Middleware) Average 50th percentile 90th percentile 99th percentile 99.9th percentile Standard deviation
rust (nightly) rocket (0.4) 106.24 ms 1.58 ms 71.40 ms 2257.22 ms 4945.40 ms 441283.00
c (11) agoo-c (0.5) 2.96 ms 1.90 ms 6.70 ms 14.38 ms 105.26 ms 3145.33

Here there is a slight lead by the 50th percentile by rocket, but agoo-c seems better overall?

Would it be ok to use a 2nd value with some weighting to get a better ranking? I'm not stats savvy, but looking over the table, if you weighted 50th percentile by 90% to 10% of the Average you'd get an order that seems more representative of the performance, rather than those that lose consistency and skew quite poorly over that half way mark.

@waghanza
Copy link
Collaborator

Hi @polarathene,

You're right.

The rank (if we can say so) is computed from the 50th percentile.

The main idea was to be as closest as we can to decide which language/framework to use.

As my background is aws / eu-west-1, the figure of 50th percentile seems to reflect the real performances ;-)

BTW, any idea / recommendation is ❤️

PS : I will edit the title to reflect main idea -> the sorting param (50th percentile) SHOULD be reflected on the README

@waghanza waghanza changed the title Rank position seems off Add documentation about sorting param Jun 12, 2019
@polarathene
Copy link
Author

As my background is aws / eu-west-1, the figure of 50th percentile seems to reflect the real performances ;-)

I was just referring to the table results listed, not in relation to performance/experiences elsewhere.

As noted above with iron, flame and agoo-c, just basing on the median value(50th percentile column) does not seem to ideally represent how they should be ranked/sorted regarding performance?

It's true that as the only metric, the top 50% of responses have better latency, but the 2nd half of the results tell a very different story. I think consistency/stability of the low latency throughout should have some weight towards the scoring. The standard deviation shows how some of the current positions are getting away with having half of their responses(slower) ignored.


I have asked a statistics community for their input on a proper way to improve the scoring/ranking and I will let you know if they make any suggestions. For now, assigning a small bit of weighting to the Average(mean) value seems to cause no negative impact, but leads to much more ideal ranking of the results.

Framework (Middleware) Average 50th percentile Weighted Score
roda (3.2) 3.50 ms 0.17 ms 0.50 ms
rack-routing (0.0) 4.66 ms 0.22 ms 0.66 ms
iron (0.6) 0.36 ms 0.36 ms 0.36 ms
rocket (0.4) 106.24 ms 1.58 ms 12.05 ms
agoo-c (0.5) 2.96 ms 1.90 ms 2.01 ms

Where Weighted Score is 90% of the median(50th Percentile) + 10% of the mean(Average), rounded to nearest 10th of a ms(2 decimal places). And now if we rank by the Weighted Score value instead:

Framework (Middleware) Average 50th percentile Weighted Score
iron (0.6) 0.36 ms 0.36 ms 0.36 ms
roda (3.2) 3.50 ms 0.17 ms 0.50 ms
rack-routing (0.0) 4.66 ms 0.22 ms 0.66 ms
agoo-c (0.5) 2.96 ms 1.90 ms 2.01 ms
rocket (0.4) 106.24 ms 1.58 ms 12.05 ms

The ranking seems to better represent performance by giving a small bit of weight to the last half of latency results.

I did not include flame vs the PHP frameworks as that one should be self-explanatory, if only ranking by the median, you should have a 2ndry sorting factor too for when there is ties.

@waghanza
Copy link
Collaborator

I have asked a statistics community for their input on a proper way to improve the scoring/ranking and I will let you know if they make any suggestions

🎉 Thanks for this

However, be aware that the results are not very accurate. I mean, all of this are actually running on a local docker and docker mess-up the results. After some documentation PR and #1011, I will work on #632 so as the results are not messed-up anymore.

@polarathene
Copy link
Author

However, be aware that the results are not very accurate. I mean, all of this are actually running on a local docker and docker mess-up the results.

Yes I understand, there is a clear warning up top on the README pointing that out. But that would not change anything regarding how results are sorted.

I do understand that the actual results themselves are not stable presently as evident with the test result history in past commits varying widely. I am just interested in more accurately representing how well a framework has performed based on the given data.

The weighted score suggestion above, seems to work well?


Off-topic to the issue:

running on a local docker and docker mess-up the results.

I don't quite follow how Docker messes up results here. Is it because of the different base images? Docker if anything should be a useful tool to get consistency. On bare metal, you're dealing with the distro environment and it's own package manager, not all distros are the same, there are many other factors that can impact results.

Users systems likewise aren't likely to be at parity with where you run the tests. But the results provide some insight, and the user can then verify on their own systems if the results are similar(easier to do with using the same Docker images, followed by adapting to their own needs/environment after confirmation).

On bare metal, you can do some things to better ensure consistency, pinning CPU cores to the processes involved(Docker would again be useful here afaik), where you can also isolate the CPU cores so that nothing else on the system is permitted to use those cores for processing.

Once you involve a network externally, that's a different variable that you might not have much control over and lack any consistency with. It's useful information to include and can still be achieved with Docker, the quality of the network is going to vary for users though, just like other parts of the environment, so local tests are still useful imo. You can configure a network that has characteristics of what you'd get from an external network being involved too.

@waghanza
Copy link
Collaborator

Feel free to suggest any idea about how to rank/sort ❤️

I have taken ideas from #670 and #223, but no preference for me 😛


running on a local docker and docker mess-up the results.

I mean, that metrics are computed in a way that prevent any framework to be push on its performance limits :

  • because docker engine is here (adding flexibility but decrease raw performance)
  • because local network
  • because parallelization, the sieger (wrk) targets multiple hosts at once, instead of one per one

My bad, this is NOT only about docker BUT more about local docker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants