New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tempesta health statistics bugs #2036
Comments
The maximum response time is the real collected time, but those shown in the percentiles are not. Instead, they're scaled for optimization, so they may be aligned to be larger than the real one. Line 737 in b39ce95
For example, if you have two response with rtt=1960ms, then the stats will be: $ cat /proc/tempesta/perfstat
Minimal response time : 1960ms
Average response time : 1960ms
Median response time : 2029ms
Maximum response time : 1960ms
Percentiles
50%: 2029ms
75%: 2029ms
90%: 2029ms
95%: 2029ms
99%: 2029ms related logs:
logs patch: @@ -783,6 +783,7 @@ tfw_apm_prnctl_calc(TfwApmRBuf *rbuf, TfwApmRBCtl *rbctl, TfwPrcntlStats *pstats
while (p < T_PSZ) {
int v_min = USHRT_MAX;
for (i = 0; i < rbuf->rbufsz; i++) {
+ printk("p=%d, v=%u,i=%u,r=%u,b=%u\n", p,st[i].v, st[i].i,st[i].r,st[i].b);
if (st[i].v < v_min)
v_min = st[i].v;
}
@@ -794,8 +795,10 @@ tfw_apm_prnctl_calc(TfwApmRBuf *rbuf, TfwApmRBCtl *rbctl, TfwPrcntlStats *pstats
cnt += pcntrng->cnt[st[i].r][st[i].b];
tfw_apm_state_next(pcntrng, &st[i]);
}
- for ( ; p < T_PSZ && pval[p] <= cnt; ++p)
+ for ( ; p < T_PSZ && pval[p] <= cnt; ++p) {
+ printk("p=%d, v_min=%u\n", p,v_min);
pstats->val[p] = v_min;
+ }
@@ -1074,6 +1077,7 @@ __tfw_apm_update(TfwApmData *data, unsigned long jtstamp, unsigned int rtt)
int centry = (jtstamp / tfw_apm_jtmintrvl) % data->rbuf.rbufsz;
unsigned long jtmistart = jtstamp - (jtstamp % tfw_apm_jtmintrvl);
TfwApmUBEnt rtt_data = { .centry = centry, .rtt = rtt };
+ printk("rtt=%d, centry=%d\n", rtt, centry);
tfw_apm_rbent_checkreset(&data->rbuf.rbent[centry], jtmistart);
WRITE_ONCE(ubent[jtstamp % ubuf->ubufsz].data, rtt_data.data);
It uses the median value
Line 325 in 8da75fd
Line 106 in 8da75fd
The |
The "Socket buffers in flight" is system-wide. So it'll change whenever some programs have network activities. In fact, it's non-zero even if tempesta has not started. The counter can be viewed by any simple kernel module: https://gist.github.com/kingluo/9c7feb606a1348cfb8c942ecb35bddf6#file-hello-c And, it seems that the counter maintenance is wrong, e.g. it does not decrease the counter in We can check the counter maintenance by a simple systemtap script. While that script is running, we trigger some random network activities, like ping a host or curl some website, then we check if the counter (delta) is 0. https://gist.github.com/kingluo/9c7feb606a1348cfb8c942ecb35bddf6#file-skb_stats-stap |
part of tempesta-tech/tempesta#2036 `napi_skb_free_stolen_head()` should decrease skb counter.
Probably the percentiles bug should be fixed with a probabilistic algorithm like count-min sketch and maybe this should be done in #712 |
The current design principles are: It has 4 ranges to collect rtt samples, each range has 16 buckets, and then the actual rtt values will be mapped to
When calculating a histogram, each percentage boundary is determined based on the cumulative count from left to right. On the other hand, the time window and scaling value in the configuration file are used to determine whether the histogram needs to be updated to avoid frequent updates. Now we know why the histogram may not match the maximum value, because the bounds are a logical approximation and it may be larger than the maximum value. So the solution is to use max as the histogram value if approx is greater than max. Example:
|
Inherited from #1454
Negative values in statistics
At the current master as the date of the issue I observe negative values in the statistics, when only couple of requests were processed by Tempesta (observed for the first line only):
One more backend server statistics issue taken from the production server:
Obviously maximum response time can not be smaller than 99 or 95 percentiies.
Hung socket buffers
The same number of socket buffers may appear in the site statistics for relatively long time, which looks fishy (our web site statistics):
The text was updated successfully, but these errors were encountered: