You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VHS will hang while in HTTP metrics mode when under load. It seems that 120 - 150 requests/sec is sufficient to reliably elicit this behavior on my system.
Expected behavior
When running VHS in HTTP metrics mode with the following command:
The expected behavior is that VHS will provide continuously-updated metrics on a Prometheus endpoint at <VHS IP address>:8888/metrics. See the following example for an excerpted sample of metrics output provided by VHS:
# HELP vhs_http_duration_seconds Duration of http exchanges seen by VHS.
# TYPE vhs_http_duration_seconds summary
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.5"} 0.0003203
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.75"} 0.0005124
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.9"} 0.0006848
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.95"} 0.0007811
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.99"} 0.0009974
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.999"} 0.0011647
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.9999"} 0.0011647
vhs_http_duration_seconds_sum{code="200",method="GET",path="/"} 0.1551513
vhs_http_duration_seconds_count{code="200",method="GET",path="/"} 412
# HELP vhs_http_requests_total Total count of http requests captured by VHS.
# TYPE vhs_http_requests_total counter
vhs_http_requests_total{code="200",method="GET",path="/"} 412
Actual (buggy) behavior
Under low load (<100 req/s), VHS seems to be stable and provides updated metrics for as long as I've cared to test (multiple hours).
Under higher load (>120 req/s), VHS will operate normally for as long as ~30 minutes, but will eventually hang. When this hang occurs, the metrics endpoint is still available, but the HTTP metrics will no longer be updated. This means that the values of all the metrics made available will be static across requests from that point forward. The metrics output at this point may also show an unusually large number of HTTP timeouts, as in the following sample from a hung session:
# HELP vhs_http_duration_seconds Duration of http exchanges seen by VHS.
# TYPE vhs_http_duration_seconds summary
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.5"} 0.0002597
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.75"} 0.0004211
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.9"} 0.000598
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.95"} 0.000716
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.99"} 0.0011365
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.999"} 0.0019557
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.9999"} 0.004606
vhs_http_duration_seconds_sum{code="200",method="GET",path="/"} 54.90745450000044
vhs_http_duration_seconds_count{code="200",method="GET",path="/"} 168625
# HELP vhs_http_requests_total Total count of http requests captured by VHS.
# TYPE vhs_http_requests_total counter
vhs_http_requests_total{code="",method="GET",path="/"} 10079
vhs_http_requests_total{code="200",method="GET",path="/"} 168625
Additionally, VHS may produce error output on the console. The errors usually pertain to failures in processing HTTP requests or request bodies. I will update this post with a sample when I gather one.
Additionally, I have seen VHS fail to start at least once when I had started my load generation script before starting VHS. That is, there were a significant number of HTTP requests and responses on the wire when VHS started up. In this case, it immediately hung and produced similar error output as what is shown above.
docker exec -it debug_vhs_1 bash cd debug && ./maxcurl.bash
Watch and wait
Check the metrics endpoint at http://localhost:8888/metrics to make sure values are changing over time. Keep an eye on the VHS console for errors.
Notes
Running VHS with the --debug flag added to the command line above seems to change the behavior slightly. VHS will still hang, but it seems to be more likely to produce large numbers of false HTTP timeouts, and I have never actually seen it produce error output on the console in debug mode. No smoking gun in the debug messages as far as I can tell.
The text was updated successfully, but these errors were encountered:
Summary
VHS will hang while in HTTP metrics mode when under load. It seems that 120 - 150 requests/sec is sufficient to reliably elicit this behavior on my system.
Expected behavior
When running VHS in HTTP metrics mode with the following command:
The expected behavior is that VHS will provide continuously-updated metrics on a Prometheus endpoint at
<VHS IP address>:8888/metrics
. See the following example for an excerpted sample of metrics output provided by VHS:Actual (buggy) behavior
Under low load (<100 req/s), VHS seems to be stable and provides updated metrics for as long as I've cared to test (multiple hours).
Under higher load (>120 req/s), VHS will operate normally for as long as ~30 minutes, but will eventually hang. When this hang occurs, the metrics endpoint is still available, but the HTTP metrics will no longer be updated. This means that the values of all the metrics made available will be static across requests from that point forward. The metrics output at this point may also show an unusually large number of HTTP timeouts, as in the following sample from a hung session:
Additionally, VHS may produce error output on the console. The errors usually pertain to failures in processing HTTP requests or request bodies. I will update this post with a sample when I gather one.
Additionally, I have seen VHS fail to start at least once when I had started my load generation script before starting VHS. That is, there were a significant number of HTTP requests and responses on the wire when VHS started up. In this case, it immediately hung and produced similar error output as what is shown above.
Steps to recreate
I have created a docker-compose environment that reliably elicits this behavior on my machine. It can be found in the
debug/
directory of the metrics-fix branch on my fork of the VHS repository.The following steps assume that you have a working installation of
docker
anddocker-compose
.Clone the branch
git clone -b metrics-fix https://github.com/ztreinhart/vhs.git
or whatever your preferred method for doing this is.Bring up the docker compose environment
cd debug && docker-compose up
Start VHS
docker exec -it debug_vhs_1 bash
go build ./cmd/vhs
./vhs --input "tcp|http" --capture-response --address 0.0.0.0:80 --prometheus-address 0.0.0.0:8888
Start load generator
docker exec -it debug_vhs_1 bash
cd debug && ./maxcurl.bash
Watch and wait
Check the metrics endpoint at http://localhost:8888/metrics to make sure values are changing over time. Keep an eye on the VHS console for errors.
Notes
Running VHS with the
--debug
flag added to the command line above seems to change the behavior slightly. VHS will still hang, but it seems to be more likely to produce large numbers of false HTTP timeouts, and I have never actually seen it produce error output on the console in debug mode. No smoking gun in the debug messages as far as I can tell.The text was updated successfully, but these errors were encountered: