Hang in HTTP metrics mode under load #38

ztreinhart · 2021-02-19T17:39:28Z

Summary

VHS will hang while in HTTP metrics mode when under load. It seems that 120 - 150 requests/sec is sufficient to reliably elicit this behavior on my system.

Expected behavior

When running VHS in HTTP metrics mode with the following command:

./vhs --input "tcp|http" --capture-response --address 0.0.0.0:80 --prometheus-address 0.0.0.0:8888

The expected behavior is that VHS will provide continuously-updated metrics on a Prometheus endpoint at <VHS IP address>:8888/metrics. See the following example for an excerpted sample of metrics output provided by VHS:

# HELP vhs_http_duration_seconds Duration of http exchanges seen by VHS.
# TYPE vhs_http_duration_seconds summary
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.5"} 0.0003203
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.75"} 0.0005124
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.9"} 0.0006848
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.95"} 0.0007811
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.99"} 0.0009974
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.999"} 0.0011647
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.9999"} 0.0011647
vhs_http_duration_seconds_sum{code="200",method="GET",path="/"} 0.1551513
vhs_http_duration_seconds_count{code="200",method="GET",path="/"} 412
# HELP vhs_http_requests_total Total count of http requests captured by VHS.
# TYPE vhs_http_requests_total counter
vhs_http_requests_total{code="200",method="GET",path="/"} 412

Actual (buggy) behavior

Under low load (<100 req/s), VHS seems to be stable and provides updated metrics for as long as I've cared to test (multiple hours).

Under higher load (>120 req/s), VHS will operate normally for as long as ~30 minutes, but will eventually hang. When this hang occurs, the metrics endpoint is still available, but the HTTP metrics will no longer be updated. This means that the values of all the metrics made available will be static across requests from that point forward. The metrics output at this point may also show an unusually large number of HTTP timeouts, as in the following sample from a hung session:

# HELP vhs_http_duration_seconds Duration of http exchanges seen by VHS.
# TYPE vhs_http_duration_seconds summary
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.5"} 0.0002597
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.75"} 0.0004211
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.9"} 0.000598
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.95"} 0.000716
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.99"} 0.0011365
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.999"} 0.0019557
vhs_http_duration_seconds{code="200",method="GET",path="/",quantile="0.9999"} 0.004606
vhs_http_duration_seconds_sum{code="200",method="GET",path="/"} 54.90745450000044
vhs_http_duration_seconds_count{code="200",method="GET",path="/"} 168625
# HELP vhs_http_requests_total Total count of http requests captured by VHS.
# TYPE vhs_http_requests_total counter
vhs_http_requests_total{code="",method="GET",path="/"} 10079
vhs_http_requests_total{code="200",method="GET",path="/"} 168625

Additionally, VHS may produce error output on the console. The errors usually pertain to failures in processing HTTP requests or request bodies. I will update this post with a sample when I gather one.

Additionally, I have seen VHS fail to start at least once when I had started my load generation script before starting VHS. That is, there were a significant number of HTTP requests and responses on the wire when VHS started up. In this case, it immediately hung and produced similar error output as what is shown above.

Steps to recreate

I have created a docker-compose environment that reliably elicits this behavior on my machine. It can be found in the debug/ directory of the metrics-fix branch on my fork of the VHS repository.

The following steps assume that you have a working installation of docker and docker-compose.

Clone the branch

git clone -b metrics-fix https://github.com/ztreinhart/vhs.git or whatever your preferred method for doing this is.

Bring up the docker compose environment

cd debug && docker-compose up

Start VHS

docker exec -it debug_vhs_1 bash
go build ./cmd/vhs
./vhs --input "tcp|http" --capture-response --address 0.0.0.0:80 --prometheus-address 0.0.0.0:8888

Start load generator

docker exec -it debug_vhs_1 bash
cd debug && ./maxcurl.bash

Watch and wait

Check the metrics endpoint at http://localhost:8888/metrics to make sure values are changing over time. Keep an eye on the VHS console for errors.

Notes

Running VHS with the --debug flag added to the command line above seems to change the behavior slightly. VHS will still hang, but it seems to be more likely to produce large numbers of false HTTP timeouts, and I have never actually seen it produce error output on the console in debug mode. No smoking gun in the debug messages as far as I can tell.

The text was updated successfully, but these errors were encountered:

ztreinhart added the bug Something isn't working label Feb 19, 2021

ztreinhart self-assigned this Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang in HTTP metrics mode under load #38

Hang in HTTP metrics mode under load #38

ztreinhart commented Feb 19, 2021

Hang in HTTP metrics mode under load #38

Hang in HTTP metrics mode under load #38

Comments

ztreinhart commented Feb 19, 2021

Summary

Expected behavior

Actual (buggy) behavior

Steps to recreate

Clone the branch

Bring up the docker compose environment

Start VHS

Start load generator

Watch and wait

Notes