diff --git a/README.md b/README.md index 08fada23..2967d3b8 100644 --- a/README.md +++ b/README.md @@ -519,131 +519,6 @@ For example, without any run-time options the following load test would automati Ok(()) ``` -## Coordinated Omission Mitigation - -By default, Goose attempts to mitigate the loss of metrics data (Coordinated Omission) caused by an abnormally lengthy response to a request. - -### Definition - -To understand Coordinated Omission and how Goose attempts to mitigate it, it's necessary to understand how Goose is scheduling requests. Goose launches one thread per `GooseUser`. Each `GooseUser` is assigned a single `GooseTaskSet`. Each of these `GooseUser` threads then loop repeatedly through all of the `GooseTasks` defined in the assigned `GooseTaskSet`, each of which can involve any number of individual requests. However, at any given time, each `GooseUser` is only making a single request and then asynchronously waiting for the response. - -If something causes the response to a request to take abnormally long, raw Goose metrics only see this slowdown as affecting a specific request (or one request per `GooseUser`). The slowdown can be caused by a variety of issues, such as a resource bottleneck (on the Goose client or the web server), garbage collection, a cache stampede, or even a network issue. A real user loading the same web page would see a much larger effect, as all requests to the affected server would stall. Even static assets such as images and scripts hosted on a reliable and fast CDN can be affected, as the web browser won't know to load them until it first loads the HTML from the affected web server. Because Goose is only making one request at a time per `GooseUser`, it may only see one or very few slow requests and then all other requests resume at normal speed. This results in a bias in the metrics to "ignore" or "hide" the true effect of a slowdown, commonly referred to as Coordinated Omission. - -### Mitigation - -Goose attempts to mitigate Coordinated Omission by back-filling the metrics with the statistically expected requests. To do this, it tracks the normal "cadence" of each `GooseUser`, timing how long it takes to loop through all `GooseTasks` in the assigned `GooseTaskSet`. By default, Goose will trigger Coordinated Omission Mitigation if the time to loop through a `GooseTaskSet` takes more than twice as long as the average time of all previous loops. In this case, on the next loop through the `GooseTaskSet` when tracking the actual metrics for each subsequent request in all `GooseTasks` it will also add in statistically generated "requests" with a `response_time` starting at the unexpectedly long request time, then again with that `response_time` minus the normal "cadence", continuing to generate a metric then subtract the normal "cadence" until arriving at the expected `response_time`. In this way, Goose is able to estimate the actual effect of a slowdown. - -When Coordinated Omission Mitigation detects an abnormally slow request, Goose will generate an INFO level message (which will be visible if Goose was started with the `-v` run time flag, or written to the log if started with the `-g` run time flag and `--goose-log` is configured). For example: - -``` -10:10:02 [INFO] coordinated omission alert 6.957s into goose attack: "GET http://apache/node/8848" [200] took abnormally long (2932 ms), task name: "(Anon) node page" -10:10:02 [INFO] coordinated omission alert 7.019s into goose attack: "GET http://apache/node/1960" [200] took abnormally long (2873 ms), task name: "(Anon) node page" -10:10:02 [INFO] coordinated omission alert 7.314s into goose attack: "GET http://apache/node/1297" [200] took abnormally long (2578 ms), task name: "(Anon) node page" -``` - -If the `--request-log` is enabled, you can get more details, in this case by looking for elapsed times matching the above messages, specifically 6957, 7019, and 7314 respectively: - -``` -{"coordinated_omission_cadence":1651,"coordinated_omission_elapsed":0,"elapsed":6957,"error":"","final_url":"http://apache/node/8848","method":"Get","name":"(Anon) node page","redirected":false,"response_time":2932,"status_code":200,"success":true,"update":false,"url":"http://apache/node/8848","user":2} -{"coordinated_omission_cadence":1439,"coordinated_omission_elapsed":0,"elapsed":7019,"error":"","final_url":"http://apache/node/1960","method":"Get","name":"(Anon) node page","redirected":false,"response_time":2873,"status_code":200,"success":true,"update":false,"url":"http://apache/node/1960","user":0} -{"coordinated_omission_cadence":1812,"coordinated_omission_elapsed":0,"elapsed":7314,"error":"","final_url":"http://apache/node/1297","method":"Get","name":"(Anon) node page","redirected":false,"response_time":2578,"status_code":200,"success":true,"update":false,"url":"http://apache/node/1297","user":3} -``` - -In the requests file, you can see that three different user threads triggered Coordinated Omission Mitigation, specifically threads 2, 0, and 3. All `GooseUser` threads were loading the same `GooseTask` as due to task weighting this is the task loaded the most frequently. Each `GooseUser` thread loops through all `GooseTasks` in a similar amount of time: thread 2 takes on average 1.651 seconds, thread 0 takes on average 1.439 seconds, and thread 3 takes on average 1.812 seconds. - -Also if the `--request-log` is enabled, requests back-filled by Coordinated Omission Mitigation show up in the generated log file, even though they were not actually sent to the server. In the following example, Coordinated Omission Mitigation was triggered when the server took 11,965 milliseconds to loop through all requests, instead of the average cadence of 3,162 milliseconds. This causes it to backfill a block of requests that statistically should have happened, with a `response_time` decreasing by the expected request cadence. - -```json -{"coordinated_omission_cadence":3161,"coordinated_omission_elapsed":11965,"elapsed":185835,"error":"","final_url":"http://example.com/misc/jquery.js?v=1.4.4","method":"Get","name":"static asset","redirected":false,"response_time":11965,"status_code":200,"success":true,"update":false,"url":"http://example.com/misc/jquery.js?v=1.4.4","user":2} -{"coordinated_omission_cadence":3161,"coordinated_omission_elapsed":11965,"elapsed":185835,"error":"","final_url":"http://example.com/misc/jquery.js?v=1.4.4","method":"Get","name":"static asset","redirected":false,"response_time":8804,"status_code":200,"success":true,"update":false,"url":"http://example.com/misc/jquery.js?v=1.4.4","user":2} -{"coordinated_omission_cadence":3161,"coordinated_omission_elapsed":11965,"elapsed":185835,"error":"","final_url":"http://example.com/misc/jquery.js?v=1.4.4","method":"Get","name":"static asset","redirected":false,"response_time":5643,"status_code":200,"success":true,"update":false,"url":"http://example.com/misc/jquery.js?v=1.4.4","user":2} -{"coordinated_omission_cadence":3161,"coordinated_omission_elapsed":11965,"elapsed":185835,"error":"","final_url":"http://example.com/misc/jquery.js?v=1.4.4","method":"Get","name":"static asset","redirected":false,"response_time":2482,"status_code":200,"success":true,"update":false,"url":"http://example.com/misc/jquery.js?v=1.4.4","user":2} -``` - -Normal requests not generated by Coordinated Omission Mitigation have a `coordinated_omission_elapsed` of 0. - -Coordinated Omission Mitigation can be disabled by setting `--co-mitigation disabled` when starting Goose. By default it uses the average cadence when backfilling, but it can also be configured to use the `minimum` or `maximum` cadence to allow for different server configuration and testing plans operating on different assumptions. - -### Metrics - -When Coordinated Omission Mitigation kicks in, Goose tracks both the "raw" metrics and the "adjusted" metrics. It shows both together when displaying metrics, first the "raw" (actually seen) metrics, followed by the "adjusted" metrics. As the minimum response time is never changed by Coordinated Omission Mitigation, this column is replacd with the "standard deviation" between the average "raw" response time, and the average "adjusted" response time. - -The following example was "contrived". The `drupal_loadtest` example was run for 15 seconds, and after 10 seconds the upstream Apache server was manually "paused" for 3 seconds, forcing some abnormally slow queries. (More specifically, the apache web server was started by running `. /etc/apache2/envvars && /usr/sbin/apache2 -DFOREGROUND`, it was "paused" by pressing `ctrl-z`, and it was resumed three seconds later by typing `fg`.) In the "PER REQUEST METRICS" Goose shows first the "raw" metrics", followed by the "adjusted" metrics: - -``` - ------------------------------------------------------------------------------ - Name | Avg (ms) | Min | Max | Median - ------------------------------------------------------------------------------ - GET (Anon) front page | 11.73 | 3 | 81 | 12 - GET (Anon) node page | 81.76 | 5 | 3,390 | 37 - GET (Anon) user page | 27.53 | 16 | 94 | 26 - GET (Auth) comment form | 35.27 | 24 | 50 | 35 - GET (Auth) front page | 30.68 | 20 | 111 | 26 - GET (Auth) node page | 97.79 | 23 | 3,326 | 35 - GET (Auth) user page | 25.20 | 21 | 30 | 25 - GET static asset | 9.27 | 2 | 98 | 6 - POST (Auth) comment form | 52.47 | 43 | 59 | 52 - -------------------------+-------------+------------+-------------+----------- - Aggregated | 17.04 | 2 | 3,390 | 8 - ------------------------------------------------------------------------------ - Adjusted for Coordinated Omission: - ------------------------------------------------------------------------------ - Name | Avg (ms) | Std Dev | Max | Median - ------------------------------------------------------------------------------ - GET (Anon) front page | 419.82 | 288.56 | 3,153 | 14 - GET (Anon) node page | 464.72 | 270.80 | 3,390 | 40 - GET (Anon) user page | 420.48 | 277.86 | 3,133 | 27 - GET (Auth) comment form | 503.38 | 331.01 | 2,951 | 37 - GET (Auth) front page | 489.99 | 324.78 | 2,960 | 33 - GET (Auth) node page | 530.29 | 305.82 | 3,326 | 37 - GET (Auth) user page | 500.67 | 336.21 | 2,959 | 27 - GET static asset | 427.70 | 295.87 | 3,154 | 9 - POST (Auth) comment form | 512.14 | 325.04 | 2,932 | 55 - -------------------------+-------------+------------+-------------+----------- - Aggregated | 432.98 | 294.11 | 3,390 | 14 - ``` - -From these two tables, it is clear that there was a statistically significant event affecting the load testing metrics. In particular, note that the standard deviation between the "raw" average and the "adjusted" average is considerably larger than the "raw" average, calling into questing whether or not your load test was "valid". (The answer to that question depends very much on your specific goals and load test.) - -Goose also shows multiple percentile graphs, again showing first the "raw" metrics followed by the "adjusted" metrics. The "raw" graph would suggest that less than 1% of the requests for the `GET (Anon) node page` were slow, and less than 0.1% of the requests for the `GET (Auth) node page` were slow. However, through Coordinated Omission Mitigation we can see that statistically this would have actually affected all requests, and for authenticated users the impact is visible on >25% of the requests. - -``` - ------------------------------------------------------------------------------ - Slowest page load within specified percentile of requests (in ms): - ------------------------------------------------------------------------------ - Name | 50% | 75% | 98% | 99% | 99.9% | 99.99% - ------------------------------------------------------------------------------ - GET (Anon) front page | 12 | 15 | 25 | 27 | 81 | 81 - GET (Anon) node page | 37 | 43 | 60 | 3,000 | 3,000 | 3,000 - GET (Anon) user page | 26 | 28 | 34 | 93 | 94 | 94 - GET (Auth) comment form | 35 | 37 | 50 | 50 | 50 | 50 - GET (Auth) front page | 26 | 34 | 45 | 88 | 110 | 110 - GET (Auth) node page | 35 | 38 | 58 | 58 | 3,000 | 3,000 - GET (Auth) user page | 25 | 27 | 30 | 30 | 30 | 30 - GET static asset | 6 | 14 | 21 | 22 | 81 | 98 - POST (Auth) comment form | 52 | 55 | 59 | 59 | 59 | 59 - -------------------------+--------+--------+--------+--------+--------+------- - Aggregated | 8 | 16 | 47 | 53 | 3,000 | 3,000 - ------------------------------------------------------------------------------ - Adjusted for Coordinated Omission: - ------------------------------------------------------------------------------ - Name | 50% | 75% | 98% | 99% | 99.9% | 99.99% - ------------------------------------------------------------------------------ - GET (Anon) front page | 14 | 21 | 3,000 | 3,000 | 3,000 | 3,000 - GET (Anon) node page | 40 | 55 | 3,000 | 3,000 | 3,000 | 3,000 - GET (Anon) user page | 27 | 32 | 3,000 | 3,000 | 3,000 | 3,000 - GET (Auth) comment form | 37 | 400 | 2,951 | 2,951 | 2,951 | 2,951 - GET (Auth) front page | 33 | 410 | 2,960 | 2,960 | 2,960 | 2,960 - GET (Auth) node page | 37 | 410 | 3,000 | 3,000 | 3,000 | 3,000 - GET (Auth) user page | 27 | 420 | 2,959 | 2,959 | 2,959 | 2,959 - GET static asset | 9 | 20 | 3,000 | 3,000 | 3,000 | 3,000 - POST (Auth) comment form | 55 | 390 | 2,932 | 2,932 | 2,932 | 2,932 - -------------------------+--------+--------+--------+--------+--------+------- - Aggregated | 14 | 42 | 3,000 | 3,000 | 3,000 | 3,000 - ``` - - The Coordinated Omission metrics will also show up in the HTML report generated when Goose is started with the `--report-file` run-time option. If Coordinated Omission mitigation kicked in, the HTML report will include both the "raw" metrics and the "adjusted" metrics. - ## Controlling Running Goose Load Test By default, Goose will launch a telnet Controller thread that listens on `0.0.0.0:5116`, and a WebSocket Controller thread that listens on `0.0.0.0:5117`. The running Goose load test can be controlled through these Controllers. Goose can optionally be started with the `--no-autostart` run time option to prevent the load test from automatically starting, requiring instead that it be started with a Controller command. When Goose is started this way, a host is not required and can instead be configured via the Controller. @@ -770,10 +645,9 @@ When operating in Gaggle-mode, the `--request-log` option can only be enabled on By default, logs are written in JSON Lines format. For example: ```json -{"coordinated_omission_cadence":3361,"coordinated_omission_elapsed":0,"elapsed":24172,"error":"","final_url":"http://local.dev/misc/feed.png","method":"Get","name":"static asset","redirected":false,"response_time":4,"status_code":200,"success":true,"update":false,"url":"http://local.dev/misc/feed.png","user":7} -{"coordinated_omission_cadence":2183,"coordinated_omission_elapsed":0,"elapsed":24149,"error":"","final_url":"http://local.dev/user/4816","method":"Get","name":"(Anon) user page","redirected":false,"response_time":28,"status_code":200,"success":true,"update":false,"url":"http://local.dev/user/4816","user":2} -{"coordinated_omission_cadence":2738,"coordinated_omission_elapsed":0,"elapsed":24168,"error":"","final_url":"http://local.dev/themes/bartik/logo.png","method":"Get","name":"static asset","redirected":false,"response_time":14,"status_code":200,"success":true,"update":false,"url":"http://local.dev/themes/bartik/logo.png","user":1} -{"coordinated_omission_cadence":2514,"coordinated_omission_elapsed":0,"elapsed":24171,"error":"","final_url":"http://local.dev/themes/bartik/logo.png","method":"Get","name":"static asset","redirected":false,"response_time":11,"status_code":200,"success":true,"update":false,"url":"http://local.dev/themes/bartik/logo.png","user":4} +{"coordinated_omission_elapsed":0,"elapsed":23189,"error":"","final_url":"http://apache/misc/drupal.js?q9apdy","method":"Get","name":"static asset","redirected":false,"response_time":8,"status_code":200,"success":true,"update":false,"url":"http://apache/misc/drupal.js?q9apdy","user":5,"user_cadence":0} +{"coordinated_omission_elapsed":0,"elapsed":23192,"error":"","final_url":"http://apache/misc/jquery.once.js?v=1.2","method":"Get","name":"static asset","redirected":false,"response_time":6,"status_code":200,"success":true,"update":false,"url":"http://apache/misc/jquery.once.js?v=1.2","user":6,"user_cadence":0} +{"coordinated_omission_elapsed":0,"elapsed":23181,"error":"","final_url":"http://apache/misc/jquery-extend-3.4.0.js?v=1.4.4","method":"Get","name":"static asset","redirected":false,"response_time":16,"status_code":200,"success":true,"update":false,"url":"http://apache/misc/jquery-extend-3.4.0.js?v=1.4.4","user":1,"user_cadence":0} ``` Logs include the entire [`GooseRequestMetric`] object as defined in `src/goose.rs`, which are created on all requests. @@ -784,12 +658,11 @@ By default Goose logs requests in JSON Lines format. The `--request-format` opti For example, `csv` output of similar requests as those logged above would like like: ```csv -elapsed,method,name,url,final_url,redirected,response_time,status_code,success,update,user,error,coordinated_omission_elapsed,coordinated_omission_cadence -22116,GET,"(Auth) node page","http://apache/node/3891","http://apache/node/3891",false,45,200,true,false,6,,0,3106 -22158,GET,"static asset","http://apache/misc/feed.png","http://apache/misc/feed.png",false,3,200,true,false,1,,0,2477 -22146,GET,"static asset","http://apache/misc/drupal.js?q9apdy","http://apache/misc/drupal.js?q9apdy",false,15,200,true,false,0,,0,1751 -22160,GET,"static asset","http://apache/misc/jquery.js?v=1.4.4","http://apache/misc/jquery.js?v=1.4.4",false,5,200,true,false,5,,0,2293 -22141,GET,"(Anon) node page","http://apache/node/9581","http://apache/node/9581",false,28,200,true,false,3,,0,2072 +elapsed,method,name,url,final_url,redirected,response_time,status_code,success,update,user,error,coordinated_omission_elapsed,user_cadence +22143,GET,"(Anon) user page","http://apache/user/4","http://apache/user/4",false,25,200,true,false,3,,0,0 +22153,GET,"static asset","http://apache/misc/jquery-extend-3.4.0.js?v=1.4.4","http://apache/misc/jquery-extend-3.4.0.js?v=1.4.4",false,16,200,true,false,6,,0,0 +22165,GET,"static asset","http://apache/misc/jquery.js?v=1.4.4","http://apache/misc/jquery.js?v=1.4.4",false,3,200,true,false,0,,0,0 +22165,GET,"static asset","http://apache/misc/feed.png","http://apache/misc/feed.png",false,4,200,true,false,1,,0,0 ``` ## Logging Load Test Tasks @@ -844,6 +717,122 @@ If `--debug-log=foo` is not specified at run time, nothing will be logged and th By default Goose writes debug logs in JSON Lines format. The `--debug-format` option can be used to log in `json` or `raw` format. The `raw` format is Rust's debug output of the `GooseDebug` object. +## Coordinated Omission Mitigation + +THIS IS AN EXPERIMENTAL FEATURE THAT IS DISABLED BY DEFAULT. The following documentation is a work in progress, and may currently be misleading. + +When enabled, Goose attempts to mitigate the loss of metrics data (Coordinated Omission) caused by an abnormally lengthy response to a request. + +### Definition + +To understand Coordinated Omission and how Goose attempts to mitigate it, it's necessary to understand how Goose is scheduling requests. Goose launches one thread per `GooseUser`. Each `GooseUser` is assigned a single `GooseTaskSet`. Each of these `GooseUser` threads then loop repeatedly through all of the `GooseTasks` defined in the assigned `GooseTaskSet`, each of which can involve any number of individual requests. However, at any given time, each `GooseUser` is only making a single request and then asynchronously waiting for the response. + +If something causes the response to a request to take abnormally long, raw Goose metrics only see this slowdown as affecting a specific request (or one request per `GooseUser`). The slowdown can be caused by a variety of issues, such as a resource bottleneck (on the Goose client or the web server), garbage collection, a cache stampede, or even a network issue. A real user loading the same web page would see a much larger effect, as all requests to the affected server would stall. Even static assets such as images and scripts hosted on a reliable and fast CDN can be affected, as the web browser won't know to load them until it first loads the HTML from the affected web server. Because Goose is only making one request at a time per `GooseUser`, it may only see one or very few slow requests and then all other requests resume at normal speed. This results in a bias in the metrics to "ignore" or "hide" the true effect of a slowdown, commonly referred to as Coordinated Omission. + +### Mitigation + +Goose attempts to mitigate Coordinated Omission by back-filling the metrics with the statistically expected requests. To do this, it tracks the normal "cadence" of each `GooseUser`, timing how long it takes to loop through all `GooseTasks` in the assigned `GooseTaskSet`. By default, Goose will trigger Coordinated Omission Mitigation if the time to loop through a `GooseTaskSet` takes more than twice as long as the average time of all previous loops. In this case, on the next loop through the `GooseTaskSet` when tracking the actual metrics for each subsequent request in all `GooseTasks` it will also add in statistically generated "requests" with a `response_time` starting at the unexpectedly long request time, then again with that `response_time` minus the normal "cadence", continuing to generate a metric then subtract the normal "cadence" until arriving at the expected `response_time`. In this way, Goose is able to estimate the actual effect of a slowdown. + +When Goose detects an abnormally slow request (one in which the individual request takes longer than the normal `user_cadence`), it will generate an INFO level message (which will be visible if Goose was started with the `-v` run time flag, or written to the log if started with the `-g` run time flag and `--goose-log` is configured). For example: + +``` +13:10:30 [INFO] 11.401s into goose attack: "GET http://apache/node/1557" [200] took abnormally long (1814 ms), task name: "(Anon) node page" +13:10:30 [INFO] 11.450s into goose attack: "GET http://apache/node/5016" [200] took abnormally long (1769 ms), task name: "(Anon) node page" +``` + +If the `--request-log` is enabled, you can get more details, in this case by looking for elapsed times matching the above messages, specifically 1814 and 1769 respectively: + +``` +{"coordinated_omission_elapsed":0,"elapsed":11401,"error":"","final_url":"http://apache/node/1557","method":"Get","name":"(Anon) node page","redirected":false,"response_time":1814,"status_code":200,"success":true,"update":false,"url":"http://apache/node/1557","user":2,"user_cadence":1727} +{"coordinated_omission_elapsed":0,"elapsed":11450,"error":"","final_url":"http://apache/node/5016","method":"Get","name":"(Anon) node page","redirected":false,"response_time":1769,"status_code":200,"success":true,"update":false,"url":"http://apache/node/5016","user":0,"user_cadence":1422} +``` + +In the requests file, you can see that two different user threads triggered Coordinated Omission Mitigation, specifically threads 2 and 0. Both `GooseUser` threads were loading the same `GooseTask` as due to task weighting this is the task loaded the most frequently. Both `GooseUser` threads loop through all `GooseTasks` in a similar amount of time: thread 2 takes on average 1.727 seconds, thread 0 takes on average 1.422 seconds. + +Also if the `--request-log` is enabled, requests back-filled by Coordinated Omission Mitigation show up in the generated log file, even though they were not actually sent to the server. Normal requests not generated by Coordinated Omission Mitigation have a `coordinated_omission_elapsed` of 0. + +Coordinated Omission Mitigation is disabled by default. This experimental feature can be enabled by enabling the `--co-mitigation` run time option when starting Goose. It can be configured to use the `average`, `minimum`, or `maximum` `GoouseUser` cadence when backfilling statistics. + +### Metrics + +When Coordinated Omission Mitigation kicks in, Goose tracks both the "raw" metrics and the "adjusted" metrics. It shows both together when displaying metrics, first the "raw" (actually seen) metrics, followed by the "adjusted" metrics. As the minimum response time is never changed by Coordinated Omission Mitigation, this column is replacd with the "standard deviation" between the average "raw" response time, and the average "adjusted" response time. + +The following example was "contrived". The `drupal_loadtest` example was run for 15 seconds, and after 10 seconds the upstream Apache server was manually "paused" for 3 seconds, forcing some abnormally slow queries. (More specifically, the apache web server was started by running `. /etc/apache2/envvars && /usr/sbin/apache2 -DFOREGROUND`, it was "paused" by pressing `ctrl-z`, and it was resumed three seconds later by typing `fg`.) In the "PER REQUEST METRICS" Goose shows first the "raw" metrics", followed by the "adjusted" metrics: + +``` + ------------------------------------------------------------------------------ + Name | Avg (ms) | Min | Max | Median + ------------------------------------------------------------------------------ + GET (Anon) front page | 11.73 | 3 | 81 | 12 + GET (Anon) node page | 81.76 | 5 | 3,390 | 37 + GET (Anon) user page | 27.53 | 16 | 94 | 26 + GET (Auth) comment form | 35.27 | 24 | 50 | 35 + GET (Auth) front page | 30.68 | 20 | 111 | 26 + GET (Auth) node page | 97.79 | 23 | 3,326 | 35 + GET (Auth) user page | 25.20 | 21 | 30 | 25 + GET static asset | 9.27 | 2 | 98 | 6 + POST (Auth) comment form | 52.47 | 43 | 59 | 52 + -------------------------+-------------+------------+-------------+----------- + Aggregated | 17.04 | 2 | 3,390 | 8 + ------------------------------------------------------------------------------ + Adjusted for Coordinated Omission: + ------------------------------------------------------------------------------ + Name | Avg (ms) | Std Dev | Max | Median + ------------------------------------------------------------------------------ + GET (Anon) front page | 419.82 | 288.56 | 3,153 | 14 + GET (Anon) node page | 464.72 | 270.80 | 3,390 | 40 + GET (Anon) user page | 420.48 | 277.86 | 3,133 | 27 + GET (Auth) comment form | 503.38 | 331.01 | 2,951 | 37 + GET (Auth) front page | 489.99 | 324.78 | 2,960 | 33 + GET (Auth) node page | 530.29 | 305.82 | 3,326 | 37 + GET (Auth) user page | 500.67 | 336.21 | 2,959 | 27 + GET static asset | 427.70 | 295.87 | 3,154 | 9 + POST (Auth) comment form | 512.14 | 325.04 | 2,932 | 55 + -------------------------+-------------+------------+-------------+----------- + Aggregated | 432.98 | 294.11 | 3,390 | 14 + ``` + +From these two tables, it is clear that there was a statistically significant event affecting the load testing metrics. In particular, note that the standard deviation between the "raw" average and the "adjusted" average is considerably larger than the "raw" average, calling into questing whether or not your load test was "valid". (The answer to that question depends very much on your specific goals and load test.) + +Goose also shows multiple percentile graphs, again showing first the "raw" metrics followed by the "adjusted" metrics. The "raw" graph would suggest that less than 1% of the requests for the `GET (Anon) node page` were slow, and less than 0.1% of the requests for the `GET (Auth) node page` were slow. However, through Coordinated Omission Mitigation we can see that statistically this would have actually affected all requests, and for authenticated users the impact is visible on >25% of the requests. + +``` + ------------------------------------------------------------------------------ + Slowest page load within specified percentile of requests (in ms): + ------------------------------------------------------------------------------ + Name | 50% | 75% | 98% | 99% | 99.9% | 99.99% + ------------------------------------------------------------------------------ + GET (Anon) front page | 12 | 15 | 25 | 27 | 81 | 81 + GET (Anon) node page | 37 | 43 | 60 | 3,000 | 3,000 | 3,000 + GET (Anon) user page | 26 | 28 | 34 | 93 | 94 | 94 + GET (Auth) comment form | 35 | 37 | 50 | 50 | 50 | 50 + GET (Auth) front page | 26 | 34 | 45 | 88 | 110 | 110 + GET (Auth) node page | 35 | 38 | 58 | 58 | 3,000 | 3,000 + GET (Auth) user page | 25 | 27 | 30 | 30 | 30 | 30 + GET static asset | 6 | 14 | 21 | 22 | 81 | 98 + POST (Auth) comment form | 52 | 55 | 59 | 59 | 59 | 59 + -------------------------+--------+--------+--------+--------+--------+------- + Aggregated | 8 | 16 | 47 | 53 | 3,000 | 3,000 + ------------------------------------------------------------------------------ + Adjusted for Coordinated Omission: + ------------------------------------------------------------------------------ + Name | 50% | 75% | 98% | 99% | 99.9% | 99.99% + ------------------------------------------------------------------------------ + GET (Anon) front page | 14 | 21 | 3,000 | 3,000 | 3,000 | 3,000 + GET (Anon) node page | 40 | 55 | 3,000 | 3,000 | 3,000 | 3,000 + GET (Anon) user page | 27 | 32 | 3,000 | 3,000 | 3,000 | 3,000 + GET (Auth) comment form | 37 | 400 | 2,951 | 2,951 | 2,951 | 2,951 + GET (Auth) front page | 33 | 410 | 2,960 | 2,960 | 2,960 | 2,960 + GET (Auth) node page | 37 | 410 | 3,000 | 3,000 | 3,000 | 3,000 + GET (Auth) user page | 27 | 420 | 2,959 | 2,959 | 2,959 | 2,959 + GET static asset | 9 | 20 | 3,000 | 3,000 | 3,000 | 3,000 + POST (Auth) comment form | 55 | 390 | 2,932 | 2,932 | 2,932 | 2,932 + -------------------------+--------+--------+--------+--------+--------+------- + Aggregated | 14 | 42 | 3,000 | 3,000 | 3,000 | 3,000 + ``` + + The Coordinated Omission metrics will also show up in the HTML report generated when Goose is started with the `--report-file` run-time option. If Coordinated Omission mitigation kicked in, the HTML report will include both the "raw" metrics and the "adjusted" metrics. + ## Gaggle: Distributed Load Test Goose also supports distributed load testing. A Gaggle is one Goose process running in Manager mode, and 1 or more Goose processes running in Worker mode. The Manager coordinates starting and stopping the Workers, and collects aggregated metrics. Gaggle support is a cargo feature that must be enabled at compile-time as documented below. To launch a Gaggle, you must copy your load test application to all servers from which you wish to generate load. diff --git a/src/goose.rs b/src/goose.rs index 6fde52ea..8c328d18 100644 --- a/src/goose.rs +++ b/src/goose.rs @@ -786,7 +786,7 @@ struct GooseRequestCadence { /// Omission Mitigation in milliseconds. coordinated_omission_mitigation: u64, /// The expected cadence to loop through all GooseTasks. - coordinated_omission_cadence: u64, + user_cadence: u64, /// If -1 coordinated_omission_mitigation was never enabled. Otherwise is a counter of how /// many times the mitigation triggered. coordinated_omission_counter: isize, @@ -803,7 +803,7 @@ impl GooseRequestCadence { average_cadence: 0, total_elapsed: 0, coordinated_omission_mitigation: 0, - coordinated_omission_cadence: 0, + user_cadence: 0, coordinated_omission_counter: -1, } } @@ -1471,7 +1471,7 @@ impl GooseUser { // this GooseUser is running. If requests are blocked by the upstream server, this // allows Goose to backfill the requests that should have been made based on // cadence statistics. - request_metric.coordinated_omission_cadence = self + request_metric.user_cadence = self .coordinated_omission_mitigation(&request_metric) .await?; @@ -1557,7 +1557,7 @@ impl GooseUser { request_cadence.coordinated_omission_mitigation = 0; } // Always track the expected cadence. - request_cadence.coordinated_omission_cadence = cadence; + request_cadence.user_cadence = cadence; } } else { // Coordinated Omission Mitigation defaults to average. @@ -1587,7 +1587,7 @@ impl GooseUser { // cadence, as that means this specific request will likely trigger Coordinated // Omission Mitigation. if request_cadence.counter > 3 - && request_metric.response_time > request_cadence.coordinated_omission_cadence + && request_metric.response_time > request_cadence.user_cadence { let task_name = if !self.weighted_tasks.is_empty() { let position = self.position.load(Ordering::SeqCst); @@ -1600,7 +1600,7 @@ impl GooseUser { "".to_string() }; info!( - "coordinated omission alert {:.3}s into goose attack: \"{} {}\" [{}] took abnormally long ({} ms){}", + "{:.3}s into goose attack: \"{} {}\" [{}] took abnormally long ({} ms){}", request_metric.elapsed as f64 / 1_000.0, request_metric.method, request_metric.url, @@ -1619,12 +1619,11 @@ impl GooseUser { coordinated_omission_request_metric.coordinated_omission_elapsed = request_cadence.coordinated_omission_mitigation; // Record data points specific to coordinated_omission. - coordinated_omission_request_metric.coordinated_omission_cadence = - request_cadence.coordinated_omission_cadence; + coordinated_omission_request_metric.user_cadence = request_cadence.user_cadence; // Send the coordinated omission mitigation generated metrics to the parent. self.send_request_metric_to_parent(coordinated_omission_request_metric)?; } - Ok(request_cadence.coordinated_omission_cadence) + Ok(request_cadence.user_cadence) } else { // A setting for coordinated omission mitigation is required, defaults to Average. unreachable!(); diff --git a/src/lib.rs b/src/lib.rs index 5a78b9f3..2b87d640 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1833,7 +1833,7 @@ impl GooseAttack { fn set_coordinated_omission(&mut self) -> Result<(), GooseError> { // Track how value gets set so we can return a meaningful error if necessary. let mut key = "configuration.coordinated_omission"; - let mut value = Some(GooseCoordinatedOmissionMitigation::Average); + let mut value = Some(GooseCoordinatedOmissionMitigation::Disabled); if self.configuration.co_mitigation.is_some() { key = "--co-mitigation"; diff --git a/src/logger.rs b/src/logger.rs index 982facec..b7bb8baa 100644 --- a/src/logger.rs +++ b/src/logger.rs @@ -234,7 +234,7 @@ fn requests_csv_header() -> String { "user", "error", "coordinated_omission_elapsed", - "coordinated_omission_cadence", + "user_cadence", ) } @@ -364,7 +364,7 @@ impl GooseLogger for GooseConfiguration { request.user, request.error, request.coordinated_omission_elapsed, - request.coordinated_omission_cadence, + request.user_cadence, ) } } diff --git a/src/metrics.rs b/src/metrics.rs index b6f5698d..a9c8efde 100644 --- a/src/metrics.rs +++ b/src/metrics.rs @@ -281,7 +281,7 @@ pub struct GooseRequestMetric { /// the upstream server, blocking requests from being made. pub coordinated_omission_elapsed: u64, /// If non-zero, the expected cadence of looping through all GooseTasks by this GooseUser. - pub coordinated_omission_cadence: u64, + pub user_cadence: u64, } impl GooseRequestMetric { pub(crate) fn new( @@ -305,7 +305,7 @@ impl GooseRequestMetric { user, error: "".to_string(), coordinated_omission_elapsed: 0, - coordinated_omission_cadence: 0, + user_cadence: 0, } } @@ -2277,7 +2277,7 @@ impl GooseAttack { // generated "request" to mitigate coordinated omission, loop to backfill // with statistically generated metrics. if request_metric.coordinated_omission_elapsed > 0 - && request_metric.coordinated_omission_cadence > 0 + && request_metric.user_cadence > 0 { // Build a statistically generated coordinated_omissiom metric starting // with the metric that was sent by the affected GooseUser. @@ -2285,7 +2285,7 @@ impl GooseAttack { // Use a signed integer as this value can drop below zero. let mut response_time = request_metric.coordinated_omission_elapsed as i64 - - request_metric.coordinated_omission_cadence as i64 + - request_metric.user_cadence as i64 - request_metric.response_time as i64; loop { @@ -2293,7 +2293,7 @@ impl GooseAttack { if response_time > request_metric.response_time as i64 { co_metric.response_time = response_time as u64; self.record_request_metric(&co_metric).await; - response_time -= request_metric.coordinated_omission_cadence as i64; + response_time -= request_metric.user_cadence as i64; } else { break; }