Doc: Understanding the impact of CO-events #304

bbros-dev · 2021-07-05T07:05:00Z

Close issue 303.

What?
Correct the statistical inference statement.

Why?
A user applying the suggested logic, with a different sample,
would very likely draw an incorrect inference.
The statement makes sense only because we know a CO-event
took place. Generally the issue is more subtle.

How?
Provide a note indicating testing for changes in
sample distributions is out of scope for Goose.
However, we provide some guidance for the novice user
that is not an unreasonable starting point.

Signed-off-by: Begley Brothers Inc begleybrothers@gmail.com

jeremyandrews · 2021-07-06T19:07:38Z

Can you re-roll w/o the unrelated white space cleanups? It makes it difficult to review the actual intended changes. (Ideal submit the white-space cleanup as it's own PR.)

jeremyandrews

I don't agree with the changes to the documentation, per my comments above. Happy to discuss more.

README.md

Close issue 303. What? Correct the statistical inference statement. Why? A user applying the suggested logic, with a different sample, would very likely draw an incorrect inference. The statement makes sense only because we know a CO-event took place. Generally the issue is more subtle. How? Provide a note indicating testing for changes in sample distributions is out of scope for Goose. However, we provide some guidance for the novice user that is not an unreasonable starting point. Signed-off-by: Begley Brothers Inc <begleybrothers@gmail.com>

Signed-off-by: Begley Brothers Inc <begleybrothers@gmail.com>

jeremyandrews · 2021-07-08T11:35:10Z

README.md

@@ -599,9 +599,22 @@ The following example was "contrived". The `drupal_loadtest` example was run for
 Aggregated               |      432.98 |     294.11 |       3,390 |         14
 ```

-From these two tables, it is clear that there was a statistically significant event affecting the load testing metrics. In particular, note that the standard deviation between the "raw" average and the "adjusted" average is considerably larger than the "raw" average, calling into questing whether or not your load test was "valid". (The answer to that question depends very much on your specific goals and load test.)
+Note: It is beyond the scope of Goose to test for statistically significant changes in the right-tail, or other locations, of the distribution of response times.  Goose produces the raw data you need to conduct these tests.


Goose is able to do this to some degree. Goose tracks how long it takes each GooseUser to run through all their GooseTasks, and after each pass updates the average time. If a given pass takes more than 2x the average, it issues a warning that there may have been a Coordinated Omission event.

While it did originally track a cadence for every single request, this was far too much overhead at scale, hence the simplified version that landed.

Goose is able to do this to some degree.

Hmm its a nest of thorns and especially when you move away from well known/trodden paths such as testing for equality between distributions.... I don't believe I've ever seen any one address what the distribution of the maxima is for their (sampled) distribution, and then establish the observed maximum is within the range expected, or not.

Gil Tene seemed to have great trouble just with persuading people the one metric you care about is the maximum.

Best to steer clear of making these sorts of claims - in our opinion.

While it did originally track a cadence for every single request, this was far too much overhead at scale, hence the simplified version that landed.

What is in place is, in our opinion, a defensible workaround until the proper implementation is arrived at.

If a given pass takes more than 2x the average

If you leave that in place I suggest offering a reason those two random numbers, i.e. 2 and the average value, were chosen.

After rejecting our conjecture the minimum is the correct value to use for the default 'cadence', we took the time to give an explanation for using the minimum as the default cadence in the absence of a user providing a particular value. That seems to have been ignored, which is unfortunate, but causes us no harm.

However we must note it is not obvious to us that the CI around the order statistics are symmetric - but we are guessing that is what you're trying to do. So we're struggling to work out what statistical theory you are trying to apply and why.
Unless a statistician emerges as a maintainer, we'd suggest moving Goose away from these topics is the prudent course of action.

The values Goose is using were chosen through empirical experimentation and observations.

While in a fully controlled environment it would be ideal to strive for the minimum response time, it has never been my experience that this is the expectation for websites on the internet. It is my expectation that most people running load tests are doing so to observe and measure performance (frequently with the aim of further optimization).

Currently each GooseUser thread simply watches the performance for a few cycles and tracks the average cadence and uses this as its baseline. Yes, there are outliers that return quicker, and there are outliers that return quicker, but on average we tend to observe reasonable consistency. So my intent is that through empirical observation Goose determines the "normal" cadence for each GooseUser, and then alerts when some event causes things to take longer. Goose is unable to determine what caused the event, but aids the tester by making them aware of it. This is a hint to the tester to review all available tools to better understand what happened: was there a bottleneck, a cache stampede, resource starvation, etc... And does the event impact the goals of the load test (fully something for the tester to decide).

What is your use-case that you assume the minimum response time is the expected cadence?

As for the magic 2, this was borrowed from the implementation in HdrHistogram:
https://github.com/HdrHistogram/HdrHistogram_rust/blob/9c09314ac91848fd696b699892414cb337d9abce/src/lib.rs#L916

While in a fully controlled environment it would be ideal to strive for the minimum response time, it has never been my experience that this is the expectation for websites on the internet.

This has nothing to do with response times or expectations about the behavior of the system under test. I cannot stress that enough - you're focused on the wrong end of the problem.

Perhaps I misunderstand:
Ack we are dealing with a workaround and some of this becomes moot when the CO guard lands.
Nonetheless, CO-mitigation reflects nothing about your expectations of response times. It is simply to ensure that when a CO event occurs you see that reflected in the empirical distribution. Plotting the empirical cumulative dist function, ideally, you should see the CDF start to climb with the CO mitigation in place. With the minimum it will start to climb very soon after the CO event occurs. Using a larger time interval will hide the evidence of the CO event. Experiments will show this. You'll also notice that as you run the different experiments your "expectation for websites on the internet" are unchanged. All you're observing is a signal or a signal being hidden.

So my intent is that through empirical observation Goose determines the "normal" cadence for each GooseUser, and then alerts when some event causes things to take longer

I give up. But please don't impose your preferences for this. I you insist it must be the default can it easily be turned off?

What is your use-case that you assume the minimum response time is the expected cadence?

In the workaround at hand we are talking about a device to alert us to CO events that we know with 100% certainty would, without this workaround, render the upper order statistics near meaningless. It has nothing to do with my expectations about anything, its a workaround.

When the CO fix lands I will be able to configure Goose to make requests with a time interval between them that I expect to reflect the behavior of my users. When those "users" are backend microservices I can make then behave in very specific manner. But again the interval I pass to the CO-fixed Goose will reflect nothing about my expectations about response time - it should reflect my expectations about request arrival times. Request arrival times have nothing to do with response times (in a CO-fixed system) - without the fix CO response times do impact request times - that's the defect being workaround and then fixed

No?

As an aside:

on average we tend to observe reasonable consistency

Try not to care about these statistical issues. No offence intended but I'm now reasonably confident they are not your strength, just as Rust is not my strength.
Allow for more the possibility of infinitely more diversity in the world than what you have experienced, and likewise with respect to views and expectations you expect other people to hold.

The role of Goose is to make requests and record responses in a way that isn't misleading. Nothing more.
The CO issue unfortunately impacts the distribution you observe, but it really has nothing to do with statistics.

To recognize this notice that Gil Tene's neat definition of the CO is this:

what each tester thread is actually doing as a result of waiting for a long response before sending the next one is the equivalent of asking the system "is it ok for me to test your latency right now? Because if it's not, I'm fine waiting and testing it only when you are ready..."

The *fix for that has nothing to do with statistics. Likewise making a request rather than waiting for the system being tested to say "OK I'm ready now, you can test me again" reveals nothing about my expectations about anything.

There are a set of use cases where people try to mimic human behavior. While important and fascinating, that's a small fraction of network traffic.

This has nothing to do with response times or expectations about the behavior of the system under test. I cannot stress that enough - you're focused on the wrong end of the problem.

Ok, so you're saying the goal is more that we enforce the request cadence?

Ack we are dealing with a workaround and some of this becomes moot when the CO guard lands.

Most articles and papers I've found seem to point to HdrHistogram as the best workaround, and that is what Goose has copied.

Nonetheless, CO-mitigation reflects nothing about your expectations of response times. It is simply to ensure that when a CO event occurs you see that reflected in the empirical distribution. Plotting the empirical cumulative dist function, ideally, you should see the CDF start to climb with the CO mitigation in place. With the minimum it will start to climb very soon after the CO event occurs. Using a larger time interval will hide the evidence of the CO event. Experiments will show this. You'll also notice that as you run the different experiments your "expectation for websites on the internet" are unchanged. All you're observing is a signal or a signal being hidden.

My concern with this is the mitigation (taken from HdrHistogram) works around this by back-filling metrics for requests that never actually happened. It seems undesirable to me for a significant portion of our metrics to be "estimated". (Fortunately we store both: the raw and the back-filled.)

I give up. But please don't impose your preferences for this. I you insist it must be the default can it easily be turned off?

That's not very constructive to solving this, but okay.

Please see the Coordinated Mitigation section in the README, specifically the --co-mitigation flag: this is how it's enabled/configured.

When the CO fix lands I will be able to configure Goose to make requests with a time interval between them that I expect to reflect the behavior of my users.

On the roadmap (but apparently not documented anywhere, oddly) is enhancing set_wait_time() (likely with a different named function) so it can instead be configured to impose a maximum time for each task. This would make it so you can enforce that each task runs after a certain duration. This will require setting a timeout on each request (or a way to cancel requests when the timeout is reached). In this way you'd be able to configure and enforce a set cadence for each GooseUser. Is this more what you're looking for? (The timeout is necessary as otherwise we end up with cascading requests...)

Try not to care about these statistical issues. No offense intended but I'm now reasonably confident they are not your strength, just as Rust is not my strength.

My goal is to be sure I understand what I'm committing: I'm not willing to commit documentation or code changes that don't make sense to me.

Ok, so you're saying the goal is more that we enforce the request cadence?

Correct. When it is on it is one. When it is off it is off.

That's not very constructive to solving this, but okay.

Nothing to do with being constructive or not. Time is finite. There are multiple things to with each minute.

impose a maximum time for each task

In our opinion, that works against preventing CO from distorting the response times.

I'm not willing to commit documentation or code changes that don't make sense to me.

Sensible.

jeremyandrews · 2021-07-08T11:37:04Z

README.md

+Nonetheless, for users interested in establishing if there was an event(s) affecting the shape of the distribution of load test metrics (by a statistically significant amount): The following program is a reasonable starting point.
+
+1. Run a test in circumstances where you believe the test can serve as a baseline sample from the 'healthy' state, record the raw response data (record the CO-adjusted data if using the minimum 'cadence' to adjust for CO).
+2. Run a test in circumstances where you want to compare the distribution of response times against your baseline sample, record the CO-adjusted response data.


Goose will monitor and tell you if it believes there was a CO-event, and will back-fill the metrics accordingly.

While the person running the load test remains the responsible party that oversees and reviews everything that's going on: Goose should simplify this considerably.

Goose will monitor and tell you if it believes there was a CO-event

Rather avoid this type of pseudo-intelligence. It clutters the code and gives a false sense of security.

If our understanding is correct: When you set the cadence as 'minimum' you'll get CO-mitigation right after the warm-up period.
Correct?

We're still of the view the minimum is the correct setting, so that is the only behavior that concerns us right now ;)

When the 'proper' solution arrives users will define their sample interval or rate/sec. In that case it would be convenient to have a flag that uses the minimum interval (maximum req/sec) obtained during some warmup period - and that might even come to be accepted as the default in the community. But we're not there yet.

Rather avoid this type of pseudo-intelligence. It clutters the code and gives a false sense of security.

The intent is quite the opposite: bring the tester's attention to abnormally slow requests together with timestamps that aid in tracking down the cause.

If --co-mitigation minimum is enabled, the user_cadence will indeed by the fastest time each GooseUser thread has completed all GooseTasks:

goose/src/goose.rs

Line 1514 in 025a7bb

// Update `minimum_cadence` if this was the fastest seen.

If our understanding is correct: When you set the cadence as 'minimum' you'll get CO-mitigation right after the warm-up period.

In the current implementation, the "warm up period" is per-GooseUser thread, and is 3 complete iterations of all their GooseTasks:

goose/src/goose.rs

Line 1529 in 025a7bb

if request_cadence.counter > 3 {

We're still of the view the minimum is the correct setting, so that is the only behavior that concerns us right now ;)

What is your use-case where this makes sense? Are you load testing in a fully controlled environment?

The intent is quite the opposite: bring the tester's attention to abnormally slow requests together with timestamps that aid in tracking down the cause.

Understood. But try an experiment where you freeze the server for some time period smaller than the average and do that often - you'll discover that frequently you are hiding the CO event.

No?

It makes sense, but this would be true for any event that is less than the period you've selected.

...which is why you'd use the minimum. Of course.

The remaining concern for me is that this could result in the metrics that Goose returns containing as much or more "back-filled metrics" (that never actually happened) as "real metrics" (that actually happened).

bbros-dev mentioned this pull request Jul 5, 2021

RFC: Add data-file output option #305

Closed

jeremyandrews requested changes Jul 7, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

bbros-dev added 2 commits July 8, 2021 18:03

[Evolve][Doc] Coordinated omission data description

4ee02b6

Signed-off-by: Begley Brothers Inc <begleybrothers@gmail.com>

bbros-dev force-pushed the issue-303 branch from b171a82 to 4ee02b6 Compare July 8, 2021 08:06

bbros-dev mentioned this pull request Jul 8, 2021

White space cleanup #308

Closed

jeremyandrews reviewed Jul 8, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc: Understanding the impact of CO-events #304

Doc: Understanding the impact of CO-events #304

bbros-dev commented Jul 5, 2021

jeremyandrews commented Jul 6, 2021

jeremyandrews left a comment

jeremyandrews Jul 8, 2021

bbros-dev Jul 8, 2021

bbros-dev Jul 8, 2021 •

edited

jeremyandrews Jul 9, 2021

bbros-dev Jul 9, 2021

jeremyandrews Jul 9, 2021

bbros-dev Jul 10, 2021

jeremyandrews Jul 8, 2021

bbros-dev Jul 8, 2021 •

edited

jeremyandrews Jul 9, 2021

bbros-dev Jul 9, 2021 •

edited

jeremyandrews Jul 9, 2021

jeremyandrews Jul 11, 2021

Doc: Understanding the impact of CO-events #304

Are you sure you want to change the base?

Doc: Understanding the impact of CO-events #304

Conversation

bbros-dev commented Jul 5, 2021

jeremyandrews commented Jul 6, 2021

jeremyandrews left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbros-dev Jul 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbros-dev Jul 8, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbros-dev Jul 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbros-dev Jul 8, 2021 •

edited

bbros-dev Jul 8, 2021 •

edited

bbros-dev Jul 9, 2021 •

edited