Circuit breaker spike march16 added windows #1

kristianhald · 2016-04-12T16:12:56Z

I think that doing a merge with your fork is probably a good idea, if your review doesn't uncover any big issues.

Any future changes, I can just create a new branch and another pull request.

…s ensuring that the windows are being used correctly by the circuit breaker.

…mplement the interface. This allows for changing the implementation of the health metrics without changing the code in the controller. Implemented a new strategy of health metrics which does not have rolling windows, but instead a single window. At the moment the decision on which 'strategy' to use is made in the controller, but this can be moved out of the controller and be handled in the syntax along with checks on the parameters.

…ice duration value.

reisenberger · 2016-04-13T08:12:59Z

@kristianhald Agree. Aiming to complete a read of this later today. No big issues so far 👍

reisenberger · 2016-04-13T18:17:34Z

src/Polly.Shared/CircuitBreaker/RollingHealthMetrics.cs

+                // therefore not be removed from the queue.
+                if (now - _windows.Peek().StartedAt < _timesliceDuration)
+                    break;
+


@kristianhald Are you happy if, around here, we simply do the following:

while (_windows.Count > 0 && (now - _windows.Peek().StartedAt >= _timesliceDuration)) _windows.Dequeue();

It keeps the logic of this and code above similar, and seems straightforwardly enough comprehensible: "while there are out-of-date slices at the front of the queue, remove them".

Either you can commit this (becomes part of the PR) or I can just do it after merging.

reisenberger · 2016-04-13T18:19:08Z

@kristianhald Thanks again for this great contribution! Having read over the code, my only detail comment is the one above around the Dequeue loop. Nice renamings and refactoring from the earlier version.

Reading over the tests next...

kristianhald · 2016-04-13T18:40:33Z

@reisenberger Thank you. Regarding the while loop, 👍.
I have added a commit with the change.

reisenberger · 2016-04-13T18:42:15Z

Hey @kristianhald. Here is a conceptual issue. I observe that, from our implementation, the windows we create within the timeslice are not necessarily adjacent. For instance, imagining windows of size 1, if the first action records at t = 0, and the second action at t = 1.5, we will be running two windows from 0->1, and 1.5->2.5, with a 0.5 gap in between them. Reasoning suggests this has no effect on the statistics, as the gaps only represent 'empty' times with no results.

I considered an outside case: imagining windows of size 1 in slice size 10, we could have a window at t = 0 and another at t = 9.9, this latter in theory extending +1 to t = 10.9 ... Again, reasoning suggests that if any statistic comes in at t >= 10, the t = 0 window will be dropped, so at no point will we be considering statistics for a period >10. However, it might be nice (for regression value, in case anyone refactors...) to have a test proving this.

It looks like this could be done by duplicating the test Should_not_open_circuit_if_failure_threshold_exceeded_but_throughput_threshold_not_met_before_timeslice_expires__even_if_timeslice_expires_only_exactly()
and adding SystemClock.UtcNow = () => time.Add(timesliceDuration.Subtract(TimeSpan.FromTicks(1))); just before the third exception? Would this cover it? It should cause it to create a new bucket which overlaps into the following timeslice ... but the t = 0 window should be dropped when the fourth exception comes in?

Really liking the look of the tests.

In general: Do you agree that the non-adjacent windows issue has no effect on statistics, or am I missing something?

Perhaps controversially, my first instinct is not to code to make windows adjacent before merging. It adds complication and doesn't add anything to solving the problem at hand. Following the mantra 'Keep the code as simple as possible to deliver the job needed', we don't need to correct for this, and the current creation/disposal of window code reads intuitively. However, there would be future scenarios (eg emitting regular statistics, as Hystrix does) where it would become important. And interested in your thoughts...

Note: I will document this also in the main thread on App-vNext/Polly, since it's important enough to have in the history for future maintenance.

Let me have your thoughts.

Thanks!

reisenberger · 2016-04-13T18:47:47Z

... unless you already have such a test ... (not quite read all; called away from kbd now)

kristianhald · 2016-04-13T19:30:06Z

In general: Do you agree that the non-adjacent windows issue has no effect on statistics, or am I missing something?

I am under the impression, like you state, that as long as we remove buckets/windows that are older than Now() - timesliceDuration then it will not have an impact. Also I felt that the implementation would be smaller and easier to read.

Did a little thinking and it might actually be quite easy to add the logic for adjacent buckets/windows.
If we need the buckets to align adjacent to each other, then I think that the easiest solution would be to decide on a 'start time' (maybe construction time of the class or just zero) and when we create a new bucket, we move the 'StartedAt' timestamp backwards in time to where the bucket should have been created.

Lets take the following scenario:
Start time: Could be decided at constructor time or just be zero (Beginning of the universe 😸). Lets say it is 23 seconds (from when the universe began).
Bucket/Window duration: 0.7 seconds

At the current moment in time, DateTime.Now() reads 26.457 seconds and an error occurs.
From 23 seconds and forward until 26.457 the buckets StartedAt reads:
23, 23.7, 24.4, 25.1, 25.8, 26.5

As the 26.5 bucket is in the future, then this cannot be created. Therefore the bucket we are in is 25.8. The question is then, how do we get from 26.457 to 25.8 without having to create all buckets from 23 seconds. The answer (I think 😄. Just throwing a quick idea here) is the following calculation:
StartedAt = Now() - ((Now() - StartTime) MOD BucketDuration)

If we take the above numbers we should get:
StartedAt = 26.457 - ((26.457 - 23) MOD 0.7)
= 26.457 - (3.457 MOD 0.7)
= 26.457 - 0.657
= 25.8

It will still have empty buckets as holes, but it should align the buckets to each other.

Question: Would aligning the buckets actually solve the statistics issue?

eg emitting regular statistics, as Hystrix does

When you say 'regular' do you mean bucket/window duration regular, every 1 second regular or a delegate call, when an event occurs (Success or Error)?

kristianhald · 2016-04-13T19:31:19Z

It looks like this could be done by duplicating the test Should_not_open_circuit_if_failure_threshold_exceeded_but_throughput_threshold_not_met_before_timeslice_expires__even_if_timeslice_expires_only_exactly()
and adding SystemClock.UtcNow = () => time.Add(timesliceDuration.Subtract(TimeSpan.FromTicks(1))); just before the third exception? Would this cover it? It should cause it to create a new bucket which overlaps into the following timeslice ... but the t = 0 window should be dropped when the fourth exception comes in?

@reisenberger Don't have such a test and I believe that you are correct that we should have one to ensure that it works as expected. I think that the test would ensure the outer case is valid.
I will add it immediately.

kristianhald · 2016-04-13T19:38:20Z

Question: Would aligning the buckets actually solve the statistics issue?

If statistics should be provided, what is the time interval given? If we have a window duration of 0.7 seconds will that match the time interval of the statistics Polly gives or should Polly give statistics for every second or every five seconds?

It is an entirely different discussion, so do not believe we should try to solve it here. I just wanted to point out that if the window durations do not align with the statistical interval that is being emitted, then it will not be possible for Polly to give the correct statistics based on the data from the windows, as the statistical interval can begin or end in the middle of a window, thereby not knowing which of the events from the window belongs in the statistical interval and which are outside.

I think?

…her, then having a bucket that starts just before the timeslice duration ends, does not result in the calculation of the HealthCount being different, than if the windows were aligned.

reisenberger · 2016-04-13T19:53:58Z

When you say 'regular' do you mean bucket/window duration regular,
every 1 second regular or a delegate call, when an event occurs (Success or Error)?

At regular intervals (every call = too frequent). But the frequency of emitting these statistics would absolutely be aligned to either timesliceDuration or windowDuration. Definitely don't want some different system of timing to interact with ;~) Vague memory, Hystrix emits stats at one of these intervals I think.

Options for adding this (later) to Polly:

emit a HealthMetric when dequeueing it from front of the queue;

OR (more timely, fractionally more code)

emit a metric when it ceases being the _current

[but this is an] entirely different discussion

Absolutely. See the (ambitious but exciting!) roadmap (do add feedback under the placeholder issue if you have thoughts!). My long-term goal is to allow variety of Polly resilience strategies to be chained together in a kind of wrap / pipeline syntax (see the roadmap), but with a more flexible model than Hystrix. So that users can choose which elements they want to combine (rather than have to learn/consider the whole lot) - either choose just one strategy, or make a more complicated wrap, as they choose.

Rather than emitting statistics now, my (current) vision is of adding emitting more broadly-encompassing statistics to Polly after more of the pipeline approach/more of the resilience strategies are in place ... not yet got around to adding this to the roadmap ...

reisenberger · 2016-04-13T20:12:21Z

Re:

StartedAt = Now() - ((Now() - StartTime) MOD BucketDuration)

Nice! Building on your suggestion, maybe:

StartedAt = Now() - ((Now() - StartTimeOfPreviousWindow) MOD BucketDuration)

Heh, we even have now - _currentWindow.StartedAt calculated in the line before, could store in variable and re-use ...

I marginally prefer this (smaller subtraction, modulus; easier comprehension), but it leaves an issue [another code branch needed] when the circuit is reset and there isn't a previous window ... if aligning across circuit resets matters ...

reisenberger · 2016-04-13T20:22:07Z

Although re:

Rather than emitting statistics now, my (current) vision is of adding emitting more broadly-
encompassing statistics to Polly after more of the pipeline approach/more of the resilience
strategies are in place ... not yet got around to adding this to the roadmap ...

if users have complete flexibility, eg of setting circuit-breaker and retry intervals entirely unrelated, then wrapping those policies together and emitting time-regulated statistics from the combined [but different time-intervalled] wrap could (would) be problematic. Not got a final view on this ... but currently seeing stats in context of bigger vision (rather than, eg, adding stats only for breaker now).

kristianhald · 2016-04-13T20:22:58Z

when the circuit is reset and there isn't a previous window ... if aligning across circuit resets matters ...

Weeeeeeeell, we could be a little lazy and define a reset as the 'StartTime' change:

        public void Reset_NeedsLock()
        {
            _currentWindow = null;
            _windows.Clear();
        }

to

        public void Reset_NeedsLock()
        {
            _currentWindow = null;
            _windows.Clear();

            long now = SystemClock.UtcNow().Ticks;
            _currentWindow = new HealthCount { StartedAt = now };
            _windows.Enqueue(_currentWindow);
        }

That would change the start time of the entire timeslice, but allow the logic to still use the current windows StartedAt.

If the system should still have windows aligned acrossed resets, then something like this would probably solve it (Haven't thought through all iterations):

        public void Reset_NeedsLock()
        {
            ActualiseCurrentMetric_NeedsLock();
            _windows.Clear();

            _currentWindow = new HealthCount { StartedAt = _currentWindow.StartedAt };
            _windows.Enqueue(_currentWindow);
        }

kristianhald · 2016-04-13T20:26:02Z

If the system should still have windows aligned acrossed resets, then something like this would probably solve it (Haven't thought through all iterations)

This code assumes that the ActualiseCurrentMetric aligns the windows next to each other.

reisenberger · 2016-04-13T20:35:27Z

Hey @kristianhald Briefly to say, don't add this (aligning) to current PR yet. It's great stuff, just want to step back reflect whether we're adding it or not ... (tho the more concise we punch it down, the better candidate it gets ...)

kristianhald · 2016-04-13T20:42:29Z

No problem. I believe that both would work in ensuring alignment of windows even across resets, but it would only ensure alignment of windows and not provide any more or less exactness of the breaker.

In regards to overhead of the calculation, I do not believe it so much, since it would only be done every window duration. But it should be performance tested and we should have a baseline to compare against. So if we believe the current implementation solves the feature, then I suggest that should be the baseline for any additional implementation.

I also would like to have the feature packaged, so I can begin using it 😃.

reisenberger · 2016-04-13T20:54:16Z

Ok, back to possible solutions: how about this? (building again on yours)

        private void ActualiseCurrentMetric_NeedsLock(long alignTo = 0) // or a separate overload rather than optional parameter, probably clearer
        {
            if (alignTo == 0) alignTo = _currentWindow.StartedAt;

            long now = SystemClock.UtcNow().Ticks;
            if (_currentWindow == null || now - _currentWindow.StartedAt >= _windowDuration)
            {
                _currentWindow = new HealthCount { StartedAt = now - ((now - alignTo) % windowDuration) };
                _windows.Enqueue(_currentWindow);
            }

          while (_windows.Count > 0 && (now - _windows.Peek().StartedAt >= _timesliceDuration))
                _windows.Dequeue();
        }

with:

        public void Reset_NeedsLock()
        {
            long alignWindowAfterReset = _currentWindow != null ? _currentWindow.StartedAt : SystemClock.UtcNow();
            _windows.Clear();
            _currentWindow = null;
            ActualiseCurrentMetric_NeedsLock(alignWindowAfterReset);
        }

(I slightly prefer this, as expresses clearly what the alignment about, and avoids enqueuing, clearing, then enqueuing ... easier later to understand why ... but all building on your suggestion)

Happy for views. Will have a think, your original 'start of universe' had simplicity too

reisenberger · 2016-04-13T21:00:30Z

could be I'm making it too complicated. Could just maintain a separate private long alignTo on the class, and update it periodically (if necessary, even).

Switching back to getting merge-down in hand without this, for now

kristianhald · 2016-04-13T21:26:13Z

I actually also prefer the 'start of universe' setting. It would mean that we might start the circuit breaker a bit unaligned (When the circuit breaker is created first time, then the first window might be cut of). However, since it normally takes a little time for an application to start up, then I do not think this would be noticed.

Having the 'start of universe' setting would mean that windows are always aligned even when reset, since it does not change.

What would be the benefit of setting alignTo different from zero?

I am not certain, but I would believe it takes the same amount of computation to calculate (Now() - alignTo) MOD windowDuration as doing Now() MOD windowDuration. Actually the first might be a bit heavier as it needs to do the subtraction first (I am guessing very low amount of extra calculation).

Should reset change the alignment of the windows or just continue as if the windows have been cleared? The latter would signify that how the timeslice is started is not affected, only the contents of the timeslice is affected, when doing a reset.

EDIT: Its late. Going to bed 😄 Also this might be a talk on the more public forum.

reisenberger · 2016-04-13T21:53:07Z

Excellent tests. Merged to my fork.

reisenberger · 2016-04-14T07:02:33Z

@kristianhald

this might be a talk on the more public forum

Agree. Do you have any objection if I copy important parts of this conversation (correctly attributed to each person) over to the main issue 41 on Polly?

kristianhald · 2016-04-14T08:23:08Z

Agree. Do you have any objection if I copy important parts of this conversation (correctly attributed to each person) over to the main issue 41 on Polly?

Nope, I think it will be a very good idea to do it, so that we can get input from other people 😃

kristianhald added 4 commits April 11, 2016 15:49

Implemented rolling windows for the advanced circuit breaker and spec…

294c842

…s ensuring that the windows are being used correctly by the circuit breaker.

Forgot some 'Fact' attributes on a few of the async tests.

5eccf0d

Added the lower limit of the circuit resolution to the allowed timesl…

8f59221

…ice duration value.

reisenberger reviewed Apr 13, 2016
View reviewed changes

Moved the inner if of the dequeuing while loop into the while statement.

cf7e43a

Added test to ensure that even if the windows do not align to each ot…

f06b854

…her, then having a bucket that starts just before the timeslice duration ends, does not result in the calculation of the HealthCount being different, than if the windows were aligned.

reisenberger merged commit db8adb3 into reisenberger:CircuitBreakerSpikeMarch16 Apr 13, 2016

reisenberger mentioned this pull request Apr 15, 2016

Will the circuit breaker trip if enough minor transient errors happen? Should frequency of exceptions or interval between exceptions be a factor? App-vNext/Polly#41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Circuit breaker spike march16 added windows #1

Circuit breaker spike march16 added windows #1

kristianhald commented Apr 12, 2016

reisenberger commented Apr 13, 2016

reisenberger Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

kristianhald commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 14, 2016

kristianhald commented Apr 14, 2016

Circuit breaker spike march16 added windows #1

Circuit breaker spike march16 added windows #1

Conversation

kristianhald commented Apr 12, 2016

reisenberger commented Apr 13, 2016

reisenberger Apr 13, 2016

Choose a reason for hiding this comment

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

kristianhald commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 13, 2016

kristianhald commented Apr 13, 2016

reisenberger commented Apr 13, 2016

reisenberger commented Apr 14, 2016

kristianhald commented Apr 14, 2016