Infinite retries even on CRITICAL severity #830

bhh1988 · 2017-05-28T08:03:22Z

Have you read the FAQ and checked for duplicate issues:
Yes, I'm filing a more general issue, which is the root cause of #762.

What version of Shaka Player are you using:
This issue exists in both 2.0.x and 2.1.x, and not before.

Can you reproduce the issue with our latest release version:
Yes

Can you reproduce the issue with the latest code from master:
Yes

Are you using the demo app or your own custom app:
Yes

If custom app, can you reproduce the issue using our demo app:
Yes

What browser and OS are you using:
Chrome, Mac (but should be reproducible in others).

What are the manifest and license server URIs:
(you can send the URIs to shaka-player-issues@google.com instead, but please use GitHub and the template for the rest)
Irrelevant. Only thing that matters is that segments return non 2XX http code. Even 401 or 403 reproduces.

What did you do?
Host segments on my local server, program server to return 401/403, or anything non 2XX for segments (manifest still served fine).

What did you expect to happen?
For 401/403, these are CRITICAL severity since #620 so I expect no retries to occur. For other non-2XX codes, I expect only up to "maxAttempts" retries to occur.

What actually happened?
Retries forever. This is counter-intuitive since it basically doesn't respect the maxAttempts in the retryParameters. Not to mention it can lead to unintentional DDOS'ing of the host...

I've debugged through the code, and I think I understand the issue and possible solutions. #390 was fixed by introducing another layer of retries in StreamingEngine (see 0d77ddf). However, the logic introduced there, by calling handleNetworkError_(), effectively leads to infinite retries. This double-layering of retries (both StreamingEngine and NetworkingEngine) is confusing and seems to defeat the point of having a retry-policy with the RetryParameters. I could be wrong, but it seems to me that #390 would have been better addressed differently, by telling the user to configuring shaka's RetryParameters appropriately to be robust to network failures, and closing the issue as "by-design". Or at the minimum, it seems that 0d77ddf is catching too many cases in the if-statement by including BAD_HTTP_STATUS, when #390 was really intended for bad network conditions (which would be HTTP_ERROR or TIMEOUT).

Here are the solutions I see for this in order of preference:

Remove the logic for calling handleNetworkError_(). This would eliminate the double-layering of retries. remove the handleNetworkError_() function altogether, after lifting out some logic for handling text-stream failures.
Add a config variable like "doNotRetryInfinitely", and make the above logic conditional on this config variable.
Continue allowing infinite retries but only for HTTP_ERROR or TIMEOUT.

Let me know if I'm overlooking something...

The text was updated successfully, but these errors were encountered:

TheModMaker · 2017-05-30T17:32:54Z

The reason we have handleNetworkError_ is because of live content. If there are clock-sync problems or if the server lags behind, we may request a segment before the server has it available. But the segment may appear later, so we will retry after a slight delay. This ensures that a delay in the server doesn't cause a full stop of live video. With the default settings, we will stop retrying after 1 second, so if we request a segment only 1 second early, then a live stream will stall.

It should be noted that we will pause at least 4 seconds between these retries and we only make one request per stream (for a max of three for audio/video/text); so it seems unlikely to cause DDOS problems.

Since this should only happen for live streams, we could change the logic to only use handleNetworkError_ for live streams. We could also add a configuration to not retry for live streams; but I am hesitant to do that since the alternative is live stream stalling for slight delays.

bhh1988 · 2017-05-30T18:13:08Z

Is it not a violation of abstraction for StreamingEngine to know whether the manifest is live or VOD?

For this live content lag scenario you mention, is there a reason why we can't expect the user to feed in appropriate retry parameters to guard against that? Perhaps the fear is that there are too many people expecting the default retry parameters to just work? In that case, how about if the retry-parameters ARE supplied, we can honor only those and not call handleNetworkError_()?

TheModMaker · 2017-05-31T17:28:20Z

I don't think an abstraction violation is a big problem here. This is only for handling errors, not for normal streaming.

One concern is that most apps will use the default values. Detecting whether they are provided or not can be hard especially since there are more than one; so changing one of the parameters may not be enough. Also we don't know if the app is considering this case when they provide these values.

But there are a lot of reasons we may not be able to download segments now but can later. Clock sync is one that the app doesn't know about and can't set retry-parameters for. Also, if the manifest uses <SegmentTimeline>, we only have an explicit list of segments; meaning we will need to wait until the next manifest update to get the next segments. The app may not be thinking of all these cases when they set the retry-parameters.

I think the best case would be to only use handleNetworkError_ for live streams and to add another config to optionally not do this. Something like .streaming.retryForLiveStreams (default to true).

Do not infinitely retry on top of the retry-policy already defined by RetryParameters, especially for VOD. For live video the previous retry logic is still useful to be robust against clock-sync or asset availability issues. This commit adds a boolean configuration parameter, infiniteRetriesForLiveStreams, to allow the user to turn this behavior off even for live video. Fixes shaka-project#830 and shaka-project#762

bhh1988 · 2017-06-02T16:35:13Z

I created a pull-request for this. For the conf value, I chose the name "infiniteRetriesForLiveStreams". Not sure if you guys agree with the name, but the suggested "retryForLiveStreams" seemed misleading to me because even with it false, there will still be retries for live streams according to the passed in RetryParameters. I'm open to suggestions for alternatives.

Do not infinitely retry on top of the retry-policy already defined by RetryParameters, especially for VOD. For live video the previous retry logic is still useful to be robust against clock-sync or asset availability issues. This commit adds a boolean configuration parameter, infiniteRetriesForLiveStreams, to allow the user to turn this behavior off even for live video. Fixes #830 and #762

joeyparrish · 2017-06-05T21:51:37Z

Cherry-picked to v2.1.3.

bhh1988 mentioned this issue Jun 2, 2017

Stop infinitely retrying on network errors for VOD #842

Merged

joeyparrish added the type: bug Something isn't working correctly label Jun 5, 2017

joeyparrish added this to the v2.2.0 milestone Jun 5, 2017

joeyparrish closed this as completed in #842 Jun 5, 2017

joeyparrish mentioned this issue Jul 25, 2017

SHAKA and Offline Events #937

Closed

joeyparrish mentioned this issue Aug 9, 2017

Provide more application control over streaming retry behavior #960

Closed

shaka-bot added the archived label Mar 22, 2018

shaka-project locked and limited conversation to collaborators Mar 22, 2018

shaka-bot added the status: archived Archived and locked; will not be updated label Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite retries even on CRITICAL severity #830

Infinite retries even on CRITICAL severity #830

bhh1988 commented May 28, 2017 •

edited

Loading

TheModMaker commented May 30, 2017

bhh1988 commented May 30, 2017

TheModMaker commented May 31, 2017

bhh1988 commented Jun 2, 2017

joeyparrish commented Jun 5, 2017

Infinite retries even on CRITICAL severity #830

Infinite retries even on CRITICAL severity #830

Comments

bhh1988 commented May 28, 2017 • edited Loading

TheModMaker commented May 30, 2017

bhh1988 commented May 30, 2017

TheModMaker commented May 31, 2017

bhh1988 commented Jun 2, 2017

joeyparrish commented Jun 5, 2017

bhh1988 commented May 28, 2017 •

edited

Loading