Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite retries even on CRITICAL severity #830

Closed
bhh1988 opened this issue May 28, 2017 · 5 comments
Closed

Infinite retries even on CRITICAL severity #830

bhh1988 opened this issue May 28, 2017 · 5 comments
Labels
status: archived Archived and locked; will not be updated type: bug Something isn't working correctly
Milestone

Comments

@bhh1988
Copy link
Contributor

bhh1988 commented May 28, 2017

Have you read the FAQ and checked for duplicate issues:
Yes, I'm filing a more general issue, which is the root cause of #762.

What version of Shaka Player are you using:
This issue exists in both 2.0.x and 2.1.x, and not before.

Can you reproduce the issue with our latest release version:
Yes

Can you reproduce the issue with the latest code from master:
Yes

Are you using the demo app or your own custom app:
Yes

If custom app, can you reproduce the issue using our demo app:
Yes

What browser and OS are you using:
Chrome, Mac (but should be reproducible in others).

What are the manifest and license server URIs:
(you can send the URIs to shaka-player-issues@google.com instead, but please use GitHub and the template for the rest)
Irrelevant. Only thing that matters is that segments return non 2XX http code. Even 401 or 403 reproduces.

What did you do?
Host segments on my local server, program server to return 401/403, or anything non 2XX for segments (manifest still served fine).

What did you expect to happen?
For 401/403, these are CRITICAL severity since #620 so I expect no retries to occur. For other non-2XX codes, I expect only up to "maxAttempts" retries to occur.

What actually happened?
Retries forever. This is counter-intuitive since it basically doesn't respect the maxAttempts in the retryParameters. Not to mention it can lead to unintentional DDOS'ing of the host...

I've debugged through the code, and I think I understand the issue and possible solutions. #390 was fixed by introducing another layer of retries in StreamingEngine (see 0d77ddf). However, the logic introduced there, by calling handleNetworkError_(), effectively leads to infinite retries. This double-layering of retries (both StreamingEngine and NetworkingEngine) is confusing and seems to defeat the point of having a retry-policy with the RetryParameters. I could be wrong, but it seems to me that #390 would have been better addressed differently, by telling the user to configuring shaka's RetryParameters appropriately to be robust to network failures, and closing the issue as "by-design". Or at the minimum, it seems that 0d77ddf is catching too many cases in the if-statement by including BAD_HTTP_STATUS, when #390 was really intended for bad network conditions (which would be HTTP_ERROR or TIMEOUT).

Here are the solutions I see for this in order of preference:

  1. Remove the logic for calling handleNetworkError_(). This would eliminate the double-layering of retries. remove the handleNetworkError_() function altogether, after lifting out some logic for handling text-stream failures.
  2. Add a config variable like "doNotRetryInfinitely", and make the above logic conditional on this config variable.
  3. Continue allowing infinite retries but only for HTTP_ERROR or TIMEOUT.

Let me know if I'm overlooking something...

@TheModMaker
Copy link
Contributor

The reason we have handleNetworkError_ is because of live content. If there are clock-sync problems or if the server lags behind, we may request a segment before the server has it available. But the segment may appear later, so we will retry after a slight delay. This ensures that a delay in the server doesn't cause a full stop of live video. With the default settings, we will stop retrying after 1 second, so if we request a segment only 1 second early, then a live stream will stall.

It should be noted that we will pause at least 4 seconds between these retries and we only make one request per stream (for a max of three for audio/video/text); so it seems unlikely to cause DDOS problems.

Since this should only happen for live streams, we could change the logic to only use handleNetworkError_ for live streams. We could also add a configuration to not retry for live streams; but I am hesitant to do that since the alternative is live stream stalling for slight delays.

@bhh1988
Copy link
Contributor Author

bhh1988 commented May 30, 2017

Is it not a violation of abstraction for StreamingEngine to know whether the manifest is live or VOD?

For this live content lag scenario you mention, is there a reason why we can't expect the user to feed in appropriate retry parameters to guard against that? Perhaps the fear is that there are too many people expecting the default retry parameters to just work? In that case, how about if the retry-parameters ARE supplied, we can honor only those and not call handleNetworkError_()?

@TheModMaker
Copy link
Contributor

I don't think an abstraction violation is a big problem here. This is only for handling errors, not for normal streaming.

One concern is that most apps will use the default values. Detecting whether they are provided or not can be hard especially since there are more than one; so changing one of the parameters may not be enough. Also we don't know if the app is considering this case when they provide these values.

But there are a lot of reasons we may not be able to download segments now but can later. Clock sync is one that the app doesn't know about and can't set retry-parameters for. Also, if the manifest uses <SegmentTimeline>, we only have an explicit list of segments; meaning we will need to wait until the next manifest update to get the next segments. The app may not be thinking of all these cases when they set the retry-parameters.

I think the best case would be to only use handleNetworkError_ for live streams and to add another config to optionally not do this. Something like .streaming.retryForLiveStreams (default to true).

bhh1988 pushed a commit to bhh1988/shaka-player that referenced this issue Jun 2, 2017
Do not infinitely retry on top of the retry-policy already defined by
RetryParameters, especially for VOD. For live video the previous retry
logic is still useful to be robust against clock-sync or asset
availability issues. This commit adds a boolean configuration parameter,
infiniteRetriesForLiveStreams, to allow the user to turn this behavior
off even for live video.

Fixes shaka-project#830 and shaka-project#762
bhh1988 pushed a commit to bhh1988/shaka-player that referenced this issue Jun 2, 2017
Do not infinitely retry on top of the retry-policy already defined by
RetryParameters, especially for VOD. For live video the previous retry
logic is still useful to be robust against clock-sync or asset
availability issues. This commit adds a boolean configuration parameter,
infiniteRetriesForLiveStreams, to allow the user to turn this behavior
off even for live video.

Fixes shaka-project#830 and shaka-project#762
@bhh1988
Copy link
Contributor Author

bhh1988 commented Jun 2, 2017

I created a pull-request for this. For the conf value, I chose the name "infiniteRetriesForLiveStreams". Not sure if you guys agree with the name, but the suggested "retryForLiveStreams" seemed misleading to me because even with it false, there will still be retries for live streams according to the passed in RetryParameters. I'm open to suggestions for alternatives.

@joeyparrish joeyparrish added the type: bug Something isn't working correctly label Jun 5, 2017
@joeyparrish joeyparrish added this to the v2.2.0 milestone Jun 5, 2017
joeyparrish pushed a commit that referenced this issue Jun 5, 2017
Do not infinitely retry on top of the retry-policy already defined by
RetryParameters, especially for VOD. For live video the previous retry
logic is still useful to be robust against clock-sync or asset
availability issues. This commit adds a boolean configuration parameter,
infiniteRetriesForLiveStreams, to allow the user to turn this behavior
off even for live video.

Fixes #830 and #762
joeyparrish pushed a commit that referenced this issue Jun 5, 2017
Do not infinitely retry on top of the retry-policy already defined by
RetryParameters, especially for VOD. For live video the previous retry
logic is still useful to be robust against clock-sync or asset
availability issues. This commit adds a boolean configuration parameter,
infiniteRetriesForLiveStreams, to allow the user to turn this behavior
off even for live video.

Fixes #830 and #762
@joeyparrish
Copy link
Member

Cherry-picked to v2.1.3.

@shaka-project shaka-project locked and limited conversation to collaborators Mar 22, 2018
@shaka-bot shaka-bot added the status: archived Archived and locked; will not be updated label Apr 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status: archived Archived and locked; will not be updated type: bug Something isn't working correctly
Projects
None yet
Development

No branches or pull requests

4 participants