Fix bandwidth sampling for small transfers and fast connections #3595

kanongil · 2021-03-09T13:48:15Z

This PR will...

Fix bandwidth sampling weight for small transfers and/or fast connections.

Why is this Pull Request needed?

The current logic is broken for fast transfers (due to small transfer sizes, or from a fast connection). See #3563 for details.

Are there any points in the code the reviewer needs to double check?

This changes the exact meaning of the abrEwma<X> options, which could be considered a breaking change.

I have also had to significantly lower the abrEwmaFastLive default to accommodate LL-HLS level downswitches. The old value meant that it takes too long to discover a new lowered bitrate, and could cause multiple successive emergency level switches and stalls. Note that the new value could make a temporary connection issue more likely to cause temporary downswitches. It might make sense to only use this low value for LL-HLS content, but that is outside the scope of this patch.

Resolves issues:

Partly fixes #3563. With this PR, the ABR algorithm is more likely to switch up from a low level (still only when there is sufficient bandwidth). Low latency content on high latency links are still unable to measure a suitable bitrate.

Checklist

changes have been done against master branch, and PR does not conflict
new unit / functional tests have been added (whenever applicable)
API or design changes are documented in API.md

robwalch

Hi @kanongil,

Thanks for the change suggestion. I have to think about this one a bit more. I can't accept it based on how it would impact all streams. For now, a custom abrController is probably the best solution for your use-case until we can address the issue holistically.

kanongil · 2021-03-18T10:22:45Z

I can't accept it based on how it would impact all streams.

That is the point of this PR. To fix the most egregious issue of #3563.

Did you see the detailed note in the commit, that gives an example of just how broken the current estimator is?

With 5s segments of size 100,000:
Start 10x 5 sec transfers => 160Kbps (reference)

Rate change (old):
1x 0.5 sec transfer (1.6Mbps) => 216Kbps
vs
1x 0.05 sec transfer (16Mbps) => 222Kbps (10x faster is only 4% more!!)

Rate change (new):
1x 0.5 sec transfer (1.6Mbps) => 627Kbps
vs
1x 0.05 sec transfer (16Mbps) => 5.3Mbps

Ie. a sampling with a 10x bandwidth increase can mean just a 1.35x increase over a 5 second interval, while a 100x bandwidth increase only makes it 1.39x !!!

An estimation rework is essential to ever get LL-HLS to work with ABR switching. This PR fixes the fundamental issue, and high-bandwidth estimation issues with the current implementation.

I really hope you will prioritise a fix before 1.0.0.

Note that the new default abrEwmaFastLive value is not essential to the fix, and could be omitted from the PR. Hls.js will need a mechanism to lower it for smooth near-edge LL-HLS ABR playback, though. Maybe the abrEwmaFastLive value could be capped to 1/2 the current time (in seconds) to buffer exhaustion?

The estimator factored the transfer duration into the weight. This means that very fast transfers will have negligible influence on the measured average. The minDelayMs_ was used to cap this effect, but it also capped the bitrate. This meant that a 1000 byte transfer, transferred in 10ms, would register at a rate of 160Kbps instead of 800Kbps (on top of the low weight). The fix is to use the duration of the transferred fragment / part for the weight. This preserves the effect of samples that match the line rate, while significantly increasing the effect of fast transfers (due to small size, or from a fast connection). With 5s segments of size 100,000: Start 10x 5 sec transfers => 160Kbps (reference) Rate change (old): 1x 0.5 sec transfer (1.6Mbps) => 216Kbps vs 1x 0.05 sec transfer (16Mbps) => 222Kbps (10x faster is only 4% more!!) Rate change (new): 1x 0.5 sec transfer (1.6Mbps) => 627Kbps vs 1x 0.05 sec transfer (16Mbps) => 5.3Mbps

kanongil · 2021-03-19T11:47:17Z

I had another look at the estimation, and found that it can be simplified, and work better, if the weight is always a fixed value for each sample.

So my initial patch tried to use the fragment / part duration for the weight, which makes some sense, and certainly works a lot better than the current logic. However, it meant that the halfLife needed to be quite different for LL-HLS vs normal content.

I came to realise that there are essentially 2 modes when estimating, part loading vs. fragment loading, and both needs to adjust for bandwidth changes in sample time. Ie. when close to live edge, both modes have ~2 parts/fragments time to react to a bandwidth change.

Based on this realisation, I changed the abrEwmaFast/Slow values to just represent samples. This means that the same value will work quite nicely for both part and fragment loading, and the implementation can be simplified. I converted from the current slow=3 & fast=9 values using a normalised 6 second fragment duration. Besides improving the estimation responsiveness of part loading, I expect it will also work better for fragment loading for playlists with high/low fragment durations without tweaking abrEwmaFast/Slow.

As part of this revision I also removed the Live/VOD distinction of the config values. I did this since the default values were already the same, and because adjusting the values based on the playlist type is a very simplistic approach. It makes much more sense to adjust the values based on the current buffer level. Both live and VOD playback can have low & high buffer levels due to bandwidth conditions. The current values are tuned to a low buffer level, so it would be prudent to detect a high buffer level and raise the values dynamically, to avoid quality dips from a temporary bandwidth burp. This is probably outside of the scope of the current patch, though.

stale · 2022-04-16T15:02:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2022-04-19T16:02:05Z

This issue has been automatically closed because it has not had recent activity. If this issue is still valid, please ping a maintainer and ask them to label it accordingly.

robwalch

Not a Contribution

What needs to happen is we must create separate estimates for round trip time (RTT or time to first byte depending on Browser JS API availability) and bandwidth. RTT will remain consistent between small and large segments from the same origin, and will inversely impact the estimate of time to load/append/play created by the ABR controller and used for variant switching. Splitting out these estimates based on available loader data and using them in our current stall/switch timing calculations should improve selection for streams with shorter (partial) segments.

@Oleksandr0xB