Fix issue 31-media segment must have frame for all AV tracks #43

wolenetz · 2016-01-05T21:45:09Z

Happy New Year!

@jdsmith3000 : please take a look. While I think this satisfies the gap I noted in the original w3c bug 29188, I didn't include any "roughly equal duration" nor "roughly same starting [decode?] timestamp" language in the non-normative note, since those should be taken care of by the explicit, normative, logic in the coded frame processing algorithm's discontinuity detection. If you know of any gaps still in that logic related to this, please let me know. Otherwise, I prefer keeping the fix for this narrow and not introduce any potentially confusing/conflicting duration/timestamp language in the new note.

@acolwell : please take a look at the web bytestream format portion of this change. The intent is to just make it clear that the requirement of at least one coded frame for each A/V track exist in media segment is now common to all bytestreams (this is no longer something specific to just webm bytestream format).

I also included a quick reference in the main spec's changelog to plh@'s recent heartbeat editorial fixes.

wolenetz · 2016-01-07T19:02:55Z

From separate email from @jdsmith3000, this PR is being reviewed now.

jdsmith3000 · 2016-01-07T22:42:52Z

@wolenetz: I would like to confirm the behavior with and without this change. With it, muxed streams that lack frame data in a given audio or video track per segment will trigger an error and signal end of stream for what could be a problem in a single segment. Our current implementation would play through this condition with some artifact, and continue parsing subsequent segments. The difference in parsing between having zero and one coded frame per track doesn't seem obvious to us. Can you elaborate a bit on why you believe it has benefit? I recall the discussion about having objective criteria, but the we don't still don't see the benefit in adding this error for missing track data.

It would be most beneficial from our view to encourage equal duration data across tracks, though I'm not sure even there that signaling end of stream on unequal durations would be desirable.

wolenetz · 2016-01-07T23:23:02Z

Edge cases that may not interop well are left open without a change like this proposed one.

Some examples (all with a muxed A/V SourceBuffer, with 1 audio track and 1 video track):

Example a)

Append video-only media segment that has valid coded frame groups in time range [2000,2200)
--> Should seek to and play from time 2000 not stall? No. There is not yet any corresponding audio data for that time range. However, I'm not sure if API implementations agree on this.
Likewise, continue and append audio-only media segment that has valid coded frame groups in time range [0,2000).
--> Should seek to and play from time 0 not stall? No. There is not yet any corresponding video data at the seek target. However, I'm not sure if API implementations agree on this. For instance Chrome allows up to 1000ms jagged-start from time 0 across audio and video streams to not cause stall, since many media streams don't align both their audio and video tracks' first coded frames to have time 0.

Example b)
Append muxed A/V media segment that has valid coded frame groups for:
audio in time range [0,2000), and video in time range [2000,2200).
--> Should seek to and play from time 0 not stall? Yes. At least Chrome uses the media segment start time in a muxed SourceBuffer to imply the beginning of a group time (and the coded frame processing algorithm's similar group start time, group end time, etc appear to reinforce this). SourceBuffer.buffered() should return a TimeRanges containing 1 range [0,2000), which is the intersection of the A/V buffered ranges, including the media segment start time.

Notably, the same coded frames were appended for each track in both examples, but example a) stalls on seek+play from time 0, though that may not be interoperably implemented.

I'm for making it clear in the spec what should interoperably play with or without stall, and what expected SourceBuffer.buffered() should return for each of these example scenarios. I believe the change I've proposed simplifies these edge cases, but if it introduces playback failure for common MSE API users, I would of course be interested in alternative clarifications. Based on our preliminary chats, we think such playback failure would not be common.

jdsmith3000 · 2016-01-08T21:58:16Z

@wolenetz; We see this as more of a quality of implementation issue and not one that should be formalized in MSE. For both your example a) and example b) (one muxed and one not), our MSE should play. If no data is available across all tracks, then we would stop and wait for data to be appended.

As you note, issue #31 feelslike a problem that would not commonly be encountered, though we do think it is real world, and sometimes intentionally. It's possible for low frame rate videos to miss entire segments (e.g. a slideshow), and we believe it's relatively common at the conclusion of movies for audio or video to end first. We've prefer not to make MSE changes that could break these scenarios, and advocate instead to play content whenever possible, even if it's audio or video only in gaps.

wolenetz · 2016-01-08T23:22:57Z

@jdsmith3000, web authors desire interoperability. If the spec is unclear, for example, about what buffered ranges are for example (a), authors risk playback quality either way (perhaps they do want to stall until media is available for all A/V streams, perhaps they don't.) Ideally, authors shouldn't need to detect browser vendor to condition their expectations for scenarios like this.
IMHO, it's not really QoI if the spec isn't clarified to at least guide authors what to do / expect in situations like this.
Furthermore, the existing MSE WebM bytestream spec already disallows this scenario; #31 was meant to make the pre-existing MSE WebM restriction against missing A/V tracks in muxed AV SourceBuffers common across all bytestreams.

I'm not against rolling back the WebM pre-existing restriction, but would need some better clarification still in the spec around what is expected (e.g., what would group_{start,end}_timestamps look like in the AppendMode transition that might occur in the middle of single-stream media segment appends to muxed SourceBuffers? Do you have a suggestion for how to word this in the spec to both improve interoperability expectation for web authors while not regressing the scenarios you describe (low frame rate videos; jagged-ended A/V; etc)?

jdsmith3000 · 2016-01-09T01:29:43Z

@wolenetz: My read of your change was that it primarily required at least one frame of data be in each sourcebuffer track for playback to continue. We believe playback should continue if a single track has at least one frame. We expect the majority of content to be well-formed, and so would prefer to play unless all tracks are missing data. This is what we strive to do in Edge today.

If there are other aspects of the change that I've not been discussing, please highlight them for me.

wolenetz · 2016-01-12T21:27:11Z

@jdsmith3000, for well-formed muxed content, this change makes no difference. The change is meant to give interoperable predictability to "not well-formed" muxed content. If it's too strict, I would be interested in alternate text.

Somewhat related: how does Edge behave if there is 1 SourceBuffer each for audio and video track, and one of the SourceBuffers has a discontinuity (per the coded frame processing algorithm) at time X? My understanding of HTML5 and MSE is this should cause a playback stall at time X. Does Edge play through the discontinuity without a stall?

wolenetz · 2016-02-23T21:40:52Z

I'm going to close this PR (without merging it), as we don't want to regress the lenient behavior some user agents (Edge, and Chrome soon) have for less-than-well-formed muxed A/V streams that may not have coded frames for all audio and video tracks in each media segment. One ad-hoc example where this leniency might be required would be for low-latency muxed live streams where video framerate might be low, but audio needs to be rendered at low latency. For at least ISO-BMFF bytestream, this implies using very small moof's, not all of which might contain video.

Thanks for engaging in this discussion, Jerry, and helping prevent regressing at least this example scenario that could be important for some MSE API users :)

Fix issue 31-media segment must have frame for all AV tracks

0272c64

wolenetz mentioned this pull request Jan 5, 2016

Require at least one block from each audio and video track in media segment definition #31

Closed

wolenetz closed this Feb 23, 2016

wolenetz deleted the fix_issue_31_media_segment_definition branch April 11, 2016 23:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 31-media segment must have frame for all AV tracks #43

Fix issue 31-media segment must have frame for all AV tracks #43

wolenetz commented Jan 5, 2016

wolenetz commented Jan 7, 2016

jdsmith3000 commented Jan 7, 2016

wolenetz commented Jan 7, 2016

jdsmith3000 commented Jan 8, 2016

wolenetz commented Jan 8, 2016

jdsmith3000 commented Jan 9, 2016

wolenetz commented Jan 12, 2016

wolenetz commented Feb 23, 2016

Fix issue 31-media segment must have frame for all AV tracks #43

Fix issue 31-media segment must have frame for all AV tracks #43

Conversation

wolenetz commented Jan 5, 2016

wolenetz commented Jan 7, 2016

jdsmith3000 commented Jan 7, 2016

wolenetz commented Jan 7, 2016

jdsmith3000 commented Jan 8, 2016

wolenetz commented Jan 8, 2016

jdsmith3000 commented Jan 9, 2016

wolenetz commented Jan 12, 2016

wolenetz commented Feb 23, 2016