Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 31-media segment must have frame for all AV tracks #43

Closed
wants to merge 1 commit into from
Closed

Fix issue 31-media segment must have frame for all AV tracks #43

wants to merge 1 commit into from

Conversation

wolenetz
Copy link
Member

@wolenetz wolenetz commented Jan 5, 2016

Happy New Year!

@jdsmith3000 : please take a look. While I think this satisfies the gap I noted in the original w3c bug 29188, I didn't include any "roughly equal duration" nor "roughly same starting [decode?] timestamp" language in the non-normative note, since those should be taken care of by the explicit, normative, logic in the coded frame processing algorithm's discontinuity detection. If you know of any gaps still in that logic related to this, please let me know. Otherwise, I prefer keeping the fix for this narrow and not introduce any potentially confusing/conflicting duration/timestamp language in the new note.

@acolwell : please take a look at the web bytestream format portion of this change. The intent is to just make it clear that the requirement of at least one coded frame for each A/V track exist in media segment is now common to all bytestreams (this is no longer something specific to just webm bytestream format).

I also included a quick reference in the main spec's changelog to plh@'s recent heartbeat editorial fixes.

@wolenetz
Copy link
Member Author

wolenetz commented Jan 7, 2016

From separate email from @jdsmith3000, this PR is being reviewed now.

@jdsmith3000
Copy link
Contributor

@wolenetz: I would like to confirm the behavior with and without this change. With it, muxed streams that lack frame data in a given audio or video track per segment will trigger an error and signal end of stream for what could be a problem in a single segment. Our current implementation would play through this condition with some artifact, and continue parsing subsequent segments. The difference in parsing between having zero and one coded frame per track doesn't seem obvious to us. Can you elaborate a bit on why you believe it has benefit? I recall the discussion about having objective criteria, but the we don't still don't see the benefit in adding this error for missing track data.

It would be most beneficial from our view to encourage equal duration data across tracks, though I'm not sure even there that signaling end of stream on unequal durations would be desirable.

@wolenetz
Copy link
Member Author

wolenetz commented Jan 7, 2016

Edge cases that may not interop well are left open without a change like this proposed one.

Some examples (all with a muxed A/V SourceBuffer, with 1 audio track and 1 video track):

Example a)

Append video-only media segment that has valid coded frame groups in time range [2000,2200)
--> Should seek to and play from time 2000 not stall? No. There is not yet any corresponding audio data for that time range. However, I'm not sure if API implementations agree on this.
Likewise, continue and append audio-only media segment that has valid coded frame groups in time range [0,2000).
--> Should seek to and play from time 0 not stall? No. There is not yet any corresponding video data at the seek target. However, I'm not sure if API implementations agree on this. For instance Chrome allows up to 1000ms jagged-start from time 0 across audio and video streams to not cause stall, since many media streams don't align both their audio and video tracks' first coded frames to have time 0.

Example b)
Append muxed A/V media segment that has valid coded frame groups for:
audio in time range [0,2000), and video in time range [2000,2200).
--> Should seek to and play from time 0 not stall? Yes. At least Chrome uses the media segment start time in a muxed SourceBuffer to imply the beginning of a group time (and the coded frame processing algorithm's similar group start time, group end time, etc appear to reinforce this). SourceBuffer.buffered() should return a TimeRanges containing 1 range [0,2000), which is the intersection of the A/V buffered ranges, including the media segment start time.

Notably, the same coded frames were appended for each track in both examples, but example a) stalls on seek+play from time 0, though that may not be interoperably implemented.

I'm for making it clear in the spec what should interoperably play with or without stall, and what expected SourceBuffer.buffered() should return for each of these example scenarios. I believe the change I've proposed simplifies these edge cases, but if it introduces playback failure for common MSE API users, I would of course be interested in alternative clarifications. Based on our preliminary chats, we think such playback failure would not be common.

@jdsmith3000
Copy link
Contributor

@wolenetz; We see this as more of a quality of implementation issue and not one that should be formalized in MSE. For both your example a) and example b) (one muxed and one not), our MSE should play. If no data is available across all tracks, then we would stop and wait for data to be appended.

As you note, issue #31 feelslike a problem that would not commonly be encountered, though we do think it is real world, and sometimes intentionally. It's possible for low frame rate videos to miss entire segments (e.g. a slideshow), and we believe it's relatively common at the conclusion of movies for audio or video to end first. We've prefer not to make MSE changes that could break these scenarios, and advocate instead to play content whenever possible, even if it's audio or video only in gaps.

@wolenetz
Copy link
Member Author

wolenetz commented Jan 8, 2016

@jdsmith3000, web authors desire interoperability. If the spec is unclear, for example, about what buffered ranges are for example (a), authors risk playback quality either way (perhaps they do want to stall until media is available for all A/V streams, perhaps they don't.) Ideally, authors shouldn't need to detect browser vendor to condition their expectations for scenarios like this.
IMHO, it's not really QoI if the spec isn't clarified to at least guide authors what to do / expect in situations like this.
Furthermore, the existing MSE WebM bytestream spec already disallows this scenario; #31 was meant to make the pre-existing MSE WebM restriction against missing A/V tracks in muxed AV SourceBuffers common across all bytestreams.

I'm not against rolling back the WebM pre-existing restriction, but would need some better clarification still in the spec around what is expected (e.g., what would group_{start,end}_timestamps look like in the AppendMode transition that might occur in the middle of single-stream media segment appends to muxed SourceBuffers? Do you have a suggestion for how to word this in the spec to both improve interoperability expectation for web authors while not regressing the scenarios you describe (low frame rate videos; jagged-ended A/V; etc)?

@jdsmith3000
Copy link
Contributor

@wolenetz: My read of your change was that it primarily required at least one frame of data be in each sourcebuffer track for playback to continue. We believe playback should continue if a single track has at least one frame. We expect the majority of content to be well-formed, and so would prefer to play unless all tracks are missing data. This is what we strive to do in Edge today.

If there are other aspects of the change that I've not been discussing, please highlight them for me.

@wolenetz
Copy link
Member Author

@jdsmith3000, for well-formed muxed content, this change makes no difference. The change is meant to give interoperable predictability to "not well-formed" muxed content. If it's too strict, I would be interested in alternate text.

Somewhat related: how does Edge behave if there is 1 SourceBuffer each for audio and video track, and one of the SourceBuffers has a discontinuity (per the coded frame processing algorithm) at time X? My understanding of HTML5 and MSE is this should cause a playback stall at time X. Does Edge play through the discontinuity without a stall?

@wolenetz
Copy link
Member Author

I'm going to close this PR (without merging it), as we don't want to regress the lenient behavior some user agents (Edge, and Chrome soon) have for less-than-well-formed muxed A/V streams that may not have coded frames for all audio and video tracks in each media segment. One ad-hoc example where this leniency might be required would be for low-latency muxed live streams where video framerate might be low, but audio needs to be rendered at low latency. For at least ISO-BMFF bytestream, this implies using very small moof's, not all of which might contain video.

Thanks for engaging in this discussion, Jerry, and helping prevent regressing at least this example scenario that could be important for some MSE API users :)

@wolenetz wolenetz closed this Feb 23, 2016
@wolenetz wolenetz deleted the fix_issue_31_media_segment_definition branch April 11, 2016 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants