Skip to content

Commit

Permalink
Add use case for improving sync accuracy
Browse files Browse the repository at this point in the history
Also:

- Restructured gap analysis section
- Mention use of VTTCue for out of band caption rendering
  • Loading branch information
chrisn committed Jan 31, 2019
1 parent adfbeb4 commit af22812
Showing 1 changed file with 145 additions and 138 deletions.
283 changes: 145 additions & 138 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@
requirements. The goal is to extend the existing support in HTML for
text track cue events to add support for dynamic content replacement
cues and generic metadata events that drive synchronized interactive
media experiences.
media experiences, and improve synchronization timing accuracy.
</p>
</section>
<section id="sotd">
Expand All @@ -138,8 +138,8 @@ <h2>Introduction</h2>
events synchronized to audio or video media, specifically for both
<a>out-of-band</a> event streams and <a>in-band</a> discrete events
(for example, MPD and <code>emsg</code> events in MPEG-DASH).
These <em>media timed events</em> can be used to support use cases such as
dynamic content replacement, ad insertion, or presentation of
These <em>media timed events</em> can be used to support use cases
such as dynamic content replacement, ad insertion, or presentation of
supplemental content alongside the audio or video, or more generally,
making changes to a web page, or executing application code triggered
from JavaScript events, at specific points on the <a>media timeline</a>
Expand Down Expand Up @@ -245,6 +245,18 @@ <h3>MPEG-DASH manifest expiry notifications</h3>
against the [[WEB-MEDIA-GUIDELINES]]. TODO: Add detail here.
</p>
</section>
<section>
<h3>Subtitle and caption rendering synchronization</h3>
<p>
A subtitle or caption author wants ensure that subtitle changes are
aligned as closely as possible to shot changes in the video.
The BBC Subtitle Guidelines [[BBC-SUBTITLES]] describes authoring
best practices. In particular, in section 6.1 authors are advised
"it is likely to be less tiring for the viewer if shot changes
and subtitle changes occur at the same time. Many subtitles therefore
start on the first frame of the shot and end on the last frame."
</p>
</section>
<section>
<h3>Synchronized map animations</h3>
<p>
Expand Down Expand Up @@ -437,31 +449,6 @@ <h3>DASH Industry Forum APIs for Interactivity</h3>
<a href="https://www.w3.org/2018/08/20-me-minutes.html">Minutes</a>.
</p>
</section>
<section>
<h3>BBC Subtitle Guidelines</h3>
<p>
The BBC Subtitle Guidelines ([[BBC-SUBTITLES]]) describe best practice
for authoring subtitles or captions. In particular, the guidelines
place requirements on the synchronization accuracy of caption
rendering. For example, in section 6.1, caption authors are advised
"it is likely to be less tiring for the viewer if shot changes and
subtitle changes occur at the same time. Many subtitles therefore start
on the first frame of the shot and end on the last frame."
</p>
<p>
Subtitles for video are typically authored against video at
a nominal frame rate, e.g., 25 frames per second, which corresponds to
40 milliseconds per frame. The actual video frame rate may be adjusted
dynamically according to the video encoding, but the subtitle timing
must remain the same ([[EBU-TT-D]], Annex E).
</p>
<p>
Where captions are rendered by application JavaScript code, in
response to <code>TextTrackCue</code> events, this places a
requirement on user agents for timely delivery of these events,
so that application code can respond and render the cues.
</p>
</section>
<section>
<h3>SCTE-35</h3>
<p>
Expand Down Expand Up @@ -527,6 +514,11 @@ <h3>WebVTT</h3>
event data to a string format (JSON, for example) when creating the
cue, and deserializing the data when the cue is triggered.
</p>
<p>
Web applications can also use <code>VTTCue</code> to trigger
rendering of <a>out-of-band</a> delivered timed text cues, such as
TTML or IMSC format captions.
</p>
</section>
</section>
<section>
Expand All @@ -539,118 +531,133 @@ <h2>Gap analysis</h2>
associated limitations.
</p>
<section>
<h3>Synchronized event triggering</h3>
<section>
<h4>MPEG-DASH and ISO BMFF emsg events</h4>
<p>
The <code>DataCue</code> API has been previously discussed as a means to
deliver <a>in-band</a> event data to web applications, but this is not implemented
in all of the main browser engines. It is <a href="https://www.w3.org/TR/2018/WD-html53-20181018/semantics-embedded-content.html#text-tracks-exposing-inband-metadata">included</a>
in the 18 October 2018 HTML 5.3 draft [[HTML53-20181018]], but is
<a href="https://html.spec.whatwg.org/multipage/media.html#timed-text-tracks">not included</a>
in [[HTML]]. See discussion <a href="https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/U06zrT2N-Xk">here</a>
and notes on implementation status <a href="https://lists.w3.org/Archives/Public/public-html/2016Apr/0005.html">here</a>.
</p>
<p>
WebKit <a href="https://discourse.wicg.io/t/media-timed-events-api-for-mpeg-dash-mpd-and-emsg-events/3096/2">supports</a>
a <code>DataCue</code> interface that extends HTML5 <code>DataCue</code>
with two attributes to support non-text metadata, <code>type</code> and
<code>value</code>.
</p>
<pre class="example">
interface DataCue : TextTrackCue {
attribute ArrayBuffer data; // Always empty
<h4>MPEG-DASH and ISO BMFF emsg events</h4>
<p>
The <code>DataCue</code> API has been previously discussed as a means to
deliver <a>in-band</a> event data to web applications, but this is not implemented
in all of the main browser engines. It is <a href="https://www.w3.org/TR/2018/WD-html53-20181018/semantics-embedded-content.html#text-tracks-exposing-inband-metadata">included</a>
in the 18 October 2018 HTML 5.3 draft [[HTML53-20181018]], but is
<a href="https://html.spec.whatwg.org/multipage/media.html#timed-text-tracks">not included</a>
in [[HTML]]. See discussion <a href="https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/U06zrT2N-Xk">here</a>
and notes on implementation status <a href="https://lists.w3.org/Archives/Public/public-html/2016Apr/0005.html">here</a>.
</p>
<p>
WebKit <a href="https://discourse.wicg.io/t/media-timed-events-api-for-mpeg-dash-mpd-and-emsg-events/3096/2">supports</a>
a <code>DataCue</code> interface that extends HTML5 <code>DataCue</code>
with two attributes to support non-text metadata, <code>type</code> and
<code>value</code>.
</p>
<pre class="example">
interface DataCue : TextTrackCue {
attribute ArrayBuffer data; // Always empty

// Proposed extensions.
attribute any value;
readonly attribute DOMString type;
};
</pre>
<p>
<code>type</code> is a string identifying the type of metadata:
</p>
<table class="simple">
<thead>
<tr>
<th colspan="2">WebKit <code>DataCue</code> metadata types</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>"com.apple.quicktime.udta"</code></td>
<td>QuickTime User Data</td>
</tr>
<tr>
<td><code>"com.apple.quicktime.mdta"</code></td>
<td>QuickTime Metadata</td>
</tr>
<tr>
<td><code>"com.apple.itunes"</code></td>
<td>iTunes metadata</td>
</tr>
<tr>
<td><code>"org.mp4ra"</code></td>
<td>MPEG-4 metadata</td>
</tr>
<tr>
<td><code>"org.id3"</code></td>
<td>ID3 metadata</td>
// Proposed extensions.
attribute any value;
readonly attribute DOMString type;
};
</pre>
<p>
<code>type</code> is a string identifying the type of metadata:
</p>
<table class="simple">
<thead>
<tr>
<th colspan="2">WebKit <code>DataCue</code> metadata types</th>
</tr>
</tbody>
</table>
<p>
and <code>value</code> is an object with the metadata item key, data, and optionally a locale:
</p>
<pre class="example">
value = {
key: String
data: String | Number | Array | ArrayBuffer | Object
locale: String
}
</pre>
<p>
Neither [[MSE-BYTE-STREAM-FORMAT-ISOBMFF]] nor [[INBANDTRACKS]] describe
handling of <code>emsg</code> boxes.
</p>
<p>
On resource constrained devices such as smart TVs and streaming sticks,
parsing media segments to extract event information leads to a significant
performance penalty, which can have an impact on UI rendering updates if
this is done on the UI thread. There can also be an impact on the battery
life of mobile devices. Given that the media segments will be parsed anyway
by the user agent, parsing in JavaScript is an expensive overhead that
could be avoided.
</p>
<p>
[[HBBTV]] section 9.3.2 describes a mapping between the <code>emsg</code>
fields described <a href="#mpeg-dash">above</a>
and the <a href="https://html.spec.whatwg.org/multipage/media.html#texttrack"><code>TextTrack</code></a>
and <a href="https://www.w3.org/TR/2018/WD-html53-20180426/semantics-embedded-content.html#datacue"><code>DataCue</code></a>
APIs. A <code>TextTrack</code> instance is created for each event
stream signalled in the MPD document (as identified by the
<code>schemeIdUri</code> and <code>value</code>), and the
<a href="https://html.spec.whatwg.org/multipage/media.html#dom-texttrack-inbandmetadatatrackdispatchtype"><code>inBandMetadataTrackDispatchType</code></a>
<code>TextTrack</code> attribute contains the <code>scheme_id_uri</code>
and <code>value</code> values. Because HbbTV devices include a native
DASH client, parsing of the MPD document and creation of the
<code>TextTrack</code>s is done by the user agent, rather than by
application JavaScript code.
</p>
<p class="ednote">
To support DASH clients implemented in web applications, there is
therefore either a need for an API that allows applications to tell
the UA which schemes it wants to receive, or the UA should simply
expose all event streams to applications. Which of these is preferred?
</p>
</section>
<section>
<h4>Synchronization and timing</h4>
<p>
The timing guarantees provided in [[HTML]] regarding the triggering of
<code>TextTrackCue</code> events may be not be enough to avoid
<a href="https://lists.w3.org/Archives/Public/public-inbandtracks/2013Dec/0004.html">events being missed</a>.
</p>
</section>
</thead>
<tbody>
<tr>
<td><code>"com.apple.quicktime.udta"</code></td>
<td>QuickTime User Data</td>
</tr>
<tr>
<td><code>"com.apple.quicktime.mdta"</code></td>
<td>QuickTime Metadata</td>
</tr>
<tr>
<td><code>"com.apple.itunes"</code></td>
<td>iTunes metadata</td>
</tr>
<tr>
<td><code>"org.mp4ra"</code></td>
<td>MPEG-4 metadata</td>
</tr>
<tr>
<td><code>"org.id3"</code></td>
<td>ID3 metadata</td>
</tr>
</tbody>
</table>
<p>
and <code>value</code> is an object with the metadata item key, data, and optionally a locale:
</p>
<pre class="example">
value = {
key: String
data: String | Number | Array | ArrayBuffer | Object
locale: String
}
</pre>
<p>
Neither [[MSE-BYTE-STREAM-FORMAT-ISOBMFF]] nor [[INBANDTRACKS]] describe
handling of <code>emsg</code> boxes.
</p>
<p>
On resource constrained devices such as smart TVs and streaming sticks,
parsing media segments to extract event information leads to a significant
performance penalty, which can have an impact on UI rendering updates if
this is done on the UI thread. There can also be an impact on the battery
life of mobile devices. Given that the media segments will be parsed anyway
by the user agent, parsing in JavaScript is an expensive overhead that
could be avoided.
</p>
<p>
[[HBBTV]] section 9.3.2 describes a mapping between the <code>emsg</code>
fields described <a href="#mpeg-dash">above</a>
and the <a href="https://html.spec.whatwg.org/multipage/media.html#texttrack"><code>TextTrack</code></a>
and <a href="https://www.w3.org/TR/2018/WD-html53-20180426/semantics-embedded-content.html#datacue"><code>DataCue</code></a>
APIs. A <code>TextTrack</code> instance is created for each event
stream signalled in the MPD document (as identified by the
<code>schemeIdUri</code> and <code>value</code>), and the
<a href="https://html.spec.whatwg.org/multipage/media.html#dom-texttrack-inbandmetadatatrackdispatchtype"><code>inBandMetadataTrackDispatchType</code></a>
<code>TextTrack</code> attribute contains the <code>scheme_id_uri</code>
and <code>value</code> values. Because HbbTV devices include a native
DASH client, parsing of the MPD document and creation of the
<code>TextTrack</code>s is done by the user agent, rather than by
application JavaScript code.
</p>
<p class="ednote">
To support DASH clients implemented in web applications, there is
therefore either a need for an API that allows applications to tell
the UA which schemes it wants to receive, or the UA should simply
expose all event streams to applications. Which of these is preferred?
</p>
</section>
<section>
<h3>Synchronization of text track cue rendering</h3>
<p>
Subtitles for video are typically authored against video at
a nominal frame rate, e.g., 25 frames per second, which corresponds to
40 milliseconds per frame. The actual video frame rate may be adjusted
dynamically according to the video encoding, but the subtitle timing
must remain the same ([[EBU-TT-D]], Annex E).
</p>
<p>
Where captions are rendered by application JavaScript code, in
response to <code>VTTCue</code> or <code>TextTrackCue</code> events,
this places a requirement on user agents for timely delivery of these
events, so that application code can respond and render the cues.
</p>
<p>
Reference: M&amp;E IG, Media Timed Events Task Force call 17 Dec 2018:
<a href="https://www.w3.org/2018/12/17-me-minutes.html#item06">Minutes</a>.
</p>
<p class="ednote">
TODO: The timing guarantees provided in [[HTML]] regarding the triggering of
<code>TextTrackCue</code> events may be not be enough to avoid
<a href="https://lists.w3.org/Archives/Public/public-inbandtracks/2013Dec/0004.html">events being missed</a>.
Explain further.
</p>
</section>
<section>
<h3>Synchronized rendering of web resources</h3>
Expand Down

0 comments on commit af22812

Please sign in to comment.