Add use case for improving sync accuracy

Also: - Restructured gap analysis section - Mention use of VTTCue for out of band caption rendering
w3c · Jan 31, 2019 · af22812 · af22812
1 parent adfbeb4
commit af22812
Showing 1 changed file with 145 additions and 138 deletions.
diff --git a/index.html b/index.html
@@ -126,7 +126,7 @@
         requirements. The goal is to extend the existing support in HTML for
         text track cue events to add support for dynamic content replacement
         cues and generic metadata events that drive synchronized interactive
-        media experiences.
+        media experiences, and improve synchronization timing accuracy.
       </p>
     </section>
     <section id="sotd">
@@ -138,8 +138,8 @@ <h2>Introduction</h2>
         events synchronized to audio or video media, specifically for both
         <a>out-of-band</a> event streams and <a>in-band</a> discrete events
         (for example, MPD and <code>emsg</code> events in MPEG-DASH).
-        These <em>media timed events</em> can be used to support use cases such as
-        dynamic content replacement, ad insertion, or presentation of
+        These <em>media timed events</em> can be used to support use cases
+        such as dynamic content replacement, ad insertion, or presentation of
         supplemental content alongside the audio or video, or more generally,
         making changes to a web page, or executing application code triggered
         from JavaScript events, at specific points on the <a>media timeline</a>
@@ -245,6 +245,18 @@ <h3>MPEG-DASH manifest expiry notifications</h3>
           against the [[WEB-MEDIA-GUIDELINES]]. TODO: Add detail here.
         </p>
       </section>
+      <section>
+        <h3>Subtitle and caption rendering synchronization</h3>
+        <p>
+          A subtitle or caption author wants ensure that subtitle changes are
+          aligned as closely as possible to shot changes in the video.
+          The BBC Subtitle Guidelines [[BBC-SUBTITLES]] describes authoring
+          best practices. In particular, in section 6.1 authors are advised
+          "it is likely to be less tiring for the viewer if shot changes
+          and subtitle changes occur at the same time. Many subtitles therefore
+          start on the first frame of the shot and end on the last frame."
+        </p>
+      </section>
       <section>
         <h3>Synchronized map animations</h3>
         <p>
@@ -437,31 +449,6 @@ <h3>DASH Industry Forum APIs for Interactivity</h3>
           <a href="https://www.w3.org/2018/08/20-me-minutes.html">Minutes</a>.
         </p>
       </section>
-      <section>
-        <h3>BBC Subtitle Guidelines</h3>
-        <p>
-          The BBC Subtitle Guidelines ([[BBC-SUBTITLES]]) describe best practice
-          for authoring subtitles or captions. In particular, the guidelines
-          place requirements on the synchronization accuracy of caption
-          rendering. For example, in section 6.1, caption authors are advised
-          "it is likely to be less tiring for the viewer if shot changes and
-          subtitle changes occur at the same time. Many subtitles therefore start
-          on the first frame of the shot and end on the last frame."
-        </p>
-        <p>
-          Subtitles for video are typically authored against video at
-          a nominal frame rate, e.g., 25 frames per second, which corresponds to
-          40 milliseconds per frame. The actual video frame rate may be adjusted
-          dynamically according to the video encoding, but the subtitle timing
-          must remain the same ([[EBU-TT-D]], Annex E).
-        </p>
-        <p>
-          Where captions are rendered by application JavaScript code, in
-          response to <code>TextTrackCue</code> events, this places a
-          requirement on user agents for timely delivery of these events,
-          so that application code can respond and render the cues.
-        </p>
-      </section>
       <section>
         <h3>SCTE-35</h3>
         <p>
@@ -527,6 +514,11 @@ <h3>WebVTT</h3>
           event data to a string format (JSON, for example) when creating the
           cue, and deserializing the data when the cue is triggered.
         </p>
+        <p>
+          Web applications can also use <code>VTTCue</code> to trigger
+          rendering of <a>out-of-band</a> delivered timed text cues, such as
+          TTML or IMSC format captions.
+        </p>
       </section>
     </section>
     <section>
@@ -539,118 +531,133 @@ <h2>Gap analysis</h2>
         associated limitations.
       </p>
       <section>
-        <h3>Synchronized event triggering</h3>
-        <section>
-          <h4>MPEG-DASH and ISO BMFF emsg events</h4>
-          <p>
-            The <code>DataCue</code> API has been previously discussed as a means to
-            deliver <a>in-band</a> event data to web applications, but this is not implemented
-            in all of the main browser engines. It is <a href="https://www.w3.org/TR/2018/WD-html53-20181018/semantics-embedded-content.html#text-tracks-exposing-inband-metadata">included</a>
-            in the 18 October 2018 HTML 5.3 draft [[HTML53-20181018]], but is
-            <a href="https://html.spec.whatwg.org/multipage/media.html#timed-text-tracks">not included</a>
-            in [[HTML]]. See discussion <a href="https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/U06zrT2N-Xk">here</a>
-            and notes on implementation status <a href="https://lists.w3.org/Archives/Public/public-html/2016Apr/0005.html">here</a>.
-          </p>
-          <p>
-            WebKit <a href="https://discourse.wicg.io/t/media-timed-events-api-for-mpeg-dash-mpd-and-emsg-events/3096/2">supports</a>
-            a <code>DataCue</code> interface that extends HTML5 <code>DataCue</code>
-            with two attributes to support non-text metadata, <code>type</code> and
-            <code>value</code>.
-          </p>
-          <pre class="example">
-            interface DataCue : TextTrackCue {
-              attribute ArrayBuffer data; // Always empty
+        <h4>MPEG-DASH and ISO BMFF emsg events</h4>
+        <p>
+          The <code>DataCue</code> API has been previously discussed as a means to
+          deliver <a>in-band</a> event data to web applications, but this is not implemented
+          in all of the main browser engines. It is <a href="https://www.w3.org/TR/2018/WD-html53-20181018/semantics-embedded-content.html#text-tracks-exposing-inband-metadata">included</a>
+          in the 18 October 2018 HTML 5.3 draft [[HTML53-20181018]], but is
+          <a href="https://html.spec.whatwg.org/multipage/media.html#timed-text-tracks">not included</a>
+          in [[HTML]]. See discussion <a href="https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/U06zrT2N-Xk">here</a>
+          and notes on implementation status <a href="https://lists.w3.org/Archives/Public/public-html/2016Apr/0005.html">here</a>.
+        </p>
+        <p>
+          WebKit <a href="https://discourse.wicg.io/t/media-timed-events-api-for-mpeg-dash-mpd-and-emsg-events/3096/2">supports</a>
+          a <code>DataCue</code> interface that extends HTML5 <code>DataCue</code>
+          with two attributes to support non-text metadata, <code>type</code> and
+          <code>value</code>.
+        </p>
+        <pre class="example">
+          interface DataCue : TextTrackCue {
+            attribute ArrayBuffer data; // Always empty
 
-              // Proposed extensions.
-              attribute any value;
-              readonly attribute DOMString type;
-            };
-          </pre>
-          <p>
-            <code>type</code> is a string identifying the type of metadata:
-          </p>
-          <table class="simple">
-            <thead>
-              <tr>
-                <th colspan="2">WebKit <code>DataCue</code> metadata types</th>
-              </tr>
-            </thead>
-            <tbody>
-              <tr>
-                <td><code>"com.apple.quicktime.udta"</code></td>
-                <td>QuickTime User Data</td>
-              </tr>
-              <tr>
-                <td><code>"com.apple.quicktime.mdta"</code></td>
-                <td>QuickTime Metadata</td>
-              </tr>
-              <tr>
-                <td><code>"com.apple.itunes"</code></td>
-                <td>iTunes metadata</td>
-              </tr>
-              <tr>
-                <td><code>"org.mp4ra"</code></td>
-                <td>MPEG-4 metadata</td>
-              </tr>
-              <tr>
-                <td><code>"org.id3"</code></td>
-                <td>ID3 metadata</td>
+            // Proposed extensions.
+            attribute any value;
+            readonly attribute DOMString type;
+          };
+        </pre>
+        <p>
+          <code>type</code> is a string identifying the type of metadata:
+        </p>
+        <table class="simple">
+          <thead>
+            <tr>
+              <th colspan="2">WebKit <code>DataCue</code> metadata types</th>
             </tr>
-            </tbody>
-          </table>
-          <p>
-            and <code>value</code> is an object with the metadata item key, data, and optionally a locale:
-          </p>
-          <pre class="example">
-            value = {
-              key: String
-              data: String | Number | Array | ArrayBuffer | Object
-              locale: String
-            }
-          </pre>
-          <p>
-            Neither [[MSE-BYTE-STREAM-FORMAT-ISOBMFF]] nor [[INBANDTRACKS]] describe
-            handling of <code>emsg</code> boxes.
-          </p>
-          <p>
-            On resource constrained devices such as smart TVs and streaming sticks,
-            parsing media segments to extract event information leads to a significant
-            performance penalty, which can have an impact on UI rendering updates if
-            this is done on the UI thread. There can also be an impact on the battery
-            life of mobile devices. Given that the media segments will be parsed anyway
-            by the user agent, parsing in JavaScript is an expensive overhead that
-            could be avoided.
-          </p>
-          <p>
-            [[HBBTV]] section 9.3.2 describes a mapping between the <code>emsg</code>
-            fields described <a href="#mpeg-dash">above</a>
-            and the <a href="https://html.spec.whatwg.org/multipage/media.html#texttrack"><code>TextTrack</code></a>
-            and <a href="https://www.w3.org/TR/2018/WD-html53-20180426/semantics-embedded-content.html#datacue"><code>DataCue</code></a>
-            APIs. A <code>TextTrack</code> instance is created for each event
-            stream signalled in the MPD document (as identified by the
-            <code>schemeIdUri</code> and <code>value</code>), and the
-            <a href="https://html.spec.whatwg.org/multipage/media.html#dom-texttrack-inbandmetadatatrackdispatchtype"><code>inBandMetadataTrackDispatchType</code></a>
-            <code>TextTrack</code> attribute contains the <code>scheme_id_uri</code>
-            and <code>value</code> values. Because HbbTV devices include a native
-            DASH client, parsing of the MPD document and creation of the
-            <code>TextTrack</code>s is done by the user agent, rather than by
-            application JavaScript code.
-          </p>
-          <p class="ednote">
-            To support DASH clients implemented in web applications, there is
-            therefore either a need for an API that allows applications to tell
-            the UA which schemes it wants to receive, or the UA should simply
-            expose all event streams to applications. Which of these is preferred?
-          </p>
-        </section>
-        <section>
-          <h4>Synchronization and timing</h4>
-          <p>
-            The timing guarantees provided in [[HTML]] regarding the triggering of
-            <code>TextTrackCue</code> events may be not be enough to avoid
-            <a href="https://lists.w3.org/Archives/Public/public-inbandtracks/2013Dec/0004.html">events being missed</a>.
-          </p>
-        </section>
+          </thead>
+          <tbody>
+            <tr>
+              <td><code>"com.apple.quicktime.udta"</code></td>
+              <td>QuickTime User Data</td>
+            </tr>
+            <tr>
+              <td><code>"com.apple.quicktime.mdta"</code></td>
+              <td>QuickTime Metadata</td>
+            </tr>
+            <tr>
+              <td><code>"com.apple.itunes"</code></td>
+              <td>iTunes metadata</td>
+            </tr>
+            <tr>
+              <td><code>"org.mp4ra"</code></td>
+              <td>MPEG-4 metadata</td>
+            </tr>
+            <tr>
+              <td><code>"org.id3"</code></td>
+              <td>ID3 metadata</td>
+          </tr>
+          </tbody>
+        </table>
+        <p>
+          and <code>value</code> is an object with the metadata item key, data, and optionally a locale:
+        </p>
+        <pre class="example">
+          value = {
+            key: String
+            data: String | Number | Array | ArrayBuffer | Object
+            locale: String
+          }
+        </pre>
+        <p>
+          Neither [[MSE-BYTE-STREAM-FORMAT-ISOBMFF]] nor [[INBANDTRACKS]] describe
+          handling of <code>emsg</code> boxes.
+        </p>
+        <p>
+          On resource constrained devices such as smart TVs and streaming sticks,
+          parsing media segments to extract event information leads to a significant
+          performance penalty, which can have an impact on UI rendering updates if
+          this is done on the UI thread. There can also be an impact on the battery
+          life of mobile devices. Given that the media segments will be parsed anyway
+          by the user agent, parsing in JavaScript is an expensive overhead that
+          could be avoided.
+        </p>
+        <p>
+          [[HBBTV]] section 9.3.2 describes a mapping between the <code>emsg</code>
+          fields described <a href="#mpeg-dash">above</a>
+          and the <a href="https://html.spec.whatwg.org/multipage/media.html#texttrack"><code>TextTrack</code></a>
+          and <a href="https://www.w3.org/TR/2018/WD-html53-20180426/semantics-embedded-content.html#datacue"><code>DataCue</code></a>
+          APIs. A <code>TextTrack</code> instance is created for each event
+          stream signalled in the MPD document (as identified by the
+          <code>schemeIdUri</code> and <code>value</code>), and the
+          <a href="https://html.spec.whatwg.org/multipage/media.html#dom-texttrack-inbandmetadatatrackdispatchtype"><code>inBandMetadataTrackDispatchType</code></a>
+          <code>TextTrack</code> attribute contains the <code>scheme_id_uri</code>
+          and <code>value</code> values. Because HbbTV devices include a native
+          DASH client, parsing of the MPD document and creation of the
+          <code>TextTrack</code>s is done by the user agent, rather than by
+          application JavaScript code.
+        </p>
+        <p class="ednote">
+          To support DASH clients implemented in web applications, there is
+          therefore either a need for an API that allows applications to tell
+          the UA which schemes it wants to receive, or the UA should simply
+          expose all event streams to applications. Which of these is preferred?
+        </p>
+      </section>
+      <section>
+        <h3>Synchronization of text track cue rendering</h3>
+        <p>
+          Subtitles for video are typically authored against video at
+          a nominal frame rate, e.g., 25 frames per second, which corresponds to
+          40 milliseconds per frame. The actual video frame rate may be adjusted
+          dynamically according to the video encoding, but the subtitle timing
+          must remain the same ([[EBU-TT-D]], Annex E).
+        </p>
+        <p>
+          Where captions are rendered by application JavaScript code, in
+          response to <code>VTTCue</code> or <code>TextTrackCue</code> events,
+          this places a requirement on user agents for timely delivery of these
+          events, so that application code can respond and render the cues.
+        </p>
+        <p>
+          Reference: M&amp;E IG, Media Timed Events Task Force call 17 Dec 2018:
+          <a href="https://www.w3.org/2018/12/17-me-minutes.html#item06">Minutes</a>.
+        </p>
+        <p class="ednote">
+          TODO: The timing guarantees provided in [[HTML]] regarding the triggering of
+          <code>TextTrackCue</code> events may be not be enough to avoid
+          <a href="https://lists.w3.org/Archives/Public/public-inbandtracks/2013Dec/0004.html">events being missed</a>.
+          Explain further.
+        </p>
       </section>
       <section>
         <h3>Synchronized rendering of web resources</h3>