-
Notifications
You must be signed in to change notification settings - Fork 11
/
index.html
928 lines (926 loc) · 44 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
<!DOCTYPE html>
<html>
<head>
<title>Media Timed Events</title>
<meta charset="utf-8">
<script src="https://www.w3.org/Tools/respec/respec-w3c-common" async class="remove"></script>
<script class="remove">
var respecConfig = {
specStatus: "IG-NOTE",
edDraftURI: "https://w3c.github.io/me-media-timed-events/",
shortName: "media-timed-events",
editors: [
{
name: "Chris Needham",
mailto: "chris.needham@bbc.co.uk",
company: "British Broadcasting Corporation",
companyURL: "https://www.bbc.co.uk"
},
],
formerEditors: [
{
name: "Giridhar Mandyam",
company: "Qualcomm",
note: "until December 2018"
}
],
wg: "Media & Entertainment Interest Group",
wgURI: "https://www.w3.org/2011/webtv/",
charterDisclosureURI: "https://www.w3.org/2017/03/webtv-charter.html",
github: {
repoURL: "https://github.com/w3c/me-media-timed-events/",
branch: "master"
},
localBiblio: {
"WEB-ISOBMFF": {
title: "ISO/IEC JTC1/SC29/WG11 N16944 Working Draft on Carriage of Web Resources in ISOBMFF",
href: "https://mpeg.chiariglione.org/standards/mpeg-4/timed-text-and-other-visual-overlays-iso-base-media-file-format/wd-carriage-web",
authors: [
"Thomas Stockhammer",
"Cyril Concolato"
],
publisher: "MPEG",
date: "July 2017",
},
"DASH-EVENTING": {
title: "DASH Eventing and HTML5",
href: "https://www.w3.org/2011/webtv/wiki/images/a/a5/DASH_Eventing_and_HTML5.pdf",
authors: [
"Giridhar Mandyam"
],
date: "February 2018"
}
}
};
</script>
</head>
<body>
<section id="abstract">
<p>
This document collects use cases and requirements for improved support
for timed events related to audio or video media on the web, where
synchronization to a playing audio or video media stream is needed,
and makes recommendations for new or changed web APIs to realize these
requirements. The goal is to extend the existing support in HTML for
text track cue events to add support for dynamic content replacement
cues and generic metadata events that drive synchronized interactive
media experiences, and improve the timing accuracy of rendering of web
content intended to be synchronized with audio or video media playback.
</p>
</section>
<section id="sotd">
<p>
The Media & Entertainment Interest Group may update these
use cases and requirements over time. Development of new web APIs based
on the requirements described here, for example, <code>DataCue</code>,
will proceed in the <a href="https://wicg.io/">Web Platform
Incubator Community Group (WICG)</a>, with the goal of eventual
standardization within a W3C Working Group. Contributors to this
document are encouraged to participate in the WICG. Where the
requirements described here affect the HTML specification, contributors
will follow up with <a href="https://whatwg.org/">WHATWG</a>. The Interest
Group will continue to track these developments and provide input and
review feedback on how any proposed API meets these requirements.
</p>
</section>
<section>
<h2>Introduction</h2>
<p>
There is a need in the media industry for an API to support metadata
events synchronized to audio or video media, specifically for both
<a>out-of-band</a> event streams and <a>in-band</a> discrete events
(for example, MPD and <code>emsg</code> events in MPEG-DASH).
These <a>media timed events</a> can be used to support use cases
such as dynamic content replacement, ad insertion, or presentation of
supplemental content alongside the audio or video, or more generally,
making changes to a web page, or executing application code triggered
from JavaScript events, at specific points on the <a>media timeline</a>
of an audio or video media stream.
</p>
</section>
<section>
<h2>Terminology</h2>
<p>
The following terms are used in this document:
</p>
<ul>
<li>
<dfn data-lt="media timed event">media timed events</dfn> —
metadata events synchronized to the <a>media timeline</a> of a
<a>media resource</a>.
</li>
<li>
<dfn>in-band</dfn> — timed event information that is delivered
within the audio or video media container or multiplexed with the
media stream.
</li>
<li>
<dfn>out-of-band</dfn> — timed event information that is
delivered over some other mechanism external to the media container
or media stream.
</li>
</ul>
<p>
The following terms are defined in [[HTML]]:
</p>
<ul>
<li>
<dfn data-cite="HTML/media.html#media-element">media element</dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#media-timeline">media timeline</dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#media-resource">media resource</dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#time-marches-on">time marches on</dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#dom-texttrack-activecues"><code>activeCues</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#dom-media-currenttime"><code>currentTime</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#event-media-enter"><code>enter</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#event-media-exit"><code>exit</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#handler-texttrack-oncuechange"><code>oncuechange</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#handler-texttrackcue-onenter"><code>onenter</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#handler-texttrackcue-onexit"><code>onexit</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#texttrack"><code>TextTrack</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#texttrackcue"><code>TextTrackCue</code></dfn>
</li>
<li>
<dfn data-cite="HTML/media.html#event-media-timeupdate"><code>timeupdate</code></dfn>
</li>
<li>
<dfn data-cite="HTML/timers-and-user-prompts.html#dom-settimeout"><code>setTimeout()</code></dfn>
</li>
<li>
<dfn data-cite="HTML/timers-and-user-prompts.html#dom-setinterval"><code>setInterval()</code></dfn>
</li>
<li>
<dfn data-cite="HTML/imagebitmap-and-animations.html#dom-animationframeprovider-requestanimationframe"><code>requestAnimationFrame()</code></dfn>
</li>
</ul>
<p>
The following term is defined in [[HR-TIME]]:
</p>
<ul>
<li>
<dfn data-cite="HR-TIME#dom-performance-now"><code>Performance.now()</code></dfn>
</li>
</ul>
<p>
The following term is defined in [[WEBVTT]]:
</p>
<ul>
<li>
<dfn data-cite="WEBVTT#vttcue"><code>VTTCue</code></dfn>
</li>
</ul>
</section>
<section>
<h2>Use cases</h2>
<p>
<a>Media timed events</a> carry metadata that is related to points in time,
or regions of time on the <a>media timeline</a>, which can be used to
trigger retrieval and/or rendering of web resources synchronized with
media playback. Such resources can be used to enhance user experience
in the context of media that is being rendered. Some examples include
display of social media feeds corresponding to a live video stream such
as a sporting event, banner advertisements for sponsored content,
accessibility-related assets such as large print rendering of
captions, and display of track titles or images alongside an audio
stream.
</p>
<p>
The following sections describe a few use cases in more detail.
</p>
<section id="dynamic-content-insertion">
<h3>Dynamic content insertion</h3>
<p>
A media content provider wants to allow insertion of content,
such as personalised video, local news, or advertisements,
into a video media stream that contains the main program content.
To achieve this, <a>media timed events</a> can be used to describe the points
on the <a>media timeline</a>, known as splice points, where switching
playback to inserted content is possible.
</p>
<p>
The Society for Cable and Televison Engineers (SCTE) specification
"Digital Program Insertion Cueing for Cable" [[SCTE35]] defines a data
cue format for describing such insertion points. Use of these cues in
MPEG-DASH and HLS streams is described in [[SCTE35]], sections 12.1
and 12.2.
</p>
</section>
<section>
<h3>Audio stream with titles and images</h3>
<p>
A media content provider wants to provide visual information alongside
an audio stream, such as an image of the artist and title of the
current playing track, to give users live information about the
content they are listening to.
</p>
<p>
HLS timed metadata [[HLS-TIMED-METADATA]] uses
<a>in-band</a> ID3 metadata to carry the artist and title information,
and image content. RadioVIS in DVB ([[DVB-DASH]], section 9.1.7)
defines <a>in-band</a> event messages that contain image URLs and text
messages to be displayed, with information about when the content
should be displayed in relation to the <a>media timeline</a>.
</p>
</section>
<section>
<h3>Control messages for media streaming clients</h3>
<p>
A media streaming server uses <a>media timed events</a> to send control
messages to media client library, such as <a href="https://github.com/Dash-Industry-Forum/dash.js">dash.js</a>.
Typically segmented streaming protocols such as HLS and MPEG-DASH make
use of a manifest document that informs the client of the available
encodings of a media stream, e.g., the Media Presentation Description
(MPD) document in [[MPEGDASH]].
</p>
<p>
Should any of the content in the manifest document need to change, the
client should refresh it by requesting an updated copy from the
server. Section 5.10.4 of [[MPEGDASH]] describes an MPEG-DASH specific
event that is used to notify a client application. An <a>in-band</a>
<code>emsg</code> event is used as an alternative to setting a cache
duration in the response to the HTTP request for the manifest, so the
client can refresh the MPD when it actually changes, as opposed to
waiting for a cache duration expiry period to elapse. This also has
the benefit of reducing the load on HTTP servers caused by frequent
server requests.
</p>
<p>
Reference: M&E IG call 1 Feb 2018:
<a href="https://www.w3.org/2018/02/01-me-minutes.html">Minutes</a>,
[[DASH-EVENTING]].
</p>
</section>
<section>
<h3>Subtitle and caption rendering synchronization</h3>
<p>
A subtitle or caption author wants ensure that subtitle changes are
aligned as closely as possible to shot changes in the video.
The BBC Subtitle Guidelines [[BBC-SUBTITLE]] describes authoring
best practices. In particular, in section 6.1 authors are advised
"it is likely to be less tiring for the viewer if shot changes
and subtitle changes occur at the same time. Many subtitles therefore
start on the first frame of the shot and end on the last frame."
</p>
</section>
<section>
<h3>Synchronized map animations</h3>
<p>
A user records footage with metadata, including geolocation, on a
mobile video device, e.g., drone or dashcam, to share on the web
alongside a map, e.g., OpenStreetMap.
</p>
<p>
[[WEBVMT]] is an open format for metadata cues, synchronized with a
timed media file, that can be used to drive an online map rendered in
a separate HTML element alongside the <a>media element</a> on the web page.
The media playhead position controls presentation and animation of the
map, e.g., pan and zoom, and allows annotations to be added and
removed, e.g., markers, at specified times during media playback.
Control can also be overridden by the user with the usual interactive
features of the map at any time, e.g., zoom. Concrete examples are
provided by the <a href="http://webvmt.org/demos">tech demos</a> at
the WebVMT website.
</p>
</section>
<section>
<h3>Media analysis visualization</h3>
<p>
A video image analysis system processes a media stream to detect and
recognize objects shown in the video. This system generates metadata
describing the objects, including timestamps that describe the when
the objects are visible, together with position information (e.g.,
bounding boxes). A web application then uses this timed metadata to
overlay labels and annotations on the video using HTML and CSS.
</p>
</section>
<section>
<h3>Presentation of auxiliary content in live media</h3>
<p>
During a live media presentation, dynamic and unpredictable events
may occur which cause temporary suspension of the media presentation.
During that suspension interval, auxiliary content such as the presentation
of UI controls and media files, may be unavailable. Depending on the
specific user engagement (or not) with the UI controls and the time
at which any such engagement occurs, specific web resources may be
rendered at defined times in a synchronized manner. For example,
a multimedia A/V clip along with subtitles corresponding to an
advertisement, and which were previously downloaded and cached
by the UA, are played out.
</p>
</section>
</section>
<section>
<h2>Related industry specifications</h2>
<p>
This section describes existing media industry specifications and
standards that specify carriage of <a>media timed events</a>, or otherwise
provide requirements for web APIs related to the triggering of media
timed events.
</p>
<section>
<h3>MPEG Common Media Application Format (CMAF)</h3>
<p>
MPEG Common Media Application Format (CMAF) [[MPEGCMAF]] is a media
container format optimized for large scale delivery of a single
encrypted, adaptable multimedia presentation to a wide range of
devices and adaptive streaming methods, including HTTP Live Streaming
[[RFC8216]] and MPEG-DASH [[MPEGDASH]]. It is based on the ISO BMFF
[[ISOBMFF]] and supports the AVC, AAC, HEVC codecs, Common Encryption
(CENC), and subtitles using IMSC1 and WebVTT. The goal is to reduce
media storage and delivery costs by using a single common media format
across different client devices.
</p>
<p>
CMAF media may contain <a>in-band</a> events in the form of
Event Message (<code>emsg</code>) boxes in ISO BMFF files.
<code>emsg</code> is specified in [[MPEGDASH]], section 5.10.3.3,
and described in more detail in the following section of this
document.
</p>
</section>
<section>
<h3>MPEG-DASH</h3>
<p>
MPEG-DASH is an adaptive bitrate streaming technique in which the
audio and video media is partitioned into segments. The Media
Presentation Description (MPD) is an XML document that contains
metadata required by a DASH client to access the media segments and to
provide the streaming service to the user. The media segments can use
any codec, typically within a fragmented MP4 (ISO BMFF) container or
MPEG-2 transport stream.
</p>
<p>
In MPEG-DASH, <a>media timed events</a> may be delivered either
<a>in-band</a> or <a>out-of-band</a>:
</p>
<ul>
<li>
<a>In-band</a> events are <code>emsg</code> boxes in ISO BMFF files.
The presence of <code>emsg</code> events in the media container for
given event schemes is signaled in the MPD document using an
<code>EventStream</code> XML element ([[MPEGDASH]], section 5.10.2).
</li>
<li>
<a>Out-of-band</a> events are represented by <code>Event</code>
XML elements </code>contained within an <code>EventStream</code>
element in the MPD.
</li>
</ul>
<p>
An <code>emsg</code> event contains the following information,
as specified in [[MPEGDASH]], section 5.10.3.3:
</p>
<ul>
<li><code>scheme_id_uri</code> — A URI that identifies
the message scheme</li>
<li><code>value</code> — The event value (string)</li>
<li><code>timescale</code> — Timescale units, in ticks
per second</li>
<li><code>presentation_time_delta</code> — Presentation
time delta (with respect to the media segment),
in <code>timescale</code> units</li>
<li><code>event_duration</code> — Event duration,
in <code>timescale</code> units</li>
<li><code>id</code> — Event message identifier</li>
<li><code>message_data</code> — Message body (may be empty)</li>
</ul>
</section>
<section>
<h3>HbbTV</h3>
<p>
HbbTV is an interactive TV application standard that supports both
broadcast (DVB) media delivery, and internet streaming using
MPEG-DASH. The HbbTV application environment is based on HTML and
JavaScript. MPEG-DASH streaming is implemented nativey by the user
agent, rather than through a JavaScript web application using Media
Source Extensions.
</p>
<p>
HbbTV includes support for <code>emsg</code> events ([[DVB-DASH]],
section 9.1) and requires this be mapped to HTML5 <code>DataCue</code>
([[HBBTV]], section 9.3.2). The revision of HTML5 referenced
by [[HBBTV]] is [[html51-20151008]]. This feature is included in user
agents shipping in connected TVs across Europe from 2017.
</p>
<p>
The <a href="https://www.hbbtv.org/wp-content/uploads/2018/03/HbbTV-testcases-2018-1.pdf">HbbTV
device test suite</a> includes test pages and streams that
cover <code>emsg</code> support. HbbTV has a
<a href="https://github.com/HbbTV-Association/ReferenceApplication">reference application</a>
and content for DASH+DRM which includes <code>emsg</code> support.
</p>
</section>
<section>
<h3>DASH Industry Forum APIs for Interactivity</h3>
<p>
The DASH-IF InterOp Working Group has an ongoing work item,
<em>DAInty</em>, "DASH APIs for Interactivity", which aims to
specify a set of APIs between the DASH client/player and interactivity-capable
applications, for both web and native applications [[DASHIFIOP]]. The origin of this
work is a related
<a href="http://www.3gpp.org/ftp/tsg_sa/TSG_SA/TSGS_77/Docs/SP-170796.zip">3GPP
work item</a> on Service Interactivity [[3GPP-INTERACTIVITY]].
The objective is to provide service enablers for user engagement with
auxiliary content and UIs on mobile device during live or time-shifted
viewing of streaming content delivered over 3GPP broadcast or unicast
bearers, and the measurement and reporting of such interactive consumption.
</p>
<p>
Two APIs are being developed that are relevant to the scope of the present
document:
</p>
<ul>
<li>
Application subscription/DASH client dispatch of DASH event stream
messages containing interactivity information. Events can be delivered
<a>in-band</a> (<code>emsg</code>) and/or as MPD events.
</li>
<li>
Application subscription/DASH client dispatch of ISO BMFF Timed
Metadata tracks providing similar functionality to DASH event streams.
</li>
</ul>
<p>
Two modes for dispatching events
<a href="https://www.w3.org/2018/08/20-me-minutes.html#item05"></a>are
defined</a>. In Mode 1 events are
dispatched at the time the event arrives, and in Mode 2 events are
dispatched at the given time on the <a>media timeline</a>. The
"arrival" of events from the DASH client perspective may be either
static or pre-provisioned, in the case MPD Events, or dynamic in the
case of <a>in-band</a> events carried in the <code>emsg</code>. The
application can register with the DASH client which Mode to use.
</p>
</section>
<section>
<h3>SCTE-35</h3>
<p>
The Society for Cable and Televison Engineers (SCTE) has produced the
SCTE-35 specification "Digital Program Insertion Cueing for Cable"
[[SCTE35]], which defines a data cue format for describing
insertion points, to support the
<a href="#dynamic-content-insertion">dynamic content insertion</a> use
case.
</p>
<p>
[[SCTE214-1]] section 6.7 describes the carriage of SCTE-35 events
in a MPEG-DASH MPD document, as <a>out-of-band</a> events.
[[SCTE214-2]] section 9 and [[SCTE214-3]] section 7.3 describe
the carriage of SCTE-35 events as <a>in-band</a> events in MPEG-DASH
using MPEG2-TS and ISO BMFF respectively, using <code>emsg</code>.
</p>
[[SCTE35]] section 9.1 describes the requirements for content
splicing: "In order to give advance warning of the impending splice
(a pre-roll function), the splice_insert() command could be sent
multiple times before the splice point. For example, the
splice_insert() command could be sent at 8, 5, 4 and 2 seconds prior
to the packet containing the related splice point. In order to meet
other splicing deadlines in the system, any message received with less
than 4 seconds of advance notice may not create the desired result."
</p>
<p>
This places an implicit requirement on the user agent in handling of
media-timed events related to insertion cues. The content originator
may provide the cue in advance with as little as 2 seconds of the
insertion time. Therefore the propagation of the event data associated
with the insertion cue to the application by the user agent should be
considerably less than 2 seconds.
</p>
</section>
<section>
<h3>MPEG Working Draft on Carriage of Web Resources in ISO BMFF</h3>
<p>
The MPEG Working Draft on Carriage of Web Resources in ISO BMFF
[[WEB-ISOBMFF]] is a draft document that specifies the use of the
ISO BMFF container format for the storage and delivery of web
content. The goal is to allow web resources (HTML, JavaScript, etc.)
to be parsed from the storage and processed by a user agent at
specific presentation times on the <a>media timeline</a>, and so be
synchronized with other tracks within the container, such as audio,
video, and subtitles.
</p>
<p>
The Media & Entertainment Interest Group is actively tracking
this work is open to discussing specific requirements for media
timed events as development progresses.
</p>
</section>
<section>
<h3>WebVTT</h3>
<p>
[[WEBVTT]] is a W3C specification that provides a format for web video
text tracks. A <a>VTTCue</a> is a text track cue, and may have
attributes that affect rendering of the cue text on a web page.
WebVTT metadata cues are text that is aligned to the
<a>media timeline</a>. Web applications can use <a>VTTCue</a>
to schedule <a>out-of-band</a> metadata events by serializing the
event data to a string format (JSON, for example) when creating the
cue, and deserializing the data when the cue is triggered.
</p>
<p>
Web applications can also use <a>VTTCue</a> to trigger
rendering of <a>out-of-band</a> delivered timed text cues, such as
TTML or IMSC format captions.
</p>
</section>
</section>
<section>
<h2>Gap analysis</h2>
<p>
This section describes gaps in existing existing web platform
capabilities needed to support the use cases and requirements described
in this document. Where applicable, this section also describes how
existing web platform features can be used as workarounds, and any
associated limitations.
</p>
<section>
<h3>MPEG-DASH and ISO BMFF emsg events</h3>
<p>
The <code>DataCue</code> API has been previously discussed as a means
to deliver <a>in-band</a> event data to web applications, but this is
not implemented in all of the main browser engines. It is
<a href="https://www.w3.org/TR/2018/WD-html53-20181018/semantics-embedded-content.html#text-tracks-exposing-inband-metadata">included</a>
in the 18 October 2018 HTML 5.3 draft [[HTML53-20181018]], but is
<a href="https://html.spec.whatwg.org/multipage/media.html#timed-text-tracks">not included</a>
in [[HTML]]. See discussion <a href="https://groups.google.com/a/chromium.org/forum/#!topic/blink-dev/U06zrT2N-Xk">here</a>
and notes on implementation status <a href="https://lists.w3.org/Archives/Public/public-html/2016Apr/0005.html">here</a>.
</p>
<p>
WebKit <a href="https://discourse.wicg.io/t/media-timed-events-api-for-mpeg-dash-mpd-and-emsg-events/3096/2">supports</a>
a <code>DataCue</code> interface that extends HTML5 <code>DataCue</code>
with two attributes to support non-text metadata, <code>type</code> and
<code>value</code>.
</p>
<pre class="example">
interface DataCue : TextTrackCue {
attribute ArrayBuffer data; // Always empty
// Proposed extensions.
attribute any value;
readonly attribute DOMString type;
};
</pre>
<p>
<code>type</code> is a string identifying the type of metadata:
</p>
<table class="simple">
<thead>
<tr>
<th colspan="2">WebKit <code>DataCue</code> metadata types</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>"com.apple.quicktime.udta"</code></td>
<td>QuickTime User Data</td>
</tr>
<tr>
<td><code>"com.apple.quicktime.mdta"</code></td>
<td>QuickTime Metadata</td>
</tr>
<tr>
<td><code>"com.apple.itunes"</code></td>
<td>iTunes metadata</td>
</tr>
<tr>
<td><code>"org.mp4ra"</code></td>
<td>MPEG-4 metadata</td>
</tr>
<tr>
<td><code>"org.id3"</code></td>
<td>ID3 metadata</td>
</tr>
</tbody>
</table>
<p>
and <code>value</code> is an object with the metadata item key, data, and optionally a locale:
</p>
<pre class="example">
value = {
key: String
data: String | Number | Array | ArrayBuffer | Object
locale: String
}
</pre>
<p>
Neither [[MSE-BYTE-STREAM-FORMAT-ISOBMFF]] nor [[INBANDTRACKS]] describe
handling of <code>emsg</code> boxes.
</p>
<p>
On resource constrained devices such as smart TVs and streaming sticks,
parsing media segments to extract event information leads to a significant
performance penalty, which can have an impact on UI rendering updates if
this is done on the UI thread. There can also be an impact on the battery
life of mobile devices. Given that the media segments will be parsed anyway
by the user agent, parsing in JavaScript is an expensive overhead that
could be avoided.
</p>
<p>
[[HBBTV]] section 9.3.2 describes a mapping between the <code>emsg</code>
fields described <a href="#mpeg-dash">above</a>
and the <a>TextTrack</a>
and <a href="https://www.w3.org/TR/2018/WD-html53-20180426/semantics-embedded-content.html#datacue"><code>DataCue</code></a>
APIs. A <a>TextTrack</a> instance is created for each event
stream signalled in the MPD document (as identified by the
<code>schemeIdUri</code> and <code>value</code>), and the
<a href="https://html.spec.whatwg.org/multipage/media.html#dom-texttrack-inbandmetadatatrackdispatchtype"><code>inBandMetadataTrackDispatchType</code></a>
<a>TextTrack</a> attribute contains the <code>scheme_id_uri</code>
and <code>value</code> values. Because HbbTV devices include a native
DASH client, parsing of the MPD document and creation of the
<a>TextTrack</a>s is done by the user agent, rather than by
application JavaScript code.
</p>
</section>
<section>
<h3>Synchronized rendering of web resources</h3>
<p>
In browsers, non media web rendering is handled through repaint
operations at a rate that generally matches the display refresh rate
(e.g., 60 times per second), following the user's wall clock. A web
application can schedule actions and render web content at specific
points on the user's wall clock, notably through
<a>Performance.now()</a>, <a>setTimeout()</a>, <a>setInterval()</a>,
and <a>requestAnimationFrame()</a>.
</p>
<p>
In most cases, media rendering follows a different path, be it because
it gets handled by a dedicated background process or by dedicated
hardware circuitry. As a result, progress along the <a>media
timeline</a> may follow a
<a data-cite="HTML/media.html#offsets-into-the-media-resource:media-timeline-8">
clock</a> different from the user's wall clock. [[HTML]] recommends
that the media clock approximate the user's wall clock but does not
require it to match the user's wall clock.
</p>
<p>
To synchronize rendering of web content to a video with frame
accuracy, a web application needs:
</p>
<ul>
<li>
A way to track progress along the <a>media timeline</a> with
<em>sufficient precision</em>. The actual precision required depends
on the use case. Subtitles for video are typically authored against
video at the nominal video frame rate, e.g., 25 frames per second,
which corresponds to 40 milliseconds per frame, even when the actual
video frame rate gets adjusted dynamically ([[EBU-TT-D]], Annex E).
This suggests a 20 milliseconds precision, or half of the duration
of a typical video frame, to render subtitles with frame accuracy.
</li>
<li>
In cases where synchronization needs to occur at frame boundaries, a
way to tie the rendering of non media content, typically done at the
display refresh rate, with the rendering of a video frame. This need
does not replace the former one: a web application that needs to
render web content at media frame boundaries may also need to
perform actions at specific points on the <a>media timeline</a>
regardless of when the next frame gets rendered.
</li>
</ul>
<p>
The following sub-sections discusses mechanisms currently available to
web applications to track progress on the <a>media timeline</a> and
render content at frame boundaries.
</p>
<section>
<h4>Using cues to track progress on the media timeline</h4>
<p>
Cues (e.g., <a>TextTrackCue</a>, or <a>VTTCue</a>) are
units of time-sensitive data on a <a>media timeline</a> [[HTML]]. The
<a>time marches on</a> steps in [[HTML]] control the firing of cue
events during media playback. <a>Time marches on</a> requires a
<a>timeupdate</a> event to be fired at the <a>media element</a>
between 15 and 250 milliseconds since the last such event, and this
requirement therefore specifies the rate at which <a>time marches
on</a> is executed during playback. In practice it
<a href="https://www.w3.org/2018/12/17-me-minutes.html#item06">has
been found</a> that the timing varies between browser
implementations.
</p>
<p>
There are two methods a web application can use to handle cues:
</p>
<ul>
<li>
Add an <a>oncuechange</a> handler function to the <a>TextTrack</a>
and inspect the track's <a>activeCues</a> list. Because
<a>activeCues</a> contains the list of cues that are active at the
time that <a>time marches on</a> is run, it is possible for cues
to be missed by a web application using this method, where cues
appear on the <a>media timeline</a> between successive executions
of <a>time marches on</a> during media playback. This may occur
if the cues have short duration, or by a long-running event
handler function.
</li>
<li>
Add <a>onenter</a> and <a>onexit</a> handler functions
to each cue. The <a>time marches on</a> steps guarantee that
<a>enter</a> and <a>exit</a> events will be fired for all cues,
including those that appear on the <a>media timeline</a> between
successive executions of <a>time marches on</a> during media
playback. This method is only possible for cues created by the web
application, i.e., <a>VTTCue</a> objects, and not cue
objects created by the user agent.
</li>
</ul>
<p>
An issue with handling of text track and data cue events in HbbTV
<a href="https://lists.w3.org/Archives/Public/public-inbandtracks/2013Dec/0004.html">was
reported</a> in 2013. HbbTV requires the user agent to implement an
MPEG-DASH client, and so applications must use the first of the
above methods for cue handling, which means that applications can
miss cues as described above.
</p>
</section>
<section>
<h4>Using <code>timeupdate</code> events from the media element</h4>
<p>
Another approach to synchronizing rendering of web content to media
playback is to use the <a>timeupdate</a> event, and for the
web application to manage the <a>media timed events</a> to be
triggered, rather than use the text track cue APIs in [[HTML]].
This approach has the same synchronization
limitations as described above due to the 250 millisecond update
rate specified in <a>time marches on</a>, and so is
<a data-cite="HTML/media.html#best-practices-for-metadata-text-tracks:event-media-timeupdate">explicitly
discouraged</a> in [[HTML]]. In addition, the timing variability of
<a>timeupdate</a> events between browser engines makes them
unreliable for the purpose of synchronized rendering of web content.
</p>
</section>
<section>
<h4>Polling the current position on the media timeline</h4>
<p>
Synchronization accuracy can be improved by polling the media
element's <a>currentTime</a> property from a <a>setInterval()</a>
callback, or by using <a>requestAnimationFrame()</a> for greater
accuracy. This technique can be useful in where content should be
animated smoothly in synchronicity with the media, for example,
rendering a playhead position marker in an audio waveform
visualization, or displaying web content at specific points on the
<a>media timeline</a>. However, the use of <a>setInterval()</a> or
<a>requestAnimationFrame()</a> for media synchronized rendering
is CPU intensive.
</p>
</section>
<section>
<h4>Detecting when the next media frame will be rendered</h4>
<p>
[[HTML]] does not expose any precise mechanism to assess the time,
from a user's wall clock perspective, at which a particular media
frame is going to be rendered. A web application may only infer this
information by looking at the <a>media element</a>'s
<a>currentTime</a> property to infer the frame being rendered
and the time at which the user will see the next frame. This has
several limitations:
</p>
<ul>
<li>
<a>currentTime</a> is represented as a <code>double</code>
value, which does not allow to identify individual frames due to
rounding errors. This is a
<a href="https://github.com/whatwg/html/issues/609">known
issue</a>.
</li>
<li>
<a>currentTime</a> is updated at a user-agent defined rate
(typically the rate at which <a>time marches on</a> runs), and is
kept stable while scripts are running. When a web application
reads <a>currentTime</a>, it cannot tell when this property
was last updated, and thus cannot reliably assess whether this
property still represents the frame currently being rendered.
</li>
</ul>
</section>
</section>
</section>
<section>
<h2>Recommendations</h2>
<p>
This section describes recommendations from the Media & Entertainment
Interest Group for the development of a generic <a>media timed event</a> API,
and associated synchronization considerations.
</p>
<section>
<h3>Subscribing to event streams</h3>
<p>
The API should allow web applications to subscribe to receive specific
event streams by event type. For example, to support MPEG-DASH
<code>emsg</code> and MPD events, the API should allow subscription by
<code>id</code> and (optional) <code>value</code>. This is to make
receiving events opt-in from the application point of view. The user
agent should deliver only those events to a web application for which
the application has subscribed. The API should also allow web
applications to unsubscribe from specific event streams by event type.
</p>
</section>
<section>
<h3>Out-of-band events</h3>
<p>
To be able to handle out of band events, including MPEG-DASH MPD
events, the API should allow web applications to create events to be
added to the <a>media timeline</a>, to be triggered by the user agent.
The API should allow the web application to provide all necessary
parameters to define the event, including start and end times, event
type, and data payload. The payload should be any data type (e.g., the
set of types supported by the WebKit <code>DataCue</code>). For
MPEG-DASH MPD events, the event type is defined by the <code>id</code>
and (optional) <code>value</code> fields.
</p>
</section>
<section>
<h3>Event triggering</h3>
<p>
For those events that the application has subscribed to receive,
the API should:
</p>
<ul>
<li>
Generate a JavaScript event when an <a>in-band</a> <a>media timed event</a>
is parsed from the media container or media stream (DAInty Mode 1).
</li>
<li>
Generate JavaScript events when the current media playback
position reaches the start time and the end time of a media timed
event during playback (DAInty Mode 2). This applies equally to
<a>in-band</a> events that the user agent has extracted from the
media container, and <a>out-of-band</a> events added by the web
application.
</li>
</ul>
<p>
The API should provide guarantees that no events can be missed during
linear playback of the media.
</p>
</section>
<section>
<h3>In-band event processing</h3>
<p>
We recommend updating [[INBANDTRACKS]] to describe handling of
<a>in-band</a> <a>media timed events</a> supported on the web platform,
following a registry approach with one specification per media format
that describes the event details for that format.
</p>
</section>
<section>
<h3>MPEG-DASH events</h3>
<p>
We recommend that browser engines support MPEG-DASH <code>emsg</code>
<a>in-band</a> events and MPD <a>out-of-band</a> events, as part of
their support for the MPEG Common Media Application Format (CMAF)
[[MPEGCMAF]].
</p>
</section>
<section>
<h3>Synchronization</h3>
<p>
In order to achieve greater synchronization accuracy between media
playback and web content rendered by an application, the <a>time
marches on</a> steps in [[HTML]] should be modified to allow delivery
of <a>media timed event</a> start time and end time notifications within 20
milliseconds of their positions on the <a>media timeline</a>.
</p>
<p>
Additionally, to allow such synchronization to happen at frame
boundaries, we recommend introducing a mechanism that would allow a
web application to accurately predict, using the user's wall clock,
when the next frame will be rendered (e.g., as done in the
<a href="https://webaudio.github.io/web-audio-api/#dom-audiocontext-getoutputtimestamp">Web
Audio API</a>). The same outcome could perhaps be achieved
through a mechanism similar to <a>requestAnimationFrame()</a> that
would allow to couple rendering of non media web content and rendering
of the next media frame.
</p>
</section>
</section>
<section>
<h2>Acknowledgments</h2>
<p>
Thanks to François Daoust, Charles Lo, Nigel Megitt, Jon Piesing, Rob Smith, and
Mark Vickers for their contributions and feedback on this document.
</p>
</section>
</body>
</html>