Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Audio Description requirements #195

Closed
nigelmegitt opened this issue Oct 6, 2016 · 37 comments

Comments

Projects
None yet
6 participants
@nigelmegitt
Copy link
Contributor

commented Oct 6, 2016

Add support for Audio Description. See also the AD Requirements wiki page.

  • Define an audio processing framework, to allow audio level and pan to be modified, and additional audio resources to be mixed in, according to the TTML content element hierarchy.
  • Define a gain style attribute
  • Define a pan style attribute
  • Check that the appropriate continuous animation interpolation calculation algorithms for audio are present. (may need addition of exponential interpolation calculation mode)
  • Define an attribute that controls the playback "in-time" within a remote referenced audio, to support requirement ADR6
  • Define feature designators
  • Update schema to include new attributes
  • Check and if necessary update TTML requirements document
@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 7, 2016

Pre-editing notes:

The audio processing framework is that any audio provided by the parent element, or the document processing context at the root level, is modified by the computed value of the audio styling attributes tts:gain and tts:pan according to their semantics at the times when those attributes are applicable.

This creates an audio graph where an element setting an audio style attribute creates an anonymous AudioNode whose input is the output of the parent element's AudioNode (or its parent if none) and whose output is modified by the audio style attribute, and whose lifetime is the active time of the element.

This generates a set of (possibly anonymous) spans which each have an implied output AudioNode. Only those spans which have audio style attributes set on them create AudioNodes.

In the case that two or more spans that have implied AudioNodes are simultaneously active their outputs are mixed additively. In case no active spans have implied AudioNodes the audio output is set to the input.

For example:

<div>
<p begin="00:01:00" end="00:01:10" tts:gain="0.5"/>
</div>

plays back the provided input, multiplying by 0.5 from the period 1 minute to 1 minute 10 seconds.

<div>
<p begin="00:01:00" end="00:01:10" tts:gain="0.5"/>
<p begin="00:01:05" end="00:01:15" tts:gain="0.5"/>
</div>

plays back the provided input, multiplying as follows:

0 -> 60s: Multiply by 1
60s -> 65s: Multiply by 0.5
65s -> 70s: Multiply by 1 (multiply by 0.5 twice and then sum the results)
70s-> 75s: Multiply by 0.5
75s -> end: Multiply by 1.

however:

<div>
<p begin="00:01:00" end="00:01:10" tts:gain="0.5"/>
<p begin="00:01:05" end="00:01:15"/>
</div>

plays back the provided input, multiplying as follows:

0 -> 60s: Multiply by 1
60s -> 70s: Multiply by 0.5 (second p has no audio styling attributes and generates no audio node)
70s-> end: Multiply by 1.

The tts:gain style attribute is a signed floating point number with a nominal range (-∞, ∞). The input audio is multiplied by this value.

The tts:pan style attribute is a signed floating point number in the range [-1, 1] that defines the position of the input in the output's stereo image. -1 represents full left, +1 represents full right. The default is 0. The Equal-power panning algorithm shall be used.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 7, 2016

An audio styling attribute tts:gain or tts:pan applied to an audio element applies those style attributes to the referenced audio prior to mixing.

The timing attributes begin, end and dur are permitted on an audio element and specify a sub-section of the audio resource to play. The syncbase of the audio resource is the beginning of the referenced resource for the purpose of resolving those times. (Note: it would be acceptable to me to define alternate attribute names if this is considered a confusing overloading of the existing timing attributes).

Checking the webaudio spec, it does not appear necessary to introduce an exponential ramp interpolation calculation for this application. The current calculation methods available in the animate element in TTML2 are probably sufficient.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 7, 2016

It would be elegant from an authoring perspective to allow a synchronisation setting that sets the begin time of playback of an audio resource to the same as the media timebase time of the TTML document instance, on a per audio element basis, rather than having to calculate the in time and set it explicitly.

SMIL 3 introduces the attributes syncBehaviour and syncBehaviourDefault that can take the values locked and independent however my current reading is that it would be at best stretching those terms too far and quite possibly misusing them for this purpose.

Perhaps a better way to achieve this is to permit a keyword documentTime to be used in the begin attribute, so assuming ttp:timeBase="media":

<p begin="00:30:00" end="00:31:00">
<audio src="#adtrack" begin="documentTime">
</p>

would at 30 minutes into the presentation begin playing back the audio track #adtrack from 30 minutes into that resource, for 1 minute whereas:

<p begin="00:30:00" end="00:31:00">
<audio src="#adtrackfrag3">
</p>

would at 30 minutes into the presentation begin playing back the audio track #adtrackfrag3 from the beginning of that resource, for 1 minute.

And more generally:

<p begin="00:30:00" end="00:31:00">
<audio src="#adtrackfrag4" begin="35s" dur="20s">
</p>

would at 30 minutes into the presentation begin playing #adtrackfrag4 from 35 seconds into that resource, for 20 seconds, and then stop. We should specify that the SMIL restart attribute is specified for our purposes as having the value "never".

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 7, 2016

I think that concludes my pre-editing notes. A diagram would be helpful in the spec to explain how the element hierarchy maps to an audio graph.

The next step is to turn these notes into spec text - @skynavga I can have a go at this, or would also be happy if you want to take it on and have time to do so.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 10, 2016

I realised I omitted one point: there needs to be a way to signal that the presentation intent of an element is to synthesise it into audio so that it generates an implicit audio node, even if the details of how to do the synthesis are wholly captured in other namespace data. The document author just needs to be able to indicate: "convert this text to speech" vs "do not convert this text to speech".

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Oct 21, 2016

Just a reminder to self that @skynavga and I agreed that @skynavga would do the spec work on this following these notes.

@skynavga skynavga modified the milestone: TTML2WR Feb 23, 2017

@skynavga

This comment has been minimized.

Copy link
Collaborator

commented Mar 1, 2017

As a preview, here's what I am currently doing:

  • add audio style namespace, with prefix tta
  • add following attributes in tta namespace primarily applied to tt:span, but a couple to tt:audio as well (some of these are based on SSML, some on WebAudio API):
    • tta:contour
    • tta:duration
    • tta:emphasis
    • tta:gain
    • tta:pan
    • tta:pitch
    • tta:range
    • tta:speak
    • tta:voice
  • add following local attributes to tt:audio
    • clipBegin
    • clipEnd

Should have a draft committed Wed afternoon.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2017

Thanks for the update @skynavga much appreciated. I had wondered about integrating SSML but thought the easier thing might just be to be open about mixing in other XML vocabularies. I'm not familiar enough with SSML specifically to know if it needs both elements and attributes, but this could be related to the recent conversation we have been having about foreign namespace element inclusion in elements other than metadata.

Someone mentioned a new requirement to me this morning at CSUN, which is to have a description audio clip run longer than its allotted time in the main media and cause the main media to pause as a result. I am not sure if we need to support that at this stage though so would recommend omitting it for the time being.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2017

To be precise, they didn't just mention it to me but as part of a session. Audio Description is a big theme this year...

@palemieux

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2017

@nigelmegitt Is this Audio Description feature intended for authoring and/or distribution?

@skynavga

This comment has been minimized.

Copy link
Collaborator

commented Mar 1, 2017

skynavga added a commit that referenced this issue Mar 2, 2017

@skynavga

This comment has been minimized.

Copy link
Collaborator

commented Mar 2, 2017

A preliminary draft for AD additions can be found at [1]. There is still work to do, but there is a draft here for @clip{Begin,End} and tta:{gain,pan,speak}. I need to confer with @nigelmegitt about a number of details.

[1] https://rawgit.com/w3c/ttml2/3811a7e840299a314b81485ddb7e66f4025b108c/spec/ttml2.html#embedded-content-vocabulary-audio

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Mar 3, 2017

Thanks @skynavga, I'll contact you offline about this.

skynavga added a commit that referenced this issue Mar 27, 2017

@dronca

This comment has been minimized.

Copy link

commented Apr 14, 2017

This is a very significant feature, and it is very late to the TTML2 project. We are concerned that this issue will unnecessarily delay TTML2. This feature should be deferred to the next revision of TTML or to a separate specification.

@mikedo

This comment has been minimized.

Copy link

commented Apr 17, 2017

I'm struggling with using TTML as a packaging framework for multiple-media presentations. It's timed text. Image support was added to support pre-rendered text glyphs - there was no "slideshow" use case. TTML documents should be integrated with other media tracks (audio, images, video, etc) using existing frameworks - HTML5, ISOBMFF, etc.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2017

@mikedo You seem to be assuming that visual renderings of text somehow have a higher priority than audio renderings. I do not agree with this.

The point that TTML documents should be integrated with other media tracks using existing frameworks is orthogonal to the AD feature. In the case of both captions/subtitles and AD, a presentation processor takes TTML, uses it to create some output media blended with the input media and presents that output. This is identical for both visual and audio presentation. I do not understand the distinction you make between the two. The other frameworks you mention are equally relevant for both.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2017

@dronca wrote:

We are concerned that this issue will unnecessarily delay TTML2. This feature should be deferred to the next revision of TTML or to a separate specification.

Since the drafting work for this is almost complete at https://github.com/w3c/ttml2/tree/issue-0195-audio-descriptions we should continue with this and, if there are unlikely to be enough implementors, consider the feature as being at risk in CR rather than removing it from the scope of a WD. Also, we cannot get review of the feature unless we publish it. The sense I have from industry and users is that there is a very strong demand for an open standard for AD; understanding that the delta for supporting it in TTML2 is very low I would like to include it if possible.

If as a Group we want to prioritise features in order to publish a working draft for wide review within a specific timescale, then we should do that across the whole set of issues we have rather than picking off individual issues in a piecemeal way. This isn't the place to have that discussion though.

@mikedo

This comment has been minimized.

Copy link

commented Apr 18, 2017

@nigelmegitt This is a pretty fundamental architectural disconnect. TTML is not (intended) to be a generic media wrapper. So, it is not (intended) to be anything like HTML or ISOBMFF. I really don't like the architectural boundary that this proposal crosses. And, how do you reconcile adding this with the various scope statements found in the Abstract and Introduction, such as: "TTML is expressly designed to meet only a limited set of requirements established by [TTAF1-REQ], and summarized in M Requirements."

@skynavga

This comment has been minimized.

Copy link
Collaborator

commented Apr 18, 2017

@mikedo

This comment has been minimized.

Copy link

commented Apr 18, 2017

I didn't like them then either. It was not an accident that those requirements were not satisfied, marked "N" and not added in over a decade. So I guess it depends on how you read that table in Appendix M.

If you read the Abstract and Introduction, it is pretty clear that TTML1 (and also TTML2 unless you change it) is only about textual information, e.g.: "The Timed Text Markup Language is a content type that represents timed text media for the purpose of interchange among authoring systems. Timed text is textual information that is intrinsically or extrinsically associated with timing information." Both sections are only about text.

How does one construct an ISD?

How does it interact with related media objects, especially audio?

@skynavga

This comment has been minimized.

Copy link
Collaborator

commented Apr 18, 2017

How does one construct an ISD?

The same way one does with images.

How does it interact with related media objects, especially audio?

If you mean user control of audio playback, that will be part of the document processing context and out of scope for TTML2. Basically, TTML2 will say that if audio is supported and enabled, then here are the implied semantics for playback, which is to say. Two types of audio sources will be supported, audio from resources and audio from a built-in speech synthesizer (if supported and enabled). Audio from both sources may be mixed. As audio comes in and out of temporal presentation scope, it will be added to or removed from the mix.

The document processing context will produce one audio output which can then be consumed by the larger, encapsulating application as it sees fit.

@mikedo

This comment has been minimized.

Copy link

commented Apr 18, 2017

Audio is not at all like images, which are static and have no intrinsic timeline of their own. Audio is perhaps closer to a continuous animation. At the very least, I believe that a second note is needed in Appendix J.

If one accepts that audio is added to TTML (I do not), then yes, it is certainly reasonable to define the TTML processor models to have both visual and aural outputs. I just don't know how to connect the audio output to today's overall decoder models. Text overlays are pretty well understood. Audio "overlays" are not yet supported in consumer devices, for example. Or have I missed that these audio features are constrained to a production workflow profile for the BBC use case?

@skynavga

This comment has been minimized.

Copy link
Collaborator

commented Apr 18, 2017

If you think of an image as a sequence of constant frames, then an image can be considered to have a timeline. Or consider an audio segment as a sequence of a constant sample. So they can be compared to one another.

Perhaps you refer to the point that an audio sample sequence may have a different intrinsic duration than the text with which it is associated? Yes, this needs to be addressed, e.g., audio duration less than text duration vanishes after the audio duration and audio duration greater than text duration is clipped. So there are a few details to specify.

@mikedo

This comment has been minimized.

Copy link

commented Apr 18, 2017

I don't agree with your analogy analysis, but the point is that mention in Annex J is needed.

I don't understand "text with which it is associated". In the BBC use case there is by definition no text in the same temporal interval as the audio. In any event the audio is associated with a temporal interval, not necessarily text.

Yes, more work is needed on boundary conditions. Why not require

@mikedo mikedo closed this Apr 18, 2017

@mikedo mikedo reopened this Apr 18, 2017

@mikedo

This comment has been minimized.

Copy link

commented Apr 18, 2017

[sorry slip of the mouse] ...that the interval align with the duration of the clip? Else, you also have to consider a clip start offset and the details around a clip that spans multiple intervals.

@skynavga skynavga self-assigned this Apr 20, 2017

@LJWatson

This comment has been minimized.

Copy link

commented May 9, 2017

The inclusion of AD in TTML2 is both logical and important. Without AD it's impossible for blind and partially sighted people to fully understand and enjoy video content, so the more we can do to make the provision of AD as efficient as possible the better.

AD can be provided using synthetic speech, based on time markers and text, which strongly suggests that TTML is a good way to do it. With commercial services offering AD on this basis, the use case is much wider than just the BBC.

The AD requirements cover what is needed, and with draft spec text being close to complete, it would be a shame for AD not to be included in TTML2 IMO.

@palemieux

This comment has been minimized.

Copy link
Contributor

commented May 9, 2017

@LJWatson The issue in my mind is not with AD, but how it is integrated in the overall A/V pipeline. In particular, it would be very unusual for a timed text document to control audio mixing and routing, which is typically done at the playlist level. It might be my understanding. I do not recall this complex proposal being presented in detail within the TTWG. I therefore suggest scheduling time in an upcoming meeting to review the requirements and the proposed solution, with hopefully yourself present.

@LJWatson

This comment has been minimized.

Copy link

commented May 9, 2017

Thanks @palemieux - if I can provide a useful perspective as an accessibility professional (and consumer of AD), I'd be happy to help.

@skynavga

This comment has been minimized.

Copy link
Collaborator

commented May 9, 2017

I view the problem in simpler terms. A TTML presentation engine produces timed raster images already. Extending this to producing a composite, timed audio stream is no great conceptual leap. It is no more complex than the "mixing and routing" of text content into regions for rasterization. The net output is one stream of images and one stream of audio. It is then up to the containing application to decide what to do with these outputs.

In any case, I support inclusion of the Audio + AD features provided that this doesn't materially delay the process of making it to CR. It will then depend on whether there are sufficient implementations to claim it is not at risk when we propose PR.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2017

@palemieux a reminder: these requirements were circulated for discussion and raised on day 2 of our meeting in TPAC 2016 in Lisbon. We also went through them in more detail during the joint TTWG/Web and TV IG meeting.

In this case the completed TTML document would be used as source data to control audio mixing of the AD track, including timed pan and fade settings. Since this information is needed as part of the AD authoring process it is relevant to include it in the TTML document itself. It could if necessary (and possible) be transferred into a playlist level format for onward distribution noting that there are use cases for client side mixing as well as content provider side mixing. Typically client devices do not process playlists.

@mikedo

This comment has been minimized.

Copy link

commented May 9, 2017

Unfortunately, this technical discussion has devolved into the geo-politics of support for AD/VD. This is not about whether AD/VD is important. Of course it is. The question is its relevance within Timed Text. I believe adding this feature will be more complex to do correctly than currently envisioned.

I concede only with the point of view that, like all applications of TTML1/TTML2, a profile will need to eventually be defined to achieve useful interoperability. I look forward to that profile.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2017

@mikedo

this technical discussion has devolved into the geo-politics of support for AD/VD

Sorry, I don't think that's a fair representation of the discussion. All the comments have clearly been concerned with support for AD in TTML2, not the merits of AD per se.

I believe adding this feature will be more complex to do correctly than currently envisioned.

What technical reason is the basis for your belief? If you are planning to implement this functionality then I'd be delighted to learn that!

@palemieux

This comment has been minimized.

Copy link
Contributor

commented May 9, 2017

raised on day 2 of our meeting in TPAC 2016 in Lisbon.

It was not discussed.

We also went through them in more detail during the joint TTWG/Web and TV IG meeting.

This is not the TTWG.

I do not think it is unreasonable to request a walkthrough of this complex feature in the WG.

In this case the completed TTML document would be used as source data to control audio mixing of the AD track,
including timed pan and fade settings.

That sounds like a promising avenue for the purposes of authoring AD essence, but not for distribution where the routing and mixing of tracks should not be handled within a given track.

@dronca

This comment has been minimized.

Copy link

commented May 16, 2017

We are concerned that using TTML for AD will make TTML more complicated, and will also place a burden on AD implementations, which would start with the very large and complex TTML specification, most of which is not relevant to AD (for example, what do regions, spans, and referential styles mean for AD). We could envision a workable compromise that would be an IMSC equivalent standard for AD (IMDA?) that profiles away all non-relevant TTML features to make the AD model digestible. Likewise, IMSC2 would profile away AD to remove AD complexity from subtitles. Were there agreement on this, then the last concern that we have is with schedule. TTML2 is already late. AD work should not delay WR/CR entry/exit, or PR. If the feature and implementations are completed, then the feature would be part of TTML2 PR. If not, then it would be deferred to TTML.next.

@nigelmegitt

This comment has been minimized.

Copy link
Contributor Author

commented May 19, 2017

@dronca I would likewise envisage and support an AD profile; it is too early to consider that though - we cannot profile something that does not exist, so the primary goal now for TTML2 is to include the vocabulary so that such a profile can be created later. I would also agree that we should profile AD features out of IMSC2 since they are not appropriate there.

We cannot have AD implementations to support CR exit == PR entry unless the required features are in the CR, and they cannot be there unless they are in the WR. Therefore I do want them to be present, noting that incrementally compared to what we already have in TTML2 the difference is small. I've already stated the BBC's commitment to making an implementation.

By the way, you ask what regions, spans and referential styles mean for AD - I believe those questions have already been addressed in the notes at the head of this issue. Remember that use of regions is optional in TTML already.

@palemieux

This comment has been minimized.

Copy link
Contributor

commented May 19, 2017

@nigelmegitt When will the meeting be scheduled to review this proposal in depth?

@skynavga skynavga added the pr open label May 30, 2017

@skynavga skynavga added pr merged and removed pr open labels Jun 18, 2017

skynavga added a commit that referenced this issue Jun 18, 2017

skynavga added a commit that referenced this issue Jun 18, 2017

@skynavga skynavga added the pr open label Jun 18, 2017

skynavga added a commit that referenced this issue Jun 29, 2017

@skynavga skynavga removed the pr open label Jun 29, 2017

@skynavga

This comment has been minimized.

Copy link
Collaborator

commented Jun 29, 2017

Closing this with merge of audio related feature designators. Regarding interpolation calculation mode, please open a new issue if new functionality is required.

@skynavga skynavga closed this Jun 29, 2017

@skynavga skynavga removed their assignment Dec 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.