Skip to content
ruoxiran edited this page Sep 2, 2024 · 18 revisions

Welcome to the Spoken Presentation Task Force wiki!

The objective of the Spoken Presentation Task Force is to develop normative specifications and best practices guidance collaborating with other W3C groups as appropriate, to provide for proper pronunciation in HTML content when using text to speech (TTS) synthesis.

SSML required for Educational Assessments

Accurate, consistent pronunciation and presentation of content is an essential requirement in education, especially in assessment. It is currently a major challenge area, and assessment vendors are looking for a standards-based solution. SSML (as well as CSS3 and PLS) has been identified as an effective solution. However, support in content and assistive technology is lacking.

We have identified the following SSML features as being critical for implementation:

  • say-as
  • phoneme
  • sub
  • emphasis
  • break

One approach that appears to be emerging is an attribute model for incorporating SSML into HTML. The approach is already used in EPUB3 [https://idpf.github.io/a11y-guidelines/content/tts/ssml.html]. A data-attribute model is being explored by some vendors as a near term solution for custom, built-in AT in assessment delivery platforms.

The elements, existing, and proposed solutions are described below for each element.

say-as

Specification

Element: https://www.w3.org/TR/speech-synthesis/#S3.1.8 Attributes: https://www.w3.org/TR/ssml-sayas/

SSML usage

There are<say-as interpret-as="ordinal">10235</say-as> people in zip code <say-as interpret-as="characters">90274</say-as>

HTML Example

In the year <span say_as="date_year">1876</span> telephone was invented.

The zip code is <span say_as="character">63105</span>.

Alternative approaches

  • CSS3 Speech 'speak-as' property but not as complete as SSML say-as
  • In the wild aria-label is seen, but introduces unacceptable braille issues

phoneme

Specification

Element: https://www.w3.org/TR/speech-synthesis/#S3.1.9

SSML usage

<phoneme alphabet="ipa" ph="təˈmeɪ toʊ">tomato</phoneme>

HTML Example

The <span phoneme="təˈmeɪ toʊ">tomato</span> is red.

Alternative approaches

  • aria-label being used by some but the pronunciation text string is sent to both TTS and refreshable braille, which is unacceptable

  • Create custom dictionary entries for each AT

  • Use PLS specification (requires TTS to support), does not address all contextual issues

  • Currently in the EPUB3 Specification using the SSML phoneme attributes. Limited uptake (production tools, some usage in Japan in reading systems). Heavily used by the biggest textbook publisher (Tokyo Shoseki) in Japan.

    The guitarist was playing a <span ssml:ph="beIs">bass</span> that was shaped like a <span ssml:ph="b&s">bass</span>.
    </p>
    

sub

Specification

Element: https://www.w3.org/TR/speech-synthesis/#S3.1.10

SSML usage

<sub alias="World Wide Web Consortium">W3C</sub>

HTML Example

Common table salt is really <span substitution="Sodium Chloride">NaCl</span>.

Alternative approaches

  • aria-label being used by some but the pronunciation text string is sent to both TTS and refreshable braille, which is unacceptable
  • Use PLS specification (requires TTS to support), does not address all contextual issues

emphasis

Specification

Element: https://www.w3.org/TR/speech-synthesis/#S3.2.2

SSML usage

That is a <emphasis level="strong"> huge </emphasis> bank account!

HTML Example

That is a really <span emphasis="strong">huge</span> car.

Alternative approaches

  • Screen Readers and Read Aloud Tools could reliably and in a consistent manner change speech characteristics for emphasised text.

break

Specification

Element: https://www.w3.org/TR/speech-synthesis/#S3.2.3

SSML usage

Take a deep <break time="3s"/>breath.

HTML Example

Take a deep <span breakTime="3">breath</span> then continue.

Alternative approaches

Some further background

We'd like to see all three standards supported:

  • SSML
  • CSS3 Speech
  • PLS

As all three have strengths:

  • precise, contextual author control (SSML)
  • standardized spoken presentation styles, without altering content (CSS3)
  • standardized pronunciation cues without altering content (PLS)

In the near term, precise author control is a critical requirement. In education, the subject matter expert author understands the context and spoken requirement; the AT doesn't and shouldn't make assumptions.

Questions

So whose problem is this anyway? It breaks down, in my view, to the following:

  1. Content - a valid method (that doesn't break rendering) to encode SSML in HTML
  2. Assistive Technology - AT must be able to consume the SSML from the content, and...
  3. Text to Speech Engines - must consume and utilize SSML in rendering speech

Let's ignore (3) for the present, as we will assume that an SSML enabled TTS will be on the delivery platform.

For (1), let's assume we can come up with an attribute model for authoring the content with SSML.

That leaves the really hard problem of (2)... How will the AT consume the SSML markup? Is there a mechanism in the accessibility API that will allow consuming and keeping separate the SSML cues from the the visually rendered text (in the span, for example) so that the unhinted text is sent to the braille display... AND....

the text, wrapped in SSML markup is sent to the synthesizer.

The same problem would have to be solved for CSS3 Speech support.

What about the rest of SSML? For example, prosody and voice? While these two elements are not immediate, critical needs in the assessment context, there is probably no reason not to include them for general usage.

How much of this is just about getting browser and AT vendors to "support existing standards" and how much is identifying new features in WAI-ARIA or defining a standard mechanism for including SSML via an attribute model?

Proposal Including SSML in HTML via WAI-ARIA

Currently, there is no standards-based mechanism for incorporating SSML into HTML to provide pronunciation or presentational hinting to assistive technologies which render text using text to speech synthesis (such as screen readers and read aloud tools). This issue has been previously shared with the WAI-ARIA group at TPAC 2016 [1].

The need for accurate pronunciation or presentation of spoken content is important in educational content, and critical in educational assessment. Across assessment vendors, a variety of approaches have been used to solve this problem, ranging from improper use of the WAI-ARIA standard, to creation of custom attributes or data-attributes. There is no consistent approach. Further, some of these approaches are problematic for braille users when hinted text intended for TTS spoken presentation is also rendered on the refreshable braille display.

We propose for your consideration a new WAI-ARIA attribute tentatively named aria-SSML which utilizes JSON to encapsulate SSML functions, attributes, and values in a manner that allows for easy consumption by Assistive Technologies.

aria-SSML

We have been experimenting with different approaches over the past several years and settled on the JSON approach, tested using a data attribute, data-ssml. The data-ssml attribute can be applied to HTML elements containing textual content. The attribute value is a JSON structure which contains the SSML function (e.g., “say-as”) and any required property-value pairs needed to fully specify the function. While we could propose simply standardizing on a data-attribute, we believe the importance of seeking AT support argues for moving this forward as an ARIA attribute.

say-as

The angle <span aria-ssml='{"say-as" : {"interpret-as":"characters"}}'>CAB</span> is 30 degrees.

phoneme

The words <span aria-ssml='{"phoneme": {"ph":"ˈkɔɹdəˌneɪt/"}}'>coordinate</span> and 
<span aria-ssml='{"phoneme": {"ph":"ˈkɔɹdənɪt"}}'>coordinate</span> have different meanings.

break

The point <span aria-ssml='{"break":{"time":"250ms"}}'></span>
<span aria-ssml='{"say-as" : {"interpret-as":"characters"}}'>x,y</span> is on the coordinate plane.

sub

1 <span aria-ssml='{"sub": {"alias":"pico meter"}}'>pm</span> is equal to one trillionth of a meter.

emphasis

You <strong><span aria-ssml='{"emphasis": {"level":"strong"}}'>must</span></strong> answer 
all questions in order to continue.

Note in this last example, the aria-ssml attribute could have been placed on the strong element.

SSML Tool

SSMLTool is a demonstrator for examining data-SSML support using the W3C Web Speech Synthesis API. The tool demonstrates the basic process of consuming JSON-encoded SSML contained as the attribute value of data-ssml.

This code is made available "as is" for demonstration purposes, and not intended as a specific proposed method of implementing SSML support in HTML.

A live version is available at [ http://www.ets-research.org/ia11ylab/ssmltool/ ]

Note that you will need to have an SSML-aware synthesizer available on Windows, or be running on MacOS where we map common SSML features to their equivalents in the MacOS TTS.

Tested on Windows with Chrome and Firefox using Ivona TTS, and Windows 10 Edge with Microsoft TTS. Also on MacOS with Chrome and Safari using the native Mac OS TTS engine, with a custom mapping of SSML to the native Apple TTS commands.

Key actions

Questions? Please write Mark Hakkinen (mhakkinen@ets.org) or Irfan Ali (irfan.ali@blackRock.com)