Skip to content
Irfan Ali edited this page May 24, 2021 · 4 revisions

Extended Abstract

The W3C Speech Synthesis Markup Language (SSML) is seeing a growing use in consumer oriented products such as the Amazon Alexa and Google Home, and in speech-based services such as Microsoft Cortana. A key benefit of SSML is its application to improve the quality of spoken presentation for digital content. SSML enables content authors to control speech characteristics such as prosody and rate, pronunciation via phonetic text strings, pausing, numeric value handling, and other features. The W3C Pronunciation Task Force has identified the importance of SSML for ensuring pronunciation accuracy in the context of educational assessment and learning materials, and is now proposing that the assistive technology community examine approaches for implementing SSML to enhance the quality of content spoken by Text to Speech Synthesizers (TTS). TTS has long been used by screen readers and other assistive technologies for people with disabilities. TTS is now also widely used in popular applications such as voice assistants. However, there is currently no way for content creators to mark up HTML content to o that it is correctly and consistently spoken by all commonly used TTS engines and operating environments. In order to address this gap, the Pronunciation Task Force has published a First Public Working Draft, Specification for Spoken Presentation in HTML, following several years of gap analysis & use cases, user scenarios, and specification development. The working draft provides two implementation approaches with the goal of soliciting feedback so that one approach will move forward and become a standard.

Background

This document is part of W3C work to provide normative specifications and best practice guidance so that text-to-speech (TTS) synthesis can properly pronounce HTML content. TTS has long been used by screen readers and other assistive technologies for people with disabilities. TTS is now also widely used in popular applications such as voice assistants. Yet today there is no way for content creators to markup HTML content so it is correctly and consistently spoken by all commonly used TTS engines and operating environments. A specification for SSML is intended to fill this critical gap.

In order to address the gap, the First Public Working Draft specification describes two possible technical approaches for author-controlled pronunciation of HTML content, using Speech Synthesis Markup Language (SSML). It includes an analysis of each approach:

  • multi-attribute approach — uses one or more element attributes with string values to convey each SSML function and property
  • single-attribute approach — uses a single element attribute with a JavaScript object notation (JSON) string to convey all SSML functions and properties

Both approaches satisfy the requirements and provide consistent results. The W3C Pronunciation Task Force seeks feedback from authors and implementors on which approach would be most implementable across all applications of spoken presentation. Also, the task force would appreciate answers to the following questions:

  • Are there additional approaches that are not described in this document?
  • Are there aspects of these approaches that are incorrect or insufficiently defined?
  • Have we overlooked some aspect in our analysis that should be addressed?

This feedback will help determine which approach will become the final standard.