Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phonetic Markup Proposal #5

Open
pipfrosch opened this issue Jul 24, 2020 · 4 comments
Open

Phonetic Markup Proposal #5

pipfrosch opened this issue Jul 24, 2020 · 4 comments

Comments

@pipfrosch
Copy link

The Problem

Accessibility is really important to me, but I will probably never have the
funds to provide audio versions of what Pipfrosch Press is publishing. Some of
my planned publications will have frequent content updates rather than just
being static publications. For example, my planned field guide to Contra Costa
County will likely never ever be finished, with new species accounts added every
year and existing species accounts modified with some frequency.

For the print-disabled user, Text To Speech (TTS) Synthesis will be how they
access the content.

ePub currently has two different mechanisms for providing pronunciation hints to
TTS Synthesizers, PLS and SSML.

When there is only one way to pronounce a grapheme, PLS is the better option as
it allows a single document that can be updated as needed, either by the ePub
publisher or by a school / library as needed. PLS also supports multiple
phonetic alphabets at the same time.

Where there are multiple ways to pronounce a grapheme, SSML is better because it
allows the specific pronunciation to be specified for the use case of the
grapheme. However SSML only allows a single phonetic alphabet to be specified.

Unfortunately neither solution allows for regional pronunciation variations.

Even though both PLS and SSML have been in the ePub standard for some time, they
are not implemented by the vast majority of ePub viewers. I have heard of one
custom viewer used by a Japanese school district that implements them, but I was
not able to confirm it.

I recommend a new solution, a single solution that covers both use cases as well
as allowing for region-specific pronunciations and allows for as many different
phonetic alphabets as the ePub publisher knows about.

This solution does not have to be restricted to ePub but could work with any
digital publishing format, including websites and PDF (though perhaps not as an
embedded solution within PDF, I do not know).

This probably should only become part of the ePub standard if Apple, Google, and
EDRLab are on board and are committed to implementing it in their software. How
to get them on-board I have no clue, I have social anxiety and as a result do
not often portray confidence when proposing solutions, even were I to find a way
to get their ear, and unfortunately when proposing something without an
appearance of confidence, those with the power to implement can not see past the
presentation to see the value of what is being presented.

This solution probably needs to be adjusted by those with far more experience in
the issues related to TTS Synthesis than I have, but this solution should be
fairly easy to extend as is.

It probably needs to be yet another W3C project for experts in the field to
refine. It is my hope that someone who knows how to work the system to make
things happen sees the value in this and runs with it. I do not need any credit
if that happens, I just want a solution that works well as I publish my ePubs. A
solution that brings my ePubs to print-disabled users enjoyment rather than
frustration.

JSON Pronunciation Library

Example JSON file attached.

The format for the JSON Pronunciation Library shall be JSON. JSON was chosen for
the ease of which valid JSON files may be generated from a number of programming
languages from database queries, including Python and PHP. I am personally a big
fan of XML but this I think should be JSON.

The character set for the JSON pronunciation library will be UTF-8.

The first definition in a the JSON pronunciation library shall be lang and
either be assigned a string value of a BCP-47 language code or a list of BCP-47
language codes.

Examples:

"lang": "en"
"lang": "en-US", "en-GB"

In most cases, the generic language is to be preferred over a localized
language.

The text to speech synthesizer will only use a JSON Pronunciation Library that
matches the current specified language within the (X)HTML document. For example,
if the current document is specified as "en-US" then a JSON Pronunciation
Library with lang="es" would not be used for pronunciations except for a
string within a node labeled with the XML attribute lang="es". This is to
avoid collisions where languages that share the same alphabet have words with
an identical grapheme but are pronounced quite differently, allowing the Text to
Speech Synthesizer to use its own pronunciation algorithms in the event that an
entry exists for one language but does not exist for the language specified for
the string being read.

Pronunciation Context Dictionary

The JSON Pronunciation Library will have at least one context dictionary named
default but may have additional context dictionaries. In the example
JSON Pronunciation Library, additional context dictionaries named taxonomy
(for taxonomy names) and proper (for proper names) are provided.

The default context dictionary is to be used by TTS synthesizers either when
a context is not specified or when the grapheme is not found in the specified
context dictionary.

Each context dictionary will have a list named entries

grapheme entry

Each context dictionary entry list item must have a grapheme definition that
specifies either a string or a list of strings. Examples:

"grapheme": "job"

"grapheme": ["estivate", "aestivate", "æstivate"]

The specified grapheme should not be interpreted as case sensitive.

In cases where only one pronunciation for that grapheme is provided, one of more
phonetic alphabets with the corresponding phoneme can be specified. An example
that provides a phoneme for both ipa and x-sampa:

{
  "grapheme": ["estivate", "aestivate", "æstivate"],
  "ipa": "ˈɛstɪˌveɪt",
  "x-sampa": "EstI%veIt"
}

The text to speech synthesizer can then pick the alphabet it has the best
support for and use that phoneme to pronounce the grapheme.

speechpart

In some languages, the same grapheme may have a different pronunciation
depending upon the part of speech it is used in. For example, the grapheme
wind in English is pronounced differently---and has a different
meaning---depending upon if it is noun (or adjective) or a verb.

In those cases, a speechpart can be defined and the (X)HTML author should
specify the speech part with a span element. The speechpart will then hold
either the phoneme or regional variation phoneme. An example:

{
  "grapheme": "wind",
  "speechpart": {
    "noun" : {
      "ipa": "wɪnd",
      "x-sampa": "wInd"  
    },
    "verb" : {
      "ipa": "waɪnd",
      "x-sampa": "waInd"
    }
  }
}

When the speechpart is not specified by the (X)HTML the text to speech
synthesizer may attempt to detect the speech part based upon a grammatical
parsing of the sentence, as some seem to do already, but best practice
should be for the (X)HTML author to specify the speechpart as an XML
attribute to a span element around the grapheme.

When the speechpart is not determined or does not match a specified
speechpart then the first speechpart should be used. In the above
example, with the sentence "That is a beautiful wind turbine" wind is an
adjective but since a pronunciation for the grapheme wind as an
adjective is not specified, the noun phoneme for wind would be used since it
is the first defined speechpart.

Regional Pronunciation

Within the same language, sometimes a grapheme has a different pronunciation
depending upon political borders or cultural grouping.

An example of this is the grapheme vase. It seems to be pronounced differently
in America than in Great Britain than in Australia, though I am not positive
about the latter.

In those cases, a list of phonemes for the grapheme may be provided. For example:

{
  "grapheme": "vase",
  "languages" : [
    {
      "lang": "en-US",
      "ipa": "veɪs",
      "x-sampa": "veIs"
    },
    {
      "lang": ["en-GB", "en-IE"]
      "ipa": "vɑz",
      "x-sampa": "vAz"
    },
    {
      "lang": "en-AU",
      "ipa" : "vɐːz",
      "x-sampa": "v6:z"
    }
  ]
}

In these cases, the lang specifies the pronunciation language rather than
the document language. A British user reading an ePub that specifies en-US
will probably prefer that words be pronounced the British way and in fact may
have lower comprehension if they are pronounced the American way.

However there are cases, such as poetry where rhymes and near-rhymes are
important, where the (X)HTML author should be able to specify that a particular
regional variation of the language be used.

Context Dictionary Use Cases

In some cases, such as taxonomy names and proper names, the correct way to
pronounce a word may differ from the way the same grapheme is ordinarily
pronounced.

The (X)HTML author should be able to define context dictionaries for these
special cases and use an attribute in a span or other element around the string
that alerts the text to speech synthesizer to look in the specified context
dictionary for the pronunciation before looking in the default context
dictionary. What the author names these dictionaries should be up to the
author.

(X)HTML Attributes

The written language should be detected from the specified language in the
ePub OPF file <dc:language></dc:language> element, but allowing that
language to be over-ridden within an XHTML document with the lang
attribute, such as one might have for a bibliography entry for a work that
is written in a different language.

At least in English and most languages I am familiar with, words are
delimited by white-space. How to specify a sub-string including a space is a
grapheme the TTS Synthesizer should look up in the library I have not yet
considered but it would be an attribute to a parent span (or whatever) node.
Probably a binary attribute (the kind represented without a value in HTML but
has any value in XML to indicate True). I understand that in Australia, they
call a root beer float a spider. Things like that could, at the discretion of
the (X)HTML author, be accommodated for by specifying root beer float as a
grapheme. That is probably a very poor example, but there are other examples
where those of us who are not print-disabled see a string but read it in our
minds as words other than what is printed. Especially strings that involve an
abbreviation.

For the other attributes...

If it were up to me, I would create speech- attributes the TTS synthesizer
could trigger off us.

For specifying the speechpart, something like

<p>Remember to <span speech-part="verb">wind</span> your watch once a week.</p>

For specifying the spoken language to be used when it is critical that a
particular regional pronunciation be used:

<p>Tim reverted to his British roots when he started rhyming about the cool vase
he found in the woods, the perfect mother’s day gift he otherwise could not afford:</p>
<p speech-region="en-GB">“The vase, so boss, was buried in the moss.”</p>

Note that the speech-region attribute should trigger the text speech to text
synthesizer to use an algorithm for the specified region even for a grapheme
that is not specified in the JSON Pronunciation Library. But in that example,
even if the speech synthesizer only had an algorithm for American English, it
would still read the grapheme vase correctly in the rhyme.

For specifying the context dictionary, something like:

<p>According to <abbr>Dr.</abbr> <span speech-context="proper">Job</span> Walters...</p>

Please lets make this happen. For the present, even though no one is
implementing them, I will use PLS and SSML but those systems have limitations
that could easily be solved by this kind of a pronunciation library.

Thank you for your time.
pronunciation.json.txt

@mattgarrish
Copy link
Member

mattgarrish commented Jul 24, 2020

Where you say:

Unfortunately neither solution allows for regional pronunciation variations.

This isn't strictly true for PLS; you just can't do it in a single file. Regional variations can be provided by designating the language of the lexicon on the link declaration:

<link rel="pronunciation" type="application/pls+xml" hreflang="en-us" href="en-us.pls"/>
<link rel="pronunciation" type="application/pls+xml" hreflang="en-gb" href="en-gb.pls"/>

But the obvious fact remains that there hasn't been any appreciable uptake of these technologies, and multiple files is arguably cumbersome, so it's kind of a moot point. :)

Have you had a look at the WAI Pronunciation work, though? They're working at a solution for web-based content, so may be a more appropriate place to take your proposal.

@pipfrosch
Copy link
Author

pipfrosch commented Jul 25, 2020

I'll take it to that group but using the hreflang doesn't allow users to have the pronunciation for the region they prefer where specific pronunciation doesn't matter to rhythm or rhyme which can be important to comprehension.

EDIT I am going to try to write the proposal a little bit clearer and repost it at their github, that definitely looks like the right place.

@llemeurfr
Copy link

That's an interesting proposal, @pipfrosch, thanks for that. Because Readium reading toolkits are using the TTS features offered by the OS via a browser API (Chromium on PC/Mac/Linux, Chrome on Android, WebKit on iOS), there is nothing Readium can do if it isn't available on the underlying OS & browser.

I didn't develop the (quite simple) TTS feature available on the Readium Mobile Android toolkit, but just I had a quick look at (Chrome TTS API)[https://developer.chrome.com/apps/tts] and (Android TTS engine)[https://developer.android.com/reference/android/speech/tts/TextToSpeech] to get an overview of what is available on Android. That is pretty limited so far; SSML is supported by the API but I didn't see anything related to lexicons. I encourage you do look at such APIs at the time you make a proposal at WAI level, so that the discussion with TTS API & engine developers can be fruitful.

@murata2makoto
Copy link
Contributor

murata2makoto commented Dec 15, 2020

SSML as specified in EPUB 3 is used in Japan. Lentrance Reader supports it. Tokyo Shoseki (the biggest textbook publisher in Japan) uses it. There was a government project for the promotion of SSML. Here is one of its reports (in Japanese). I am sure that I can find more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants