Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What happens when SSML alphabet not specified #1706

Closed
mattgarrish opened this issue Jun 17, 2021 · 6 comments · Fixed by #1723
Closed

What happens when SSML alphabet not specified #1706

mattgarrish opened this issue Jun 17, 2021 · 6 comments · Fixed by #1723
Labels
Spec-TTS The issue affects the EPUB 3 Text-to-Speech Enhancements 1.0 WG Note

Comments

@mattgarrish
Copy link
Member

Another of the holes in our TTS definitions is we don't define what to make of an ssml:ph attribute that doesn't have an ssml:alphabet defined for it. Do we default to... IPA? To whatever the actual TTS engine supports by default?

It's not even required that there be an in-scope alphabet definition.

Also, what happens if the alphabet is defined but not supported? Default to IPA again?

We need to figure out what are really reading system requirements and what information is just passed to a tts engine.

@mattgarrish mattgarrish added the Spec-TTS The issue affects the EPUB 3 Text-to-Speech Enhancements 1.0 WG Note label Jun 17, 2021
@murata2makoto
Copy link
Contributor

Lentrance supports SSML and is a member of the Japanese DAISY Consortium. I will ask.

@mattgarrish
Copy link
Member Author

mattgarrish commented Jun 17, 2021

I'm assuming if a reading system is voicing the html it's sending the text content to the engine, so if there isn't an alphabet or it knows the engine doesn't support the grammar, the reading system should just send the actual text instead of the the value of the ssml:ph attribute.

But it would definitely be good to know what someone who has actually tried to implement this has made of our instructions.

Another oddity is that we don't even say to use the ssml:ph value in place of the text when having it voiced. There seems to be an assumption that the text and markup are sent to the tts engine and it makes sense of what to do with these.

That wasn't my experience in the past generating tts from html, as we had to inject the phonemes into the text content of the files prior to voicing. The question I keep having when I look at these is are we trying to work with known tts engines or is this trying to model a new type of tts engine? If we're trying to create the latter, we need a lot more detail.

@murata2makoto
Copy link
Contributor

Interactions of TTS engines and browsers/RSes are implementation-dependent gray area. Different guys appear to do different things. JDC is studying this topic (especially for ruby) and should continue to do so. Having said that, I do not think the first edition of the planned note can entirely solve this hard problem. Hopefully, we can say what implementations do about ssml:alphabet.

@mattgarrish
Copy link
Member Author

Different guys appear to do different things.

Right, this is what makes our requirements confusing.

I do not think the first edition of the planned note can entirely solve this hard problem.

Agree. I'd just like to see the requirements allow adoption to different approaches.

Here probably all we need to say is that the reading system should use the text content when an alphabet isn't specified but allow it to supply a default. Similarly, to use the supplied phonemes in place of the text content when an alphabet is specified. We don't need to force a single solution or get bogged down in minutiae.

An intro that clarifies that there are different models wouldn't hurt, either.

@murata2makoto
Copy link
Contributor

murata2makoto commented Jun 22, 2021

@okayama247 Do you know what will happen when the alphabet is not specified? I guess that implementations in Japan always assume the x-jeita.

@mattgarrish
Copy link
Member Author

I was looking at the SSML definition today, and it defines processing behaviours, including leaving it to processors to handle when not specified:

It is an error if a value for alphabet is specified that is not known or cannot be applied by a synthesis processor. The default behavior when the alphabet attribute is left unspecified is processor-specific.

https://www.w3.org/TR/speech-synthesis11/#g9

To avoid inconsistencies with ssml, we should probably also adopt all processing behaviours for the two elements, not just inherit their semantics as we currently have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spec-TTS The issue affects the EPUB 3 Text-to-Speech Enhancements 1.0 WG Note
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants