Add vocabulary to indicate which sections of a document are particularly 'speakable' #1389

Open
danbri opened this Issue Oct 5, 2016 · 15 comments

Comments

Projects
None yet
@danbri
Contributor

danbri commented Oct 5, 2016

Usecase:

"With use of text-to-speech on the rise in mainstream use-case scenarios such as smart
speakers (Amazon Echo, Google Home), multimodal interaction on smart phones and in-car systems, there is a need for authors and publishers to be able to easily call out portions of a Web page that are particularly appropriate for reading out aloud. Such read-aloud functionality may
vary from speaking a short title and summary, to speaking a few key sections of a page; in some cases, it may amount to speaking most non-visual content on the page. "

A vocab draft:

@chaals

This comment has been minimized.

Show comment
Hide comment
@chaals

chaals Mar 3, 2017

Contributor

It seems like you're identifying the "key bits of the page", presumably as an initial view of it, a bit like
<meta name="description" content="This is the most important page about speaking things"> but more directly oriented to consuming the content or interacting with it than to choosing between two or more pages.

I think that kind of summary has a fair bit of application beyond reading it out on a speech system. I like the model of being able to gather a few different pieces of the content together, but I'm wary of trying to tie it tightly to text-to-speech usage.

On the other hand, I am still thinking about this.

(The examples also seem to be a bit broken)

@LJWatson ping?

Contributor

chaals commented Mar 3, 2017

It seems like you're identifying the "key bits of the page", presumably as an initial view of it, a bit like
<meta name="description" content="This is the most important page about speaking things"> but more directly oriented to consuming the content or interacting with it than to choosing between two or more pages.

I think that kind of summary has a fair bit of application beyond reading it out on a speech system. I like the model of being able to gather a few different pieces of the content together, but I'm wary of trying to tie it tightly to text-to-speech usage.

On the other hand, I am still thinking about this.

(The examples also seem to be a bit broken)

@LJWatson ping?

@LJWatson

This comment has been minimized.

Show comment
Hide comment
@LJWatson

LJWatson Mar 3, 2017

This seems like a useful property. When using a voice UI the interaction needs to be clutter free, or it becomes fairly horrible.

The only other use case for something like it, is those tools that strip out the visual clutter of pages for better readability. I don't know whether the desireable content would be the same for both use cases though...

LJWatson commented Mar 3, 2017

This seems like a useful property. When using a voice UI the interaction needs to be clutter free, or it becomes fairly horrible.

The only other use case for something like it, is those tools that strip out the visual clutter of pages for better readability. I don't know whether the desireable content would be the same for both use cases though...

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri May 23, 2017

Contributor

@chaals @LJWatson - I've just posted a brief proposal to the JSON-LD group, who are working on improvements to JSON-LD. The idea would be for the cross-domain parts of this to be specified as something a JSON-LD parser might do, i.e. as @chaals says, not "tie it tightly to text-to-speech usage". Within the purely schema.org world, at least the 'xpath' and 'cssSelector' properties have nothing binding them to text-to-speech; other definitions and usecases could easily reuse them.

(edit - here's the issue I mentioned) - json-ld/json-ld.org#498

Contributor

danbri commented May 23, 2017

@chaals @LJWatson - I've just posted a brief proposal to the JSON-LD group, who are working on improvements to JSON-LD. The idea would be for the cross-domain parts of this to be specified as something a JSON-LD parser might do, i.e. as @chaals says, not "tie it tightly to text-to-speech usage". Within the purely schema.org world, at least the 'xpath' and 'cssSelector' properties have nothing binding them to text-to-speech; other definitions and usecases could easily reuse them.

(edit - here's the issue I mentioned) - json-ld/json-ld.org#498

@nicolastorzec

This comment has been minimized.

Show comment
Hide comment
@nicolastorzec

nicolastorzec Jun 22, 2017

Contributor

I'm with Chaals regarding clarifying the goal. Is it about:

  1. Annotating the portions of a page that would be particularly appropriate for reading out loud because the publisher think they could be accurately rendered via TTS? This option mostly makes sense when publishers are TTS experts...
  2. Annotating which portions of a page would be worth reading out loud because the publisher think they are the most important information on the page? This option is more about marking up prominent information than speakable information...
  3. Providing an alternate, speakable, version of the most prominent information on the page?

Also, we should look into SSML if we want to go beyond annotating the speakable portions of a page:

  • The Speech Synthesis Markup Language is "designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications."
  • SSML is supported by Amazon Alexa, Microsoft Cortana, and the Google Assistant.
Contributor

nicolastorzec commented Jun 22, 2017

I'm with Chaals regarding clarifying the goal. Is it about:

  1. Annotating the portions of a page that would be particularly appropriate for reading out loud because the publisher think they could be accurately rendered via TTS? This option mostly makes sense when publishers are TTS experts...
  2. Annotating which portions of a page would be worth reading out loud because the publisher think they are the most important information on the page? This option is more about marking up prominent information than speakable information...
  3. Providing an alternate, speakable, version of the most prominent information on the page?

Also, we should look into SSML if we want to go beyond annotating the speakable portions of a page:

  • The Speech Synthesis Markup Language is "designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications."
  • SSML is supported by Amazon Alexa, Microsoft Cortana, and the Google Assistant.
@gkellogg

This comment has been minimized.

Show comment
Hide comment
@gkellogg

gkellogg Jul 1, 2017

Contributor

In json-ld/json-ld.org#498 (comment) I suggested that it may simply be better to combine RDFa and JSON-LD on page to address this, as RDFa allows HTML content to be referenced/extracted from the page using rdf:XMLLiteral or rdf:HTML. Adding some new kind of HTML selector as a value in JSON-LD seems like mixing domain metaphors.

I don't think any existing examples contains both JSON-LD and RDFa, but this is feasible and well-supported by existing processors.

Contributor

gkellogg commented Jul 1, 2017

In json-ld/json-ld.org#498 (comment) I suggested that it may simply be better to combine RDFa and JSON-LD on page to address this, as RDFa allows HTML content to be referenced/extracted from the page using rdf:XMLLiteral or rdf:HTML. Adding some new kind of HTML selector as a value in JSON-LD seems like mixing domain metaphors.

I don't think any existing examples contains both JSON-LD and RDFa, but this is feasible and well-supported by existing processors.

danbri added a commit that referenced this issue Aug 1, 2017

Amended xpath and cssSelector properties to use new dedicated datatypes.
For #1389 #1672

Intent is that these types be applicable to usecases beyond SpeakableSpecification.

They are named with "*Type" to avoid the types having same spelling as the property.
@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Sep 11, 2017

Contributor

Some implementor feedback from Google: the "cssSelector" (and "xpath") property would be particularly useful on http://schema.org/WebPageElement to indicate the part(s) of a page matching the selector / xpath.

Note that this isn't "element" in some formal XML sense, and that the selector might match multiple XML/HTML elements if it is a CSS class selector.

I suggest adding WebPageElement as a type that these 2 properties are expected on.

Contributor

danbri commented Sep 11, 2017

Some implementor feedback from Google: the "cssSelector" (and "xpath") property would be particularly useful on http://schema.org/WebPageElement to indicate the part(s) of a page matching the selector / xpath.

Note that this isn't "element" in some formal XML sense, and that the selector might match multiple XML/HTML elements if it is a CSS class selector.

I suggest adding WebPageElement as a type that these 2 properties are expected on.

@danbri

This comment has been minimized.

Show comment
Hide comment
Contributor

danbri commented Sep 12, 2017

@vholland

This comment has been minimized.

Show comment
Hide comment
@vholland

vholland Sep 12, 2017

Contributor

+1

Contributor

vholland commented Sep 12, 2017

+1

@danbri

This comment has been minimized.

Show comment
Hide comment
@danbri

danbri Sep 14, 2017

Contributor

Proceeding on the basis that this is a commonsense combination of two terms with related semantics, I'm making an edit now to cssSelector, xpath, and the expected type assocations of both. There might be some nuance in the details but it doesn't make sense having a type for parts of a page, a property for pointing into parts of a page, and failing to say how they relate!

Contributor

danbri commented Sep 14, 2017

Proceeding on the basis that this is a commonsense combination of two terms with related semantics, I'm making an edit now to cssSelector, xpath, and the expected type assocations of both. There might be some nuance in the details but it doesn't make sense having a type for parts of a page, a property for pointing into parts of a page, and failing to say how they relate!

danbri added a commit that referenced this issue Sep 14, 2017

Updated to explain how 'xpath', 'cssSelector' relate to 'WebPageEleme…
…nt' /cc #1389

Allowed both properties to be expected on that type.
@jvandriel

This comment has been minimized.

Show comment
Hide comment
@jvandriel

jvandriel Nov 27, 2017

Finally got a moment to respond to this...

Heaving read the discussion I'm still wondering what it exactly is this proposal is trying to resolve?

In general the part of a web page that should be 'speakable/pronouncable' is the main content of a page, which most of the times, are things like Article, BlogPosting, Product, Service, Recipe, etc. which IMHO all have plenty of properties (even WebPage itself) for devices to be able to 'speak/pronounce' the textual content that matters.

At the same time I can't help feeling that this proposal tries to bypass WCAG accessibility guidelines which IMO should suffice for devices (can't image things like speakers need more specific Types and attributes than screen readers (and visually impaired people) do).

Am I overlooking reasons why WCAG guidelines don't suffice here?

jvandriel commented Nov 27, 2017

Finally got a moment to respond to this...

Heaving read the discussion I'm still wondering what it exactly is this proposal is trying to resolve?

In general the part of a web page that should be 'speakable/pronouncable' is the main content of a page, which most of the times, are things like Article, BlogPosting, Product, Service, Recipe, etc. which IMHO all have plenty of properties (even WebPage itself) for devices to be able to 'speak/pronounce' the textual content that matters.

At the same time I can't help feeling that this proposal tries to bypass WCAG accessibility guidelines which IMO should suffice for devices (can't image things like speakers need more specific Types and attributes than screen readers (and visually impaired people) do).

Am I overlooking reasons why WCAG guidelines don't suffice here?

@anschluss80

This comment has been minimized.

Show comment
Hide comment
@anschluss80

anschluss80 Jan 20, 2018

Summary: I guess, WCAG guidelines and this 'speakable' proposal have different use cases and target audiences.

WCAG is about making the whole (main) content of a webpage accessible. The use case here is to serve the whole content to anyone, who deliberately visits a specific website.

Voice assistants, on the other hand, should keep their answer to a specific question brief - a short summary of the page topic could fit quite well, A typical use case is a search engine research, where users won't visit the website, but instead get an ecxerpt of the topic. Even more, users often do not know, where the excerpt is originating from (see example below).

An advisory from Amazon for Alexa responses: "Be brief"
https://developer.amazon.com/designing-for-voice/what-alexa-says/

And for Google: "Recommended: Less than 300 characters for each dialog turn."
https://developers.google.com/actions/assistant/responses

If you. for example, ask Amazon Alexa "Alexa, who is Chuck Norris", it will read the first sentence of the Wikipedia article on Chuck Norris, without mentioning the origin. At the time of writing, the answer is "Carlos Ray 'Chuck' Norris (born March 10, 1940) is an American martial artist, actor, film producer and screenwriter." (English Wikipedia). It's not the whole article, which you would expect when using a screen reader.

Just my two cents ;-)
Alex.

anschluss80 commented Jan 20, 2018

Summary: I guess, WCAG guidelines and this 'speakable' proposal have different use cases and target audiences.

WCAG is about making the whole (main) content of a webpage accessible. The use case here is to serve the whole content to anyone, who deliberately visits a specific website.

Voice assistants, on the other hand, should keep their answer to a specific question brief - a short summary of the page topic could fit quite well, A typical use case is a search engine research, where users won't visit the website, but instead get an ecxerpt of the topic. Even more, users often do not know, where the excerpt is originating from (see example below).

An advisory from Amazon for Alexa responses: "Be brief"
https://developer.amazon.com/designing-for-voice/what-alexa-says/

And for Google: "Recommended: Less than 300 characters for each dialog turn."
https://developers.google.com/actions/assistant/responses

If you. for example, ask Amazon Alexa "Alexa, who is Chuck Norris", it will read the first sentence of the Wikipedia article on Chuck Norris, without mentioning the origin. At the time of writing, the answer is "Carlos Ray 'Chuck' Norris (born March 10, 1940) is an American martial artist, actor, film producer and screenwriter." (English Wikipedia). It's not the whole article, which you would expect when using a screen reader.

Just my two cents ;-)
Alex.

@jvandriel

This comment has been minimized.

Show comment
Hide comment
@jvandriel

jvandriel Jan 21, 2018

OK, I see sense in the point that WCAG's use case is different than that of this proposal. But for use cases like a 'title', there is the name property which could be used by speaking devices (or headline for creative works). And as for summaries, can't description be used for that? (which in most cases are less than 300 characters)

What about 'speakable' parts of an Article, do we actually expect publishers to markup individual <section>, <p>, <div> and <span> elements? (when there also are properties like articleBody and articleSection) Will be real fun if the rest of the markup gets published in JSON-LD, can't wait to see how WYSIWYG text editors will cope with this (educated guess, they won't).

What about something like a short Answer (or Question, Review or any other form of user generated content for that matter), is it expected that both the text property and speakable > SpeakebleSpecification be provided? (which will more than likely contain exactly the same content)

And lastly, what about something like a Product? A <meta name="description"> can now be ±300 chars in Google's organic search, which coincidentally is more or less the same amount of chars a marketer would provide Amazon for a product's description. I therefore expect that in a product's case the <meta name="description">, description and speakable > SpeakebleSpecification will all contain exactly the same content (as well as the descriptions provided (in other formats) to marketplaces and price comparison sites).

Really, I get the intention of the proposal but I don't expect much more to come of it than publishers duplicating the same content multiple times to be able to populate multiple properties (in different formats for different parties).

Now I've worked with some very large publishers in the past and can tell you that all hell breaks loose when authors have to start providing new/multiple titles and descriptions for the same article (or product) because different media require different character counts - simply because this costs time (=money) they don't have and therefore this isn't a trivial matter for them!

Meaning, I'm pretty sure authors (as well as business owners/stakeholders) won't be happy at all if they have to start providing 'speakable' descriptions (of a certain length) as well - especially if this also involves doing this for multiple sections of an article or web page.

And from a CMS perspective I don't expect much positives either as this will probably lead to authors having to fill out (many) more input fields in the CMS form of an article or a product's PIM system (or even worse, forcing authors having to start adding and managing CSS classes of elements for the cssSelectors - fun job for sites with (hundreds of) thousands of articles or products).

Apologies if I sound negative (especially because I do like the idea of being able to easily serve speaking devices) but I just don't see publishers handling this proposal very well mostly due to technical/resource constraints (which will lead to the duplication of content), as well as time constraints for (professional) authors (as they already have so many things to fill out).

Try looking at it from a business perspective, what's there to be gained by website owners after they've spend a ton of time and resources to make this happen? I understand the ROI for companies that produce speaking devices but what's the ROI for those implementing this proposal on their sites? Will this lead to users reading more articles or buying more products? If not, why would publishers bother to accommodate speaking devices? Typical questions businesses need answers to, and guess what happens if the answers aren't in their favor? Absolutely nothing as they'll see it as a waist of precious resources.

Can't this instead be resolved by having speaking devices simply use properties that already exist (and are being used by publishers)?

jvandriel commented Jan 21, 2018

OK, I see sense in the point that WCAG's use case is different than that of this proposal. But for use cases like a 'title', there is the name property which could be used by speaking devices (or headline for creative works). And as for summaries, can't description be used for that? (which in most cases are less than 300 characters)

What about 'speakable' parts of an Article, do we actually expect publishers to markup individual <section>, <p>, <div> and <span> elements? (when there also are properties like articleBody and articleSection) Will be real fun if the rest of the markup gets published in JSON-LD, can't wait to see how WYSIWYG text editors will cope with this (educated guess, they won't).

What about something like a short Answer (or Question, Review or any other form of user generated content for that matter), is it expected that both the text property and speakable > SpeakebleSpecification be provided? (which will more than likely contain exactly the same content)

And lastly, what about something like a Product? A <meta name="description"> can now be ±300 chars in Google's organic search, which coincidentally is more or less the same amount of chars a marketer would provide Amazon for a product's description. I therefore expect that in a product's case the <meta name="description">, description and speakable > SpeakebleSpecification will all contain exactly the same content (as well as the descriptions provided (in other formats) to marketplaces and price comparison sites).

Really, I get the intention of the proposal but I don't expect much more to come of it than publishers duplicating the same content multiple times to be able to populate multiple properties (in different formats for different parties).

Now I've worked with some very large publishers in the past and can tell you that all hell breaks loose when authors have to start providing new/multiple titles and descriptions for the same article (or product) because different media require different character counts - simply because this costs time (=money) they don't have and therefore this isn't a trivial matter for them!

Meaning, I'm pretty sure authors (as well as business owners/stakeholders) won't be happy at all if they have to start providing 'speakable' descriptions (of a certain length) as well - especially if this also involves doing this for multiple sections of an article or web page.

And from a CMS perspective I don't expect much positives either as this will probably lead to authors having to fill out (many) more input fields in the CMS form of an article or a product's PIM system (or even worse, forcing authors having to start adding and managing CSS classes of elements for the cssSelectors - fun job for sites with (hundreds of) thousands of articles or products).

Apologies if I sound negative (especially because I do like the idea of being able to easily serve speaking devices) but I just don't see publishers handling this proposal very well mostly due to technical/resource constraints (which will lead to the duplication of content), as well as time constraints for (professional) authors (as they already have so many things to fill out).

Try looking at it from a business perspective, what's there to be gained by website owners after they've spend a ton of time and resources to make this happen? I understand the ROI for companies that produce speaking devices but what's the ROI for those implementing this proposal on their sites? Will this lead to users reading more articles or buying more products? If not, why would publishers bother to accommodate speaking devices? Typical questions businesses need answers to, and guess what happens if the answers aren't in their favor? Absolutely nothing as they'll see it as a waist of precious resources.

Can't this instead be resolved by having speaking devices simply use properties that already exist (and are being used by publishers)?

@akuckartz

This comment has been minimized.

Show comment
Hide comment

Can HTML annotations be used?
https://www.w3.org/TR/annotation-html/

@postphotos

This comment has been minimized.

Show comment
Hide comment
@postphotos

postphotos May 4, 2018

Meaning, I'm pretty sure authors (as well as business owners/stakeholders) won't be happy at all if they have to start providing 'speakable' descriptions (of a certain length) as well - especially if this also involves doing this for multiple sections of an article or web page.

Thanks @jvandriel for describing this larger concerns and I hear you. A few thoughts I'd like to offer:

  • Many publishers will leverage excerpt for Social Media tailoring and RSS feeds already. I could see that being be pretty close to a speakable selection from a CMS side, allowing a X,XXX-long article to make sense in <300 words... I'm thinking of Google's Featured Snippets or the Wolfram Alpha powered knowledge answers in iOS's Siri.

  • I would be happy if I could better define what answer these platforms see, especially if I'm using this spec on a site with high domain authority and lots of relevant content for a structured data consumer (like the two already mentioned.)

  • I could see the speakable speakableSelection spec here allows for a user to better define, in limited terms, what the whole thing is about with less details and more summary, and could provide better context (just as meta descriptions or RSS excerpts already do) with a better focus on speakable-friendly vocabulary.

Thoughts?

postphotos commented May 4, 2018

Meaning, I'm pretty sure authors (as well as business owners/stakeholders) won't be happy at all if they have to start providing 'speakable' descriptions (of a certain length) as well - especially if this also involves doing this for multiple sections of an article or web page.

Thanks @jvandriel for describing this larger concerns and I hear you. A few thoughts I'd like to offer:

  • Many publishers will leverage excerpt for Social Media tailoring and RSS feeds already. I could see that being be pretty close to a speakable selection from a CMS side, allowing a X,XXX-long article to make sense in <300 words... I'm thinking of Google's Featured Snippets or the Wolfram Alpha powered knowledge answers in iOS's Siri.

  • I would be happy if I could better define what answer these platforms see, especially if I'm using this spec on a site with high domain authority and lots of relevant content for a structured data consumer (like the two already mentioned.)

  • I could see the speakable speakableSelection spec here allows for a user to better define, in limited terms, what the whole thing is about with less details and more summary, and could provide better context (just as meta descriptions or RSS excerpts already do) with a better focus on speakable-friendly vocabulary.

Thoughts?

@BigBlueHat

This comment has been minimized.

Show comment
Hide comment
@BigBlueHat

BigBlueHat Jun 4, 2018

FWIW, the Web Annotation selectors encoding might provide more future-proofing, flexibility, and potential re-use throughout Schema.org (i.e. new selection systems can be added without new properties)--though at the exchange of being a bit more verbose.

So example 1 for SpeakableSpecification might support both CSS Selectors and XPath's and fragment identifiers together to become:

{
  "@context": "http://schema.org/",
  "@type": "WebPage",
  "name": "Jane Doe's homepage",
  "speakable": {
    "@type": "SpeakableSpecification",
    "selector": [
      {"@type": "CssSelector", "@value": ".headline"},
      {"@type": "XPathSelector", "@value": "//summary"},
      {"@type": "FragmentSelector", "@value": "#speakable"}
    ]
  },
  "url": "http://www.janedoe.com"
 }

Essentially, the WebPage is the targeted resource and the speakable property would create the ResourceSelection as a SpeakableSpecification (in this usage).

The Selectors and States note focuses on this part of the Web Annotation Data Model.

I'd be happy to help with some mappings between the two, if there's interest.

FWIW, the Web Annotation selectors encoding might provide more future-proofing, flexibility, and potential re-use throughout Schema.org (i.e. new selection systems can be added without new properties)--though at the exchange of being a bit more verbose.

So example 1 for SpeakableSpecification might support both CSS Selectors and XPath's and fragment identifiers together to become:

{
  "@context": "http://schema.org/",
  "@type": "WebPage",
  "name": "Jane Doe's homepage",
  "speakable": {
    "@type": "SpeakableSpecification",
    "selector": [
      {"@type": "CssSelector", "@value": ".headline"},
      {"@type": "XPathSelector", "@value": "//summary"},
      {"@type": "FragmentSelector", "@value": "#speakable"}
    ]
  },
  "url": "http://www.janedoe.com"
 }

Essentially, the WebPage is the targeted resource and the speakable property would create the ResourceSelection as a SpeakableSpecification (in this usage).

The Selectors and States note focuses on this part of the Web Annotation Data Model.

I'd be happy to help with some mappings between the two, if there's interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment