Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce datatypes for CssSelector and XPath #1672

Open
danbri opened this issue Jun 19, 2017 · 22 comments
Open

Introduce datatypes for CssSelector and XPath #1672

danbri opened this issue Jun 19, 2017 · 22 comments
Assignees

Comments

@danbri
Copy link
Contributor

danbri commented Jun 19, 2017

Suggestion building on (experience implementing) the xpath/css mechanism from SpeakableSpecification ie. #1389

Context: http://pending.schema.org/SpeakableSpecification was added in v3.2 (including 'xpath' and 'cssSelector' properties which expect 'Text' values), in "pending review" area of schema.org.

Proposal:

  • XPathType (new datatype, subtype of Text), "Text that encodes a W3C XPath".
  • CssSelectorType (new datatype, subtype of Text), "Text that encodes a CSS selector".

Motivation: to allow applications to offer more accurate validation, error checking, and automated coercion to other representations of these datatypes. Also to help decouple the generic aspects of the SpeakableSpecification proposal from its Text-to-Speech specifics.

Currently http://pending.schema.org/xpath expects a value of type Text; if these datatypes went through, it could expect XPathType instead (and the schemas would declare this to be a specialization of Text). Similarly for cssSelector.

Potentially there could be subtypes tied to versions of Xpath and CSS. My understanding is that XPath is more explicitly versioned than CSS, perhaps Xpath explicit versions would be more necessary? e.g. https://www.w3.org/TR/xpath-30/#nt-bnf

/cc @chaals

@danbri danbri self-assigned this Jun 19, 2017
@chaals
Copy link
Contributor

chaals commented Jun 19, 2017

At a technical level this makes sound sense, but to make it work we need to have really good documentation, examples, and probably a place where people can test and get a sense of what it does by playing - and a big warning if they are doing it wrong.

@danbri
Copy link
Contributor Author

danbri commented Jun 19, 2017

@chaals thanks. any sense for the versioning aspect? in terms of making a viable validator/checker, there's the "does this look like the right kind of formatted string" aspect, ... but then there's the "which bits of some HTML document does it match" side too. I am not an expert but I guess there are xpath expressions that would match different bits of doc depending on the assumed xpath version?

@chaals
Copy link
Contributor

chaals commented Jun 19, 2017

I'm not an xpath expert either... I'll ask one if I find one. But my rough sense is that we should leave out the version thing unless there is a screaming need for it. As far as I understand, the versions are generally not going to result in a particular xpath pointing to a different part of the same document just because it uses a different version.

@darobin
Copy link
Contributor

darobin commented Jun 19, 2017

I can't think of XPath constructs that would match different things depending on the XPath version in use. If you use constructs from an unsupported version, most likely you will just get an error. This is also the behaviour you are likely to get from CSS Selectors.

I don't think specifying the version is really useful. For the use case at hand, v1 should be more than enough. It's also the only version you're likely to find in a browser or in JS. There are some considerations applying to usage of XPath in HTML that might be good to link to.

@danbri
Copy link
Contributor Author

danbri commented Jun 19, 2017

Thanks @chaals @darobin - yeah I was leaning towards implying latest/v3 but not creating types for all 3. But point taken re v1 and JS.

Ping @tmarshbing @scor @nicolastorzec @rvguha @vholland @tilid

Any objection to my going ahead and sketching this out within the context of pending.schema.org?

I feel it could give us a useful primitive for making stronger links between schema.org data and the browser environment / non-schema.org web content.

@chaals
Copy link
Contributor

chaals commented Jun 19, 2017

sketch away :)

@darobin
Copy link
Contributor

darobin commented Jun 20, 2017

I would suggest implying (or even specifying) v1. Switching to a higher version later if needed will be painless, which is not true of the reverse.

@nicolastorzec
Copy link
Contributor

Sketch away.

@danbri
Copy link
Contributor Author

danbri commented Jun 27, 2017

Ok, I'll make a pass at this. cheers...

danbri added a commit that referenced this issue Aug 1, 2017
For #1389 #1672

Intent is that these types be applicable to usecases beyond SpeakableSpecification.

They are named with "*Type" to avoid the types having same spelling as the property.
danbri added a commit that referenced this issue Aug 10, 2017
@danbri
Copy link
Contributor Author

danbri commented Jul 15, 2020

This is still in Pending. The nature of the (fairly well adopted) Speakable specification is such that these terms are only used to define the vocabulary deployed for SpeakableSpecification, and won't themselves be appearing in actual markup. Since speakable stuff went into the core, I think these ought to accompany it. Any objections?

/cc @RichardWallis

@RichardWallis
Copy link
Contributor

@danbri
See no reason not to move to core.
However, if you then view the list of sub-datatypes of Text, htmlText seems to be an obvious absentee.

@github-actions
Copy link

This issue is being tagged as Stale due to inactivity.

@github-actions github-actions bot added the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Sep 14, 2020
@KellySEO461
Copy link

I've tried using this datatype for the xpath however it's showing errors in the validator which weren't there before, e.g.

xpath | /html/body/div[1]/section[4]/div[1]/p[1] (No matches found for expression /html/body/div[1]/section[4]/div[1]/p[1].)

This is the type of structured data I'm using for example:

"mainContentOfPage": [
{
"@type": "Table",
"name": "Rescue Add On",
"xpath": "/html/body/div[1]/section[4]/div[1]/p[1]",
"sameAs": "https://www.wikidata.org/wiki/Q337810"
},

I've also tried adding XPathType like this:

"mainContentOfPage": [
{
"@type": "Table",
"name": "Rescue Add On",
"xpath": {
"@type": "XPathType",
"xpath": "/html/body/div[1]/section[4]/div[1]/p[1]"
},
"sameAs": "https://www.wikidata.org/wiki/Q337810"
},

and I get this error - /html/body/div[1]/section[4]/div[1]/p[1] (The property xpath is not recognised by the schema (e.g. schema.org) for an object of type XPathType.)

I've also tried adding it like this:

"xpath": {
"XPathType": "/html/body/div[1]/section[4]/div[1]/p[1]"
},

I've also tried using "text" rather than "xpath" however that doesn't worth either. What's am I doing wrong here please? Thanks

@KalleOlaviNiemitalo
Copy link

For JSON-LD, perhaps a value object makes sense:

{
  "@context": "https://schema.org/",
  "@type": "WebPage",
  "mainContentOfPage": [
    {
      "@type": "Table",
      "name": "Rescue Add On",
      "xpath": {
        "@value": "/html/body/div[1]/section[4]/div[1]/p[1]",
        "@type": "XPathType"
      },
      "sameAs": "https://www.wikidata.org/wiki/Q337810"
    }
  ]
}

But I don't know whether this is at all easier for software to consume than plain "xpath": "/html/body/div[1]/section[4]/div[1]/p[1]", as any software that does something specific with the https://schema.org/xpath property would also know the expected type of its value.

@KellySEO461
Copy link

Thank you very much @KalleOlaviNiemitalo, I've tested that and it's still throwing an error on the schema.org validator, value does make sense from a logical perspective though.

I've also tested the xpath exists on the page (which it does) so it's not that which is the problem

@KalleOlaviNiemitalo
Copy link

I tried copying the following to https://validator.schema.org/:

<html>
 <head>
  <title>Demo page</title>
  <script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "WebPage",
  "mainContentOfPage": [
    {
      "@type": "Table",
      "name": "Rescue Add On",
      "xpath": {
        "@value": "/html/body/div[1]/section[4]/div[1]/p[1]",
        "@type": "XPathType"
      },
      "sameAs": "https://www.wikidata.org/wiki/Q337810"
    }
  ]
}
  </script>
 </head>
 <body>
  <div>
   <section></section>
   <section></section>
   <section>
    <div>
     <p>Not referenced in JSON-LD.</p>
    </div>
   </section>
   <section>
    <div>
     <p>The XPath refers to this.</p>
    </div>
   </section>
   <section>
    <div>
     <p>This is not referenced, either.</p>
    </div>
   </section>
  </div>
 </body>
</html>

The validator did not report any errors or warnings. It displayed these results:

  • @type = WebPage
  • mainContentOfPage =
    • @type = Table
    • name = Rescue Add On
    • xpath = /html/body/div[1]/section[4]/div[1]/p[1]
    • value = The XPath refers to this.
    • sameAs = https://www.wikidata.org/wiki/Q337810

It is strange to me that @type is Table when the xpath references a p element rather than a table element. The validator did not warn about that, though.

@KellySEO461
Copy link

I can see that it works @KalleOlaviNiemitalo which is great, however it doesn't seem to work on live sites when you check the URL. I've triple checked the Xpaths exist (I realised after I'd posted that I copied the wrong xpath in there, it should have been the table one like as below as I was working with several broken ones at the same time, sorry about that):

{
"@type": "Table",
"name": "Rescue Add On",
"xpath": {
"@value": "/html/body/div[1]/section[6]/div[2]/table",
"@type": "XPathType"
},
"sameAs": "https://www.wikidata.org/wiki/Q337810"
},

However this still throws an error in Schema.org validator for both speakable and table (I've removed customer data for privacy reasons but this was the live site):
Screenshot 2023-09-07 at 12 08 47

Is there another way to do this, as prior to XPathType being added as a new type, this worked fine? I really appreciate your help.

@KalleOlaviNiemitalo
Copy link

https://validator.schema.org/ apparently doesn't like it if the xpath property refers to an element that does not have any immediate text content, even if there is text in a child element.

<html>
 <head>
  <title>Demo page</title>
  <script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "WebPage",
  "hasPart": [
    {
      "@type": "WebPageElement",
      "xpath": "/html/body/footer[1]",
      "description": "That xpath deliberately does not match anything."
    },
    {
      "@type": "WPAdBlock",
      "xpath": "/html/body/p[1]",
      "description": "Validator is happy with this."
    },
    {
      "@type": "WPAdBlock",
      "xpath": "/html/body/p[2]",
      "description": "Validator complains 'No matches found' even though the element is there."
    },
    {
      "@type": "WPAdBlock",
      "xpath": "/html/body/table[1]",
      "description": "Validator complains 'No matches found' even though the element is there."
    },
    {
      "@type": "WPAdBlock",
      "xpath": "/html/body/table[1]//*",
      "description": "The validator considers each text node a separate value."
    }
  ]
}
  </script>
 </head>
 <body>
  <p>Buy our product!</p>
  <p><strong>Now you can have a second one for free.</strong></p>
  <table>
   <tbody>
    <tr><td>Would you like a bulk discount?</td></tr>
    <tr><td>Call our representative for more information.</td></tr>
   </tbody>
  </table>
 </body>
</html>

@KellySEO461
Copy link

Thank you so much @KalleOlaviNiemitalo , that's fixed it!

It removes the error if you paste the first table row e.g "/html/body/div[1]/div[4]/div/div/div[2]/div/div[20]/div/table/thead/tr/td[1]" which actually contains text. Whether or not this will work for Google to understand the full table is yet to be seen, but hopefully it can.

I really appreciate your help, you are fantastic!

@KellySEO461
Copy link

Interestingly, it appears to be a different story for "speakable" in that you need to reference the

tag which contains the text rather than the text itself, e.g /html/body/div[1]/div[4]/div/div/div[2]/div/p[2]
rather than this one which actually is the text: /html/body/div[1]/div[4]/div/div/div[2]/div/p[2]/text()

@KalleOlaviNiemitalo
Copy link

For the example above, I had originally tried "xpath": "/html/body/table[1]//text()" but that didn't satisfy the validator.

@github-actions github-actions bot removed the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Oct 20, 2024
@DaveyJake
Copy link

When viewing the cssSelector examples, the first example either has an error with the selectors or needs to include relevant microdata as the line "cssSelector": ["headline", "summary"] has no frame of reference.

The second example has an error regarding the xpath property. The microdata example contains an h1 but the meta tag shows the xpath value as /html/body/h3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants