Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should document conventions for sites to share Schema.org-based "feeds" #2891

Open
danbri opened this issue Jun 1, 2021 · 23 comments
Open
Assignees
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!).

Comments

@danbri
Copy link
Contributor

danbri commented Jun 1, 2021

Over the years there have been various uses of Schema.org beyond its original founding "mark up the structure implicit in ordinary web pages" approach. It is time to explore and document some other approaches that can be used for publishing Schema.org data in the web.

Specifically, given a site, how might schema.org data for it be found, beyond extracting Microdata + RDFa + JSON-LD from each of its pages?

@danbri danbri self-assigned this Jun 1, 2021
@jvandriel
Copy link

jvandriel commented Jun 1, 2021

Is it an idea to include a http-header response example like link: <https://example.com/?schema>; rel="alternate"; type="application/ld+json" of which the link refers to a json-ld document?

@jdevalk has a working implementation on his personal blog:

(and of course, theoretically a sitemap XML that refers to the json-ld documents could be created to help discovery)

@gkellogg
Copy link
Contributor

gkellogg commented Jun 1, 2021

Rel=alternate would be fine for any given page, but not for the whole site. How about re-risiting Semantic Site. Maps.

@RichardWallis
Copy link
Contributor

Building on @jvandriel's suggestion of linking to a json-ld document...

The document could effectively be a file (or api/download representation of one) that contains the json-ld representation for a page or a site or something in between.

How an external site would find them could be a dual pronged approach:

  • rel="alternative" for individual pages
  • Building on sitemaps, or Semantic Sitemaps

I think both would be needed to support a broad community of consumers and publishers.

@jvandriel
Copy link

"Rel=alternate would be fine for any given page, but not for the whole site. How about re-risiting Semantic Site. Maps."

I hadn't heard of that until now Gregg. Happen to now if there's any copy of its specs hidden somewhere? (would love to read it but alas the page can't be reached)

@danbri
Copy link
Contributor Author

danbri commented Jun 1, 2021

Initially I want to focus on larger granularity - i.e. addressing the issue of descriptions being smeared across 100s and 1000s of pages, with the same entity being repeatedly re-described. Will post a proposal shortly!

The idea of moving per-page content out into separate URLs also deserves consideration.

The question of how this relates to the sitemaps.org heritage, or the old semantic site maps proposals, is also complex. There may be potential to do something useful there. And I'm surprised nobody has mentioned RSS/Atom yet too.

@jonoalderson
Copy link

jonoalderson commented Jun 1, 2021

I love the idea of splitting site-level data out. "This is my organization's description and opening hours" doesn't need to be on every single page of the website, for sure. Or if "This is the organization which published this blog post" could be referenced centrally, without repetition, that'd be rather lovely.

@jdevalk
Copy link
Contributor

jdevalk commented Jun 1, 2021

If we could do site level data at the homepage only, that'd be simple to implement. Then you also use the method @jvandriel linked above, specifically for the homepage, to get that data, so https://example.com/?schema.

@gkellogg
Copy link
Contributor

gkellogg commented Jun 1, 2021

"Rel=alternate would be fine for any given page, but not for the whole site. How about re-risiting Semantic Site. Maps."

I hadn't heard of that until now Gregg. Happen to now if there's any copy of its specs hidden somewhere? (would love to read it but alas the page can't be reached)

I've never implemented Semantic Site Maps, and the fact that the spec is no longer reachable doesn't not speak well for using it as a pattern. You can still get the page from the Internet Archive, however.

Using a link relationship other than "rel" might be a good practice, however. From the list of link relationship types, relationships such as "contents", "enclosure", "index", or "related" might be appropriate.

@vberkel
Copy link

vberkel commented Jun 3, 2021

For website feeds or bulk exports while we implemented webhooks and sparql endpoint with customers, our Schema Bulk Export method is worth mentioning.

We implement the bulk export with the Hydra Core Vocabulary and specifically the Collections to access website data. While the link rel=contents/enclosure/index/related could point to that API, the Hydra draft spec includes a section on 4.3 Discovering a Hydra-powered Web API.

Yet, even if you have a common entrypoint and datafeed the schema data should be published with an identity and URI @id minting strategy, otherwise you're still going to have a deduplication challenge. Curious to see @danbri proposal ...

@danbri
Copy link
Contributor Author

danbri commented Jun 8, 2021

Related notes (chatting with my colleague @alex-jansen). Alex notes that for Shopping, these things often crop up and are (very roughly) at a site-level / merchant-level: Shipping, return level policies, taxes, general business information, contact information, business logo, ...

@WeaverStever
Copy link

WeaverStever commented Jun 9, 2021

I looked into doing some sort of off-site JSON-LD declarations (without an embed widget). As I recall, there are some cross-site rules that would have to be negotiated.

Also, it would be handy to require the @id field, this so that webmasters can append properties to scripts that originate from offsite.

There a couple of Event services (embed) that have outdated / incomplete scripts -- it would be cool just to provide the missing information to (append to) those scripts instead of entirely re-creating the wheel. Alas, they do not provide @id fields on their scripts and don't seem to have any interest in updating.

@jvandriel
Copy link

jvandriel commented Jul 9, 2021

Since it's mentioned in the latest release notes I thought it'd be of use for this discussion to link to the draft @danbri and @rvguha put together: https://schema.org/docs/feeds.html

@jvandriel
Copy link

@danbri / @rvguha, I'm wondering why you suggest using /DataFeed, wouldn't an @graph with an array of entities suffice (or even an html document with multiple top level entities)?

@jdevalk
Copy link
Contributor

jdevalk commented Jul 9, 2021

Note that I'm more than happy to experiment with the implementation I have on joost.blog, and if we need a bigger data set, I could easily push this feature into Yoast SEO and have it available on 2-3 million sites within a matter of weeks.

@github-actions
Copy link

github-actions bot commented Oct 8, 2021

This issue is being nudged due to inactivity.

@github-actions github-actions bot added the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label Oct 8, 2021
@Seirdy
Copy link

Seirdy commented Feb 20, 2022

Another option could be a standard for "translating" other forms of markup into schema.org vocabularies. Microformats and Microformats2 have analogues in schema.org, for instance. Some RDFa vocabularies have parallels too: FOAF could map to schema.org/Person and DOAP could map to schema.org/SoftwareApplication and schema.org/SoftwareSourceCode.

I think that a standard for Microformats interop could prove quite useful, given how common Microformats are for certain types of content (e.g. blogs, the Fediverse).

@AlasdairGray
Copy link
Contributor

Within the Bioschemas community we are finding the need to harvest data from each individual page is too resource intensive and time consuming, particularly with the rise of single page applications which require the client to render the javascript.

This idea of data dumps (I find Feed does not provide the right mental picture for me) is something we would like to actively explore. We could do this within a limited scope of say Intrinsically Disordered Proteins.

@sneumann
Copy link

Hi, @AlasdairGray pointed me here, I posted the following to this morning to public-schemaorg@w3.org
but I think here could be a better place:

Currently, we are including schema.org markup in our web pages for a mass spectrometry database (Yes, that's Abbey's Major massspec from Navy CIS :-) ) embedded as <script type="application/ld+json">...</script>. See https://massbank.eu/MassBank/RecordDisplay?id=PB000123 for an example, and the bioschemas LiveDeploy for Links to the sitemap and SMV at https://bioschemas.org/liveDeploys

Instead of scraping, we are looking to (also) provide the same schema.org markup through an established API. In the DataCite and library world, people have been using https://www.openarchives.org/pmh/ as API.

The OAI-PMH spec defines six verbs (Identify, ListIdentifiers, ListRecords, GetRecords, ListSets, ListMetadataFormat) used for discovery and sharing of metadata.

My question whether there is a popular solution / API to serve a collection of items beyond web scraping has been answered by this issue here, and I'd like to contribute discussion potential avenues.

Yours,
Steffen

@AlasdairGray
Copy link
Contributor

@danbri is there a reason why DataFeed was chosen as the type rather than DataDownload?

@danbri
Copy link
Contributor Author

danbri commented Nov 8, 2022 via email

@AlasdairGray
Copy link
Contributor

What we are generating don't really fit with the concept of a feed, we are seeing them as a distribution of the Dataset. They seem to fit better with the idea of a DataDownload.

@ivanmicetic
Copy link

Two approaches to getting site-level dumps were explored and described in the BioHackathon2022 project #23 repo:

Bioschemas community prefers the DataDownload approach in Dataset type since it is much simpler and most resources already have a working implementation of Dataset type.

@danbri
Copy link
Contributor Author

danbri commented Nov 9, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!).
Projects
None yet
Development

No branches or pull requests