Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONLD context doesn't have https variant #2853

Open
VladimirAlexiev opened this issue Feb 26, 2021 · 10 comments
Open

JSONLD context doesn't have https variant #2853

VladimirAlexiev opened this issue Feb 26, 2021 · 10 comments
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!).

Comments

@VladimirAlexiev
Copy link

https://schema.org/docs/developers.html offers http and https variants of the ontology (though in #2852 I question whether that's a great idea):

However, requesting just the JSONLD context (see #2851) doesn't make that distinction.
Both of these

curl -I -L -Haccept:application/ld+json http://schema.org/
curl -I -L -Haccept:application/ld+json https://schema.org/

return the same link:

link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"

No matter whether you access it by http or https

It returns the same file, which defines ontology terms as http:

        "schema": "http://schema.org/",
@RichardWallis
Copy link
Contributor

Firstly, although subtle, there is a difference between the purpose of the vocabulary definition download files and the JSON-LD context. The download files containing the RDF Triples that define the vocabulary, in various serialisations. The context to provide short cut terms to use in JSON-LD code.

Because of this, and there being no standardised accepted way to indicate need for different context versions of a context, the only version returned is one that reflects the underlying coding of the data that defines Schema.org in the repository. Currently (as of version 11.0) that is http based.

It is worth noting however that as of version 12.0 (due for release soon and visible in draft form on webschemas.org), that underlying coding is moving to be https based.

@datadavev
Copy link

@RichardWallis - can you clarify the meaning of "underlying coding is moving to be https based"?

Is the intent for the schema.org vocabulary to use https://schema.org/ as the IRI prefix for all schema.org terms defined in the context document? Specifically, will the context define:

{
  "@vocab": "https://schema.org/",
  "schema": "https://schema.org/",
  
  ...

or will those definitions continue to use http://schema.org/?

Discussion here and issues #2814 and #2852 imply that the switch to https was to occur in release 12.0.

(the following is copied from issue #2814 since that issue is not open)

It appears v12.0 was to use https://schema.org/ as the namespace for schema.org, switching from http://schema.org/ as indicated in the original v12.0 pre-release, 836cae7. However that change was reverted in 1856ba6. v12.0 release reports http://schema.org/ for the vocabulary URIs.

curl "https://schema.org/docs/jsonldcontext.jsonld"
{
  "@context": {
        "type": "@type",
        "id": "@id",
        "HTML": { "@id": "rdf:HTML" },

        "@vocab": "http://schema.org/",
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "xsd": "http://www.w3.org/2001/XMLSchema#",
        "schema": "http://schema.org/",
...

What is the intended vocabulary namespace for schema.org moving forward? https://schema.org/ or http://schema.org/?

The difference does impact downstream processing and recommendations for our community of implementors.

@danbri
Copy link
Contributor

danbri commented Mar 30, 2021

An update on this.

Firstly, I appreciate and share the desire for migration towards https everywhere.

During the v12 launch we did initially switch the entire context to use a vocab declaration of 'https://schema.org/' for all cases. This was part of @RichardWallis's efforts in #2814.

I regret that I had not realized this change was part of #2814 - it was problematic immediately (e.g. JSON-LD tests started failing in Apache Jena), because it changed the output triples of all parsers that use contexts in realtime. The change was immediately reverted because of this.

The idea of the URL for a http:-based context giving 'http:' triples, and https:-based giving a context designed to generate triples using 'https:', is one approach. While it would be difficult on our current (100% static served appengine) infrastructure, we could explore that further. However, that change would not address the larger problem - that of having a mix of 'http' and 'https' schema.org triples out there.

Background

Trends in web technology have made it clear that sites are going to rapidly move towards https:, and it was increasingly untenable for us to avoid redirecting e.g. http://schema.org/Event to https://schema.org/Event

At that point, we had a usability problem: the main URL for the documentation of Event was https, but most markup used http (whether rdfa, microdata, json-ld). That was how we ended up agreeing to say in the FAQ that both variations were fine, and that consumers would have to figure out the equivalences (https://schema.org/docs/faq.html#19).

The most recent changes to this codebase move us into an environment in which Schema.org's internal definitions use 'https' on-disk. Anyone working with schema.org in an RDF setting will need to decide whether to canonicalize to the http: or the https: form, and since both forms are very much "out there" in the wild, this is unavoidable. Consequently we publish a version of the definitions in both flavours, and expect this is likely to be needed for a while.

Switching the content of the JSON-LD context to generate https: triples is a very special situation. Unlike RDFa and Microdata, changes to that definition can alter the behaviour of software processes at a distance. If we do go there, I think it's the kind of change we ought to publicize at least a year in advance, with significant supporting documentation.

@jaygray0919
Copy link

However, the https transition does fall into the category: it's inevitable. Separately we are working with folks to upgrade SPARQLer (Apache Jena) to support https IRIs in javascript programs. While a different domain, the https issue is a gating issue there too. But endpoints in SPARQL programs increasingly are problematic. Whether consuming or generating content, it seems to us that a fast transition to https is in our collective best interest.

@danbri
Copy link
Contributor

danbri commented Mar 30, 2021

@jaygray0919 et al., can I suggest a different framing of the situation?

Schema.org has long expressed a few things that made sense in 2011, when optimizing for ease of adoption from webmasters/publishers who knew little about these technologies, were working solely in Microdata, and had relatively modest incentives to adopt. There was less expertise, tooling, documentation and advice to draw upon.

Hence, in the datamodel doc:

We expect schema.org properties to be used with new types, both from schema.org and from external extensions. We also expect that often, where we expect a property value of type Person, Place, Organization or some other subClassOf Thing, we will get a text string, even if our schemas don't formally document that expectation. In the spirit of "some data is better than none", search engines will often accept this markup and do the best we can. Similarly, some types such as Role and URL can be used with all properties, and we encourage this kind of experimentation amongst data consumers.

At Google for example we use some heuristics to normalize string-based shortcuts into thing-based structure. This isn't always easy, and involves determining a plausible type where possible For e.g. "alumniOf": "Westergate Comprehensive" might get expanded into "alumniOf": { "@type": "Organization": "name": "..." }.

It might be useful for consuming applications to work towards more shared canonicalization / normalization steps. Of these, mapping http: triples into https: would be amongst the easiest, since it is lossless, simple to implement, etc. If mappings exist between e.g. Schema.org and Dublin Core, Wikidata, FOAF, SKOS etc., we know that we can relatively easily create an https: version of such mappings.

My view is that this kind of pre-processing will become increasingly important, and that we'll find more useful things to collaborate on in that space - e.g. shacl, shex etc.

Anyone who has looked at any kind of structured data from the wider Web knows that you can't just load it up into an application environment and use it without various kinds of cleanup, quality check, canonicalization, heuristics etc. This was true of Dublin Core, FOAF, Open Graph markup, and it remains true of Schema.org too. Data is inherently messy. It is unfortunate that we have this http vs https issue in the Schema.org ecosystem, but in terms of making data from the Web usable for applications it is a relatively simple problem.

@jaygray0919
Copy link

Liking the idea ... On our side, we want to be fast followers, and defer to a consensus design where folks (who are smarter than we are on this issue) do the design or formulate the pre-processing 'linter'. As a user, we need a solution that doesn't get flagged/rejected by other subsystems (like a browser) or another processor that enforces system-wide rules. We face that problem today with some @context statements and nearly all browsers that immediately flag, and sometimes block, an http resource. And mixing http and https in some situations is like volunteering to live on death row. This problem amplifies with each new brower release, and each new network sensor that is trying to protect us from intrusion.

@datadavev
Copy link

One approach that may be helpful for consumer canonicalization for the https namespace variant is to provide a separate schema.org context document that uses the https variant published in parallel with the existing http variant. Content creators can continue to reference the remote context document using a construct like:

{
  "@context": "https://schema.org/",
  "@type": "Thing",
  ...
}

A content consumer can adjust their mechanism for retrieving the remote context document by intercepting requests for the http variant schema.org context document and replacing it with the URL for the https variant. This is the approach being taken in DataONE and has a benefit of placing all accumulated schema.org content in a consistent namespace, simplifying subsequent processing.

Depending on the library being used for processing, it can be very straight forward. Here's a worked example using the pyld python library:

https://gist.github.com/datadavev/3ba3b12390c859b2f780ad7b78ebd739

It's not a perfect solution since there may be references explicitly to expanded schema.org URIs (e.g. http://schema.org/Thing) which would require additional processing (either hacking the triples or a compaction + expansion process).

@gkellogg
Copy link
Contributor

The SDL uses separate vocabularies for http and https varieties, and as sub-classes domains and ranges are separate, it will flag most attempts to intermix the two.

Note that RDFa initial context relates the “schema” prefix to the http version automatically. My tools favor the “schemas” prefix for schema.org. Any change to defaults, as with the JSON-LD context must be well advertised and coordinated.

@github-actions
Copy link

This issue is being tagged as Stale due to inactivity.

@github-actions github-actions bot added the no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!). label May 31, 2021
@kaefer3000
Copy link

kaefer3000 commented Jan 11, 2022

Is there a timeline/strategy for the transition to https also for the JSON-LD context?

Besides the practical aspects how large-scale consumers of data from the wider Web could address the issue as discussed in this thread, it would require less footnotes when teaching semantic web technologies with practical and working examples if students who just learnt about the standards for RDF term equality and different RDF serialisations could write triples with URIs copied-and-pasted from the browser address bar at schema.org (or from some place from the rendered HTML) and combine them with NQuads from the json-ld.org playground (or some other JSON-LD/RDFa deployment) and things would just work together.

In the meantime, maybe it would be useful to have an extension to what is being shown when clicking [more...] at eg. https://schema.org/Person, which says:

Canonical URL: https://schema.org/Person
Equivalent Class: http://xmlns.com/foaf/0.1/Person 

And the extension could be:

Deprecated URL (currently still used in JSON-LD and RDFa): http://schema.org/Person

Thus making the transition more transparent. Maybe with a pointer to FAQ item 19 that could be updated with a bit more of the technical background from this thread.

Edit: I just found #2886, which is not linked in this thread yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity Discuss has gone quiet. Auto-tagging to encourage people to re-engage with the issue (or close it!).
Projects
None yet
Development

No branches or pull requests

7 participants