Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Canonical URL" should provide the (HTTPS) URL that is the value of rel="canonical" #2018

Closed
Aaranged opened this issue Jul 23, 2018 · 9 comments

Comments

@Aaranged
Copy link

Aaranged commented Jul 23, 2018

A current type or property page on schema.org lists at top a "Canonical URL", such as this statement for Book.

Canonical URL: http://schema.org/Book

In the code of the page we also find a form canonical URL declaration. This is for the https:// version of a page, as now the http:// version of a page 301 redirects to the https:// version.

<link rel="canonical" href="https://schema.org/Book" />

Accordingly the top-of-page statement should provide the same value as the href attribute of <link rel="canonical">. Without this normalization there is a persistent mismatch between the canonical URL the page describes (i.e. to humans) and the canonical URL provided to machine data consumers.

For those working in the search engine space canonical has a very precise and well-documented meaning. If what's being described by the on-page "Canonical URL" is something other than this URL the wording should be change to reflect what this value means: "canonical" shouldn't have one meaning for humans and one for data consumers in the context of the very same page.

The original intent here was probably to inform users what URL they should use for a term defined in an extension, such as abridged (and which now points to the correct href value for <rel="canonical">, here https://schema.org/abridged rather than https://bib.schema.org/abridged. However, as documented above, the "canonical URL" provided on page now conflicts with the href value because of the HTTP protocol in the former.

@thadguidry
Copy link
Contributor

@Aaranged yeah, noticed that also just this past weekend. Thanks for writing this issue up.

@RichardWallis
Copy link
Contributor

There are two meanings/uses of canonical that are 'apparently' in conflict here, which is a consequence of moving the site to https, whilst the vocabulary (that the site describes) remains http based.

  1. <link rel="canonical"> is an instruction to those that may find the same page referenced by more than one url, as to which is the master page. This, in previous releases ( pre v3.4) was mapped to http://schema.org/Book for both http & https served pages. In the current (v3.4) release, the site only serves https pages (301 redirecting any calls to http), so this link is not strictly necessary.

  2. The Canonical URL as displayed on the term definition pages. This is the unique identifier for the term itself, independent of the page its details are displayed upon. It is this value that should be used in mark up - eg. <div itemscope="" itemtype="http://schema.org/Book"> or "@context": "http://schema.org".

Several years of usage will have resulted in millions of http based term URLs being harvested into knowledge graphs and data stores. To such systems, without special interventions, the two term URLs http://schema.org/Book & https://schema.org/Book are totally different identifiers. The larger web organisations of the world can probably cope with a special rule for Schema.org based identifiers, to equate the two. Those who build their systems, for harvesting or publishing data, on generally available software will not cope quite so easily.

There is therefore a major backwards compatibility issue, if/when we move the scheme for term identifying URLs from http to https. For some, it would be as big a deal if we were moving from http://schema.org to https://new-schema.org.

I am not saying that we shouldn't move to https, for canonical term URLs, just that it is not as simple as you might at first assume.

An initial step is already in place to ease that transfer. If you take a look at the RDF definitions of the terms (on-page RDFa, JSON-LD & RDF/XML, dumps etc.), you will find that http://schema.org/Book is defined as sameAs https://schema.org/Book. This will enable those with inferencing capability within their data stores, to infer that the two URL identifiers are identifying things that are the same.

Following the above description, what you have spotted however is a defect with the <link rel="canonical"> definition in an extension page such as for abridged . In this case the link should be <link rel="canonical" href="https://bib.schema.org/abridged" />

@Aaranged
Copy link
Author

An admirable summary of the current situation @RichardWallis - a couple of comments on what you've documented.

It is this value that should be used in mark up...

While comprehensible to some developers, this will be lost on most web publishers, as the direction to "use a URL you can't actually resolve in your browser" is counter-intuitive, and likely to be ignored. And the presence or absence of this on-page "Canonical URL" value won't do anything to stem the flow of markup that employs https://schema.org that originates from site users copying and pasting a schema.org (now always-HTTPS) URL from their browser's address bar. In other words, insofar as this "Canonical URL" statement is designed to instruct publishers which protocol to use it has no value, as absent specific requirements about protocol use from specific data consumers there's no mechanism to enforce "correct" protocol encoding (e.g. the Google Structured Data Testing Tool considers the @context value https://schema.org every bit as valid as http://schema.org).

Following the above description, what you have spotted however is a defect with the definition in an extension page such as for abridged...

Indeed. And while this makes sense, it sets up an even more confusing mismatch for those familiar with <rel="canonical">: when referring to this type or property in markup don't use the protocol you see in your browser address bar and don't use the subdomain you see in your browser address bar.

As I said initially, if this requires changing the statement from "Canonical URL" to something else, like "URL to use in markup" than that's IMO much better than requiring publishers to know and appreciate the difference between two definitions of "canonical URL". FWIW what's far and above the most commonly understood by "canonical URL" is its manifestation in rel="canonical", excepting some older usage which still means, more broadly, "the URL a publisher would hope to have preferentially indexed by a search engine."

There is therefore a major backwards compatibility issue, if/when we move the scheme for term identifying URLs from http to https....

Understood, but from a practical perspective (that is, "practical" in terms of what protocol web publishers are now using in their markup) it's a moot point. Publishers are using https://schema.org broadly, and there's no reason to believe that they won't continue to do so.

@RichardWallis
Copy link
Contributor

@Aaranged I agree with your views about not being able to hold back the tide of https term identifier usage much longer. A tide that will inevitably increase now that the web interface has moved to be exclusively https.

My opinion, overriding my natural conservatism about changing fundamental things in widely shared vocabularies, is that we should soon pragmatically move forward to make the vocabulary https based.

Whenever we do this there will be pain for some, mostly data consumers. I believe doing it sooner rather than later will reduce confusion for data producers.

This I believe would mean:

  • Making the canonical URL for all terms https based.
  • Updating the on-page Canonical URL value to reflect this
  • In machine-readable definitions, reverse the current sameAs definition to become https://schema.org/{term} sameAs http://schema.org/{term}. This should help some with backwards compatibility inferencing.
  • For terms in the core vocabulary maintain the current <link rel="canonical" href="https://schema.org/{term}" /> value. Although there will only be one valid url for pages within the schema.org site, this could prevent confusion if crawlers picked up pages on test sites such as webschemas.org.
  • Fix the identified bug with the rel canonical values for terms in extensions. eg. <link rel="canonical" href="https://bib.schema.org/abridged" />

Those are the somewhat easy to implement steps - meaning they will probably down to me for coding.

However we should also consider the following:

  • An easily found post/help document that simply describes (if that is possible!) the issues around http/https plus the difference between the page URL for terms in extensions and their canonical URL that should be used in mark up.
  • Trawling through and updating the example code on the site to reflect this change:
    • In RDFa vocab="https://schema.org/"
    • In Microdata itemtype="https://schema.org/Book"
    • In JSON-LD "@context": "https://schema.org"

@vberkel
Copy link

vberkel commented Jan 8, 2019

Google's schema.org recommendations have switched to https. While I recognize this is not a tool specific community or repository, the change provides a further incentive to make the vocabulary switch to HTTPS. Could this be included in scope for v3.5? Happy to assist if there's an appetite for this.

@jeannieh
Copy link

jeannieh commented Jan 8, 2019 via email

@danbri
Copy link
Contributor

danbri commented Jan 9, 2019

Let's not rush it. There are a lot of subtle things that can get busted e.g. mappings, sparql queries etc. I will look into getting some measures of http vs https data from the Web.

@Bcadej
Copy link

Bcadej commented Apr 1, 2020

When creating NEW website should we use https://schema.org OR http://schema.org?

@RichardWallis
Copy link
Contributor

Implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants