-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is 'base' for an embedded json-ld? #23
Comments
My view is that, unless base is explicitly set in the JSON-LD content, the base URL of the HTML containing the |
There are a couple of possibilities:
I find the HTML and DOM descriptions somewhat confusing, but basically it's a matter of if it uses the base URL in the scope of the script element, or that of the document itself, either from it's location, or from |
The (WhatWG) DOM document says:
The HTML standard says:
I do not find this particularly confusing... Whether |
What I found confusing was how In any case, it would be fine for us to ignore this usage and rely only on the original document base URI> |
in some sense, this is not our problem:-) If what we say is that the There is a similar question, b.t.w., which did come up lately at the Publication WG: what about the possible |
The underlying question seems to be about whether the surrounding HTML effects the contained JSON-LD (i.e. Right now, it's my assumption that the markup does not need to be consulted when extracting or using the JSON-LD, but that the request URL and response headers used to convey the HTML (with the embedded JSON-LD) would retain their meaning to the JSON-LD. For example (riffing off this example in the spec): GET /markus
Host: http://example.com/
Accept: text/html HTTP/1.1 200 OK
Content-Type: text/html
Link: <https://json-ld.org/contexts/person.jsonld>;
rel="http://www.w3.org/ns/json-ld#context";
type="application/ld+json" <html>
<head>
<script type="application/ld+json">
{
"@id": "",
"name": "Markus Lanthaler",
"homepage": "http://www.markus-lanthaler.com/",
"image": "http://twitter.com/account/profile_image/markuslanthaler"
}
</script>
</head>
<body>
<img src="http://twitter.com/account/profile_image/markuslanthaler" />
<a href="http://www.markus-lanthaler.com/">Markus Lanthaler</a>
</body>
</html> Given that the document was requested from <http://example.com/markus> <http://xmlns.com/foaf/0.1/homepage> <http://www.markus-lanthaler.com/> .
<http://example.com/markus> <http://xmlns.com/foaf/0.1/img> <http://twitter.com/account/profile_image/markuslanthaler> .
<http://example.com/markus> <http://xmlns.com/foaf/0.1/name> "Markus Lanthaler" . Essentially...
This would avoid the complication(s) of consulting the surrounding formats many similar value expressions (lang, base, etc), while still benefiting from the HTTP message headers we're already leaning on for the "plain JSON" use case. If this is on the right track, I'm happy to contribute text to make the Embedding JSON-LD in HTML Documents normative (following the pattern set in the "plain JSON" section). Just let me know. 😃 |
FWIW, this is how the JSON-LD Playground experience feels now (at least as regards the Base URI). Any |
That's always been my interpretation, other things to consider are the following: <html>
<head xml:base="http://example.org/alt-base/" lang="EN-alt">
<script type="application/ld+json">
{
"@id": "",
"name": "Markus Lanthaler",
"homepage": "http://www.markus-lanthaler.com/",
"image": "http://twitter.com/account/profile_image/markuslanthaler"
}
</script>
</head>
<body>
<img src="http://twitter.com/account/profile_image/markuslanthaler" />
<a href="http://www.markus-lanthaler.com/">Markus Lanthaler</a>
</body>
</html>
We need to be careful about putting too many restrictions on implementations; by making HTML processing normative, all compliant processors will need to include an HTML parser, whereas now they only need to process JSON. My library depends on the RDFa processor to look for other embedded encodings in an HTML document, including Microdata, RDF/XML, and anything in a |
The problem I have with the strict HTTP response based approach is my usual one: for most of the users/authors out there (think of all the schema.org users!) setting the HTTP response headers is not an option: they have neither the knowledge nor the access to do that. I believe any setting must rely on what they can really control, namely the HTML content itself. I also believe we should not shy away from relying on the usage of an HTML parser. After all, if a JSON-LD processor does implement the embedded JSON-LD extraction, then the most natural way of doing so is to parse the HTML file with a parser (and they are available everywhere for all types of languages) and perform something like:
We should encourage using this pattern rather than using some sort of a text processing trick imho. That being said: I think we have three generic behaviors that we need to specify. Base: To use the code snippet above we should say that the base for the embedded content is the value of Default language: I think if the language is set explicitly in the HTML source, this should be honored. A question that is not 100% clear to me is whether any inherited language should be considered or only an explicit setting. Ie, I believe that the following case:
must be equivalent to a The slightly more unclear case is whether
has the same effect. I think the answer should be 'yes', too. One reason is that if I use that mechanism for any other HTML element, like a Default text direction: (Ie, the I do not see any attributes for the script element that would/could affect the JSON-LD processing. I other words, I believe all other attributes can be ignored as far as JSON-LD is concerned. Which means that spec-ing all this does not look like an overcomplicated issue... |
Relates to #57. |
I'm concerned that making the exterior "packaging" influence the "contents" that heavily introduces far too much overhead and confusion--especially when considering CMS and SEO usage where the same snippet of JSON-LD might be injected into multiple pages or even sites. Also, with the exception of text direction, base and language already have their equivalences in JSON-LD, so the "can't edit HTTP headers" scenario doesn't apply for those. |
I am not sure I understand the problem. Obviously, if the JSON-LD content has, e.g., a language setting, that overrules everything else. The issue is when this is not the case. If an author includes JSON-LD via a script tag, I would think that she would expect, e.g., and explicit language setting to be valid for the metadata expressed in JSON-LD, too (unless explicitly stated otherwise in the metadata). |
I share @BigBlueHat's concerns about the scope of inclusions into the graph from the surrounding HTML as defaults, rather than by explicit reference. You would need to traverse the DOM all the way up, looking for Can we split the effects of surrounding document structure off into a separate issue, and keep this one as the base URI only? |
The HTML5 spec calls these "data-blocks" and says the following:
Consequently, I'd suggest we avoid the intermixing of DOM parsing and unique-to-us data-block handling. |
HTML5 pretty much shuts the door on using the DOM to interpret script elements (data blocks). |
@BigBlueHat on #23 (comment) : that is not how I interpret this text. What it says is that the HTML spec is completely silent (as it should) what happens in that block. That does not mean that the specification for that specific mime type cannot specify what it wants. In particular, the |
@azaroth42 per #23 (comment) : it is indeed actually strange to me that the DOM is specified in a way that, while the base URI (which is also to be calculated) is per definition available in the DOM element, this is not true for the language. I have the impression that this could be considered as a bug, but I would leave that to @r12a and his friends to decide... While I do not consider the resulting extra calculation to be complex (the processor handling embedded JSON has, presumably, access to the DOM, because it has to find the script element; if so the rest is a simple recursion via the |
I have just read the resolution on the last WG call:
and I must admit I would have voted -1, had I been at the call. I have already put my argument into #23 (comment) and I do not want to repeat them. Any JSON-LD processor that understands embedded JSON-LD have to do some level of HTML DOM parsing, and the DOM parser will provide the value of Although I admit it is rarely used, I would expect that the author of a document with embedded JSON-LD will be surprised that a |
WG resolution on call of 2018-09-21 was (indeed) to use the document base URL only and ignore all surrounding data such as xml:base. The rationales included the data-block definition seeming to at the very least imply a clean separation, the additional requirements on processing regardless of how complex, and the perceived surprise of mixing data and presentational content. |
And now not as chair ... Outside of the browser environment, parsing HTML into a DOM is a pain. A regular expression can find script tags in the HTML tag soup and extract those that have the right media type ... but then would not have the surrounding information beyond the tag and its contents. So it could get an attribute on the script tag, but it would be harder to find other elements, and very hard to process the tree to find the closest wrapping element with a particular attribute ... that would need a full DOM based stack. The extent to which Benjamin's reading of the data-block specification is correct or not ... I think we could verify at TPAC. "None of the script attributes (except I agree with Gregg that the principle of least surprise should be given due attention, and that's a very subjective issue. What is surprising to one person is intuitive to someone from a different background. In this situation I would defer to developers who want to create JSON-LD embedded in HTML ... for whom @danbri seems like the best proxy. Not to try and discuss everything at TPAC, but that seems like something else we could determine in person. |
@azaroth42 yes, this is really something to be discussed at TPAC, ie, we should definitely not close the issue here. Not as a staff contact:-) I believe using regular expression to extract the script tag is even more of a pain... having the right expression that avoid such pitfalls as having a
may be tricky. I think that any decent programming environment these days have an HTML parser, ie, I would certainly not even dream about doing this in any other way than getting to the DOM (I did implement a Web Publication Manifest extractor and I just used a library to get this done. It was a breeze.) Let us add this to the TPAC agenda... |
…e about prospect of dynamically changing base. This reflects discussion from #23 (comment) and resulting TAG advice w3ctag/design-reviews#312 (comment).
So far we've mostly focused on In other words |
I don't see how that follows. The document space is different from the vocabulary space. Base affects the interpretation of the document space, but not the vocabulary space. |
In light of #72, the vocabulary space is the document space (at least its relative to it). But, I believe, that in both cases it would be relative to any existing |
Sure, if via @vocab you set the default vocab space to the document space, then @base / will affect the vocabulary. Perhaps I was misreading @BigBlueHat, but I interpreted the comment that it would /always/ affect the vocab space. |
I think one of the questions of @BigBlueHat is what would happen with the URL of a context file. Ie, what is the outcome of the following context if the url of the JSON-LD file is
Will the linked context file be As an aside: looking through the syntax document I have found only one example that uses an array of contexts like above (ex. 103) and I did not find any statement whether the context can be an absolute URI only or not. Actually, I have not found, through a search on |
@iherman from 3.2.1 of the Context Processing Algorithm:
So, the linked context file would be This does have implications for setting base from HTML envelope, but that already exists for setting If we were to address this, it would require keeping separate notions of "base", and passing them into the context processing algorithm. |
Slight aside, but do we all agree that URLs inside HTML inside JSON-LD inside HTML script sections are never going to get expanded with base URIs from the outer HTML? i.e. no variations on this will expand 'here/' to a full URI:
|
@danbri unless that |
@danbri In this case, the content is just a string, and I don't know anywhere that strings are processed as HTML, even if an rdf:HTML literal; the processing happens in RDF/XML or RDFa to perform the exclusive canonicalization when generating the literal; JSON-LD has no such rules. Any other use as @BigBlueHat suggests is out of scope. Of course, if someone takes on the task of imbuing more intelligence for embedded JSON-LD (as suggested by @hadleybeeman in the TAG response), that might be another issue. In that case, it would likely be similar to any other but of JavaScript injecting HTML, which I presume would case the relative URL to be evaluated against the document base. |
This issue was discussed in a meeting.
View the transcript3. What is ‘base’ for embedded json-ld?Ivan Herman: link: #23 Ivan Herman: Link of the PR: #93 Benjamin Young: #68 Gregg Kellogg: w3c/json-ld-api#50 Benjamin Young: we discussed that one at tpac … we sent it in for TAG review, and they basically widened the scope Gregg Kellogg: there are 2 open PRs … 1) basic support for json-ld in html … 2) PR-93 adds text to specifically add text to add html as base … in the API spec, it’s PR-50 Adam Soroka: quick question, what are we expected to do with their comments? … shall we respond? Ivan Herman: what they propose is interesting but beyond our charter … this would elevate json-ld … but yeah beyond our charter … I would say this is something the CG has to pick up … and we can cross the bridge at some time, but if this is realistic from a manpower perspective I dont know Benjamin Young: #68 (comment) Ivan Herman: regarding the PR-93, there is some stuff about having XML Benjamin Young: the thing I just linked shows how script tags affect html parsing … it’s syntactically correct json-ld Gregg Kellogg: what I did in the PR-68 I call out specifics on how to handle those blocks if the media type is application/json … I think I’ve taken in the specifics on how content of script tags has to be handled and adjusted in for our needs … we asked specific questions to TAG, got an answer but they kinda got a bit over enthusiastic … out of this needs to come something that improves web platform Benjamin Young: the HTML comments stuff as really bothered me since I’ve read it … but it seems to primarily affect only HTML parsing … question is how much of this we need to have in the spec … json-ld in script tags vs “raw” json-ld … both have totally different escaping rules and what not.. and none of that has something to do with html base Ivan Herman: for the comment storing, the whole section is a normative thing … I have the impression this is an HTML problem … which we should certainly mention, but maybe not as part of a normative section Benjamin Young: HTML5 spec level text about parsing <script> tags https://www.w3.org/TR/html5/semantics-scripting.html#restrictions-for-contents-of-script-elements Ivan Herman: we should officially answer to the TAG and will officially add to the standard what they said about base … and that we try to get the CG involved Adam Soroka: +1 Gregg Kellogg: comments in html and escaping.. it depends on the encoding … we don’t need to give guidance on how to handle this Ivan Herman: https://pr-preview.s3.amazonaws.com/w3c/json-ld-syntax/pull/68.html#ex-103-embedding-json-ld-in-html-with-comments Ivan Herman: it has to be valid json-ld … that’s invalid json Gregg Kellogg: that’s something you see quite often Benjamin Young: ivan: because https://www.w3.org/TR/html5/semantics-scripting.html#restrictions-for-contents-of-script-elements Gregg Kellogg: comments are often used just to make sure there are no other issues embedded in the script elements that would cause any issues Benjamin Young: I did quite some digging on that issue … the DOM parsing is only concerned with what’s inside the tags … it’s just treated as raw string … if there’s something inside the json-ld an html parser would choke on … the json-ld would need to be treated in such a way such that an html parser wouldn’t choke on it Pierre-Antoine Champin: one crazy idea by looking at the json-ld embedded in html comments: you could add a js comment in front of the html comment, making it valid javascript … it could become technically correct Benjamin Young: sadly it wouldn’t … it would continue being parsed … we could make it our own parsing space … the question is how far the parser gets before it finds the ending script tag Gregg Kellogg: the json-ld would not be allowed to contain anything that could be interpreted as html and/or html comments … not really feasible and also not helping our mission … I’ve outlined multiple approaches to tackle this … there are only a few cases where json-ld would contain things that resemble comments Harold Solbrig: why is this an json-ld issue but not a javascript issue? Gregg Kellogg: [explains why it isn’t] Gregg Kellogg: it did some test cases for this, exploring corner cases we know of … I don’t know how to move forward unless addressing at least some of the stuff from the PR Gregg Kellogg: it describes script tags and data blocks are a subset Benjamin Young: what’s breaking it, is the potential of one to too early close the script tag … so this would need somehow being taken care of … the risk is, the json-ld could contain content that jacks up the html it’s contained in Ivan Herman: is it so horrible to say, if I put json-ld in a script tag I’m supposed to escape anything that html would need to have escaped … thus a json-ld parser would have to do the unescaping … but you are in HTML regardless.. so Gregg Kellogg: for someone who’s actually looking at the source, those entities become rather annoying Ivan Herman: realistically, I don’t know how often this would happen Benjamin Young: the escaping issue is very similar of putting json-ld inside a text env. Ivan Herman: I think it’s perfectly reasonable to accept both PRs, close the issue … and open a new issue on the specific problem Gregg Kellogg: it’s a editor’s draft not a working draft Ivan Herman: we would open a issue right away Benjamin Young: I would only +1 this, if we add a big red AT RISK disclaimer Ivan Herman: a lot of very important things are pending for now … I think it’s an edge case Adam Soroka: I don’t think we should use a phrase like “AT RISK” but more something along the lines of “will be part of the final spec but might undergo some changes” Ivan Herman: we cannot commit ourselves to having always consistent editor’s drafts Benjamin Young: I’m not sure we have reached consensus on all the things contained Gregg Kellogg: I cannot work on other open issues Pierre-Antoine Champin: what about a parameter on the media type hinting at having to do unescaping? (like application/ld+json;escaped=html) Proposed resolution: merge open HTML related PRs #93 and #68 and #50 after adding “At Risk” (or similar terminology) to present that things are not finalized (Benjamin Young) Benjamin Young: +1 Adam Soroka: +1 Ivan Herman: what does “that” mean? Gregg Kellogg: +1 Benjamin Young: I don’t want to have stuff merged without reaching consensus … but I could live with having it marked as being “at risk” or similar Ivan Herman: putting things that are already done “at risk” would be going backwards … opening a new issue that highlights things that are still being discussed would be ok … having the feeling that ~90% are done Adam Soroka: I have to generally agree with ivan Ivan Herman: +1 Adam Soroka: it seems for me very unlikely that we would stop talking about it David Newbury: +1 Harold Solbrig: +1 Benjamin Young: I’m fine with merging those Simon Steyskal: +1 Resolution #2: merge open HTML related PRs #93 and #68 and #50 after adding “At Risk” (or similar terminology) to present that things are not finalized Pierre-Antoine Champin: +1 |
…e about prospect of dynamically changing base. This reflects discussion from #23 (comment) and resulting TAG advice w3ctag/design-reviews#312 (comment).
The definition of the embedded JSON-LD does not specify what the 'base' for that JSON-LD must be. It should specify it.
(cf #22)
The text was updated successfully, but these errors were encountered: