Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation guidance on ETags #60

Closed
acoburn opened this issue Sep 20, 2019 · 21 comments
Closed

Implementation guidance on ETags #60

acoburn opened this issue Sep 20, 2019 · 21 comments

Comments

@acoburn
Copy link
Member

acoburn commented Sep 20, 2019

The LDP specification requires the use of ETag headers for GET and HEAD responses. There is some subtlety to how ETags work in the context of RDF, and some implementation guidance (e.g. in a non-normative section) might be useful.

For example, RFC 7232, section 2.3 is clear that different ETags should be produced for different representations of a resource:

An entity-tag is an opaque validator for differentiating between multiple representations of the same resource, regardless of whether those multiple representations are due to resource state changes over time, content negotiation resulting in multiple representations being valid at the same time, or both.

Furthermore, RFC 7232, section 2.1 describes the difference between weak and strong ETags. For RDF serializations where RDF semantics may be more important than byte-for-byte consistency, is it legitimate to generate a strong ETag for an LDP-RS even if the server does not guarantee the order of the triples?

Consider also the case where a server can produce text/turtle, application/ld+json and application/n-triples, each of which generates a different ETag value. Then consider also that the server supports Prefer headers as well as Content-Encoding negotiation, each of which could produce an additional dimension for ETag generation. In that case, suppose a client wishes to send a PATCH request along with an If-Match header as part of a conditional request. What value should be included in the If-Match header? And given the various permutations of ETags that could be generated, does a server need to check all such permutations before accepting the conditional request? (Does If-Match make sense in the context of PATCH?)

This question is simpler in the context of PUT, but what if a client retrieves an LDP-RS as JSON-LD using a custom profile (Accept: application/ld+json; profile="https://...."), modifies the RDF graph and then, via PUT replaces the resource as JSON-LD. What value should be used with If-Match? Does this change if the resource is retrieved as JSON-LD and replaced as Turtle?

There are clearly some nuances here, and it may be helpful to provide some guidance to implementers.

@kjetilk
Copy link
Member

kjetilk commented Sep 23, 2019

Interesting topic!

@acoburn :

For RDF serializations where RDF semantics may be more important than byte-for-byte consistency, is it legitimate to generate a strong ETag for an LDP-RS even if the server does not guarantee the order of the triples?

It does seem to be that it would be legitimate to do so, from the spec:

However, two simultaneous representations might share the same strong validator if they differ only in the representation metadata, such as when two different media types are available for the same representation data.

So, it seems we can stretch it not only to when the order of triples change, but also to different media types. However, it is clearly not the case for all RDF representations, as an RDFa media type may contain content that is not represented in e.g. Turtle, even if the graphs expressed are the same.

This could resolve some of the issues you mention, but certainly not all. Also, I don't think weak validators are sufficient for our purpose (e.g. conditional writes). Therefore, I speculate that perhaps a new type of validator that has semantics that is better suited is something we should undertake?

@pmcb55
Copy link

pmcb55 commented Sep 24, 2019

Yeah, I agree with Kjetil - for simplicity I would say it is legitimate to generate a strong ETag for an LDP-RS based on the underlying graph, regardless of serialization and triple order.

One consequence would be PATCHing a comment on a Turtle resource wouldn't update the ETag, but I'm Ok with that, as I'd strip comments anyway (i.e. I'd just persist the graph). If the client wants comments preserved, then POST it as a NonRS!). I'm not sure how to treat RDFa (in general :) !), but my feeling is it'll just have to be special-cased too (and in LDP terms, also treated as a NonRS (but with internal awareness of it's 'partial' RDF-ness)).

I'm sure I'm missing loads of nuance here, but I'm just trying to keep the server dumb!

I'm also fine with @kjetilk's suggestion to begin undertaking a new validator, but that's a separate issue, for possible inclusion in v2.0 of the spec!

@Mitzi-Laszlo Mitzi-Laszlo added this to the Candidate Recommendation milestone Sep 26, 2019
@csarven
Copy link
Member

csarven commented Oct 4, 2019

re:

For RDF serializations where RDF semantics may be more important than byte-for-byte consistency, is it legitimate to generate a strong ETag for an LDP-RS even if the server does not guarantee the order of the triples?

Given https://www.w3.org/TR/rdf11-concepts/#dfn-rdf-source :

A snapshot of the state can be expressed as an RDF graph.

Only the underlying RDF graph is semantically significant. Snapshots of the same state have isomorphic graphs. All other information in the serialization is out of scope.

What's semantically significant is mentioned in https://tools.ietf.org/html/rfc7232#section-2.1 with a recommended exception:

A strong validator might change for reasons other than a change to
the representation data, such as when a semantically significant part
of the representation metadata is changed (e.g., Content-Type), but
it is in the best interests of the origin server to only change the
value when it is necessary to invalidate the stored responses held by
remote caches and authoring tools.

Hence, it is legitimate to generate a strong ETag. It is also legitimate to use the same ETag value for different representations of the same RDF source.

@csarven
Copy link
Member

csarven commented Oct 4, 2019

However, it is clearly not the case for all RDF representations, as an RDFa media type may contain content that is not represented in e.g. Turtle, even if the graphs expressed are the same.

"may contain content" is not part of the same graph comparison as per https://www.w3.org/TR/rdf11-concepts/#graph-isomorphism . Only the information that can be stated as an RDF graph is compared.

@acoburn
Copy link
Member Author

acoburn commented Oct 4, 2019

@csarven I really appreciate this guidance. It is incredibly helpful. I would, however, like to bring up another very practical consideration:

It is also legitimate to use the same ETag value for different representations of the same RDF source.

If a browser fetches a resource as, say, Turtle, and the response includes an ETag (strong or weak doesn't matter here).

< GET /resource
< Accept: text/turtle

> HTTP/1.1 200
> Content-Type: text/turtle; charset=UTF-8
> ETag: "opaque-etag-value"

Then, in a subsequent request by that browser for the same resource but as, e.g. JSON-LD or with a different Prefer header, the request sends the ETag value in an If-None-Match header. If that ETag matches, the server would presumably respond with a 304 Not Modified:

< GET /resource
< Accept: application/ld+json
< If-None-Match: "opaque-etag-value"

> HTTP/1.1 304

If the client here is expecting JSON-LD (instead of Turtle), the graph parsing would likely fail.

Is this too much of an edge case? I.e. a work-around would be for the client to explicitly not send the If-None-Match header on the second request. Or should this be part of the consideration of the above? The same issue applies to various representations of a resource via the Prefer header mechanism.

Again, any guidance would be appreciated.

@dmitrizagidulin
Copy link
Member

@acoburn Good point. Would adding Content-type to the Vary: header help address that?

@acoburn
Copy link
Member Author

acoburn commented Oct 4, 2019

@dmitrizagidulin I have already done that (though Accept is what appears in the Vary response header). Perhaps this is just a limitation of browsers and their internal cache mechanism. As far as I can tell, if a server generates the same ETag for different content-negotiated representations, then subsequent requests (with varying accept values) just return 304 Not Modified and the browser uses the representation from its own cache even though it would be expecting a different serialization (so in the case of RDF parsing, one may easily encounter parse exceptions).

The easy way to resolve this would be to have different representations producing different ETag values, but that also leads to the issue described above w/r/t which value to use with If-Match (i.e. for conditional writes). Either way, I keep going around in circles on this issue.

The best browser-based work-around that I have found is to use the cache: "no-store" parameter when using the Fetch API.

@kjetilk
Copy link
Member

kjetilk commented Oct 7, 2019

Very important insight, @acoburn , and while I agree with @csarven that strong validators are legitimate from the wording of the specifications, I think such issues are going to cause problems. The use of conditional requests will be very important for performance optimizations in Solid, and is therefore something that has to Just Work.

I assume that the parsing is done by Solid-near code, perhaps we could work around it. Just braindumping here: My general sense is that the use of a cache is so important and parsers are generally available, so it shouldn't be an impediment to use a cache that you have a different serialization in there than was requested by an app. Generic client-side libraries should be able to get stuff from cache and transform it to whatever the app wants.

So, even if the browser cache contains a Turtle serialization, and the strong validator makes the browser return that Turtle, perhaps the client side libraries should be capable of transforming the RDF, so that the app gets a the requested media type?

@kjetilk
Copy link
Member

kjetilk commented Oct 7, 2019

However, it is clearly not the case for all RDF representations, as an RDFa media type may contain content that is not represented in e.g. Turtle, even if the graphs expressed are the same.

"may contain content" is not part of the same graph comparison as per https://www.w3.org/TR/rdf11-concepts/#graph-isomorphism . Only the information that can be stated as an RDF graph is compared.

Yes, but then, I would argue that it is not the same representation. For example, I would say it is a stretch to say that:

  <h2 property="dct:title">The Trouble with Bob</h2>
  <p>Date: <span property="dct:created">2011-09-10</span></p>

is fully represented by

<> dct:title "The Trouble with Bob" ;
   dct:created "2011-09-10" .

even though they are isomorphic graphs. In the latter, the header and paragraph semantics is lost.

It is definitly breaking it to say that

  <h2>The Trouble with Bob</h2>
  <p>Date: <span property="dct:created">2011-09-10</span></p>

is fully represented by

<> dct:created "2011-09-10" .

even though they are again isomorphic. Graph isomorphism isn't sufficient to decide whether two representations are equivalent for all RDF serializations, even though it is for most.

@csarven
Copy link
Member

csarven commented Oct 7, 2019

Yes, but then, I would argue that it is not the same representation.

Citation needed.

The equivalence of representations is in context of RDF Sources in which LDP-RS is based off.

Markup languages like HTML, SVG, MathML are host languages for RDFa. What's relevant in context of an RDF Source is that RDFa happens to be a way to materialise the RDF graph. All other information eg. h2, p, span, have no meaning (in the plain ol' HTML sense) to an RDFa parser. The purpose of interpreting an RDFa embedded document is to extract the "underlying abstract representation is RDF" ( https://www.w3.org/TR/rdfa-core/ ).

@RubenVerborgh
Copy link
Contributor

Citation needed.

It seems the argument was provided below? 🙂

@csarven
Copy link
Member

csarven commented Oct 7, 2019

It seems the argument was provided below? slightly_smiling_face

I was requesting to see material from the world of specs that would dismiss what I've argued (and cited) and supports his argument.

I could "argue" that infinitely different Turtle, JSON-LD.. serializations but with the same underlying graph to be different representations all meanwhile being interpreted in context of LDP-RS/RDF Source. But, that would make no sense (to me).

neutral_face

@RubenVerborgh
Copy link
Contributor

I could "argue" that infinitely different Turtle, JSON-LD.. serializations but with the same underlying graph

If there are no named graphs; yes. Otherwise, triple-based formats are not lossless compared to quad-based.

@kjetilk
Copy link
Member

kjetilk commented Oct 7, 2019

Right, we're citing exactly the same specs here, that is not the issue. My point is that an RDFa representation cannot simply be considered an LDP-RS, because that would discard semantics of the host language. I suppose we have to go to the Webarch definition of representation. The host language and RDF combined can represent the resource state in fully in a way that one of them does not.

So, if we do not take the academic argument here, but the pragmatic one, just imagine if the user requests an RDFa document with rich host language markup and content, and we give them just a few triples that represents a tiny part of the original document, because a strong validator has told us that the two are equivalent. I think users would be very upset.

@kjetilk
Copy link
Member

kjetilk commented Oct 7, 2019

Hmmm, OK, rereading the thread, then, I think I understand your argument better, @csarven , because if we assume that it is known that the resource is an LDP-RS, then indeed, data that are not represented by RDF is of no relevance. OK, I can go with that.

However, my point is that in the general case, we'd have a situation with RDFa where it is not clear if we have an LDP-RS, and then the host language semantics and content matters.

@csarven
Copy link
Member

csarven commented Oct 7, 2019

If content publisher deems that only the information that's encapsulated in RDFa is intended to persist eg. through other RDF representations, that's their call. This is why content publishers wanting to persist whatever is of relevance for a graph, they should take a lossless approach as much as possible. There is nothing stopping them to describe the complete structure and the content such that the resulting graph contains all information.

It is sensible to treat a resource as an LDP-RS given RDF Source, which says "any web document that has an RDF-bearing representation may be considered an RDF source." Holds true for any syntax used to convey information with RDF.

If host language's semantics and the content that's not encapsulated in RDFa is important for the publisher ie. at least more important than having it emit an RDF graph, and needs to persist, then they should instead consider treating the resource as an LDP-NR. Their call as to how they wish handle their resources.

@kjetilk
Copy link
Member

kjetilk commented Oct 7, 2019

Right, I can see your point.

I realize it is essentially an argument from ignorance, but I tend to think about the mistakes that I might make and try to design robustness around them. I'm not confident that I would know to make an RDFa document an LDP-NR, nor am I sure I should have to.

So, the implication of all this is that a strong validator can be used for an LDP-RS with different serializations with the practical caveats around how browsers treat caches.

We could be in a situation where the difference between an LDP-RS and an LDP-NR lies only in the markup (e.g. h2, p elements as above), not in the contents. I haven't thought the consequences of that fully through.

@csarven
Copy link
Member

csarven commented Oct 8, 2019

Good point about mistakes and reducing their chances from happening.

I suppose this is where authoring/sharing tools (aka Solid applications) get to decide a bit on behalf of the user. If the source document is intended to be graph-like, then that's all there is to it. We could try to dissect the rationale further but I think it would suffice to treat that as an axiom. Ultimately only an application and its user would know whether something is intended to be an LDP-RS or -NR. That is also why LDP servers are instructed to honour client's interaction model in the request.

@csarven
Copy link
Member

csarven commented Oct 8, 2019

@acoburn ,

In that case, suppose a client wishes to send a PATCH request along with an If-Match header as part of a conditional request. What value should be included in the If-Match header?

Assuming that the server responded to an LDP-RS with a strong ETag, then If-Match is possible on subsequent request by a client. If no ETag or weak ETag was available from earlier request, then the server will drop https://tools.ietf.org/html/rfc7232#section-3.1 :

An origin server MUST use the strong comparison function when
comparing entity-tags for If-Match (Section 2.3.2), since the client
intends this precondition to prevent the method from being applied if
there have been any changes to the representation data.

An origin server MUST NOT perform the requested method if a received
If-Match condition evaluates to false; instead, the origin server
MUST respond with either a) the 412 (Precondition Failed) status code

So:

Does If-Match make sense in the context of PATCH?

Especially so:

If-Match is most often used with state-changing methods (e.g., POST,
PUT, DELETE) to prevent accidental overwrites

Am I missing something?

given the various permutations of ETags that could be generated, does a server need to check all such permutations before accepting the conditional request?

Excellent question. I don't think so. It is an implementation detail re https://tools.ietf.org/html/rfc7232#section-2.3.1 and that if a representation changes that "can be reasonably and consistently determined" there will be a new ETag value. It is an open-ended criteria, so a set of dimensions that is deemed to uniquely identify a representation.

This question is simpler in the context of PUT, but what if a client retrieves an LDP-RS as JSON-LD using a custom profile (Accept: application/ld+json; profile="https://...."), modifies the RDF graph and then, via PUT replaces the resource as JSON-LD. What value should be used with If-Match?

Strong ETag from the original request.

Does this change if the resource is retrieved as JSON-LD and replaced as Turtle?

No, it shouldn't because one of the key identifiers ie. Content-Type, for the representation didn't change. Contrast this with Entity-Tags Varying on Content-Negotiated Resources https://tools.ietf.org/html/rfc7232#section-2.3.3 where a distinct dimension (Content-Encoding) helps to signal the difference.

If the client here is expecting JSON-LD (instead of Turtle), the graph parsing would likely fail.

I'm not sure how that works in implementations. Do they combine the information from the cached entry with the new request and factor them in the process eg. graph parsing? The intention to use If-None-Match in GET is whether to bust the cache or not, so if a 304 comes back, that tells the recipient that they should use the cached item ie. the Turtle instead of JSON-LD. The request with If-None-Match helped to verify if the JSON-LD is equivalent to Turtle and whether the cache needs to be updated or not. I think the recipient should work with the cached as a whole.

This is a bit fuzzy for me at the moment but it is probably worthwhile to specify that a server should be capable of generating both weak and strong entity-tags so that however it decides to come up with one for an ETag, it can potentially be used for If-Match and If-None-Match.

The best browser-based work-around that I have found is to use the cache: "no-store" parameter when using the Fetch API.

Wouldn't the server perhaps include Cache-Control for the example from earlier (1st request: Turtle, 2nd request: JSON-LD)? I'm deriving that from https://tools.ietf.org/html/rfc7232#section-4.1 :

The server generating a 304 response MUST generate any of the
following header fields that would have been sent in a 200 (OK)
response to the same request: Cache-Control, Content-Location, Date,
ETag, Expires, and Vary.

I did not have good experience in reusing cache after 304. That may be the shortcoming of the communication between the application's libraries and the Web browser. So, I came to the same conclusion that in order to work around inadequate access/reuse of the cache, client sends Cache-Control: no-store (related to browsers caching Origin for CORS in the same browser session and not letting go: https://lists.w3.org/Archives/Public/www-archive/2017Aug/0003.html ). It is a dirty hack that's best left as an implementation decision.

@kjetilk
Copy link
Member

kjetilk commented Oct 8, 2019

I did not have good experience in reusing cache after 304. That may be the shortcoming of the communication between the application's libraries and the Web browser. So, I came to the same conclusion that in order to work around inadequate access/reuse of the cache, client sends Cache-Control: no-store (related to browsers caching Origin for CORS in the same browser session and not letting go: https://lists.w3.org/Archives/Public/www-archive/2017Aug/0003.html ). It is a dirty hack that's best left as an implementation decision.

My concern is that if we are to realize the use cases we are promoting where the app gathers data from many different sources, then we have very little control over the optimizations at the origin (as opposed to e.g. Facebook), so we will have to rely on every single tool in the shed to realize the performance goals, and caching based on conditional requests whether on the client side or in proxies, I'm pretty sure will be very important. I'm therefore wary that the easy way out by using "no-store" will become a problem not easily corrected down the road.

@csarven
Copy link
Member

csarven commented Oct 8, 2019

@kjetilk I agree. Given the current state of user-agents and the degree of discrepancies with the specifications, perhaps TSE can have an informative text using a should- or may-like language to note the potholes that applications may encounter (and maybe ways to get around them.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants