Link parsing only works for basic cases #2099

viliam-durina · 2024-03-08T11:47:39Z

We've decided to use your library to parse Link headers, hoping that you'll correctly implement the specification intricacies, because doing parsing correctly is tricky. However, it's implemented completely naively, working only in the basic cases. It uses regex for parsing, even though it can't be used for non-context-free grammars.

Here are a bunch of examples:

// should have failed
Link.valueOf("foo bar <url>;rel=\"next\"");

// greedy capture until the last `>`
System.out.println(Link.valueOf("<url>;customparam=foo>;rel=\"next\"").getHref());

// multiple rels not supported, should have been parsed to a collection
System.out.println(Link.valueOf("<url>;rel=\"next last\"").getRel().value());

// incorrect unquoting of values - missing end quote must be ignored, but currently "nex" is returned (missing last "t")
System.out.println(Link.valueOf("<url>;rel=\"next").getRel().value());

// param value not unescaped
System.out.println(Link.valueOf("<url>;rel=\"next\";title=\"foo\\\"bar\"").getTitle());

// incorrect semicolon handling - a semicolon within a quoted value is not treated literally
System.out.println(Link.valueOf("<url>;rel=\"next\";title=\"foo;bar\"").getRel().value());

This is the ABNF for the Link header:

  Link       = #link-value
  link-value = "<" URI-Reference ">" *( OWS ";" OWS link-param )
  link-param = token BWS [ "=" BWS ( token / quoted-string ) ]

And here for the content of the rel param:

  relation-type *( 1*SP relation-type )
where:

  relation-type  = reg-rel-type / ext-rel-type
  reg-rel-type   = LOALPHA *( LOALPHA / DIGIT / "." / "-" )
  ext-rel-type   = URI ; Section 3 of [RFC3986]

See https://httpwg.org/specs/rfc8288.html

So there can be multiple values in the rel param, even URIs and quoted URIs.

Some of the errors can't be fixed without breaking b-w compatibility, e.g. the decoding of href, unescaping of params and multiple rel values. Also, I'm not sure why custom parameters aren't supported - I didn't read the whole spec, but I don't think it prohibits custom params (we don't use them, just wondering).

The text was updated successfully, but these errors were encountered:

viliam-durina · 2024-03-08T14:39:25Z

Looking at the RFC more deeply, looks that I'm missing a lot. For example, missing > or missing final quote " MUST be ignored. There's nothing about URL-decoding the href, at least I can't find it. Seems that the particular service I'm working with now got this incorrectly since they are double-escaping their URL, contrary to the standard. There also can be multiple links within a single header, separated by ,, if I understand it correctly.

But many of the points above still apply. The RFC contains precise algorithm for parsing, I would expect a high quality library to strictly follow it, including the edge cases.

odrotbohm · 2024-03-13T11:05:48Z

Would you mind listing the cases that really do break and are valid link headers according to the specification?

viliam-durina · 2024-03-13T11:27:07Z

@odrotbohm I edited the original post of this issue.
The list isn't complete, to implement correctly, one should follow the specification very closely. There's exact algorithm for the parsing an Appendix B, from which handling of all (most?) edge cases follows (e.g. the ignoring of the missing final quote, if I understand it correctly).

odrotbohm · 2024-03-13T11:32:07Z

Would you mind turning them into test cases and submitting a PR? I'll see what I can do then. If you fancy taking a stab at a better parsing implementation, feel free to submit that, too.

The previous implementation used naive parsing suffering from many issues, especially when special characters were part of values. It also didn't escape the special characters when serializing a link to string. I marked the `Link.valueOf` method as deprecated because it doesn't correctly parse multiple links, nor does it handle multiple values for the `rel` param. One should use `Links.parse`. There are no other incompatible API changes. I had to change one current test: we no longer ignore missing final `>` after the URL - such link is invalid anyway because it didn't have the rel parameter. Fixes spring-projects#2099

odrotbohm self-assigned this Mar 13, 2024

odrotbohm added the process: waiting for feedback label Mar 13, 2024

viliam-durina linked a pull request Mar 20, 2024 that will close this issue

Fix link parsing for non-trivial cases #2128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link parsing only works for basic cases #2099

Link parsing only works for basic cases #2099

viliam-durina commented Mar 8, 2024 •

edited

viliam-durina commented Mar 8, 2024

odrotbohm commented Mar 13, 2024

viliam-durina commented Mar 13, 2024

odrotbohm commented Mar 13, 2024

Link parsing only works for basic cases #2099

Link parsing only works for basic cases #2099

Comments

viliam-durina commented Mar 8, 2024 • edited

viliam-durina commented Mar 8, 2024

odrotbohm commented Mar 13, 2024

viliam-durina commented Mar 13, 2024

odrotbohm commented Mar 13, 2024

viliam-durina commented Mar 8, 2024 •

edited