Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link parsing only works for basic cases #2099

Open
viliam-durina opened this issue Mar 8, 2024 · 4 comments · May be fixed by #2128
Open

Link parsing only works for basic cases #2099

viliam-durina opened this issue Mar 8, 2024 · 4 comments · May be fixed by #2128

Comments

@viliam-durina
Copy link

viliam-durina commented Mar 8, 2024

We've decided to use your library to parse Link headers, hoping that you'll correctly implement the specification intricacies, because doing parsing correctly is tricky. However, it's implemented completely naively, working only in the basic cases. It uses regex for parsing, even though it can't be used for non-context-free grammars.

Here are a bunch of examples:

// should have failed
Link.valueOf("foo bar <url>;rel=\"next\"");

// greedy capture until the last `>`
System.out.println(Link.valueOf("<url>;customparam=foo>;rel=\"next\"").getHref());

// multiple rels not supported, should have been parsed to a collection
System.out.println(Link.valueOf("<url>;rel=\"next last\"").getRel().value());

// incorrect unquoting of values - missing end quote must be ignored, but currently "nex" is returned (missing last "t")
System.out.println(Link.valueOf("<url>;rel=\"next").getRel().value());

// param value not unescaped
System.out.println(Link.valueOf("<url>;rel=\"next\";title=\"foo\\\"bar\"").getTitle());

// incorrect semicolon handling - a semicolon within a quoted value is not treated literally
System.out.println(Link.valueOf("<url>;rel=\"next\";title=\"foo;bar\"").getRel().value());

This is the ABNF for the Link header:

  Link       = #link-value
  link-value = "<" URI-Reference ">" *( OWS ";" OWS link-param )
  link-param = token BWS [ "=" BWS ( token / quoted-string ) ]

And here for the content of the rel param:

  relation-type *( 1*SP relation-type )
where:

  relation-type  = reg-rel-type / ext-rel-type
  reg-rel-type   = LOALPHA *( LOALPHA / DIGIT / "." / "-" )
  ext-rel-type   = URI ; Section 3 of [RFC3986]

See https://httpwg.org/specs/rfc8288.html

So there can be multiple values in the rel param, even URIs and quoted URIs.

Some of the errors can't be fixed without breaking b-w compatibility, e.g. the decoding of href, unescaping of params and multiple rel values. Also, I'm not sure why custom parameters aren't supported - I didn't read the whole spec, but I don't think it prohibits custom params (we don't use them, just wondering).

@viliam-durina
Copy link
Author

Looking at the RFC more deeply, looks that I'm missing a lot. For example, missing > or missing final quote " MUST be ignored. There's nothing about URL-decoding the href, at least I can't find it. Seems that the particular service I'm working with now got this incorrectly since they are double-escaping their URL, contrary to the standard. There also can be multiple links within a single header, separated by ,, if I understand it correctly.

But many of the points above still apply. The RFC contains precise algorithm for parsing, I would expect a high quality library to strictly follow it, including the edge cases.

@odrotbohm
Copy link
Member

Would you mind listing the cases that really do break and are valid link headers according to the specification?

@viliam-durina
Copy link
Author

@odrotbohm I edited the original post of this issue.
The list isn't complete, to implement correctly, one should follow the specification very closely. There's exact algorithm for the parsing an Appendix B, from which handling of all (most?) edge cases follows (e.g. the ignoring of the missing final quote, if I understand it correctly).

@odrotbohm
Copy link
Member

Would you mind turning them into test cases and submitting a PR? I'll see what I can do then. If you fancy taking a stab at a better parsing implementation, feel free to submit that, too.

viliam-durina pushed a commit to viliam-durina/spring-hateoas that referenced this issue Mar 20, 2024
The previous implementation used naive parsing suffering from many issues, especially when special characters were part of values. It also didn't escape the special characters when serializing a link to string.

I marked the `Link.valueOf` method as deprecated because it doesn't correctly parse multiple links, nor does it handle multiple values for the `rel` param. One should use `Links.parse`. There are no other incompatible API changes.

I had to change one current test: we no longer ignore missing final `>` after the URL - such link is invalid anyway because it didn't have the rel parameter.

Fixes spring-projects#2099
@viliam-durina viliam-durina linked a pull request Mar 20, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants