Skip to content

Parsing of fragments through ParseResult.fromString can be cut short by a newline character #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JohnJamesUtley opened this issue Apr 21, 2023 · 0 comments · Fixed by #100

Comments

@JohnJamesUtley
Copy link
Contributor

Background

According to RFC 3986 line termination characters like \n are not allowed in any part of a URI but the percent-encoded versions %0A are allowed. For other sections of the URL, such as the query and the path, ParseResult.fromString will normalize \n characters into percent-encoded characters and accept them. This is not true for the fragment section of the URI.

The Bug

Inserting any line termination character into the fragment section of the URL will result in the parsing of the fragment section being cut short.

Minimal Reproducible Example

import rfc3986
# Arguments are (URL, encoding, strict, lazy_normalize)
parsed_url = rfc3986.ParseResult.from_string('scheme://user@host.com:80/path?query#Fragment\nThatIsIllusive', 'utf-8', True, False)
print("Fragment: " + parsed_url.fragment)

This will print Fragment: Fragment

In contrast Furl, Hyperlink, Urllib, and Yarl all return Fragment: Fragment%0AThatIsIllusive

Cause

This is the regex used to parse different parts of a URI

SCHEME_RE = "[a-zA-Z][a-zA-Z0-9+.-]*"
_AUTHORITY_RE = "[^\\\\/?#]*"
_PATH_RE = "[^?#]*"
_QUERY_RE = "[^#]*"
_FRAGMENT_RE = ".*"

This bug is a result of the use of .* in the fragment regex. The . symbol in regex accepts every character except for line termination characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant