Description
Background
According to RFC 3986 line termination characters like \n
are not allowed in any part of a URI but the percent-encoded versions %0A
are allowed. For other sections of the URL, such as the query and the path, ParseResult.fromString will normalize \n
characters into percent-encoded characters and accept them. This is not true for the fragment section of the URI.
The Bug
Inserting any line termination character into the fragment section of the URL will result in the parsing of the fragment section being cut short.
Minimal Reproducible Example
import rfc3986
# Arguments are (URL, encoding, strict, lazy_normalize)
parsed_url = rfc3986.ParseResult.from_string('scheme://user@host.com:80/path?query#Fragment\nThatIsIllusive', 'utf-8', True, False)
print("Fragment: " + parsed_url.fragment)
This will print Fragment: Fragment
In contrast Furl, Hyperlink, Urllib, and Yarl all return Fragment: Fragment%0AThatIsIllusive
Cause
This is the regex used to parse different parts of a URI
SCHEME_RE = "[a-zA-Z][a-zA-Z0-9+.-]*"
_AUTHORITY_RE = "[^\\\\/?#]*"
_PATH_RE = "[^?#]*"
_QUERY_RE = "[^#]*"
_FRAGMENT_RE = ".*"
This bug is a result of the use of .*
in the fragment regex. The .
symbol in regex accepts every character except for line termination characters.