Skip to content

Parsing of fragments through ParseResult.fromString can be cut short by a newline character #99

Closed
@JohnJamesUtley

Description

@JohnJamesUtley

Background

According to RFC 3986 line termination characters like \n are not allowed in any part of a URI but the percent-encoded versions %0A are allowed. For other sections of the URL, such as the query and the path, ParseResult.fromString will normalize \n characters into percent-encoded characters and accept them. This is not true for the fragment section of the URI.

The Bug

Inserting any line termination character into the fragment section of the URL will result in the parsing of the fragment section being cut short.

Minimal Reproducible Example

import rfc3986
# Arguments are (URL, encoding, strict, lazy_normalize)
parsed_url = rfc3986.ParseResult.from_string('scheme://user@host.com:80/path?query#Fragment\nThatIsIllusive', 'utf-8', True, False)
print("Fragment: " + parsed_url.fragment)

This will print Fragment: Fragment

In contrast Furl, Hyperlink, Urllib, and Yarl all return Fragment: Fragment%0AThatIsIllusive

Cause

This is the regex used to parse different parts of a URI

SCHEME_RE = "[a-zA-Z][a-zA-Z0-9+.-]*"
_AUTHORITY_RE = "[^\\\\/?#]*"
_PATH_RE = "[^?#]*"
_QUERY_RE = "[^#]*"
_FRAGMENT_RE = ".*"

This bug is a result of the use of .* in the fragment regex. The . symbol in regex accepts every character except for line termination characters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions