Skip to content

ParseResult.fromString accepts ports that aren't preceded by a ':' character #102

Closed
@JohnJamesUtley

Description

@JohnJamesUtley

Background

RFC 3986 defines an authority as such

authority = [ userinfo "@" ] host [ ":" port ]

WHAT WG says that

An opaque-host-and-port string must be either the empty string or: a valid opaque-host string, optionally followed by U+003A (:) and a URL-port string.
A scheme-relative-special-URL string must be "//", followed by a valid host string, optionally followed by U+003A (:) and a URL-port string, optionally followed by a path-absolute-URL string.

The Bug

ParseResult.from_string will parse a port number which is not preceded by a colon. This conflicts with both specifications.

Minimally Reproducable Example

import rfc3986
# Arguments are (URL, encoding, strict, lazy_normalize)
parsed_url = rfc3986.ParseResult.from_string('scheme://[v1.ip]8000/path', 'utf-8', True, False)
print("Host: " + str(parsed_url.host)) # prints 'Host: [v1.ip]'
print("Port: " + str(parsed_url.port)) # prints 'Port: 8000'

Cause

This is the regex used to parse the authority component in misc.py

SUBAUTHORITY_MATCHER = re.compile(
    (
        "^(?:(?P<userinfo>{})@)?"  # userinfo
        "(?P<host>{})"  # host
        ":?(?P<port>{})?$"  # port
    ).format(
        abnf_regexp.USERINFO_RE, abnf_regexp.HOST_PATTERN, abnf_regexp.PORT_RE
    )
)

This bug is a result of the first '?' character in the regex used for the port :?(?P<port>{})?$. This regex allows the colon to be optional independently of an optional port number. However, according to the specs a port number and colon should always be paired.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions