Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect pattern in the re module docs for conditional regex #55492

Closed
wesleychun mannequin opened this issue Feb 22, 2011 · 7 comments
Closed

incorrect pattern in the re module docs for conditional regex #55492

wesleychun mannequin opened this issue Feb 22, 2011 · 7 comments
Labels
docs Documentation in the Doc dir topic-regex

Comments

@wesleychun
Copy link
Mannequin

wesleychun mannequin commented Feb 22, 2011

BPO 11283
Nosy @orsenthil, @pitrou, @ezio-melotti
Files
  • re.rst: patched re.rst file
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-03-12.03:46:46.692>
    created_at = <Date 2011-02-22.08:48:19.404>
    labels = ['expert-regex', 'docs']
    title = 'incorrect pattern in the re module docs for conditional regex'
    updated_at = <Date 2011-03-12.03:46:46.691>
    user = 'https://bugs.python.org/wesleychun'

    bugs.python.org fields:

    activity = <Date 2011-03-12.03:46:46.691>
    actor = 'orsenthil'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2011-03-12.03:46:46.692>
    closer = 'orsenthil'
    components = ['Documentation', 'Regular Expressions']
    creation = <Date 2011-02-22.08:48:19.404>
    creator = 'wesley.chun'
    dependencies = []
    files = ['20833']
    hgrepos = []
    issue_num = 11283
    keywords = []
    message_count = 7.0
    messages = ['129041', '129473', '129680', '129686', '130654', '130657', '130658']
    nosy_count = 6.0
    nosy_names = ['orsenthil', 'pitrou', 'ezio.melotti', 'docs@python', 'wesley.chun', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue11283'
    versions = ['Python 3.1', 'Python 2.7', 'Python 3.2', 'Python 3.3']

    @wesleychun
    Copy link
    Mannequin Author

    wesleychun mannequin commented Feb 22, 2011

    In the re docs, it states the following for the conditional regular expression syntax:

    (?(id/name)yes-pattern|no-pattern)
    Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>) is a poor email matching pattern, which will match with '<user@host.com>' as well as 'user@host.com', but not with '<user@host.com'.

    this regex is incomplete as it allows for 'user@host.com>':

    >>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)', '<user@host.com>'))
    True
    >>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)', 'user@host.com'))
    True
    >>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)', '<user@host.com'))
    False
    >>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)', 'user@host.com>'))
    True

    This error has existed since this feature was added in 2.4...
    http://docs.python.org/release/2.4.4/lib/re-syntax.html

    ... through the 3.3. docs...
    http://docs.python.org/dev/py3k/library/re.html#regular-expression-syntax

    The fix is to add the end char '$' to the regex to get all 4 working:

    >>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', '<user@host.com>'))
    True
    >>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', 'user@host.com'))
    True
    >>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', '<user@host.com'))
    False
    >>> bool(re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', 'user@host.com>'))
    False

    If accepted, I propose this patch (also attached):

    $ svn diff re.rst
    Index: re.rst

    ===================================================================

    --- re.rst      (revision 88499)
    +++ re.rst      (working copy)
    @@ -297,9 +297,9 @@
     ``(?(id/name)yes-pattern|no-pattern)``
        Will try to match with ``yes-pattern`` if the group with given *id* or *name*
        exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
    -   can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
    +   can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email
        matching pattern, which will match with ``'<user@host.com>'`` as well as
    -   ``'user@host.com'``, but not with ``'<user@host.com'``.
    +   ``'user@host.com'``, but not with ``'<user@host.com'`` nor ``'user@host.com>'`` .

    @wesleychun wesleychun mannequin assigned docspython Feb 22, 2011
    @wesleychun wesleychun mannequin added docs Documentation in the Doc dir topic-regex labels Feb 22, 2011
    @wesleychun
    Copy link
    Mannequin Author

    wesleychun mannequin commented Feb 25, 2011

    i wanted to add one additional comment that it would be nice to have a
    regex that works with search() (in addition to match()) because such
    an email address may appear in the middle of a line, say a From: or
    To: email header.

    the fix of using a '$' prevents this from happening, so i'm not 100%
    satisfied with the patch although it does fix the regex to get it
    working with match().

    @orsenthil
    Copy link
    Member

    On Tue, Feb 22, 2011 at 08:48:20AM +0000, wesley chun wrote:

    The fix is to add the end char '$' to the regex to get all 4 working:

    Better would be a regex for white-space '\s' which would achieve the
    same purpose plus it would satisfy the other requirement for using it
    with search and can do the search in-middle of the document too.

    @orsenthil
    Copy link
    Member

    Thinking about the regex pattern again. The example given is not really wrong. It does what it claims to match, that is '<user@example.com>' and 'user@example.com' and reject <user@example.com' kind of string. Nothing is said about 'user@example.com>' kind of string.

    Also, this is not an example of validating an email address or finding an email address pattern in text data. A good regex for that purposes would be more complex[1][2].

    Having said that, if example of conditional regex has been given - the current one is sufficient (in which case no change is required) or a simpler one can be presented, which may not like matching a email address and thus devoid of any expectations of valid patterns.

    Also, if we 'really' think that rejecting 'user@example>' is good idea in the example documentation, then having '$' in no-pattern of regex is good enough. No need to think for regex search cases for the explanation given about.

    1: http://www.regular-expressions.info/email.html
    2: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 12, 2011

    New changeset 06cca90ff105 by orsenthil in branch 'default':
    Fix bpo-11283 - Clarifying a re pattern in the re module docs for conditional regex
    http://hg.python.org/cpython/rev/06cca90ff105

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 12, 2011

    New changeset d676601fee6f by Senthil Kumaran in branch '3.1':
    Fix bpo-11283 - Clarifying a re pattern in the re module docs for conditional regex
    http://hg.python.org/cpython/rev/d676601fee6f

    @orsenthil
    Copy link
    Member

    Okay, fixed in all relevant branches.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant