Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misleading documentations and comments in regular expression HOWTO #62979

Closed
vajrasky mannequin opened this issue Aug 19, 2013 · 5 comments
Closed

Misleading documentations and comments in regular expression HOWTO #62979

vajrasky mannequin opened this issue Aug 19, 2013 · 5 comments
Labels
docs Documentation in the Doc dir topic-regex

Comments

@vajrasky
Copy link
Mannequin

vajrasky mannequin commented Aug 19, 2013

BPO 18779
Nosy @akuchling, @pitrou, @ezio-melotti, @bitdancer, @vajrasky
Files
  • fix_alphanumeric_and_underscore_doc_in_regex.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-02-15.02:46:23.850>
    created_at = <Date 2013-08-19.07:45:16.799>
    labels = ['expert-regex', 'docs']
    title = 'Misleading documentations and comments in regular expression HOWTO'
    updated_at = <Date 2017-02-15.02:46:23.836>
    user = 'https://github.com/vajrasky'

    bugs.python.org fields:

    activity = <Date 2017-02-15.02:46:23.836>
    actor = 'akuchling'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2017-02-15.02:46:23.850>
    closer = 'akuchling'
    components = ['Documentation', 'Regular Expressions']
    creation = <Date 2013-08-19.07:45:16.799>
    creator = 'vajrasky'
    dependencies = []
    files = ['31371']
    hgrepos = []
    issue_num = 18779
    keywords = ['patch']
    message_count = 5.0
    messages = ['195611', '195617', '195618', '195627', '287809']
    nosy_count = 7.0
    nosy_names = ['akuchling', 'pitrou', 'ezio.melotti', 'mrabarnett', 'r.david.murray', 'docs@python', 'vajrasky']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue18779'
    versions = ['Python 3.3', 'Python 3.4']

    @vajrasky
    Copy link
    Mannequin Author

    vajrasky mannequin commented Aug 19, 2013

    According to:

    http://oald8.oxfordlearnersdictionaries.com/dictionary/alphanumeric
    http://en.wikipedia.org/wiki/Alphanumeric

    Alphanumeric is defined as [A-Za-z0-9]. Underscore (_) is not one of them. One of the documentation in Python (Doc/tutorial/stdlib2.rst) differentiates them very clearly:

    "The format uses placeholder names formed by $ with valid Python identifiers
    (alphanumeric characters and underscores). Surrounding the placeholder with
    braces allows it to be followed by more alphanumeric letters with no intervening
    spaces. Writing $$ creates a single escaped $::"

    Yet, in documentations as well as comments in regex, we implicitely assumes underscore belongs to alphanumeric.

    Explicit is better than implicit!

    Attached the patch to differentiate alphanumeric and underscore in documentations and comments in regex.

    This is important in case someone is confused with this code:
    >>> import re
    >>> re.split('\W', 'haha$hihi*huhu_hehe hoho')
    ['haha', 'hihi', 'huhu_hehe', 'hoho']

    On the side note:
    In Python code base, sometimes we write "alphanumerics" and "underscores", yet sometimes we write "alphanumeric characters" and "underscore characters". Which one again is the true way?

    @vajrasky vajrasky mannequin assigned docspython Aug 19, 2013
    @vajrasky vajrasky mannequin added the docs Documentation in the Doc dir label Aug 19, 2013
    @pitrou
    Copy link
    Member

    pitrou commented Aug 19, 2013

    I was wondering which doc you were alluding it, before I noticed your patch is against the regex HOWTO.
    The HOWTO seems quite outdated wrt. Python 3. For example "\w" is not equivalent to "[a-zA-Z0-9_]", anymore, except with the ASCII flag.

    @pitrou pitrou changed the title Misleading documentations and comments in regular expression about alphanumerics and underscore Misleading documentations and comments in regular expression HOWTO Aug 19, 2013
    @vajrasky
    Copy link
    Mannequin Author

    vajrasky mannequin commented Aug 19, 2013

    In Lib/re.py, starting from line 77 (Python 3.4):

    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
    

    The prelude is "Matches any alphanumeric character;".

    Yet, in any case (bytes, string patterns with ascii flag, string patterns without the ascii flag, strings with locale), the underscore is always included.

    Then why don't we change the prelude to "Matches any alphanumeric character and underscore character;"? In the description we explain the alphanumeric depending on it's unicode or not can be [A-Za-z0-9] or wider than that.

    The description is already okay but the prelude is misleading readers.

    @bitdancer
    Copy link
    Member

    The answer to the question about "alphanumerics" versus "alphanumeric characters" is that is is mostly likely context-dependent, so I'd have to see particular examples to say which I though read better. So, there is no One True Answer for this question, I think.

    @akuchling
    Copy link
    Member

    Unfortunately making the sentences pedantically correct also makes them ungainly, and I think people generally assume that underscores are treated as a letter.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants