Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize the re module sources #91308

Closed
serhiy-storchaka opened this issue Mar 29, 2022 · 31 comments
Closed

Reorganize the re module sources #91308

serhiy-storchaka opened this issue Mar 29, 2022 · 31 comments
Labels
3.11 expert-regex stdlib Python modules in the Lib dir

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Mar 29, 2022

BPO 47152
Nosy @gvanrossum, @vstinner, @ezio-melotti, @serhiy-storchaka, @animalize, @asottile
PRs
  • bpo-47152: Convert the re module into a package #32177
  • [WIP] bpo-23689: re module, fix memory leak when a match is terminated by a signal #32188
  • bpo-47152: Move sources of the _sre module into a subdirectory #32290
  • bpo-47152: Remove unused import in re #32298
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-03-29.15:54:32.145>
    labels = ['expert-regex', 'library', '3.11']
    title = 'Reorganize the re module sources'
    updated_at = <Date 2022-04-05.08:10:39.512>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2022-04-05.08:10:39.512>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2022-03-29.15:54:32.145>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 47152
    keywords = ['patch']
    message_count = 26.0
    messages = ['416268', '416294', '416320', '416328', '416471', '416497', '416502', '416523', '416543', '416545', '416547', '416548', '416551', '416557', '416563', '416591', '416593', '416595', '416615', '416657', '416659', '416667', '416676', '416693', '416747', '416761']
    nosy_count = 8.0
    nosy_names = ['gvanrossum', 'vstinner', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'malin', 'Anthony Sottile', 'dom1310df']
    pr_nums = ['32177', '32188', '32290', '32298']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue47152'
    versions = ['Python 3.11']

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Mar 29, 2022

    I proposed it several years ago on the Python-Dev mailing list and that change was approved in general. The reorganization was deferred because there were several known bugs in the RE engine (fixes for which could potentially be backported) and there were not merged patches waiting for review. Now the patch for atomic groups was merged and bugs was fixed (thanks to Ma Lin).

    Both the C code and the Python code for the re module are distributed on few files, which lie down in directories Modules and Lib. It makes difficult to work with all related files because they are intermixed with source files of different modules.

    The following changes are planned:

    1. Convert the re module into a package. Make sre_* modules its submodules.
    2. Move C sources for the _sre module into a separate directory.
    3. Extract the code for generating definitions of C constants from definitions of Python constants into a separate script and add it in the Tools/scripts directory (there are precedences: generate_token.py, etc).

    @serhiy-storchaka serhiy-storchaka added 3.11 stdlib Python modules in the Lib dir expert-regex labels Mar 29, 2022
    @dom1310df
    Copy link
    Mannequin

    dom1310df mannequin commented Mar 29, 2022

    Could the sre_parse and sre_constants modules be kept with public names (i.e. without the leading underscore) but within the re namespace? I use them to tokenize and then syntax highlight regular expressions.

    I did a quick search and found a few other users of the modules:

    • pydoctor uses them for regex syntax highlighting[1], although it has its own copy of the sre_parse source rather than importing from stdlib.
    • lark uses sre_parse to find minimum and maximum length of matching strings[2]
    • sre_yield uses them to determine all strings that will match a regex[3]

    The whole modules don't necessarily need exposing, but certainly sre_parse.parse, sre_parse.parse_template, and the opcodes from sre_constants would be the most useful.

    [1] https://github.com/twisted/pydoctor/blob/c86273dffade5455890570142c8b7b068f5dffd1/pydoctor/epydoc/markup/_pyval_repr.py#L776
    [2] https://github.com/lark-parser/lark/blob/85ea92ebf4e983e9997f9953a9c1463bb3d1c6cc/lark/utils.py#L120
    [3] https://github.com/google/sre_yield/blob/3af063a0054c4646608b43b941fbfcbe4e01214a/sre_yield/__init__.py

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 30, 2022

    Please don't merge too close to the 3.11 beta1 release date, I'll submit PRs after this merged.

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Mar 30, 2022

    It turns out that pip uses sre_constants in its copy of pyparsing. The problem is already fixed in the upstream of pyparsing and soon should be fixed in pip. We still need to keep sre_constants and maybe other sre_* modules, but deprecate them.

    Could the sre_parse and sre_constants modules be kept with public names (i.e. without the leading underscore) but within the re namespace?

    It is a good idea which will allow to minimize breakage in short term. You can write "from re import sre_parse", and it would work in old and new versions because sre_parse and sre_compile were imported in the re module. This trick does not work with sre_constants, you still need try/except.

    But the code that depends on these modules is fragile and can be broken by other ways.

    Please don't merge too close to the 3.11 beta1 release date, I'll submit PRs after this merged.

    I am going to implement step 2 only after merging your changes for bpo-23689.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 1, 2022

    sre_constants, sre_compile and sre_parse are not tested and are not documented. I don't consider them as public API currently.

    If someone has good reason to use them, IMO we must clearly define which exact API is needed, properly document and test it.

    If we expose something, I don't think that the API would be exposed as re.sre_xxx.xxx, but as re.xxx.

    I suggest to hide sre_xxx submodules by adding an underscore to their name. Moreover, the "sre_" prefix is now redundant. I suggest renaming:

    • sre_constants => re._constants
    • sre_compile => re._compile
    • sre_parse => re._parse

    @gvanrossum
    Copy link
    Member

    gvanrossum commented Apr 1, 2022

    I don't mind reorganizing this, but I would insist that we keep code using old undocumented things (like the sre_* modules) working for several releases, using the standard deprecation approach.

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Apr 1, 2022

    Modules with old names are kept (deprecated). The questions are:

    1. Should we keep the sre_ prefix in new submodules? Should we prefix them with underscores?
    2. Should we keep only non-underscored names in the sre_* modules or undescored names too?

    @gvanrossum
    Copy link
    Member

    gvanrossum commented Apr 1, 2022

    1. If we're reorganizing anyway, I see no reason to keep the old names.
    2. For maximum backwards compatibility, I'd say keep as much as you can, as long as keeping it won't interfere with the reorganization.

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Apr 2, 2022

    New changeset 1be3260 by Serhiy Storchaka in branch 'main':
    bpo-47152: Convert the re module into a package (GH-32177)
    1be3260

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    $ ls Lib/re/
    _compiler.py  _constants.py  __init__.py  _parser.py

    Thanks, that's a nice enhancement!

    Serhiy: Would you mind to explicitly document the 3 deprecated modules in What's New in Python 3.11?
    https://docs.python.org/dev/whatsnew/3.11.html#deprecated

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    Is the "import _locale" still used in re/init.py? It cannot see any reference to it in the code and test_re still if it's removed.

    The last reference to the _locale module has been removed in 2017 by the commit 898ff03.

    diff --git a/Lib/re/__init__.py b/Lib/re/__init__.py
    index c47a2650e3..b887722bbb 100644
    --- a/Lib/re/__init__.py
    +++ b/Lib/re/__init__.py
    @@ -124,10 +124,6 @@
     import enum
     from . import _compiler, _parser
     import functools
    -try:
    -    import _locale
    -except ImportError:
    -    _locale = None
     
     
     # public symbols

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    It's funny to still see mentions of "experimental stuff" in Python 3.11 (2022), whereas these "experimental stuff" are there for 20 years.

    *Maybe* it's time to consider that re.template() and re.Scanner are no longer experimental? Maybe change their status to alpha or beta? :-D

    commit 770617b
    Author: Fredrik Lundh <fredrik@pythonware.com>
    Date: Sun Jan 14 15:06:11 2001 +0000

    SRE fixes for 2.1 alpha:
    

    +# sre extensions (experimental, don't rely on these)
    +T = TEMPLATE = sre_compile.SRE_FLAG_TEMPLATE # disable backtracking

    commit 7cafe4d
    Author: Fredrik Lundh <fredrik@pythonware.com>
    Date: Sun Jul 2 17:33:27 2000 +0000

    - actually enabled charset anchors in the engine (still not
      used by the code generator)
    
    - changed max repeat value in engine (to match earlier array fix)
    
    - added experimental "which part matched?" mechanism to sre; see
      http://hem.passagen.se/eff/2000_07_01_bot-archive.htm#416954
      or python-dev for details.
    

    +# experimental stuff (see python-dev discussions for details)
    +
    +class Scanner:
    (...)

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Apr 2, 2022

    In Modules folder, there are _sre.c/sre.h/sre_constants.h/sre_lib.h files. Will them be put into a folder?

    @asottile
    Copy link
    Mannequin

    asottile mannequin commented Apr 2, 2022

    would it be possible to expose parse_template -- or at least some way to validate that a regex replacement string is correct prior to executing the replacement?

    I'm currently using that for my text editor: https://github.com/asottile/babi/blob/d37d7d698d560aef7c6a0d1ec0668672e039bd9a/babi/screen.py#L501

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Apr 2, 2022

    Is the "import _locale" still used in re/init.py? It cannot see any reference to it in the code and test_re still if it's removed.

    It is true.

    *Maybe* it's time to consider that re.template() and re.Scanner are no longer experimental? Maybe change their status to alpha or beta? :-D

    First we need to find original discussions for these features (it may be not easy) and decide whether we want to finish them or remove.

    In Modules folder, there are _sre.c/sre.h/sre_constants.h/sre_lib.h files. Will them be put into a folder?

    It is step 2.

    would it be possible to expose parse_template -- or at least some way to validate that a regex replacement string is correct prior to executing the replacement?

    Maybe, in some form. Currently you can precompile a pattern, but for a replacement string you rely on a LRU cache. It is slower, and limited by the fixed size of the cache. I think it would be worth to add a function for compiling a replacement string. sub() etc should accept both string and a precompiled template object. It is a separate issue.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    See also bpo-40259: "re.Scanner groups".

    @vstinner
    Copy link
    Member

    vstinner commented Apr 2, 2022

    The re.template() function and the re.TEMPLATE functions are not documented and not tested.

    The re.Scanner class is not documented but has a test_scanner() test in test_re.

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Apr 3, 2022

    There are two very different classes with similar names: _sre.SRE_Scanner and re.Scanner. The former is used to implement the Pattern.finditer() method, but it could be used in other cases. The latter is an experimental implementation of generalized lexer using the former class. Both are undocumented. It is difficult to document Pattern.scanner() and _sre.SRE_Scanner because the class name contains implementation-specific prefix, and without it it would conflict with re.Scanner.

    But let leave it all to a separate issue.

    The original discussion about TEMPLATE was lost. Initially it only affected repetition operators, but now using them with TEMPLATE is error.

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Apr 4, 2022

    New changeset 1578f06 by Serhiy Storchaka in branch 'main':
    bpo-47152: Move sources of the _sre module into a subdirectory (GH-32290)
    1578f06

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Apr 4, 2022

    New changeset ff2cf1d by Serhiy Storchaka in branch 'main':
    bpo-47152: Remove unused import in re (GH-32298)
    ff2cf1d

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Apr 4, 2022

    Match.regs is an undocumented attribute, it seems it has existed since 1991.
    Can it be removed?

    {"regs", (getter)match_regs_get, (setter)NULL},

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Apr 4, 2022

    For reference, I also implemented .regs in the regex module for compatibility, but I've never used it myself. I had to do some investigating to find out what it did!

    It returns a tuple of the spans of the groups.

    Perhaps I might have used it if it didn't have such a cryptic name and/or was documented.

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Apr 4, 2022

    Match.regs is an undocumented attribute, it seems it has existed since 1991.
    Can it be removed?

    It was kept for compatibility with the pre-SRE implementation of the re module. It was an implementation detail in the original Python code, but I am sure that somebody still uses it. I am sure some code still use it. If we are going to remove it, it needs to be deprecated first.

    @serhiy-storchaka
    Copy link
    Member Author

    serhiy-storchaka commented Apr 23, 2022

    All planned goals in this issue have been achieved.

    The-Compiler added a commit to The-Compiler/lark that referenced this issue Apr 26, 2022
    sre_parse and sre_constants are deprecated in 3.11, see python/cpython#91308
    The-Compiler added a commit to The-Compiler/hypothesis that referenced this issue Apr 26, 2022
    Fixes HypothesisWorks#3309
    See python/cpython#91308
    
    I don't see a way to avoid relying on Python 3.11 internals here, so
    this instead just adjusts the imports to avoid deprecation warnings.
    The-Compiler added a commit to The-Compiler/hypothesis that referenced this issue Apr 27, 2022
    Fixes HypothesisWorks#3309
    See python/cpython#91308
    
    I don't see a way to avoid relying on Python 3.11 internals here, so
    this instead just adjusts the imports to avoid deprecation warnings.
    jwilk added a commit to jwilk/pydiatra that referenced this issue May 11, 2022
    Since Python 3.11, sre_parse is a deprecated wrapper around re._parser:
    python/cpython#91308
    
    We must monkey-patch the latter.
    halstead pushed a commit to openembedded/bitbake that referenced this issue Jun 8, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    splitice pushed a commit to HalleyAssist/poky that referenced this issue Jun 8, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: 09888bd738d50e4e275c0d505493f1be2f95e324)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    halstead pushed a commit to openembedded/bitbake that referenced this issue Jun 10, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    splitice pushed a commit to HalleyAssist/poky that referenced this issue Jun 10, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: 4414cbd5199ab690e5a1c04196efc8118106ecb2)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    splitice pushed a commit to HalleyAssist/poky that referenced this issue Jun 10, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: 4414cbd5199ab690e5a1c04196efc8118106ecb2)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    splitice pushed a commit to HalleyAssist/poky that referenced this issue Jun 10, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: 4414cbd5199ab690e5a1c04196efc8118106ecb2)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    halstead pushed a commit to openembedded/bitbake that referenced this issue Jun 11, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    github-actions bot pushed a commit to homalozoa/poky_fork that referenced this issue Jun 11, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: 3c0cd401472ffee06d5a93bdba566cb033851fcf)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    halstead pushed a commit to openembedded/bitbake that referenced this issue Sep 16, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    (cherry picked from commit 3c0cd40)
    Signed-off-by: Steve Sakoman <steve@sakoman.com>
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    armcc pushed a commit to lgirdk/poky that referenced this issue Sep 16, 2022
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: c98007217b8e40f1abfdcba709185dc5ddbcd0c2)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    (cherry picked from commit 3c0cd401472ffee06d5a93bdba566cb033851fcf)
    Signed-off-by: Steve Sakoman <steve@sakoman.com>
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    jpuhlman pushed a commit to MontaVista-OpenSourceTechnology/poky that referenced this issue Sep 19, 2022
    Source: poky
    MR: 121051
    Type: Integration
    Disposition: Merged from poky
    ChangeID: a18a014
    Description:
    
    As reported by Martin Jansa <Martin.Jansa@gmail.com>:
    
    bitbake/lib/bb/cooker.py:16: DeprecationWarning: module 'sre_constants' is deprecated
      import sre_constants
    
    it's deprecated since 3.11 with:
    
      python/cpython#91308
    
    The correct replacement for our usage is re.error so use that instead.
    
    (Bitbake rev: c98007217b8e40f1abfdcba709185dc5ddbcd0c2)
    
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    (cherry picked from commit 3c0cd401472ffee06d5a93bdba566cb033851fcf)
    Signed-off-by: Steve Sakoman <steve@sakoman.com>
    Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
    Signed-off-by: Jeremy A. Puhlman <jpuhlman@mvista.com>
    slyon added a commit to slyon/rstr that referenced this issue Nov 8, 2022
    The undocumented sre_parse module got deprecated in Python 3.11 which leads
    to build/test failures on Linux distributions making use of this version, e.g.
    Ubuntu 23.04.
    
    See bpo-47152: python/cpython#91308
    slyon added a commit to slyon/rstr that referenced this issue Nov 8, 2022
    The undocumented sre_parse module got deprecated in Python 3.11 which leads
    to build/test failures on Linux distributions making use of this version, e.g.
    Ubuntu 23.04.
    
    See bpo-47152: python/cpython#91308
    slyon added a commit to slyon/rstr that referenced this issue Nov 8, 2022
    The undocumented sre_parse module got deprecated in Python 3.11 which leads
    to build/test failures on Linux distributions making use of this version, e.g.
    Ubuntu 23.04.
    
    See bpo-47152: python/cpython#91308
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 expert-regex stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants