Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add plugin to infer more precise regex match types #7803

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Michael0x2a
Copy link
Collaborator

This pull request adds a plugin to make mypy infer more precise types when grabbing regex groups: the plugin will when possible analyze original regex to deduce whether a given group is required or not.

from typing_extensions import Final, Literal
import re

pattern: Final = re.compile("(a)(b)*")
match: Final = pattern.match("")
if match:
    reveal_type(match.groups())  # Revealed type is Tuple[str, Optional[str]]
    reveal_type(match.group(0))  # Revealed type is str
    reveal_type(match.group(1))  # Revealed type is str
    reveal_type(match.group(2))  # Revealed type is Optional[str]

    index: int
    reveal_type(match.group(index))  # Revealed type is Optional[str]

    # Error: Regex has 3 total groups, given group number 5 is too big
    match.group(5)

To track this information, I added in an optional 'metadata' dict field to the Instance class, similar to the metadata dict for plugins in TypeInfos. We skip serializing this dict if it does not contain any data.

A limitation of this plugin is that both the pattern and the match variables must be declared to be final. Otherwise, we just default to using whatever types are defined in typeshed.

This is because we set and erase the metadata field in exactly the same way we set and erase the last_known_value field in Instances: both kinds of info are "transient" and are unsafe to keep around if the variable reference is mutable.

This limitation does end up limiting the usefulness of this plugin to some degree: it won't support common patterns like the below, since variables aren't allowed to be declared final inside loops:

for line in file:
    match = pattern.match(line)
    if match:
        ...

Possibly we can remove this limitation by making mypy less aggressive about removing this transient info by tracking the "lifetime" of this sort of data in some way?

This pull request should mostly address #7363, though it's unclear if it really fully resolves it: we might want to do something about the limitation described above and re-tune typeshed first.

The other mostly unrelated change this PR makes is to refactor some of the helper functions in checker.py into typeops.py so I could use them more cleanly in the plugin.

This pull request adds a plugin to make mypy infer more
precise types when grabbing regex groups: the plugin will
when possible analyze original regex to deduce whether a
given group is required or not.

```
from typing_extensions import Final, Literal
import re

pattern: Final = re.compile("(a)(b)*")
match: Final = pattern.match("")
if match:
    reveal_type(match.groups())  # Revealed type is Tuple[str, Optional[str]]
    reveal_type(match.group(0))  # Revealed type is str
    reveal_type(match.group(1))  # Revealed type is str
    reveal_type(match.group(2))  # Revealed type is Optional[str]

    index: int
    reveal_type(match.group(index))  # Revealed type is Optional[str]

    # Error: Regex has 3 total groups, given group number 5 is too big
    match.group(5)
```

To track this information, I added in an optional 'metadata' dict
field to the Instance class, similar to the metadata dict for
plugins in TypeInfos. We skip serializing this dict if it does not
contain any data.

A limitation of this plugin is that both the pattern and the match
variables must be declared to be final. Otherwise, we just default
to using whatever types are defined in typeshed.

This is because we set and erase the metadata field in exactly the
same way we set and erase the `last_known_value` field in Instances:
both kinds of info are "transient" and are unsafe to keep around if
the variable reference is mutable.

This limitation *does* end up limiting the usefulness of this plugin
to some degree: it won't support common patterns like the below, since
variables aren't allowed to be declared final inside loops:

```
for line in file:
    match = pattern.match(line)
    if match:
        ...
```

Possibly we can remove this limitation by making mypy less aggressive
about removing this transient info by tracking the "lifetime" of this
sort of data in some way?

This pull request should mostly address
python#7363, though it's unclear if it
really fully resolves it: we might want to do something about the
limitation described above and re-tune typeshed first.

The other mostly unrelated change this PR makes is to refactor some of
the helper functions in checker.py into typeops.py so I could use them
more cleanly in the plugin.
@Michael0x2a
Copy link
Collaborator Author

Michael0x2a commented Oct 27, 2019

One other note: it's possible that this plugin is a bit too intrusive to include in mypy itself. While the actual regex-related code is pretty lightweight, it does require us to now have to worry about interfacing with the sre_parse module and the (undocumented?) Python regex internals.

For example, this PR is currently blocked by python/typeshed#3412.

So maybe this is better off being spun out into a separate plugin? Not sure.

@ilevkivskyi
Copy link
Member

So maybe this is better off being spun out into a separate plugin? Not sure.

FWIW I think everything related to stdlib (like builtins, ctypes, contextlib, and dataclasses) should be bundled with mypy and all plugins for third party libs (like django or sqlalchemy) should be separate (attrs is the only historic exception form this pattern).

@JukkaL
Copy link
Collaborator

JukkaL commented Oct 28, 2019

I think that the requirement to use Final unfortunately limits the usefulness of this, and may make it a bit too marginal to be added to mypy proper. The main problem I see is that this doesn't seem very discoverable -- users would need to learn that for return values from these specific functions, they should use Final. I wonder if it would be better to detect locals that are only ever assigned once, and effectively make them implicitly final. This way the new feature would work for most existing code without changes, making it much more generally useful.

A few other random notes:

  • Instead of adding a new attribute to Instance, what about merging the attribute with last_known_value to save memory? Dictionary might be too big in terms of memory use for this, however. Something like a linked list of "extra attribute objects" would be pretty memory-efficient. Not sure if the latter optimization is worth it though.
  • The PR could be made smaller by isolating the refactoring which moves code to typeops into a separate PR.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@97littleleaf11 97littleleaf11 added this to Review in progress in mypy Jan 5, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2022

Diff from mypy_primer, showing the effect of this PR on open source code:

psycopg (https://github.com/psycopg/psycopg)
+ psycopg/psycopg/types/range.py:467: error: Invalid regex: unbalanced parenthesis  [misc]

zulip (https://github.com/zulip/zulip)
+ zerver/openapi/markdown_extension.py:176: error: Statement is unreachable  [unreachable]
+ zerver/lib/markdown/__init__.py:890: error: Statement is unreachable  [unreachable]
+ zerver/lib/compatibility.py:53: error: Statement is unreachable  [unreachable]

sympy (https://github.com/sympy/sympy)
+ sympy/parsing/mathematica.py:93: error: Invalid regex: unbalanced parenthesis
+ sympy/parsing/mathematica.py:102: error: Invalid regex: missing ), unterminated subpattern
+ sympy/parsing/mathematica.py:122: error: Invalid regex: unterminated character set

porcupine (https://github.com/Akuli/porcupine)
+ porcupine/plugins/geometry.py:14: error: Statement is unreachable  [unreachable]
+ porcupine/plugins/editorconfig.py:150: error: Statement is unreachable  [unreachable]
+ porcupine/plugins/editorconfig.py:176: error: Statement is unreachable  [unreachable]

dragonchain (https://github.com/dragonchain/dragonchain)
- dragonchain/broadcast_processor/broadcast_functions.py:95:32: error: Item "None" of "Optional[Match[str]]" has no attribute "group"
- dragonchain/lib/authorization.py:315:19: error: Item "None" of "Optional[Match[str]]" has no attribute "group"
- dragonchain/lib/authorization.py:317:25: error: Item "None" of "Optional[Match[str]]" has no attribute "group"
- dragonchain/lib/authorization.py:340:31: error: Item "None" of "Optional[Match[str]]" has no attribute "group"

aiohttp (https://github.com/aio-libs/aiohttp)
+ aiohttp/multipart.py:812: error: Invalid regex: bad character range \\-.  [misc]

streamlit (https://github.com/streamlit/streamlit)
+ lib/streamlit/git_util.py: note: In member "get_repo_info" of class "GitRepo":
+ lib/streamlit/git_util.py:163:13: error: Statement is unreachable  [unreachable]

@97littleleaf11
Copy link
Collaborator

( or ) inside comments of multiline regex causes false positive of mypy_primer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
mypy
Review in progress
Development

Successfully merging this pull request may close these issues.

None yet

4 participants