Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change regex to use re.search or re.fullmatch #1631

Closed
yurikhan opened this issue Jun 14, 2020 · 11 comments
Closed

change regex to use re.search or re.fullmatch #1631

yurikhan opened this issue Jun 14, 2020 · 11 comments
Labels
Change Suggested alteration to pydantic, not a new feature nor a bug documentation Feedback Wanted

Comments

@yurikhan
Copy link
Contributor

Bug

Output of python -c "import pydantic.utils; print(pydantic.utils.version_info())":

             pydantic version: 1.5.1
            pydantic compiled: False
                 install path: /home/khan/.local/lib/python3.6/site-packages/pydantic
               python version: 3.6.9 (default, Nov  7 2019, 10:44:02)  [GCC 8.3.0]
                     platform: Linux-4.15.0-88-generic-x86_64-with-Ubuntu-18.04-bionic
     optional deps. installed: ['typing-extensions']

Related: #1396

import jsonschema
import pydantic

class Foo(pydantic.BaseModel):
    bar: str = pydantic.Field(..., regex='baz')

try:
    Foo(bar='bar baz quux')
except pydantic.ValidationError as e:
    print(e)
    # ValidationError: 1 validation error for Foo
    # bar
    #   string does not match regex "baz" (type=value_error.str.regex; pattern=baz)
else:
    print('Valid')

try:
    jsonschema.validate({'bar': 'bar baz quux'}, Foo.schema())
except jsonschema.ValidationError as e:
    print(e)
else:
    print('Valid')
    # Valid

In JSON Schema, all versions so far, the regular expression in the pattern keyword is treated as unanchored at both ends, i.e. re.search behavior rather than re.match or re.fullmatch.

Pydantic uses re.match to validate strings with Field regex argument, as explained in #1396.

When constructing a JSON Schema from a model, Pydantic generates a pattern keyword from a Field regex argument, without any regex postprocessing. Thus, the resulting JSON Schema validates differently from the original Pydantic model.

I strongly feel Pydantic should follow suit and use re.search to validate fields with regex argument, and explicitly call this out in the documentation. (Anchors in example usage are not sufficient.)

Alternatively, if you are concerned with backward compatibility, I propose the following long-term solution:

  1. Add a new Field (and constr) argument, perhaps named pattern after the JSON Schema keyword, mutually exclusive with the existing regex argument. This is API extension, thus, backward compatible.
  2. Implement validation for pattern using re.search.
  3. Change the behavior of BaseModel.schema to copy the pattern Field argument to the pattern schema keyword as is if present. This way, new users of the pattern argument get correct schema generation.
  4. One or more of:
    • Fix the behavior of BaseModel.schema so that a regex argument with value REGEX produces a schema pattern keyword with value ^(REGEX). (The grouping is necessary because the original regex may contain alternatives.) This way, users of the regex argument get schemas that behave consistently with the original Pydantic model.
    • Deprecate regex, suggesting pattern with explicit anchoring.
    • Emit a warning if a model containing fields with the regex argument is used to generate a JSON Schema. This way, users of the regex argument who are likely to be adversely affected by the inconsistensy get a heads-up.
@yurikhan yurikhan added the bug V1 Bug related to Pydantic V1.X label Jun 14, 2020
@samuelcolvin samuelcolvin added Change Suggested alteration to pydantic, not a new feature nor a bug Feedback Wanted and removed bug V1 Bug related to Pydantic V1.X labels Jun 16, 2020
@samuelcolvin
Copy link
Member

This is not a bug, but a suggested change.

Happy to change this in v2, personally I think re.fullmatch behaviour would be preferable to re.search, but I'm open to input from others.

However, it looks like JSON Schema is closer to re.search. Is that correct or are there any nuances?

As described in #1478, pydantic tries to follow 2019-09.

@samuelcolvin samuelcolvin added this to the Version 2 milestone Jun 16, 2020
@samuelcolvin samuelcolvin changed the title Field regex validation is inconsistent with JSON Schema pattern validation change regex to use re.search or re.fullmatch Jun 16, 2020
@yurikhan
Copy link
Contributor Author

There are multiple aspects here.

  1. There exist many ways to test a string against a regular expression. In Python, they are re.search (no implicit anchoring), re.match (implicit anchor to the beginning of input), and re.fullmatch (implicit anchors to the beginning and end of input). This distinction is useful, and we, users of regular expressions, must be aware of it.

  2. JSON Schema unambiguously calls for unanchored regular expression testing; see Core § 6.4 paragraph 3 for an authoritative Draft 2019-09 reference. And yes, this is exactly re.search; the jsonschema library directly uses re.search in its implementation of pattern validation.

    Arguably, JSON Schema should have chosen implicit anchoring on both ends, but, unfortunately, the verbiage about anchoring first appeared only in Draft 04. This suggests that early implementations of Drafts 01–03 arbitrarily picked unanchored test (because that’s what Perl’s =~ /foo/ does), and Draft 04 had to codify the existing practice.

  3. The choice of anchoring semantics in Pydantic is your prerogative as the library author and maintainer. No bug here but two mutually opposite possibilities for change.

  4. That choice should be explicitly and accurately documented. Currently, the documentation says:

    regex: for string values, this adds a Regular Expression validation generated from the passed string and an annotation of pattern to the JSON Schema

    Because the exact method is not mentioned, and JSON Schema pattern keyword is mentioned, the reader is led to believe that Pydantic treats regex the same way JSON Schema treats pattern, that is, unanchored. This is a documentation bug. It could be fixed by adding a note to the effect of “Pydantic uses re.match to validate strings”.

  5. My expectation, and likely your intent, was that a generated schema accepts exactly the same strings as the model it was generated from. Because the model and the schema use the same regular expression but different implicit anchoring semantics, that equivalence does not hold. I consider it a bug that the actual behavior does not match expectations. It could be fixed:

    • by changing Pydantic validation to use the same anchoring semantics, i.e. re.search

      • con: breaking change
      • pro: users who deal with both Pydantic and JSON Schema have an easier time remembering that regexen are not implicitly anchored
    • or by exporting a different, explicitly anchored regex into the schema

      • pro: mostly backward-compatible
      • con: users have to remember each tool’s anchoring semantics
      • con: the generated schema becomes less readable, e.g. a regex='^foo$' becomes "pattern": "^(^foo$)"
        • This could be further solved by clever regex rewriting, but it’s a nontrivial exercise.
    • or by adjusting expectations, such as documenting Pydantic’s use of re.match, calling out the difference in behavior with JSON Schema validators and suggesting that users who need schema interoperability always make sure their regexen are either explicitly anchored with ^ or explicitly unanchored with .*.

      • pro: fully backward-compatible
      • con(?): users have to take care composing regular expressions

@samuelcolvin
Copy link
Member

I agree with all of this. Happy to accept a PR to correct the documentation.

In v2 we have the chance to make breaking changes and get things right. I personally think using re.match() was definitely a mistake - we should either use unanchored regexes or regexes anchored at beginning and end - re.match() is the most confusing choice.

We could either switch to re.search or re.fullmatch, I'm flexible about which. Like you said: JSON Schema is most like re.search, but re.fullmatch is probably the least confusing in most scenarios.

I will accept whichever seems most popular here. @yurikhan you might be the only person other than me who cares enough to think about this, in that case I'll accept your preference.

@yurikhan
Copy link
Contributor Author

Doc patch will follow. Not immediately but possibly on the weekend.

+1 to re.match being confusing — of the three, it is simplest to implement and hardest to use correctly.

As for behavior in v2, my vote is for re.search, because it achieves model/schema equivalence with minimum hassle.

yurikhan added a commit to yurikhan/pydantic that referenced this issue Jun 21, 2020
yurikhan added a commit to yurikhan/pydantic that referenced this issue Jun 21, 2020
yurikhan added a commit to yurikhan/pydantic that referenced this issue Jul 1, 2020
yurikhan added a commit to yurikhan/pydantic that referenced this issue Jul 1, 2020
samuelcolvin pushed a commit that referenced this issue Jul 1, 2020
@steve-mavens
Copy link

steve-mavens commented May 12, 2022

"by adjusting expectations, such as documenting Pydantic’s use of re.match, calling out the difference in behavior with JSON Schema validators and suggesting that users who need schema interoperability always make sure their regexen are either explicitly anchored with ^ or explicitly unanchored with .*"

It's actually a bit worse than that. I'm not sure it's possible to get schema interoperability. Because of the use of re.match, "foo\n" (with a linebreak at the end) matches the regex constraint ^foo$, (or ^[a-z]$, which is what you might naively write to mean "lowercase-letters only", thinking that you are anchoring both ends). "foo\n" doesn't match ^foo\Z, but that's not portable because \Z in a JSONSchema won't be interpreted to mean that by non-Python regex engines. re.search is the same as re.match.

Java regexes have \z to mean the same thing as Python's \Z, whereas Java's \Z means the same as $: it matches before a terminating newline, not after. Javascript doesn't have either as far as I can tell, but javascript says that $ matches the end of a string. I haven't tested whether it really means it. Yuck. Ultimately it might be that regex constraints in JSONSchema are inherently limited in this way, and to get schema interoperability you (not pydantic in particular: any library in any language) need to use an ECMA-compatible regex engine to validate.

@samuelcolvin
Copy link
Member

Luckily with pydantic v2 we don't have much choice without harming performance - we have to use the regex crate's is_match, which is roughly the same as re.search.

@steve-mavens
Copy link

steve-mavens commented May 12, 2022

That seems to have \z but not \Z, so I guess the regex parameter of pydantic fields will then have different syntax from Python regexes? I guess this is going to start happening to a lot of libraries as Rust use increases? Not that I blame anyone in particular, this is just what happens when people start embedding useful but non-standardised DSLs like regex in their programming languages ;-)

@samuelcolvin
Copy link
Member

If enough people really cared, we could use python's regex library, or even make it configurable, but I'm -1 based on faff and performance.

pydantic/pydantic-core#3

@steve-mavens
Copy link

steve-mavens commented May 12, 2022

I can certainly live with Rust's regexes. I'm generating JSON schemas from some of my models for validation in other languages, so Python-specific syntax doesn't help me much anyway.

If it was going to be configurable, then maybe it'd be nice if the regex parameter accepted the result of re.compile (or indeed anything else with a search method), on the understanding Pydantic would then refuse to generate a JSON Schema since what you're doing is inherently local, not serialisable. But fundamentally that's just a validator, and you already have those with a different syntax, so probably not worth any faff.

@samuelcolvin
Copy link
Member

Well you can do that easily using validator functions, and even customize the schema as you wish.

@adriangb
Copy link
Member

I think the original issue is resolved. In the current version of V2 the example given is "Valid" for both validation libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Change Suggested alteration to pydantic, not a new feature nor a bug documentation Feedback Wanted
Projects
None yet
Development

No branches or pull requests

4 participants