Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAPI schema contains invalid unicode characters if project uses a Django URL field #175

Closed
fyhertz opened this issue Oct 22, 2020 · 4 comments

Comments

@fyhertz
Copy link

fyhertz commented Oct 22, 2020

Describe the bug

Not really a bug of drf-spectacular per se, but probably worth mentioning:

When using django.db.model.URLField, the OpenAPI spec ends up containing the same regex pattern that Django uses to validates URLs:

"url": {
  "type": "string",
  "format": "uri",
  "nullable": true,
  "maxLength": 200,
  "pattern": "^(?:[a-z0-9.+-]*)://(?:[^\\s:@/]+(?::[^\\s:@/]*)?@)?(?:(?:25[0-5]|2[0-4]\\d|[0-1]?\\d?\\d)(?:\\.(?:25[0-5]|2[0-4]\\d|[0-1]?\\d?\\d)){3}|\\[[0-9a-f:.]+\\]|([a-z¡-�0-9](?:[a-z¡-�0-9-]{0,61}[a-z¡-�0-9])?(?:\\.(?!-)[a-z¡-�0-9-]{1,63}(?<!-))*\\.(?!-)(?:[a-z¡-�-]{2,63}|xn--[a-z0-9]{1,59})(?<!-)\\.?|localhost))(?::\\d{2,5})?(?:[/?#][^\\s]*)?\\Z"
}

Issues arise when the JSON or YAML parser used by a consumer of OpenAPI specs does not allow such characters. For instance pyyaml will raise an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simon/work/venv/lib/python3.8/site-packages/yaml/__init__.py", line 162, in safe_load
    return load(stream, SafeLoader)
  File "/home/simon/work/venv/lib/python3.8/site-packages/yaml/__init__.py", line 112, in load
    loader = Loader(stream)
  File "/home/simon/work/venv/lib/python3.8/site-packages/yaml/loader.py", line 34, in __init__
    Reader.__init__(self, stream)
  File "/home/simon/work/venv/lib/python3.8/site-packages/yaml/reader.py", line 74, in __init__
    self.check_printable(stream)
  File "/home/simon/work/venv/lib/python3.8/site-packages/yaml/reader.py", line 143, in check_printable
    raise ReaderError(self.name, position, ord(character),
yaml.reader.ReaderError: unacceptable character #xffff: special characters are not allowed
  in "<unicode string>", position 101019

Expected behavior

What do you think is the best approach here? Should OpenAPI specs produced by drf-spectacular avoid invalid unicode characters?

@fyhertz fyhertz changed the title OpenAPI schema contains non utf-8 characters if project uses a Django URL field OpenAPI schema contains invalid unicode characters if project uses a Django URL field Oct 22, 2020
@KimSoungRyoul
Copy link
Contributor

oh... i think it's difficult issue

yaml/pyyaml#250
https://yaml.org/spec/1.2/spec.html#id2770814

YAML Standard Spec does not allow these things

therefore
from pyaml's point of view , it's not a bug

but regex should be allow to include unicode like a #xffff
especially for URLRegexValidation

and
https://swagger.io/specification/
OAS Standard Document does not mention about this


drf-spectacular should be trade off

  1. support safety Yaml load to drf-spectacular.OpenApiYamlRenderer
    and get Damaged Regex

or

  1. support original Regex
    and get incomplete Yaml as it is now

in my opinion, The second seems more appropriate.
even if someone who want to use yaml.load() should be customize yaml Loader


I haven't checked it out yet. but
I think it's helpful pydantic schema how to resolve these issue
https://pydantic-docs.helpmanual.io/usage/schema/
and
tell me
"pydantic resolve these issue blabla..."
then i will support these issue in imitation of their ways

P.S. Oh and , I'm not a core contributor so it's just only my opinion

@tfranzel
Copy link
Owner

@KimSoungRyoul thanks for lookup that up. that is a tricky one! ok so lets get down to brass tacks.

  • with PyYAML==5.3.1 i can read AND write this regex to file. unicode point ¡ is written as literal and \uFFFF is apparently escaped by pyyaml. i cannot speak to other versions.
  • it looks like because \uFFFF is escaped by the lib, the whole string is quoted which in turn breaks JSON schema validation. simply unquoting it by hand at least solves the JSON schema violation. i have a hard time understanding why that is exactly.

these are the assumptions i currently hold:

with that said, these are the problems:

  • regular expressions in python may be PCRE (ECMA is a "subset" of PCRE)
  • transcoding PCRE to ECMA is infeasable
  • the URL regex uses functionality not covered by ECMA-262 5.1 (negative lookbehind)

so i would simply propose removing only the Django URL regex and adding format: uri because it is kind of broken here anyway.
for all other regex patterns, i would do ''.encode('unicode_escape') and hope for the best. this also gets rid of the quoting. i think that is a reasonable tradeoff.

@tfranzel
Copy link
Owner

@fyhertz so this is the current state. please check this out and close the issue if it works for you.

  • i removed the URL pattern because it is not ECMA compliant (usage of negative lookbehind).
  • normalizing all other pattern occurrences for unicode and simple anchors. this is a best effort and people may still break compliance there.

the normalizing even worked surprisingly well for the URL pattern. i tested it with https://editor.swagger.io/ (in chrome 86) and it properly validated all URLs i tried. But for compliance sake i opted not to include it anymore.

      parameters:
      - in: header
        name: url_test
        required: True
        schema:
          type: string
          format: uri
          pattern: ^(?:[a-z0-9.+-]*)://(?:[^\s:@/]+(?::[^\s:@/]*)?@)?(?:(?:25[0-5]|2[0-4]\d|[0-1]?\d?\d)(?:\.(?:25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}|\[[0-9a-f:.]+\]|([a-z\u00a1-\uffff0-9](?:[a-z\u00a1-\uffff0-9-]{0,61}[a-z\u00a1-\uffff0-9])?(?:\.(?!-)[a-z\u00a1-\uffff0-9-]{1,63}(?<!-))*\.(?!-)(?:[a-z\u00a1-\uffff-]{2,63}|xn--[a-z0-9]{1,59})(?<!-)\.?|localhost))(?::\d{2,5})?(?:[/?#][^\s]*)?$

@fyhertz
Copy link
Author

fyhertz commented Oct 26, 2020

Great ! Thanks @tfranzel, I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants