Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schemas: Unicode support for pattern and patternProperties keywords #554

Closed
jgraettinger opened this issue Oct 13, 2021 · 0 comments
Closed

Comments

@jgraettinger
Copy link
Contributor

Is your enhancement related to a problem? Please describe.

Currently YLS compiles regular expressions without Unicode support.

This means that patterns using Unicode character groups don't work. For example:

$ node
Welcome to Node.js v12.21.0.
Type ".help" for more information.
> var pattern = '^[\\p{Letter}]+$';
undefined
> var rb = new RegExp(pattern);
undefined
> var ru = new RegExp(pattern, 'u');
undefined
> rb.test('test');
false
> ru.test('test');
true

Historically the standard left Unicode of regular expressions undefined (implementations were free to choose), but with draft 2020-12 of the standard that's been clarified and regexs should support Unicode.

Technically YLS targets draft-07 of JSON schema, and is thus conformant. However users reasonably expect to be able to throw more recent schema versions at it and have it largely work -- and it does! because the standard has remained broadly compatible.

However, the specific experience of using schemas with Unicode character groups is poor, because they fail to validate as expected.

Describe the solution you would like

YLS should build pattern and patternProperties keywords with Unicode support (using 'u', flag, as in new RegExp(pattern, 'u').

This is allowed by the targeted draft-07 standard.

Describe alternatives you have considered

Unfortunately there aren't any, aside from "don't use Unicode groups in patterns", which is pretty silly in 2021.

The downside risk is this breaks pre-existing patterns which explicitly expect to handle Unicode as UTF-8 codepoints in a binary string. However:

  • Today, such patterns are in turn broken by other JSON schema implementations which do support Unicode.
  • They're also broken by string encodings other than UTF-8.
  • They're going to get more broken as implementations shift to Unicode as per the updated standard.

Additional context

I have a PR forthcoming.

jgraettinger added a commit to jgraettinger/yaml-language-server that referenced this issue Oct 19, 2021
This was under-specified in previous JSON schema drafts.

As of 2020-12, patterns are defined as Unicode:
https://json-schema.org/draft/2020-12/json-schema-core.html#regex

Issue redhat-developer#554
evidolob pushed a commit that referenced this issue Oct 20, 2021
This was under-specified in previous JSON schema drafts.

As of 2020-12, patterns are defined as Unicode:
https://json-schema.org/draft/2020-12/json-schema-core.html#regex

Issue #554
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants