Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode categories and properties #60

Open
mcclure opened this issue Nov 26, 2015 · 0 comments
Open

Support Unicode categories and properties #60

mcclure opened this issue Nov 26, 2015 · 0 comments

Comments

@mcclure
Copy link

mcclure commented Nov 26, 2015

Hello, parsley looks really cool and I have an application I am interested in switching from pyparsing to parsley. However, I am blocked in doing so because my application needs to parse Unicode text (Python 3 source code, actually). The built in rules are not Unicode aware, so I cannot possibly match, for example, "a unicode alphanumeric character"-- I would have to inline all possible alphanumeric characters in my grammar (which might not be possible anyway, see issue #1). Pyparsing avoids this problem for my purposes by allowing the use of standard Python regex, which has three simple unicode "classes".

The ideal here would be if Parsley would allow to match on arbitrary Unicode categories and properties. Every codepoint in Unicode has one category, which is a two-letter code (like "Ll" for Letter, Lowercase) and an arbitrary number of "properties" which are key-value pairs (like "Script=Cyrillic"). Properties are very important because if you are using Parsley you are probably parsing something like a programming language, and the current best practice as I understand is that programming languages follow the rules of UAX 31. This defines two Unicode properties XID_Start and XID_Continue, which when set to true match the Unicode body's recommendations for what constitutes an identifier (like, a variable name). Supporting properties unfortunately might be kind of difficult as there is nothing in the standard library for this. So to support properties either Parsley would have to embed some form of the Unicode character database, which is physically large, or introduce a module dependency (the third-party "Regex" module can do this using \p{})

A good next-best-effort would be if Parsley could support just the Unicode categories. This is easier because the standard library has the unicodedata module which lets you query a character's category.

I feel like the minimum version of this feature would be to at least have feature parity with the Python regex module, which has coarse \s, \w, and \d (whitespace, alphanumeric, numeric) classes. It might be hard to literally match the same strings as the re module because the re documentation is quite vague as to how \s\w\d are defined. But as far as I know you could get a reasonable approximation of these categories by using the categories in unicodedata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant