Support Unicode categories and properties #60

mcclure · 2015-11-26T17:28:20Z

Hello, parsley looks really cool and I have an application I am interested in switching from pyparsing to parsley. However, I am blocked in doing so because my application needs to parse Unicode text (Python 3 source code, actually). The built in rules are not Unicode aware, so I cannot possibly match, for example, "a unicode alphanumeric character"-- I would have to inline all possible alphanumeric characters in my grammar (which might not be possible anyway, see issue #1). Pyparsing avoids this problem for my purposes by allowing the use of standard Python regex, which has three simple unicode "classes".

The ideal here would be if Parsley would allow to match on arbitrary Unicode categories and properties. Every codepoint in Unicode has one category, which is a two-letter code (like "Ll" for Letter, Lowercase) and an arbitrary number of "properties" which are key-value pairs (like "Script=Cyrillic"). Properties are very important because if you are using Parsley you are probably parsing something like a programming language, and the current best practice as I understand is that programming languages follow the rules of UAX 31. This defines two Unicode properties XID_Start and XID_Continue, which when set to true match the Unicode body's recommendations for what constitutes an identifier (like, a variable name). Supporting properties unfortunately might be kind of difficult as there is nothing in the standard library for this. So to support properties either Parsley would have to embed some form of the Unicode character database, which is physically large, or introduce a module dependency (the third-party "Regex" module can do this using \p{})

A good next-best-effort would be if Parsley could support just the Unicode categories. This is easier because the standard library has the unicodedata module which lets you query a character's category.

I feel like the minimum version of this feature would be to at least have feature parity with the Python regex module, which has coarse \s, \w, and \d (whitespace, alphanumeric, numeric) classes. It might be hard to literally match the same strings as the re module because the re documentation is quite vague as to how \s\w\d are defined. But as far as I know you could get a reasonable approximation of these categories by using the categories in unicodedata.

The text was updated successfully, but these errors were encountered:

mcclure mentioned this issue Nov 26, 2015

Support custom rule functions #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode categories and properties #60

Support Unicode categories and properties #60

mcclure commented Nov 26, 2015

Support Unicode categories and properties #60

Support Unicode categories and properties #60

Comments

mcclure commented Nov 26, 2015