Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for equivalence classes #404

Closed
BurntSushi opened this issue Oct 7, 2017 · 8 comments
Closed

support for equivalence classes #404

BurntSushi opened this issue Oct 7, 2017 · 8 comments

Comments

@BurntSushi
Copy link
Member

See "character equivalents" here: https://www.regular-expressions.info/posixbrackets.html#eq

The major change this seems to require (aside for actually determining character equivalents) is introducing the concept of locale into the regex engine. I'm not sure that's a good idea, but this issue should at least be investigated at some point.

@albfan
Copy link

albfan commented Nov 5, 2017

Researching for grep implementation. It uses libc regexp for this.

Basically it looks for equivalence table defined on locale

link to code on glibc

link to doc

order_start  forward;backward
UNDEFINED    IGNORE;IGNORE
<LOW>
<space>      <LOW>;<space>
...          <LOW>;...
<a>          <a>;<a>
<a-acute>    <a>;<a-acute>
<a-grave>    <a>;<a-grave>
<A>          <a>;<A>
<A-acute>    <a>;<A-acute>
<A-grave>    <a>;<A-grave>
<ch>         <ch>;<ch>
<Ch>         <ch>;<Ch>
<s>          <s>;<s>
<eszet>      "<s><s>";"<eszet><eszet>"
order_end

locale knows what characters are equivalent of others, here equivalent for a.

I design a simple test case:

$ cat >testfile <<EOF
bar
camión
bazz
foo
EOF
$ grep [[=o=]] testfile

which should result in:

camión
foo

trying to debug

$ gdb --args grep [[=o=]] testfile

I can't do it. libc as it has to be compiled with optimizations and can't see a single var

if there's a way to retrieve that findidx function call result into rust, this would be just detect "[=" "=]" for bracket expression. I will try to find where rust regexp engine parse bracket expression and propose a WIP on this

@albfan
Copy link

albfan commented Nov 5, 2017

I'm little lost into rust locale approach, so rust-locale seems a decent choose for me (sorry if I'm wrong)

I think all the pieces are put in place, but I have no skills on bindings to call findidx from libc and understand what is libc doing with that info (here info for es_ES):

_NL_COLLATE_TABLEMB = 2207821300
_NL_COLLATE_WEIGHTMB = 2207822324
_NL_COLLATE_EXTRAMB = 2208009096
_NL_COLLATE_INDIRECTMB = 2208041452

to retrieve equivalence chars for a char

findidx (table, indirect, extra, &cp, -1);

So for 'o' in spanish it would start with:

findidx (2207821300, 2208041452, 2208009096, &'o', -1);

@BurntSushi
Copy link
Member Author

@albfan Thanks for doing this research! I don't have time to dig in myself at the moment, but I'd like to just say a couple things:

  1. I'm hopeful that we can come up with an implementation plan before doing the work in a PR.
  2. How do other regex engines handle this?
  3. I'd prefer not to depend on any new third party crates with respect to locales. Rust doesn't really have a locale story yet, so we shouldn't try to pick a winner at this point.

@albfan
Copy link

albfan commented Nov 5, 2017

libc regex relay on localedata. Many other regex don't have such concept.

I think suggestion on rust-locale is correct

rust-locale/rust-locale#26 (comment)

Relaying on unicode there's a concept of normalization form for characters.

Searching for a character all combining characters on search text should be equivalent. That's a good design because it will not relay on predefined metadata but I guess it could be a slow implementation for rg

@BurntSushi
Copy link
Member Author

Yeah the regex engine isn't getting support for normalization, sorry. In particular, the regex engine is normalization agnostic. If you care about it, then you need to normalize the regex and the text searched consistently.

That means that if ripgrep wants to care about normalization, then it has to normalize the regex and every file before it searches.

@albfan
Copy link

albfan commented Nov 5, 2017

Yes. I think I can fork ripgrep and provide an implementation that does that and see how much it slows the searches (without any idea for PR)

About regex, I think again, forking for my special needs can be a decent solution, preprocess all characters for a(2) e(2) i(2) o(2) u(3) n(2) A(2) E(2) I(2) O(2) U(3) = 24 equivalences is now worth.

I see all this as refining my toolset, but not as a general helpful contribution

I suggest to close this issue

@tmccombs
Copy link
Contributor

That means that if ripgrep wants to care about normalization, then it has to normalize the regex and every file before it searches.

Normalizing the file is (relatively) straightforward. However, normalizing the regex, in general, can be very complicated.

My main concern is with canonical equivalence. In many cases, a single code point is equivalent to a series of two or more code points (generally as part of a single grapheme). Because of the semantics of regex, a simple normalization of the string representation is insufficient, because depending on the position of a character composing or decomposing it would significantly change the meaning of the regex.

As a concrete example, consider the regex "[a\u00e0-\u00e5]" which is in NFC form (composed). If this needs to be normalized to NFD (decomposed), the ideal representation would be something like: a[\u0300-\u0303\u0308\u030a]?. And if the original regex were instead [a\u00e0-\u00e6], then it would need to be normalized to (?:a[\u0300-\u0303\u0308\u030a]?|\u00e6).

A more naive implementation could, of course, iterate over every code point in the range, and generate an alternation of all the normalized results. But that would perform poorly, both for normalizing the regex, and the search.

Normalizing to "NFC" is even more problematic. For example consider something like (a|u)\u0308?. That could roughly be converted to ([aäuü]), and maybe that is what the user wants? but maybe it isn't.

Maybe it doesn't make sense for normalization or equivalence to be part of the regex crate, and there should be a separate crate to handle this. However, normalizing the regex and the input is not a trivial workaround.

@BurntSushi
Copy link
Member Author

@tmccombs Indeed! The "Canonical Equivalents"' section of UTS#18 elaborates on this (and even brings up some of the same issues that you do). Their suggested work-around is pretty straight-forward though:

  1. Putting the text to be matched into a defined normalization form (NFD or NFKD).
  2. Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.
  3. Applying the matching algorithm on a code point by code point basis, as usual.

In other words, instead of trying to devise a general algorithm for turning a regex into a particular normalization form, you have to ask the end user to do it themselves. That is, you ask the user to be aware of normalization forms and craft their regex appropriately. Yes, I agree it isn't an ideal user experience, but I think it's the best that can probably be done in this space until someone comes up with a better idea. (And I wouldn't hold my breath on that front.)

There is also a middle ground where you might be able to fix 80% of the cases. That is, you put all literal strings that occur in a regex into NFD form. We don't always need to focus on 100% correctness. For example, this approach would seemingly fix sharkdp/fd#1535.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants