-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brackets regex of symbol e.g. [ɔ] also captures symbol with diacritic [ɔ̃] #10286
Comments
I would think this is not correct. "[abc]" should work the same as "(a|b|c)". |
Hmm, github dropped the backslashes, should be |
However this is not the issue, I think. What is striking is that I just tried with |
However this is not the issue, I think. What is striking is that `[ɔ]`
captures both `ɔ` and `ɔ̃`. And vice versa (`[ɔ̃]` captures also `ɔ`
when
it shouldn't?). I tried with `[ɛ]` as well, same thing, it captures
both
`ɛ` and `ɛ̃` (and vice versa).
I just tried with `[e]` and with `[é]` and everything behaves as
expected (`[e]` captures only `e` and `[é]` only `é`). `[à]` also
behaves normally.
The difference is that ɔ̃ is the base character ɔ with a composing
character, while é and à are single characters. That the behavior
differs is exactly why I think this is wrong.
…--
hundred-and-one symptoms of being an internet addict:
56. You leave the modem speaker on after connecting because you think it
sounds like the ocean wind...the perfect soundtrack for "surfing the net".
/// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\
/// \\\
\\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
|
I'm note sure what you mean by wrong, but sure. It still would then be a bit impractical (in the same way as sorting alphabetically does not handle It remains somewhat unintuitive to me, the fact that if I do Btw that leads me to discover this truly odd edge case (to my intuition at least): if I search |
Yeah, as Bram said, most "IPA Extensions" (U+0250 to U+02AF) don't exist with a precomposed diacritic, so you have to add a "combining diacritical mark" (U+0300 to U+036F) to place the diacritic on them. But they are still two characters, e.g. ɔ̃ is U+0254 LATIN SMALL LETTER OPEN O followed by U+0303 COMBINING TILDE. When you search for U+0254 ɔ Vim finds it as part of the combined (spacing IPA character + combining character) which is expected. OTOH, there exists a precomposed é U+00E9 LATIN SMALL LETTER E WITH ACUTE and if your text contains that rather than U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT, Vim won't find e when searching for é nor vice-verse. If your ɔ̃ had been followed by something in your search pattern, e.g. if you had been searching for ɔ̃] then Vim wouldn't have found ɔ] (nor vice-versa) because ɔ̃] is U+0254 LATIN SMALL LETTER OPEN O followed by U+0303 COMBINING TILDE followed by U+005D RIGHT SQUARE BRACKET while ɔ] is only U+0254 LATIN SMALL LETTER OPEN O followed by U+005D RIGHT SQUARE BRACKET and Vim won't find the one when looking for the other unless you put \Z in your search pattern (see ":help /\Z"). So finding ɔ̃ but not ɔ should be straightforward but for the opposite you should construct a pattern specifying that you want ɔ "not followed by U+0303" which is more complicated. Searching for /[ɑ̃ɔ̃] will find any of the following four characters:
The fact that /(ɔ|ɔ̃) captures only ɔ and not ɔ̃ is (to my mind, at least) unexpected. Best regards, |
Oops, I meant, "the fact that Best regards, |
In a sub-pattern with two or more branches the first matching branch is preferred if more than one branch can match, i.e., So if someone wants to use Regards, |
Thanks so much @brammool @tonymec & @jottkaerr for the detailed responses! I knew about the composition of character + diacritic, I just did not expect the regex in bracket to match the way it did. I was actually doing a filter based on IPA representations, where I have a vocab table and only wanted to retain words with certain sounds, and remove the rest, using the
Thanks for the Strangely, I inserted an orphan U+0303 COMBINING TILDE somewhere in my text and the brackets pattern does not match it... (I thought, oh maybe And to continue with the strangeness, if you insert U+0303 in the text, and then search for said character |
When searching for |
Hm, diff --git a/src/regexp_bt.c b/src/regexp_bt.c
index 698ff043e..49c53d97a 100644
--- a/src/regexp_bt.c
+++ b/src/regexp_bt.c
@@ -3720,13 +3720,37 @@ regmatch(
case ANYOF:
case ANYBUT:
- if (c == NUL)
- status = RA_NOMATCH;
- else if ((cstrchr(OPERAND(scan), c) == NULL) == (op == ANYOF))
- status = RA_NOMATCH;
- else
- ADVANCE_REGINPUT();
- break;
+ {
+ char_u *q = OPERAND(scan);
+
+ if (c == NUL)
+ status = RA_NOMATCH;
+ else if ((cstrchr(q, c) == NULL) == (op == ANYOF))
+ status = RA_NOMATCH;
+ else
+ {
+ // Check following combining characters
+ int len, i;
+
+ if (enc_utf8)
+ len = utfc_ptr2len(q) - utf_ptr2len(q);
+
+ MB_CPTR_ADV(rex.input);
+ MB_CPTR_ADV(q);
+
+ if (!enc_utf8 || len == 0)
+ break;
+
+ for (i = 0; i < len; ++i)
+ if (q[i] != rex.input[i])
+ {
+ status = RA_NOMATCH;
+ break;
+ }
+ rex.input += len;
+ }
+ break;
+ }
case MULTIBYTECODE:
if (has_mbyte) But for the NFA, I don't know how to make it work, not sure how to make sure that there must be a composing character followed inside a collation. I tried to embed |
It seems to me that some of the points made earlier maybe can be understood more precisely by clearly differentiating between a Unicode string and its encoded bytes sequence. I can only give limited information, but I hope it might help solving the issues presented here. As far as I learned from Python, there are two basic operations which matter here: Encoding a Unicode string and Decoding a Bytes Sequence. Encoding creates a bytes sequence. Decoding produces a Unicode string. To my knowledge, some Unicode characters including diacritical marks, e.g. "ä" or "ö", can be encoded in different ways: For example as an UTF-8 bytes sequence (UTF-8 is a Unicode coding) or as, say, an ASCII single byte character in a certain code page (each code page can only code a limited set of Unicode characters). I am not sure if ASCII code pages are a true but limited "Unicode coding". All of the resp. bytes sequences can be decoded to the Unicode characters "ä" or "ö" in this example. I do not know how to handle characters with diacritical marks which have not received a single Unicode character. Furthermore, I cannot tell how the Unicode standard deals with a sequence of a base character ("a" or "o" in this example) followed by a diacritical mark (here a diaeresis: the two dots above the base character) which is (in printout) identical to the single Unicode character: Are these sequences treated as a "combined character", s.t. I think that even Unicode strings need to use some coding scheme internally, but I don't know details on how this is solved in Python 3. As a side remark, strings in Python 3 are always Unicode strings; separate from Bytes sequences. |
fixes vim#10286 Also, while at it, make debug mode work again.
fixes vim#10286 Also, while at it, make debug mode work again.
fixes vim#10286 Also, while at it, make debug mode work again.
fixes vim#10286 Also, while at it, make debug mode work again.
Problem: regexp cannot match combining chars in collection Solution: Check for combining characters in regex collections for the NFA and BT Regex Engine Also, while at it, make debug mode work again. fixes vim/vim#10286 closes: vim/vim#12871 vim/vim@d2cc51f Co-authored-by: Christian Brabandt <cb@256bit.org>
Problem: regexp cannot match combining chars in collection Solution: Check for combining characters in regex collections for the NFA and BT Regex Engine Also, while at it, make debug mode work again. fixes vim/vim#10286 closes: vim/vim#12871 vim/vim@d2cc51f Co-authored-by: Christian Brabandt <cb@256bit.org>
…#26992) Problem: regexp cannot match combining chars in collection Solution: Check for combining characters in regex collections for the NFA and BT Regex Engine Also, while at it, make debug mode work again. fixes vim/vim#10286 closes: vim/vim#12871 vim/vim@d2cc51f Co-authored-by: Christian Brabandt <cb@256bit.org>
…neovim#26992) Problem: regexp cannot match combining chars in collection Solution: Check for combining characters in regex collections for the NFA and BT Regex Engine Also, while at it, make debug mode work again. fixes vim/vim#10286 closes: vim/vim#12871 vim/vim@d2cc51f Co-authored-by: Christian Brabandt <cb@256bit.org>
…neovim#26992) Problem: regexp cannot match combining chars in collection Solution: Check for combining characters in regex collections for the NFA and BT Regex Engine Also, while at it, make debug mode work again. fixes vim/vim#10286 closes: vim/vim#12871 vim/vim@d2cc51f Co-authored-by: Christian Brabandt <cb@256bit.org>
Steps to reproduce
Tested in vim/gvim 8.2 on Ubuntu 18.04
/ɔ
and
/ɔ̃
produce the expected result, only finding the respective versions of the character, however
/[ɔ]
and
/[ɔ̃]
both capture ɔ and ɔ̃, is this intended behaviour?
Thanks!
Expected behaviour
I was expecting a distinction between ɔ and ɔ̃ even when included in brackets. (I had been using a list of ipa characters in brackets to eliminate lines using :v/, and found myself still having the sound described by the second IPA symbol (with tilde) even if I had only included the first in my list.
Version of Vim
8.2 (2019 Dec 12, compiled Oct 12 2020 17:33:22)
Environment
Ubuntu 18.04
Gnome Terminal 3.28.2
$TERM: xterm-256color
zsh 5.4.2 (x86_64-ubuntu-linux-gnu)
Logs and stack traces
No response
The text was updated successfully, but these errors were encountered: