Brackets regex of symbol e.g. [ɔ] also captures symbol with diacritic [ɔ̃] #10286

jchwenger · 2022-04-25T18:53:57Z

Steps to reproduce

Tested in vim/gvim 8.2 on Ubuntu 18.04

Text to search:

/abɔ̃kɔ̃t/	à bon compte
/abupɔʁtɑ̃/	à bout portant

/ɔ
and
/ɔ̃
produce the expected result, only finding the respective versions of the character, however
/[ɔ]
and
/[ɔ̃]
both capture ɔ and ɔ̃, is this intended behaviour?

Thanks!

Expected behaviour

I was expecting a distinction between ɔ and ɔ̃ even when included in brackets. (I had been using a list of ipa characters in brackets to eliminate lines using :v/, and found myself still having the sound described by the second IPA symbol (with tilde) even if I had only included the first in my list.

Version of Vim

8.2 (2019 Dec 12, compiled Oct 12 2020 17:33:22)

Environment

Ubuntu 18.04
Gnome Terminal 3.28.2
$TERM: xterm-256color
zsh 5.4.2 (x86_64-ubuntu-linux-gnu)

Logs and stack traces

No response

The text was updated successfully, but these errors were encountered:

brammool · 2022-04-25T20:22:41Z

I would think this is not correct. "[abc]" should work the same as "(a|b|c)".

brammool · 2022-04-25T20:30:52Z

Hmm, github dropped the backslashes, should be $a\|b\|c$

jchwenger · 2022-04-25T20:36:12Z

However this is not the issue, I think. What is striking is that [ɔ] captures both ɔ and ɔ̃. And vice versa ([ɔ̃] captures also ɔ when it shouldn't?). I tried with [ɛ] as well, same thing, it captures both ɛ and ɛ̃ (and vice versa).

I just tried with [e] and with [é] and everything behaves as expected ([e] captures only e and [é] only é). [à] also behaves normally.

brammool · 2022-04-25T21:20:53Z

However this is not the issue, I think. What is striking is that `[ɔ]` captures both `ɔ` and `ɔ̃`. And vice versa (`[ɔ̃]` captures also `ɔ` when it shouldn't?). I tried with `[ɛ]` as well, same thing, it captures both `ɛ` and `ɛ̃` (and vice versa). I just tried with `[e]` and with `[é]` and everything behaves as expected (`[e]` captures only `e` and `[é]` only `é`). `[à]` also behaves normally.

The difference is that ɔ̃ is the base character ɔ with a composing character, while é and à are single characters. That the behavior differs is exactly why I think this is wrong.

…

-- hundred-and-one symptoms of being an internet addict: 56. You leave the modem speaker on after connecting because you think it sounds like the ocean wind...the perfect soundtrack for "surfing the net". /// Bram Moolenaar -- ***@***.*** -- http://www.Moolenaar.net \\\ /// \\\ \\\ sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ /// \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

jchwenger · 2022-04-25T21:38:29Z

I'm note sure what you mean by wrong, but sure. It still would then be a bit impractical (in the same way as sorting alphabetically does not handle a and à as the same character...), I'm afraid.

It remains somewhat unintuitive to me, the fact that if I do /[ɑ̃ɔ̃] I will also capture all the ɔ and the ɑ, and so I should use /$ɑ̃\|ɔ̃$ instead, more verbose, especially if there's more than a few characters. Certainly unexpected.

Btw that leads me to discover this truly odd edge case (to my intuition at least): if I search
/$ɔ\|ɔ̃$
that only captures ɔ and not ɔ̃ !

tonymec · 2022-04-26T04:57:29Z

Yeah, as Bram said, most "IPA Extensions" (U+0250 to U+02AF) don't exist with a precomposed diacritic, so you have to add a "combining diacritical mark" (U+0300 to U+036F) to place the diacritic on them. But they are still two characters, e.g. ɔ̃ is U+0254 LATIN SMALL LETTER OPEN O followed by U+0303 COMBINING TILDE. When you search for U+0254 ɔ Vim finds it as part of the combined (spacing IPA character + combining character) which is expected. OTOH, there exists a precomposed é U+00E9 LATIN SMALL LETTER E WITH ACUTE and if your text contains that rather than U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT, Vim won't find e when searching for é nor vice-verse.

If your ɔ̃ had been followed by something in your search pattern, e.g. if you had been searching for ɔ̃] then Vim wouldn't have found ɔ] (nor vice-versa) because ɔ̃] is U+0254 LATIN SMALL LETTER OPEN O followed by U+0303 COMBINING TILDE followed by U+005D RIGHT SQUARE BRACKET while ɔ] is only U+0254 LATIN SMALL LETTER OPEN O followed by U+005D RIGHT SQUARE BRACKET and Vim won't find the one when looking for the other unless you put \Z in your search pattern (see ":help /\Z"). So finding ɔ̃ but not ɔ should be straightforward but for the opposite you should construct a pattern specifying that you want ɔ "not followed by U+0303" which is more complicated.

Searching for /[ɑ̃ɔ̃] will find any of the following four characters:

U+0251 LATIN SMALL LETTER ALPHA
U+0303 COMBINING TILDE
U+0304 LATIN SMALL LETTER OPEN O
U+0303 COMBINING TILDE
two of which are the same, so you should even find ɛ̃ (which also consists of a character followed by a combining tilde) but not ɛ (which includes none of the characters I listed above).

The fact that /(ɔ|ɔ̃) captures only ɔ and not ɔ̃ is (to my mind, at least) unexpected.

Best regards,
Tony.

tonymec · 2022-04-26T05:05:40Z

Oops, I meant, "the fact that /$ɔ\|ɔ̃$ captures only…", it was transformed because I typed it without backticks.

Best regards,
Tony.

jottkaerr · 2022-04-26T06:55:35Z

In a sub-pattern with two or more branches the first matching branch is preferred if more than one branch can match, i.e., $a\|an$ will match the a in branch while $an\|a$ will match the an. Of course, if the pattern continues after the branches, that might force a different branch to match, i.e., both $a\|an$c and $an\|a$c will match the anc in branch.

So if someone wants to use (...\|...\|...\) as an alternative for [...] to search for a number of characters with possibly combining characters, they should put the characters with combining characters at the start of the branches.

Regards,
Jürgen

jchwenger · 2022-04-26T07:42:17Z

Thanks so much @brammool @tonymec & @jottkaerr for the detailed responses! I knew about the composition of character + diacritic, I just did not expect the regex in bracket to match the way it did.

I was actually doing a filter based on IPA representations, where I have a vocab table and only wanted to retain words with certain sounds, and remove the rest, using the :v / :g commands. So I would do e.g. :v/\/[abilmnpstuɛʁ]\+\//d_, which was just simple and elegant. Thanks to the additional information provided here I know now I'll have to watch out for compositions like ɛ̃ as well, and perhaps do a second pass with another filter.

Vim won't find the one when looking for the other unless you put \Z in your search pattern (see ":help /\Z"). So finding ɔ̃ but not ɔ should be straightforward but for the opposite you should construct a pattern specifying that you want ɔ "not followed by U+0303" which is more complicated.

Thanks for the \Z ref! By the way, I repeat, and that was counterintuitive to me, but if you do /[ɔ] vim will match ɔ and ɔ̃, as said, but also the reverse, /[ɔ̃] will match ɔ and ɔ̃. (This does not happen if you just have a /ɔ or /ɔ̃, as expected.)

Strangely, I inserted an orphan U+0303 COMBINING TILDE somewhere in my text and the brackets pattern does not match it... (I thought, oh maybe [ɔ̃] just divides the character and the diacritic and treating them as two separate entities). It's as if the brackets pattern simply ignored diacritic marks altogether, both in its components and in the text...

And to continue with the strangeness, if you insert U+0303 in the text, and then search for said character /U+0303, it matches characters with said mark, e.g. ɔ̃, but not the orphan mark itself...

brammool · 2022-04-26T10:29:31Z

When searching for /[ɔ̃] it should not match "ɔ", that is a bug. It looks like the part in NFA that turns the sequence into items separates base and composing characters, that is not right.

chrisbra · 2022-05-20T17:19:49Z

Hm,
The problem in the bt engine can be fixed like this:

diff --git a/src/regexp_bt.c b/src/regexp_bt.c
index 698ff043e..49c53d97a 100644
--- a/src/regexp_bt.c
+++ b/src/regexp_bt.c
@@ -3720,13 +3720,37 @@ regmatch(

          case ANYOF:
          case ANYBUT:
-           if (c == NUL)
-               status = RA_NOMATCH;
-           else if ((cstrchr(OPERAND(scan), c) == NULL) == (op == ANYOF))
-               status = RA_NOMATCH;
-           else
-               ADVANCE_REGINPUT();
-           break;
+           {
+               char_u  *q = OPERAND(scan);
+
+               if (c == NUL)
+                   status = RA_NOMATCH;
+               else if ((cstrchr(q, c) == NULL) == (op == ANYOF))
+                   status = RA_NOMATCH;
+               else
+               {
+                   // Check following combining characters
+                   int len, i;
+
+                   if (enc_utf8)
+                       len = utfc_ptr2len(q) - utf_ptr2len(q);
+
+                   MB_CPTR_ADV(rex.input);
+                   MB_CPTR_ADV(q);
+
+                   if (!enc_utf8 || len == 0)
+                       break;
+
+                   for (i = 0; i < len; ++i)
+                       if (q[i] != rex.input[i])
+                       {
+                           status = RA_NOMATCH;
+                           break;
+                       }
+                   rex.input += len;
+               }
+               break;
+           }

          case MULTIBYTECODE:
            if (has_mbyte)

But for the NFA, I don't know how to make it work, not sure how to make sure that there must be a composing character followed inside a collation. I tried to embed NFA_COMPOSING inside NFA_COLLATION, but then it complained about that character class.

fixes vim#10286

friedrichromstedt · 2022-05-21T16:12:45Z

It seems to me that some of the points made earlier maybe can be understood more precisely by clearly differentiating between a Unicode string and its encoded bytes sequence. I can only give limited information, but I hope it might help solving the issues presented here.

As far as I learned from Python, there are two basic operations which matter here: Encoding a Unicode string and Decoding a Bytes Sequence. Encoding creates a bytes sequence. Decoding produces a Unicode string.

To my knowledge, some Unicode characters including diacritical marks, e.g. "ä" or "ö", can be encoded in different ways: For example as an UTF-8 bytes sequence (UTF-8 is a Unicode coding) or as, say, an ASCII single byte character in a certain code page (each code page can only code a limited set of Unicode characters). I am not sure if ASCII code pages are a true but limited "Unicode coding". All of the resp. bytes sequences can be decoded to the Unicode characters "ä" or "ö" in this example. I do not know how to handle characters with diacritical marks which have not received a single Unicode character. Furthermore, I cannot tell how the Unicode standard deals with a sequence of a base character ("a" or "o" in this example) followed by a diacritical mark (here a diaeresis: the two dots above the base character) which is (in printout) identical to the single Unicode character: Are these sequences treated as a "combined character", s.t. "a" + <diaeresis> is identical to ä (the named Unicode character)? Thus, are the sequences of base character + diacritic mark which do not have a single Unicode name treated as one or two Unicode entities? Maybe someone more knowledgeable than me can clarify these points.

I think that even Unicode strings need to use some coding scheme internally, but I don't know details on how this is solved in Python 3. As a side remark, strings in Python 3 are always Unicode strings; separate from Bytes sequences.

fixes vim#10286

fixes vim#10286 Also, while at it, make debug mode work again.

Problem: regex: combining chars in collections not handled Solution: Check for following combining characters for NFA and BT engine closes: vim#10459 closes: vim#10286 Signed-off-by: Christian Brabandt <cb@256bit.org>

fixes vim#10286 Also, while at it, make debug mode work again.

Problem: regex: combining chars in collections not handled Solution: Check for following combining characters for NFA and BT engine closes: vim#10459 closes: vim#10286 Signed-off-by: Christian Brabandt <cb@256bit.org>

fixes vim#10286 Also, while at it, make debug mode work again.

Problem: regexp cannot match combining chars in collection Solution: Check for combining characters in regex collections for the NFA and BT Regex Engine Also, while at it, make debug mode work again. fixes vim/vim#10286 closes: vim/vim#12871 vim/vim@d2cc51f Co-authored-by: Christian Brabandt <cb@256bit.org>

…#26992) Problem: regexp cannot match combining chars in collection Solution: Check for combining characters in regex collections for the NFA and BT Regex Engine Also, while at it, make debug mode work again. fixes vim/vim#10286 closes: vim/vim#12871 vim/vim@d2cc51f Co-authored-by: Christian Brabandt <cb@256bit.org>

…neovim#26992) Problem: regexp cannot match combining chars in collection Solution: Check for combining characters in regex collections for the NFA and BT Regex Engine Also, while at it, make debug mode work again. fixes vim/vim#10286 closes: vim/vim#12871 vim/vim@d2cc51f Co-authored-by: Christian Brabandt <cb@256bit.org>

jchwenger closed this as completed Apr 26, 2022

brammool reopened this Apr 26, 2022

chrisbra added a commit to chrisbra/vim that referenced this issue May 20, 2022

regexp: make combining chars in collections work

571bced

fixes vim#10286

chrisbra mentioned this issue May 20, 2022

regexp: correctly handle combining chars in collations #10459

Closed

chrisbra closed this as completed in ca22fc3 Aug 20, 2023

chrisbra added a commit to chrisbra/vim that referenced this issue Aug 20, 2023

regexp: make combining chars in collections work

b2fecfc

fixes vim#10286

chrisbra reopened this Aug 20, 2023

chrisbra mentioned this issue Aug 20, 2023

regexp: make combining chars in collections work #12871

Closed

chrisbra added a commit to chrisbra/vim that referenced this issue Aug 20, 2023

regexp: make combining chars in collections work

54b8cfa

fixes vim#10286

chrisbra added a commit to chrisbra/vim that referenced this issue Aug 21, 2023

regexp: make combining chars in collections work

f05165a

fixes vim#10286 Also, while at it, make debug mode work again.

chrisbra added a commit to chrisbra/vim that referenced this issue Aug 27, 2023

regexp: make combining chars in collections work

2c291b6

fixes vim#10286 Also, while at it, make debug mode work again.

chrisbra added a commit to chrisbra/vim that referenced this issue Aug 30, 2023

regexp: make combining chars in collections work

0c86e35

fixes vim#10286 Also, while at it, make debug mode work again.

chrisbra added a commit to chrisbra/vim that referenced this issue Jan 4, 2024

regexp: make combining chars in collections work

35fbb59

fixes vim#10286 Also, while at it, make debug mode work again.

chrisbra closed this as completed in d2cc51f Jan 4, 2024

zeertzjq mentioned this issue Jan 12, 2024

vim-patch:9.1.0011: regexp cannot match combining chars in collection neovim/neovim#26992

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brackets regex of symbol e.g. [ɔ] also captures symbol with diacritic [ɔ̃] #10286

Brackets regex of symbol e.g. [ɔ] also captures symbol with diacritic [ɔ̃] #10286

jchwenger commented Apr 25, 2022

brammool commented Apr 25, 2022

brammool commented Apr 25, 2022

jchwenger commented Apr 25, 2022

brammool commented Apr 25, 2022 via email

jchwenger commented Apr 25, 2022 •

edited

Loading

tonymec commented Apr 26, 2022

tonymec commented Apr 26, 2022

jottkaerr commented Apr 26, 2022

jchwenger commented Apr 26, 2022

brammool commented Apr 26, 2022

chrisbra commented May 20, 2022

friedrichromstedt commented May 21, 2022

Brackets regex of symbol e.g. [ɔ] also captures symbol with diacritic [ɔ̃] #10286

Brackets regex of symbol e.g. [ɔ] also captures symbol with diacritic [ɔ̃] #10286

Comments

jchwenger commented Apr 25, 2022

Steps to reproduce

Expected behaviour

Version of Vim

Environment

Logs and stack traces

brammool commented Apr 25, 2022

brammool commented Apr 25, 2022

jchwenger commented Apr 25, 2022

brammool commented Apr 25, 2022 via email

jchwenger commented Apr 25, 2022 • edited Loading

tonymec commented Apr 26, 2022

tonymec commented Apr 26, 2022

jottkaerr commented Apr 26, 2022

jchwenger commented Apr 26, 2022

brammool commented Apr 26, 2022

chrisbra commented May 20, 2022

friedrichromstedt commented May 21, 2022

jchwenger commented Apr 25, 2022 •

edited

Loading