Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brackets regex of symbol e.g. [ɔ] also captures symbol with diacritic [ɔ̃] #10286

Closed
jchwenger opened this issue Apr 25, 2022 · 12 comments
Closed

Comments

@jchwenger
Copy link

Steps to reproduce

Tested in vim/gvim 8.2 on Ubuntu 18.04

  1. Text to search:
/abɔ̃kɔ̃t/	à bon compte
/abupɔʁtɑ̃/	à bout portant


and
/ɔ̃
produce the expected result, only finding the respective versions of the character, however
/[ɔ]
and
/[ɔ̃]
both capture ɔ and ɔ̃, is this intended behaviour?

Thanks!

Expected behaviour

I was expecting a distinction between ɔ and ɔ̃ even when included in brackets. (I had been using a list of ipa characters in brackets to eliminate lines using :v/, and found myself still having the sound described by the second IPA symbol (with tilde) even if I had only included the first in my list.

Version of Vim

8.2 (2019 Dec 12, compiled Oct 12 2020 17:33:22)

Environment

Ubuntu 18.04
Gnome Terminal 3.28.2
$TERM: xterm-256color
zsh 5.4.2 (x86_64-ubuntu-linux-gnu)

Logs and stack traces

No response

@brammool
Copy link
Contributor

I would think this is not correct. "[abc]" should work the same as "(a|b|c)".

@brammool
Copy link
Contributor

Hmm, github dropped the backslashes, should be \(a\|b\|c\)

@jchwenger
Copy link
Author

However this is not the issue, I think. What is striking is that [ɔ] captures both ɔ and ɔ̃. And vice versa ([ɔ̃] captures also ɔ when it shouldn't?). I tried with [ɛ] as well, same thing, it captures both ɛ and ɛ̃ (and vice versa).

I just tried with [e] and with [é] and everything behaves as expected ([e] captures only e and [é] only é). [à] also behaves normally.

@brammool
Copy link
Contributor

brammool commented Apr 25, 2022 via email

@jchwenger
Copy link
Author

jchwenger commented Apr 25, 2022

I'm note sure what you mean by wrong, but sure. It still would then be a bit impractical (in the same way as sorting alphabetically does not handle a and à as the same character...), I'm afraid.

It remains somewhat unintuitive to me, the fact that if I do /[ɑ̃ɔ̃] I will also capture all the ɔ and the ɑ, and so I should use /\(ɑ̃\|ɔ̃\) instead, more verbose, especially if there's more than a few characters. Certainly unexpected.

Btw that leads me to discover this truly odd edge case (to my intuition at least): if I search
/\(ɔ\|ɔ̃\)
that only captures ɔ and not ɔ̃ !

@tonymec
Copy link

tonymec commented Apr 26, 2022

Yeah, as Bram said, most "IPA Extensions" (U+0250 to U+02AF) don't exist with a precomposed diacritic, so you have to add a "combining diacritical mark" (U+0300 to U+036F) to place the diacritic on them. But they are still two characters, e.g. ɔ̃ is U+0254 LATIN SMALL LETTER OPEN O followed by U+0303 COMBINING TILDE. When you search for U+0254 ɔ Vim finds it as part of the combined (spacing IPA character + combining character) which is expected. OTOH, there exists a precomposed é U+00E9 LATIN SMALL LETTER E WITH ACUTE and if your text contains that rather than U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT, Vim won't find e when searching for é nor vice-verse.

If your ɔ̃ had been followed by something in your search pattern, e.g. if you had been searching for ɔ̃] then Vim wouldn't have found ɔ] (nor vice-versa) because ɔ̃] is U+0254 LATIN SMALL LETTER OPEN O followed by U+0303 COMBINING TILDE followed by U+005D RIGHT SQUARE BRACKET while ɔ] is only U+0254 LATIN SMALL LETTER OPEN O followed by U+005D RIGHT SQUARE BRACKET and Vim won't find the one when looking for the other unless you put \Z in your search pattern (see ":help /\Z"). So finding ɔ̃ but not ɔ should be straightforward but for the opposite you should construct a pattern specifying that you want ɔ "not followed by U+0303" which is more complicated.

Searching for /[ɑ̃ɔ̃] will find any of the following four characters:

  • U+0251 LATIN SMALL LETTER ALPHA
  • U+0303 COMBINING TILDE
  • U+0304 LATIN SMALL LETTER OPEN O
  • U+0303 COMBINING TILDE
    two of which are the same, so you should even find ɛ̃ (which also consists of a character followed by a combining tilde) but not ɛ (which includes none of the characters I listed above).

The fact that /(ɔ|ɔ̃) captures only ɔ and not ɔ̃ is (to my mind, at least) unexpected.

Best regards,
Tony.

@tonymec
Copy link

tonymec commented Apr 26, 2022

Oops, I meant, "the fact that /\(ɔ\|ɔ̃\) captures only…", it was transformed because I typed it without backticks.

Best regards,
Tony.

@jottkaerr
Copy link

In a sub-pattern with two or more branches the first matching branch is preferred if more than one branch can match, i.e., \(a\|an\) will match the a in branch while \(an\|a\) will match the an. Of course, if the pattern continues after the branches, that might force a different branch to match, i.e., both \(a\|an\)c and \(an\|a\)c will match the anc in branch.

So if someone wants to use (...\|...\|...\) as an alternative for [...] to search for a number of characters with possibly combining characters, they should put the characters with combining characters at the start of the branches.

Regards,
Jürgen

@jchwenger
Copy link
Author

Thanks so much @brammool @tonymec & @jottkaerr for the detailed responses! I knew about the composition of character + diacritic, I just did not expect the regex in bracket to match the way it did.

I was actually doing a filter based on IPA representations, where I have a vocab table and only wanted to retain words with certain sounds, and remove the rest, using the :v / :g commands. So I would do e.g. :v/\/[abilmnpstuɛʁ]\+\//d_, which was just simple and elegant. Thanks to the additional information provided here I know now I'll have to watch out for compositions like ɛ̃ as well, and perhaps do a second pass with another filter.

Vim won't find the one when looking for the other unless you put \Z in your search pattern (see ":help /\Z"). So finding ɔ̃ but not ɔ should be straightforward but for the opposite you should construct a pattern specifying that you want ɔ "not followed by U+0303" which is more complicated.

Thanks for the \Z ref! By the way, I repeat, and that was counterintuitive to me, but if you do /[ɔ] vim will match ɔ and ɔ̃, as said, but also the reverse, /[ɔ̃] will match ɔ and ɔ̃. (This does not happen if you just have a or /ɔ̃, as expected.)

Strangely, I inserted an orphan U+0303 COMBINING TILDE somewhere in my text and the brackets pattern does not match it... (I thought, oh maybe [ɔ̃] just divides the character and the diacritic and treating them as two separate entities). It's as if the brackets pattern simply ignored diacritic marks altogether, both in its components and in the text...

And to continue with the strangeness, if you insert U+0303 in the text, and then search for said character /U+0303, it matches characters with said mark, e.g. ɔ̃, but not the orphan mark itself...

@brammool
Copy link
Contributor

When searching for /[ɔ̃] it should not match "ɔ", that is a bug. It looks like the part in NFA that turns the sequence into items separates base and composing characters, that is not right.

@brammool brammool reopened this Apr 26, 2022
@chrisbra
Copy link
Member

Hm,
The problem in the bt engine can be fixed like this:

diff --git a/src/regexp_bt.c b/src/regexp_bt.c
index 698ff043e..49c53d97a 100644
--- a/src/regexp_bt.c
+++ b/src/regexp_bt.c
@@ -3720,13 +3720,37 @@ regmatch(

          case ANYOF:
          case ANYBUT:
-           if (c == NUL)
-               status = RA_NOMATCH;
-           else if ((cstrchr(OPERAND(scan), c) == NULL) == (op == ANYOF))
-               status = RA_NOMATCH;
-           else
-               ADVANCE_REGINPUT();
-           break;
+           {
+               char_u  *q = OPERAND(scan);
+
+               if (c == NUL)
+                   status = RA_NOMATCH;
+               else if ((cstrchr(q, c) == NULL) == (op == ANYOF))
+                   status = RA_NOMATCH;
+               else
+               {
+                   // Check following combining characters
+                   int len, i;
+
+                   if (enc_utf8)
+                       len = utfc_ptr2len(q) - utf_ptr2len(q);
+
+                   MB_CPTR_ADV(rex.input);
+                   MB_CPTR_ADV(q);
+
+                   if (!enc_utf8 || len == 0)
+                       break;
+
+                   for (i = 0; i < len; ++i)
+                       if (q[i] != rex.input[i])
+                       {
+                           status = RA_NOMATCH;
+                           break;
+                       }
+                   rex.input += len;
+               }
+               break;
+           }

          case MULTIBYTECODE:
            if (has_mbyte)

But for the NFA, I don't know how to make it work, not sure how to make sure that there must be a composing character followed inside a collation. I tried to embed NFA_COMPOSING inside NFA_COLLATION, but then it complained about that character class.

@friedrichromstedt
Copy link

It seems to me that some of the points made earlier maybe can be understood more precisely by clearly differentiating between a Unicode string and its encoded bytes sequence. I can only give limited information, but I hope it might help solving the issues presented here.

As far as I learned from Python, there are two basic operations which matter here: Encoding a Unicode string and Decoding a Bytes Sequence. Encoding creates a bytes sequence. Decoding produces a Unicode string.

To my knowledge, some Unicode characters including diacritical marks, e.g. "ä" or "ö", can be encoded in different ways: For example as an UTF-8 bytes sequence (UTF-8 is a Unicode coding) or as, say, an ASCII single byte character in a certain code page (each code page can only code a limited set of Unicode characters). I am not sure if ASCII code pages are a true but limited "Unicode coding". All of the resp. bytes sequences can be decoded to the Unicode characters "ä" or "ö" in this example. I do not know how to handle characters with diacritical marks which have not received a single Unicode character. Furthermore, I cannot tell how the Unicode standard deals with a sequence of a base character ("a" or "o" in this example) followed by a diacritical mark (here a diaeresis: the two dots above the base character) which is (in printout) identical to the single Unicode character: Are these sequences treated as a "combined character", s.t. "a" + <diaeresis> is identical to ä (the named Unicode character)? Thus, are the sequences of base character + diacritic mark which do not have a single Unicode name treated as one or two Unicode entities? Maybe someone more knowledgeable than me can clarify these points.

I think that even Unicode strings need to use some coding scheme internally, but I don't know details on how this is solved in Python 3. As a side remark, strings in Python 3 are always Unicode strings; separate from Bytes sequences.

chrisbra added a commit to chrisbra/vim that referenced this issue Aug 20, 2023
@chrisbra chrisbra reopened this Aug 20, 2023
chrisbra added a commit to chrisbra/vim that referenced this issue Aug 20, 2023
chrisbra added a commit to chrisbra/vim that referenced this issue Aug 21, 2023
fixes vim#10286

Also, while at it, make debug mode work again.
chrisbra added a commit to chrisbra/vim that referenced this issue Aug 27, 2023
fixes vim#10286

Also, while at it, make debug mode work again.
Konfekt pushed a commit to Konfekt/vim that referenced this issue Aug 30, 2023
Problem:  regex: combining chars in collections not handled
Solution: Check for following combining characters for NFA and BT engine

closes: vim#10459
closes: vim#10286

Signed-off-by: Christian Brabandt <cb@256bit.org>
chrisbra added a commit to chrisbra/vim that referenced this issue Aug 30, 2023
fixes vim#10286

Also, while at it, make debug mode work again.
chrisbra added a commit to chrisbra/vim that referenced this issue Sep 22, 2023
Problem:  regex: combining chars in collections not handled
Solution: Check for following combining characters for NFA and BT engine

closes: vim#10459
closes: vim#10286

Signed-off-by: Christian Brabandt <cb@256bit.org>
chrisbra added a commit to chrisbra/vim that referenced this issue Jan 4, 2024
fixes vim#10286

Also, while at it, make debug mode work again.
zeertzjq added a commit to zeertzjq/neovim that referenced this issue Jan 12, 2024
Problem:  regexp cannot match combining chars in collection
Solution: Check for combining characters in regex collections for the
          NFA and BT Regex Engine

Also, while at it, make debug mode work again.

fixes vim/vim#10286
closes: vim/vim#12871

vim/vim@d2cc51f

Co-authored-by: Christian Brabandt <cb@256bit.org>
zeertzjq added a commit to zeertzjq/neovim that referenced this issue Jan 12, 2024
Problem:  regexp cannot match combining chars in collection
Solution: Check for combining characters in regex collections for the
          NFA and BT Regex Engine

Also, while at it, make debug mode work again.

fixes vim/vim#10286
closes: vim/vim#12871

vim/vim@d2cc51f

Co-authored-by: Christian Brabandt <cb@256bit.org>
zeertzjq added a commit to neovim/neovim that referenced this issue Jan 12, 2024
…#26992)

Problem:  regexp cannot match combining chars in collection
Solution: Check for combining characters in regex collections for the
          NFA and BT Regex Engine

Also, while at it, make debug mode work again.

fixes vim/vim#10286
closes: vim/vim#12871

vim/vim@d2cc51f

Co-authored-by: Christian Brabandt <cb@256bit.org>
cleong14 pushed a commit to cleong14/neovim that referenced this issue Jan 16, 2024
…neovim#26992)

Problem:  regexp cannot match combining chars in collection
Solution: Check for combining characters in regex collections for the
          NFA and BT Regex Engine

Also, while at it, make debug mode work again.

fixes vim/vim#10286
closes: vim/vim#12871

vim/vim@d2cc51f

Co-authored-by: Christian Brabandt <cb@256bit.org>
glepnir pushed a commit to glepnir/neovim that referenced this issue Mar 31, 2024
…neovim#26992)

Problem:  regexp cannot match combining chars in collection
Solution: Check for combining characters in regex collections for the
          NFA and BT Regex Engine

Also, while at it, make debug mode work again.

fixes vim/vim#10286
closes: vim/vim#12871

vim/vim@d2cc51f

Co-authored-by: Christian Brabandt <cb@256bit.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants