Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex :ignorecase behaviour is wrong with characters in regex that case-fold to more than one character #4263

Open
dpk opened this issue Mar 24, 2021 · 4 comments

Comments

@dpk
Copy link

dpk commented Mar 24, 2021

I reported this bug in 2017, but I think I did it wrong at the time since I’m fairly certain I never heard anything back … apologies if it’s actually a duplicate.

The Problem

If a regex with :ignorecase on contains a character which decomposes in case folding into two (or presumably more) characters, only the first character of the decomposition will be matched. Indeed it seems to erroneously push back the end position of the match by one character for the remainder of the match.

Expected Behavior

> say "ss" ~~ /:ignorecase ß/ 
ss
> say "sst" ~~ /:ignorecase ßt/ 
sst
> say "fiddle" ~~ /:ignorecase/
fi
> say "fiddle" ~~ /:ignorecase fid/
fid
> say "fiddle" ~~ /:ignorecase fidd/
fidd

Actual Behavior

> say "ss" ~~ /:ignorecase ß/ 
s
> say "sst" ~~ /:ignorecase ßt/ 
ss
> say "fiddle" ~~ /:ignorecase/
f
> say "fiddle" ~~ /:ignorecase fid/
fi
> say "fiddle" ~~ /:ignorecase fidd/
fid

(The last three have ASCII f and i in the string to match, but Unicode fi ligature in the regex.)

Environment

Mac OS 11.2

Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2021.02.1.
Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d.
Built on MoarVM version 2021.02.
@JJ
Copy link
Collaborator

JJ commented Mar 24, 2021

I find this related issue in the "old" issue tracker, as well as this one. Thanks for the report.

@coke
Copy link
Collaborator

coke commented Mar 24, 2021

Any tests should also check with the (e.g.) ligature in the string instead of the regex:

Expected:

> say "fidd" ~~ /:ignorecase fid/
[fid」

Actual:

> say "fidd" ~~ /:ignorecase fid/
「fidd」

@vrurg
Copy link
Member

vrurg commented Mar 26, 2021

The problem is, perhaps, a little bit deeper than it seems, though I may miss something in Unicode handling:

say "fidd".chars; # 3

Looking into the match result:

say ("fidd" ~~ /:ignorecase fid/).raku; # Match.new(:orig("fidd"), :from(0), :pos(3))

It does it right but fails to report correctly because:

say "fidd".substr(0,3); # fidd

@jubilatious1
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants