Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emoji group matches digits #704

Closed
Gegy opened this issue Aug 9, 2020 · 1 comment
Closed

Emoji group matches digits #704

Gegy opened this issue Aug 9, 2020 · 1 comment
Labels

Comments

@Gegy
Copy link

Gegy commented Aug 9, 2020

Using regex 1.3.9, matching with the \p{Emoji} group yields matches on digits as well as emoji. As far as I could tell, this isn't intentional.

Example to reproduce issue:

use regex::Regex;

fn main() {
    let match_emoji = Regex::new(r#"\p{Emoji}"#).unwrap();

    let matches = match_emoji.find_iter("hello! 😀 😁 1234")
        .map(|m| m.as_str())
        .collect::<Vec<_>>();

    println!("{:?}", matches);
}

This prints ["😀", "😁", "1", "2", "3", "4"], where I'd expect it to print just ["😀", "😁"]

Not totally sure if I'm misinterpreted the intended use of the Emoji group. For now using the pattern [\p{Emoji}--\p{Digit}] is a functioning workaround. 🙂

@BurntSushi
Copy link
Member

Using regex 1.3.9, matching with the \p{Emoji} group yields matches on digits as well as emoji. As far as I could tell, this isn't intentional.

It is correct. The is a difference between what an emoji is, what Unicode defines as an emoji and what \p{Emoji} is. An emoji is the thing that a human would recognize as an emoticon in text. Unicode defines emoji via UAX#51 using a set of rules that is assuredly imperfect, but does well in practice. Finally, \p{Emoji} is merely a character class that matches precisely one codepoint among a set of them, where that set is defined by Unicode via the UCD. The purpose of \p{Emoji} is as a convenient tool for implementing the more precise definition described in UAX#51. In that light, digits are indeed defined to be a part of this set, since they can combine with other codepoints to form emoji. This should reveal that simply using \p{Emoji} to recognize emoji is quite problematic because it doesn't implement UAX#51 all on its own. Trivially, it will only match a single codepoint and many UAX#51 emoji consist of several emoji.

I wrote more about this here, when someone asked a similar question about ripgrep, which uses the regex crate: BurntSushi/ripgrep#1623 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants