Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grapheme_strlen shows different length of emoji ZWJ Sequence when compared to native #370

Closed
Luc45 opened this issue Sep 10, 2021 · 7 comments

Comments

@Luc45
Copy link
Contributor

Luc45 commented Sep 10, 2021

Take the following emoji for instance: 👩‍👩‍👦‍👦

This emoji consists of four different emojis glued together by Zero Width Joiner characters, as seen on https://emojipedia.org/family-woman-woman-boy-boy/.

When checking the length with grapheme_strlen(), it returns 1, while this library returns 4.

This is possibly due to a bug on the GRAPHEME_CLUSTER_RX regex.

This bug should only happen on PCRE_VERSION < 8.32, however, when combined with the bug #369 , it applies to all PCRE_VERSION that contains a date timestamp, which seems to be the default format.

Therefore, the grapheme_strlen function in this polyfill is likely to provide incorrect results, such as in this example:

Expected result grapheme_strlen('👩‍👩‍👦‍👦'):

The test is being conducted using the regex: \X

int(1)
int(1)
int(1)
int(1)

Actual result with the custom cluster grapheme_strlen('👩‍👩‍👦‍👦'):

The test is being conducted using the regex: (?:\r\n|(?:[ -~\x{200C}\x{200D}]|[ᆨ-ᇹ]+|[ᄀ-ᅟ]*(?:[가개갸걔거게겨계고과괘괴교구궈궤귀규그긔기까깨꺄꺠꺼께껴꼐꼬꽈꽤꾀꾜꾸꿔꿰뀌뀨끄끠끼나내냐냬너네녀녜노놔놰뇌뇨누눠눼뉘뉴느늬니다대댜댸더데뎌뎨도돠돼되됴두둬뒈뒤듀드듸디따때땨떄떠떼뗘뗴또똬뙈뙤뚀뚜뚸뛔뛰뜌뜨띄띠라래랴럐러레려례로롸뢔뢰료루뤄뤠뤼류르릐리마매먀먜머메며몌모뫄뫠뫼묘무뭐뭬뮈뮤므믜미바배뱌뱨버베벼볘보봐봬뵈뵤부붜붸뷔뷰브븨비빠빼뺘뺴뻐뻬뼈뼤뽀뽜뽸뾔뾰뿌뿨쀄쀠쀼쁘쁴삐사새샤섀서세셔셰소솨쇄쇠쇼수숴쉐쉬슈스싀시싸쌔쌰썌써쎄쎠쎼쏘쏴쐐쐬쑈쑤쒀쒜쒸쓔쓰씌씨아애야얘어에여예오와왜외요우워웨위유으의이자재쟈쟤저제져졔조좌좨죄죠주줘줴쥐쥬즈즤지짜째쨔쨰쩌쩨쪄쪠쪼쫘쫴쬐쬬쭈쭤쮀쮜쮸쯔쯰찌차채챠챼처체쳐쳬초촤쵀최쵸추춰췌취츄츠츼치카캐캬컈커케켜켸코콰쾌쾨쿄쿠쿼퀘퀴큐크킈키타태탸턔터테텨톄토톼퇘퇴툐투퉈퉤튀튜트틔티파패퍄퍠퍼페펴폐포퐈퐤푀표푸풔풰퓌퓨프픠피하해햐햬허헤혀혜호화홰회효후훠훼휘휴흐희히]?[ᅠ-ᆢ]+|[가-힣])[ᆨ-ᇹ]*|[ᄀ-ᅟ]+|[^\p{Cc}\p{Cf}\p{Zl}\p{Zp}])[\p{Mn}\p{Me}\x{09BE}\x{09D7}\x{0B3E}\x{0B57}\x{0BBE}\x{0BD7}\x{0CC2}\x{0CD5}\x{0CD6}\x{0D3E}\x{0D57}\x{0DCF}\x{0DDF}\x{200C}\x{200D}\x{1D165}\x{1D16E}-\x{1D172}]*|[\p{Cc}\p{Cf}\p{Zl}\p{Zp}])

int(1)
int(4)
int(1)
int(4)
@Luc45
Copy link
Contributor Author

Luc45 commented Sep 13, 2021

I forgot to share the code snippet used on the results above: https://3v4l.org/OPBFq#v8.0.10

@nicolas-grekas
Copy link
Member

Would you agree with considering that once #369 is merged, this issue can be closed? Aka we don't provide the most recent regexp to ppl that use older PCRE versions?

Alternatively, would you mind looking at improving this regexp? I'm sure I generated it but I don't remember how. There might be a script somewhere in this repo or mayne in https://github.com/tchwork/utf8

@Luc45
Copy link
Contributor Author

Luc45 commented Sep 13, 2021

Thanks for asking my input.

This package requires PHP 7.1, which seems to use PCRE 8.38 according to 3v4l.org: https://3v4l.org/S1bPl

On the PHP versions made available by 3v4l, 8.32 is used on PHP versions bellow 5.5.9, but I'm not sure if this will always be the case.

Is it possible for PHP 7.1+ to be running PCRE 8.32..?

@Luc45
Copy link
Contributor Author

Luc45 commented Sep 13, 2021

It seems PCRE 8.32 made it's way into PHP core in 2013: php/php-src@357ab3c

And has been replaced with 8.35 in 2014: php/php-src@dd0e96c

I guess it's fine to drop support for the old PCRE_VERSION. It would be ideal if this could be enforced in composer.json through ext-pcre, but given the non-standard version number of PCRE, it can be challenging to enforce the versions.

https://jubianchi.github.io/semver-check/#/^10%20||%20^8.34/8.34%202013-12-15

Or "ext-pcre": "> 8.32":

https://jubianchi.github.io/semver-check/#/%3E%208.32/8.34%202013-12-15

@Luc45
Copy link
Contributor Author

Luc45 commented Sep 13, 2021

It seems that PHP 7.1.0 requires PCRE > 6.6 to compile

Only on PHP 7.3 the version restriction was increased to PCRE > 10.30

These restrictions refer only to compilling PHP with an external PCRE.

@Luc45 Luc45 closed this as completed Sep 13, 2021
@Luc45 Luc45 reopened this Sep 13, 2021
@Luc45
Copy link
Contributor Author

Luc45 commented Sep 13, 2021

Actually, only PCRE2 (10+) is able to handle the initial grapheme_strlen example correctly: https://3v4l.org/grqP9

@nicolas-grekas
Copy link
Member

I'm going to close here because nobody worked on this. Ppl should upgrade to PCRE 10+ (or contribute a fix here ;) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants