Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Intl] Add EmojiTransliterator to translate emoji to many locales #46755

Merged
merged 2 commits into from
Jul 27, 2022

Conversation

lyrixx
Copy link
Member

@lyrixx lyrixx commented Jun 23, 2022

Q A
Branch? 6.2
Bug fix? no
New feature? yes
Deprecations? no
Tickets
License MIT
Doc PR

Today, I used the AsciiSlugger and I was really surprised that it replaced
my emoji with nothing (a blank string). I dig a bit, and I think the best way
to solve that is to add a EmojiTransliterator, and then wire it (another PR)

Current API:

$tr = EmojiTransliterator::getInstance('en');
$tr->transliterate('a 😺, 🐈‍⬛, and a 🦁 go to 🏞️... 😍 🎉 💛');
// a grinning cat, black cat, and a lion go to national park️... smiling face with heart-eyes party popper yellow heart

To do that, I built some transliterator rules, and I committed them to the repository.
The sources come from the official data at https://github.com/unicode-org/cldr

@carsonbot carsonbot added this to the 6.2 milestone Jun 23, 2022
@lyrixx lyrixx changed the title [Intl] Add EmojiTransliteratorTest to translate emoji to english form. [Intl] Add EmojiTransliterator to translate emoji to english form. Jun 23, 2022
@nicolas-grekas
Copy link
Member

Did you want to also provide a map for short names that could work both ways?
Typically mapping 😓 to :sweat:?

@lyrixx lyrixx force-pushed the intl-trans-emoji branch 2 times, most recently from c6c5bb2 to 7cd2a11 Compare June 29, 2022 17:13
@lyrixx
Copy link
Member Author

lyrixx commented Jun 29, 2022

@alamirault you review the PR a bit too soon :) It was a WIP (I pushed in order to backup my work).

@nicolas-grekas

Typically mapping 😓 to :sweat: ?

That's already the case, or I missed something ?

    public function testTransliterate()
    {
        $tr = EmojiTransliterator::getInstance('en');

        $this->assertSame(
            'a grinning cat, black cat, and a lion go to national park️... smiling face with heart-eyes party popper yellow heart',
            $tr->transliterate('a 😺, 🐈‍⬛, and a 🦁 go to 🏞️... 😍 🎉 💛')
        );
    }

@lyrixx
Copy link
Member Author

lyrixx commented Jun 30, 2022

@stof Thanks for the review, I've addressed your comments.

I think the PR is now ready.

@lyrixx lyrixx force-pushed the intl-trans-emoji branch 6 times, most recently from ddc8dd7 to 066e257 Compare June 30, 2022 13:34
Copy link
Member

@nicolas-grekas nicolas-grekas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice :)
I just have minor comments.

In a follow up, it'd be great to add two more locales:

@lyrixx
Copy link
Member Author

lyrixx commented Jul 6, 2022

BTW, looking at the maps, I'm wondering: shouldn't eg en be applied also for en_ca (after it likely)? en_ca looks very short compared to en.

I missed that point!

What do you recommend? we can merge en into en_ca?

@nicolas-grekas
Copy link
Member

What do you recommend? we can merge en into en_ca?

that might increase the size of maps, better run the str_replace twice when needed (with the php-based implem)

@nicolas-grekas
Copy link
Member

we can merge en into en_ca

I went this way actually, not much to merge. The PR on your fork is up to date.

@lyrixx
Copy link
Member Author

lyrixx commented Jul 7, 2022

@nicolas-grekas Feel free to merge this PR, and then you can open a new PR to improve the situation if it needs to

@fabpot
Copy link
Member

fabpot commented Jul 7, 2022

@lyrixx Let's finish this PR before merging it. I don't see why we would merge something that we know is not finished yet.

@nicolas-grekas
Copy link
Member

Especially when this means bloating the git history :)

@lyrixx
Copy link
Member Author

lyrixx commented Jul 7, 2022

So I think the best way It to merge the PR of @nicolas-grekas (in my PR) (I'll do that in few minutes)

So we can discuss about it. But Unfortunately, I don't understand the new code, and I miss some time to dig into it.

@nicolas-grekas nicolas-grekas force-pushed the intl-trans-emoji branch 5 times, most recently from e44e73d to 205587f Compare July 7, 2022 16:32
@lyrixx
Copy link
Member Author

lyrixx commented Jul 7, 2022

I talked a bit with @nicolas-grekas on slack, and we decided to share our thought on the subject here.

Nikos rewrite all the code to be able to rely on Opcache preloading, in order to increase the performance.

To be honest, I don't like this new implementation (that explains my previous^2 comment). So the rest of this comment might be subjective. But I let you judge :)

The previous code was easier to understand, maintain and work with (see exception message for instance) 😄 :

About the performance, I wrote this Unit Test in both branch

    public function testBench()
    {
        $t = microtime(true);

        for ($i=0; $i < 1000; $i++) {
            $tr = EmojiTransliterator::create('en'); // I also tried with this line outside the loop. Same result ($tr is cached internally) 
            $tr->transliterate('un 😺, 🐈‍⬛, et a 🦁 vont au 🏞️');
        }

        dump((microtime(true) - $t));
    }

Here are the results:

Previous code: 0.0145s
New code: 0.088s

It looks like the previous code is 6 times faster than the new code when using only one transliterator instance. It's legit to me, since the new code does not use the translitator anymore but strtr.

But what about many transliterator? To know that, I used the following command that run for all locales

./phpunit src/Symfony/Component/Intl/Tests/Transliterator/EmojiTransliteratorTest.php --filter testAllTransliterator --repeat 100

old code 00:01.746
new code: 00:02.592

So here the old code is also faster


php -v
PHP 8.1.7 (cli) (built: Jun 25 2022 08:13:46) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.1.7, Copyright (c) Zend Technologies
    with Zend OPcache v8.1.7, Copyright (c), by Zend Technologies
    with blackfire v1.77.0~linux-x64-non_zts81, https://blackfire.io, by Blackfire
php -i | grep opca
/etc/php/8.1/cli/conf.d/10-opcache.ini,
opcache.blacklist_filename => no value => no value
opcache.consistency_checks => 0 => 0
opcache.dups_fix => Off => Off
opcache.enable => On => On
opcache.enable_cli => On => On
opcache.enable_file_override => Off => Off
opcache.error_log => no value => no value
opcache.file_cache => no value => no value
opcache.file_cache_consistency_checks => On => On
opcache.file_cache_only => Off => Off
opcache.file_update_protection => 2 => 2
opcache.force_restart_timeout => 180 => 180
opcache.huge_code_pages => Off => Off
opcache.interned_strings_buffer => 8 => 8
opcache.jit => tracing => tracing
opcache.jit_bisect_limit => 0 => 0
opcache.jit_blacklist_root_trace => 16 => 16
opcache.jit_blacklist_side_trace => 8 => 8
opcache.jit_buffer_size => 0 => 0
opcache.jit_debug => 0 => 0
opcache.jit_hot_func => 127 => 127
opcache.jit_hot_loop => 64 => 64
opcache.jit_hot_return => 8 => 8
opcache.jit_hot_side_exit => 8 => 8
opcache.jit_max_exit_counters => 8192 => 8192
opcache.jit_max_loop_unrolls => 8 => 8
opcache.jit_max_polymorphic_calls => 2 => 2
opcache.jit_max_recursive_calls => 2 => 2
opcache.jit_max_recursive_returns => 2 => 2
opcache.jit_max_root_traces => 1024 => 1024
opcache.jit_max_side_traces => 128 => 128
opcache.jit_prof_threshold => 0.005 => 0.005
opcache.lockfile_path => /tmp => /tmp
opcache.log_verbosity_level => 1 => 1
opcache.max_accelerated_files => 15000 => 15000
opcache.max_file_size => 0 => 0
opcache.max_wasted_percentage => 5 => 5
opcache.memory_consumption => 128 => 128
opcache.opt_debug_level => 0 => 0
opcache.optimization_level => 0x7FFEBFFF => 0x7FFEBFFF
opcache.preferred_memory_model => no value => no value
opcache.preload => no value => no value
opcache.preload_user => no value => no value
opcache.protect_memory => Off => Off
opcache.record_warnings => Off => Off
opcache.restrict_api => no value => no value
opcache.revalidate_freq => 2 => 2
opcache.revalidate_path => Off => Off
opcache.save_comments => On => On
opcache.use_cwd => On => On
opcache.validate_permission => Off => Off
opcache.validate_root => Off => Off
opcache.validate_timestamps => On => On

@nicolas-grekas
Copy link
Member

Thanks for challenging my proposal @lyrixx
I figured out that using strtr() was way faster than using str_replace() for the use case.
I also figured out that we can skip creating a slicing transliterator most of the time.
The PHP version is now faster again :)

@nicolas-grekas
Copy link
Member

nicolas-grekas commented Jul 9, 2022

My bench: ab -n 100 http://localhost:8080/ with the script below.

Results with 100 transliterations:

  • before: 75 req/s
  • after: 390 req/s

Results with 1000 transliterations:

  • before: 40 req/s
  • after: 70 req/s
<?php
  
use Symfony\Component\Intl\Transliterator\EmojiTransliterator;

require __DIR__.'/vendor/autoload.php';

$tr = EmojiTransliterator::create('en');
//$tr = EmojiTransliterator::getInstance('en');

for ($i=0; $i < 100; $i++) {
    $tr->transliterate('un 😺, 🐈<200d>⬛, et a 🦁 vont au 🏞️ ');
}

echo $tr->transliterate('un 😺, 🐈<200d>⬛, et a 🦁 vont au 🏞️ ');

@ro0NL
Copy link
Contributor

ro0NL commented Jul 11, 2022

how would we link AsciiSlugger + EmojiTransliterator? is it desired?

@nicolas-grekas
Copy link
Member

how would we link AsciiSlugger + EmojiTransliterator? is it desired?

In a follow up PR, and yes, I think it's desired.

@OskarStark OskarStark changed the title [Intl] Add EmojiTransliterator to translate emoji to many locales [Intl] Add EmojiTransliterator to translate emoji to many locales Jul 14, 2022
@nicolas-grekas
Copy link
Member

Thank you @lyrixx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet