Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences in transliteration compared to behat/transliterator #35061

Closed
kdambekalns opened this issue Dec 20, 2019 · 12 comments · Fixed by #38198
Closed

Differences in transliteration compared to behat/transliterator #35061

kdambekalns opened this issue Dec 20, 2019 · 12 comments · Fixed by #38198
Labels
Projects

Comments

@kdambekalns
Copy link

kdambekalns commented Dec 20, 2019

Symfony version(s) affected: 5.0.2

Description

We used behat/transliterator until now, but because of issues with PHP 7.4 look into replacing it. I switched our codebase to use symfony/string and run into some differences with the result of transliteration to ASCII.

Input behat/transliterator symfony/strings
汉语 yi-yu han-yu
ភាសាខ្មែរ bhaasaakhmaer
ภาษาไทย phaasaaaithy phas-a-thiy
العَرَبِية l-arabiy al-arabit
한국어 hangugeo hangug-eo
မြန်မာဘာသာ m-n-maabhaasaa
हिन्दी hindii hindi

Notes:

How to reproduce

class TransliterationTest extends UnitTestCase
{
    public function sourcesAndResults(): array
    {
        return [
            ['Überlandstraßen; adé', 'uberlandstrassen-ade'],
            ['TEST DRIVE: INFINITI Q50S 3.7', 'test-drive-infiniti-q50s-3-7'],
            ['汉语', 'yi-yu'],
            ['日本語', 'ri-ben-yu'],
            ['Việt', 'viet'],
            ['ភាសាខ្មែរ', 'bhaasaakhmaer'],
            ['ภาษาไทย', 'phaasaaaithy'],
            ['العَرَبِية', 'l-arabiy'],
            ['עברית', 'bryt'],
            ['한국어', 'hangugeo'],
            ['ελληνικά', 'ellenika'],
            ['မြန်မာဘာသာ', 'm-n-maabhaasaa'],
            [' हिन्दी', 'hindii'],
            ['Иван Иванович', 'ivan-ivanovic'],
            ['Löic & René', 'loic-rene'],
            ['Châteauneuf du Pape', 'chateauneuf-du-pape'],
            ['Žluťoučký kůň', 'zlutoucky-kun'],
            [' x- ', 'x'],
        ];
    }

    /**
     * @test
     * @dataProvider sourcesAndResults
     */
    public function transliterationWorksAsExpected($source, $result): void
    {
        $result = strtolower((new AsciiSlugger())->slug($source)->toString());
        self::assertEquals($result, $result);
    }
}
@nicolas-grekas
Copy link
Member

nicolas-grekas commented Dec 20, 2019

Thanks for the report and the links that's interesting. A core design principle of the Symfony slugger is that we want to reuse standard data as a source: we're not linguists and we have no authority to maintain translit maps.

The source we use is from Unicode, the CLDR sub-project maintains maps here:
https://github.com/unicode-org/cldr/tree/master/common/transforms

E.g. for ភាសាខ្មែរ, there are no maps for Khmer (km), that's why it's not translated correctly.

This raises two questions to me:

  • is the empty string the best possible return value for the situation? (shouldn't we throw an exception instead when this happens? How do we detect the situation?)
  • what's the authority of https://github.com/dzcpy/transliteration?

Maybe the missing maps should be reported to the CLDR project?

@javiereguiluz
Copy link
Member

What Nicolas said is true. Symfony's rules should be correct because they aren't our rules but CLDR rules. Also, in some cases (but not in all of them) Google also agrees with Symfony/CLDR. Example:

google

About the "return empty string vs throw exception" I think it'd be better to throw the exception. An empty slug will probably break the app anyway ... so better warn the developer.

@bwaidelich
Copy link

bwaidelich commented Dec 20, 2019

is the empty string the best possible return value for the situation?

That's a tricky one..
I'd say exception would definitely be better than an empty string. And I suppose it's an exceptional case if either the input or the output string is empty (trimmed).

That way the invoking side would be forced to deal with the situation (and generate a random slug for example)

@stof
Copy link
Member

stof commented Dec 20, 2019

Well, I would say that if the input is the empty string, we should still return the empty string as output. That's the sane transliteration for it. The exceptional case would be a non-empty input producing an empty output.

@bwaidelich
Copy link

But trying to create a slug for an empty string.. Does that make sense?

@nicolas-grekas
Copy link
Member

The issue is how do we handle Khmer chars found in the middle of non-Khmer ones.
A fallback character is another option, turning unmapped chars to eg. dots.

@bwaidelich
Copy link

Just a thought on future backwards compatibility:
Being defensive and throwing an exception for strings/characters that can't be converted would probably make it easier to extend the functionality without breaking possibly expected behavior. It would then just accept additional inputs rather than changing the resulting output.

@garak
Copy link
Contributor

garak commented Feb 16, 2020

I would add another important difference: Symfony is not lowercasing string.

@nicolas-grekas nicolas-grekas added this to Nicolas' in Help Wanted Feb 26, 2020
@fabpot fabpot closed this as completed Sep 16, 2020
fabpot added a commit that referenced this issue Sep 16, 2020
…nicolas-grekas)

This PR was merged into the 5.2-dev branch.

Discussion
----------

[String] allow translit rules to be given as closure

| Q             | A
| ------------- | ---
| Branch?       | master
| Bug fix?      | no
| New feature?  | yes
| Deprecations? | no
| Tickets       | Fix #35061
| License       | MIT
| Doc PR        | -

Instead of trying to fix #35061 at our level, I propose to add a hook so that ppl can use a transliterator that fits their need (eg voku/portable-ascii or behat/transliterator) while still relying on the `SluggerInterface`.

Commits
-------

0bb48df [String] allow translit rules to be given as closure
@stof
Copy link
Member

stof commented Sep 18, 2020

@garak we transliterate to ASCII. ASCII has uppercase letters.

@garak
Copy link
Contributor

garak commented Sep 18, 2020

@garak we transliterate to ASCII. ASCII has uppercase letters.

Well, it also has a lot of other characters that I would not put in a slug.

@nicolas-grekas
Copy link
Member

slug($foo)->toLower() and done

@garak
Copy link
Contributor

garak commented Sep 18, 2020

slug($foo)->toLower() and done

I guess you meant ->lower(), but anyway it's fine. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Help Wanted
Nicolas'
Development

Successfully merging a pull request may close this issue.

7 participants