Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Utf8] New component with Bytes, CodePoints and Graphemes implementations of string objects #22184

Closed
wants to merge 5 commits into from

Conversation

@nicolas-grekas
Copy link
Member

commented Mar 27, 2017

Q A
Branch? master
Bug fix? no
New feature? yes
BC breaks? no
Deprecations? no
Tests pass? yes
Fixed tickets -
License MIT
Doc PR -

[edit: continued in #33553]

This is a port of tchwork/utf8 to Symfony.
tchwork/utf8 has 7M downloads on packagist, and I'd be really happy to maintain it under the Symfony umbrella.

It provides 3 classes that wrap PHP strings into objects, and deal with the 3 usual unit spaces of strings: bytes, utf8 chars and grapheme clusters.

All 3 classes implement the GenericStringInterface, so that one can type hint any of them, then potentially select which appropriate unit system one want to deal with (see above) with "converter" methods. GenericStringInterface is annotated @final to tag it as not-implementable by userland - thus allow us to change it and add more methods later on if we want, without being blocked by our BC promise.

In order to help the implementation, the component has a PHP 7.0 requirement. It'd be nice if this could be accepted as such - this helps a lot to make the code clean.

Test coverage is at 100%.

A big thank to @hhamon who did the port.

(for cross ref, here is a related package: https://packagist.org/packages/danielstjules/stringy)

@nicolas-grekas nicolas-grekas added this to the 3.x milestone Mar 27, 2017

@hhamon hhamon force-pushed the hhamon:utf8 branch 4 times, most recently from b807183 to 9edd61a Mar 27, 2017

@Fleshgrinder
Copy link

left a comment

The inheritance does not really make sense. Programming against the GenericStringInterface does not provide me with any confidence over the kind of string I will receive, in other words, I could program against literal PHP strings as well.

What actually would make sense is that the UTF-8 variations extend Bytes, while implementing a common UTF8String interface.

This approach would also be extensible for the future, e.g. to add an ASCIIString that extends Bytes and implements the UTF8String interface as well. Since any valid ASCII string is valid UTF-8.

To put it differently: anyone capable of processing bytes is capable of processing UTF-8, anyone capable of processing UTF-8 is capable of processing ASCII, … you may continue this chain until you reach a pure Latin Alphabet (e.g. [a-z]).

src/Symfony/Component/Utf8/Bytes.php Show resolved Hide resolved
src/Symfony/Component/Utf8/Bytes.php Outdated Show resolved Hide resolved
src/Symfony/Component/Utf8/Bytes.php Show resolved Hide resolved
src/Symfony/Component/Utf8/Bytes.php Show resolved Hide resolved
src/Symfony/Component/Utf8/Bytes.php Show resolved Hide resolved
src/Symfony/Component/Utf8/Utf8Trait.php Show resolved Hide resolved
src/Symfony/Component/Utf8/Utf8Trait.php Outdated Show resolved Hide resolved
}
} else {
throw new InvalidArgumentException('Pattern replacement must be a valid string or array of strings.');
}

This comment has been minimized.

Copy link
@Fleshgrinder

Fleshgrinder Mar 27, 2017

Same problem as before, second argument can be an array only if first argument is an array.

This comment has been minimized.

Copy link
@nicolas-grekas

nicolas-grekas Mar 27, 2017

Author Member

nope, see above

This comment has been minimized.

Copy link
@Fleshgrinder
@nicolas-grekas

This comment has been minimized.

Copy link
Member Author

commented Mar 27, 2017

Programming against the GenericStringInterface does not provide me with any confidence over the kind of string I will receive, in other words, I could program against literal PHP strings as well.

PHP doesn't provide generic programming (as e.g. C++ "templates"), so this is the way we found to simulate generic programming.
If one wants contractual type safety, one should just type hint the proper class (eg Bytes). Classes are final for this purpose. If one wants to code some generic algo that doesn't care about the specific unit system, then that's when the interface should be used. And when one type-hints against the interface but still wants control over the unit system, then the toCodePoints, toBytes and toGraphemes are provided for this purpose.

What actually would make sense is that the UTF-8 variations extend Bytes, while implementing a common UTF8String interface.

Certainly not: an utf8 string is not an instanceof bytes, for the purpose of this component. The return value of eg the length methods of each corresponding classes do not adhere to the same contract: one return a "bytes" unit - the other a "code point" unit. Even if the interfaces look like the same, they are not.

To put it differently: anyone capable of processing bytes is capable of processing UTF-8, anyone capable of processing UTF-8 is capable of processing ASCII, … you may continue this chain until you reach a pure Latin Alphabet (e.g. [a-z]).

That is generic programming : ignoring the type of things do to similar operations on objects. GenericStringInterface is provided exactly for this purpose. But as far as the type system is considered, the three kinds of strings provided here are not and should not be "instanceof" each others.

@Fleshgrinder

This comment has been minimized.

Copy link

commented Mar 27, 2017

I can treat any and every UTF-8 string as a series of bytes. The current implementation of Bytes assumes an ASCII encoded string, this is most certainly not the case, it can have any and all encodings. PHP’s string type is already generic, wrapping it is only exchanging a well known API against a new one. I thought that this is actually meant to provide more control over a string’s content, as well as confidence that that content is of a certain character set (I count a byte stream as a kind of characters set, it just provides the least confidence over what I am dealing with; which is actually not really interesting in the first place and my inheritance chain would actually start with UTF-8). Seems like I am wrong here, and this is just an OO flavored utility implementation.

If this is considered to be useful, than it’s fine with me.

I probably have to add, that I truly like the initiative and effort. String handling is very complicated, and I thought very often about creating a similar thing. I mean, I would not take the time to review this and give constructive feedback if I would consider this being crap. So, please feel encouraged and not discouraged by all my comments. 🐱

@hhamon hhamon force-pushed the hhamon:utf8 branch 5 times, most recently from 624ee76 to 8f52667 Mar 28, 2017

src/Symfony/Component/Utf8/Bytes.php Outdated Show resolved Hide resolved
src/Symfony/Component/Utf8/Bytes.php Outdated Show resolved Hide resolved
src/Symfony/Component/Utf8/README.md Show resolved Hide resolved
@ro0NL
Copy link
Contributor

left a comment

👍 cool stuff.

in general there's a lot of the same-ish code.. could it be further simplified with a common mb_ trait using either 8bit or UTF-8?

@hhamon hhamon force-pushed the hhamon:utf8 branch 10 times, most recently from a4fe716 to 551d790 Mar 30, 2017

@hhamon

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2017

Hi @GeoffreyHervet and thank you very much for participating to the discussion.

I don't think it makes sense to add an Immutable suffix as the design of the classes makes it already obvious. It's similar to adding a phpdoc block on a method to document the explicit type hinted arguments in the method prototype.

However, we may document in each class description block at the top that these objects are immutable.

public function toLowerCase(): self
{
$result = clone $this;
$result->string = mb_strtolower($this->string, 'UTF-8');

This comment has been minimized.

Copy link
@Fleshgrinder

Fleshgrinder Mar 31, 2017

This will not yield the desired result in some languages. Classical example for this is Turkish:

assert(mb_strtolower('DİNÇ') === 'dinç');
assert(mb_strtolower('DINÇ') === 'dınç');

The latter will not be converted correctly.

assert(transliterator_transliterate('tr-Lower', 'DİNÇ') === 'dinç');
assert(transliterator_transliterate('tr-Lower', 'DINÇ') === 'dınç');

This works as expected. Of course the same is true for toUpperCase.

@robfrawley

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2017

I have a few (less developed) of my own implementations handling similar functionality. I'd love to instead use a common component, such as this. I haven't taken the time to traverse the code just yet (I will soon), but as far as the generic stated purpose of this component: 👍

@hhamon hhamon force-pushed the hhamon:utf8 branch from 9a12070 to 679106f Apr 3, 2017

@nicolas-grekas nicolas-grekas self-assigned this Apr 10, 2017

@soullivaneuh

This comment has been minimized.

Copy link
Contributor

commented May 4, 2017

So if I well understood the component, we have to instantiate an object containing the string to do utf-8 safe manipulation and comparison?

Why not having static methods like voku/portable-utf8 package does?

BTW, maybe this was already discussed and I'll be glad to have a thread link in this case, but why not requiring and using a library like voku/portable-utf8 or yours instead of creating a new component?

Maybe I'm too curious, but I like to have some elucidation. 👼 😉

@stof

This comment has been minimized.

Copy link
Member

commented May 4, 2017

@soullivaneuh this is an API upgrade of an existing package from @nicolas-grekas, meant to bring this package in the symfony ecosystem (so that it is maintained by the core team rather than @nicolas-grekas alone)

@hhamon

This comment has been minimized.

Copy link
Contributor

commented May 4, 2017

@soullivaneuh because package from voku doesn't allow to manipulate strings as bytes, codepoints or graphemes units. Depending on your domain specific context, you'll choose one of the 3 implementations. Bytes for simple string and fast string manipulations. CodePoints for simple UTF8 strings and Graphemes when you need to deal with advanced chars map where you have combined characters.

@soullivaneuh

This comment has been minimized.

Copy link
Contributor

commented May 5, 2017

@stof Seems indeed legit, I was thinking like you after posting this question. 😉

voku doesn't allow to manipulate strings as bytes, codepoints or graphemes units.

Indeed, this is maybe the package class name is UTF8. :trollface:

But still. Why class instantiation for string manipulation? It look a little bit too heavy if I just want a lenght of a string and do a sub string.

Maybe it will be clearer for me when the related documentation will come. 😉

@Fleshgrinder

This comment has been minimized.

Copy link

commented May 5, 2017

The reason for using the type system is usually to be able to use the type system. Util classes do not give you any kind of security. If I need a valid UTF-8 string I should be able to type hint that to you. Using string basically tells you, well, nothing.

That being said, I still don't like the implementation, sorry.

@nicolas-grekas nicolas-grekas modified the milestones: 3.4, 4.1 Sep 28, 2017

@hhamon hhamon force-pushed the hhamon:utf8 branch from 679106f to 8be629d Dec 12, 2017

hhamon added 3 commits Dec 12, 2017
@fabpot

This comment has been minimized.

Copy link
Member

commented Mar 24, 2019

@nicolas-grekas What about this one? Does it make sense to finish it and merge it?

@nicolas-grekas

This comment has been minimized.

Copy link
Member Author

commented Apr 6, 2019

I really want to finish it :)

@nicolas-grekas nicolas-grekas changed the base branch from master to 4.4 Jun 2, 2019

@nicolas-grekas

This comment has been minimized.

Copy link
Member Author

commented Sep 4, 2019

I'm closing here so we can keep the discussion relevant to the attached patch.
I'm going to open a new PR soon, I'll post the ref here so that everyone interested can join it.

@nicolas-grekas

This comment has been minimized.

Copy link
Member Author

commented Sep 11, 2019

Continued in #33553

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.