-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type Independent String Manipulation #912
Comments
I think it makes sense to have string manipulation functions for
This is the fundamental premise of this proposal, but I am not convinced that it is sound. The goal of cross platform abstraction is to encapsulate the platform specific differences and then share as much higher level code - and assumptions - as possible. You're proposing to make the default string encoding, instead of UTF-8, generic depending on the target. This is certainly more convoluted from an API usage perspective. |
This is hard to answer because I guess we disagree about the same point. Because this is kinda the crux of my viewpoint; var readBytes : []const u16 = win.readBytes();
var winBytes : []const u8 = Win16.ConvertToUtf8(readBytes);
var it = mem.split(winBytes, '/');
while (it.next()) |next| {
var backTo16 : []const u16 = Win16.ConvertToUtf16(next);
win.writeBytes(backTo16);
// Whatever
} Then just var readBytes : []const u16 = win.readBytes();
var it = string.utf16Split(readBytes, "/");
while (it.nextBytes()) |next| {
win.writeBytes(next);
// Whatever
} Like idk, you could still use utf8 split if you wanted and converted but now you have a CHOICE.
No I would say I'm making the default string encoding UTF-8, BUT let others do the same operations on UTF-16 and so on :).
Keep in mind you still call it like
The user will rarely ever use Views, UNLESS they add their own new type such as lets say Utf27 in which case they would add a view and iterator and probably add a call to each of the std.string functions that includes the required types. WrapupI guess it comes down to this; this proposal is all about taking away that annoyance of handling with a singular string type in a program that will handle many, you still have your CORE standard which is Utf8, since if you really want to do things like writing to files or whatever you'll probably want to use Utf-8, but if you don't and you just want to read from a windows u16 buffer do some splitting then put it somewhere else still in utf16 then you can without that pushback. I guess I fail to see where this would be viewed as 'negative' for the standard library as it doesn't perpetuate anything like conflating the idea of a string. |
Closing as PR decision was made |
Why
Zig focused on keeping out of the programmers way, and by caring about edge cases, so where most languages would define a SINGLE string standard (Go uses Utf8, C# uses Utf16 for example), I think Zig should be independent on the manners, this is due to quite a few reasons as detailed throughout the proposal but come down to a singular core reason; "people need different string types dependent on what they are doing".
Examples
Currently in std/os/path.zig there are numerous places that need to use a unicode format for Unix that is different to the one for Windows, Windows requires a very specific Utf16 format, now while we could maintain everything in Utf8 then convert as needed which is okay, that is not what the issue is; the issue is if someone else wanted to do interfacing with Windows such to use Utf16 would they also have to utilise these manipulations; taking in a Utf16 array from windows converting to Utf8, doing the manipulations and converting back; to me that seems convoluted and is a key example why this kind of system is beneficial.
Next, lets say you are building an embedded system, or perhaps your interfacing with C code; and you have to handle char* strings; the niceness about this system is that it can utilise char* strings and just like utf8 arrays or ascii arrays it can do split/trim/whatever you want on it; meaning that you don't have to reinvent the wheel just because you are doing trans-language code.
Finally, for another example and the reason why I propose ascii to be different from unicode; Ascii is much simpler in terms of operations such as slicing, and obtaining code points then unicode as unicode has to iterate through the array to find the code points for code point slicing, so if you are building an application that purely uses Ascii for efficiency or for compatibility or for whatever reason why should you suffer the slowdown of Unicode in these cases ?
Implementation
Basically it comes down to this; the code for utilising type independent strings isn't complex, every thing you pass into the functions has to be both a view and an iterator, your views and iterators HAVE to have the following available to be used in functions;
That is for the view, the iterator requires the following;
So an example definition for the new split is;
How to use the new functions in comparison to the old
YOU USE TO CALL IT LIKE;
YOU NOW CALL IT LIKE;
It is implemented as;
This means that overall you DON'T need to actually even realise that it requires all this extra information, however this means it also is easy to build your own views and iterators and hook them up as a user for your own goals; as stated previously. Maybe for an API you need Utf27 or some other weird format and for simplicity thus being able to use the string functions nicely without converting is definitely a good thing we should aim for.
I guess it all comes down to this; the 'logic' for a unicode split/trim/whatever is the SAME regardless of its composition thus why shouldn't we allow for multiple different versions it doesn't actually result in any less efficient code (unless merely allocating a struct requires effort in zig, though I felt it was the same as C and simply offsets in which case we are talking trivial amounts of difference).
Note:
This DOES not include Locale, this is an independent system :). Locale will be a DIFFERENT proposal currently I have conflated the two and I apologise for that confusion, I'm doing this one first as this is much more complete and I have a more deep understanding of it in terms of good solutions. I don't want to take down the PR (and put one up that is just this proposal) due to the amount of good discussion there so in the next day or two I'll remove locale from the PR (and put it into a separate PR :D).
Hopefully this quite clearly and succinctly indicates why I think it is a good idea :).
The text was updated successfully, but these errors were encountered: