Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Independent String Manipulation #912

Closed
BraedonWooding opened this issue Apr 11, 2018 · 3 comments
Closed

Type Independent String Manipulation #912

BraedonWooding opened this issue Apr 11, 2018 · 3 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@BraedonWooding
Copy link
Contributor

BraedonWooding commented Apr 11, 2018

That is to not care about the type of the string passed

Why

Zig focused on keeping out of the programmers way, and by caring about edge cases, so where most languages would define a SINGLE string standard (Go uses Utf8, C# uses Utf16 for example), I think Zig should be independent on the manners, this is due to quite a few reasons as detailed throughout the proposal but come down to a singular core reason; "people need different string types dependent on what they are doing".

Examples

Currently in std/os/path.zig there are numerous places that need to use a unicode format for Unix that is different to the one for Windows, Windows requires a very specific Utf16 format, now while we could maintain everything in Utf8 then convert as needed which is okay, that is not what the issue is; the issue is if someone else wanted to do interfacing with Windows such to use Utf16 would they also have to utilise these manipulations; taking in a Utf16 array from windows converting to Utf8, doing the manipulations and converting back; to me that seems convoluted and is a key example why this kind of system is beneficial.

Next, lets say you are building an embedded system, or perhaps your interfacing with C code; and you have to handle char* strings; the niceness about this system is that it can utilise char* strings and just like utf8 arrays or ascii arrays it can do split/trim/whatever you want on it; meaning that you don't have to reinvent the wheel just because you are doing trans-language code.

Finally, for another example and the reason why I propose ascii to be different from unicode; Ascii is much simpler in terms of operations such as slicing, and obtaining code points then unicode as unicode has to iterate through the array to find the code points for code point slicing, so if you are building an application that purely uses Ascii for efficiency or for compatibility or for whatever reason why should you suffer the slowdown of Unicode in these cases ?

Implementation

Basically it comes down to this; the code for utilising type independent strings isn't complex, every thing you pass into the functions has to be both a view and an iterator, your views and iterators HAVE to have the following available to be used in functions;

pub fn init(s: []const u8) !View;
pub fn initUnchecked(s: []const u8) View;
// Compare if two views are equal
pub fn eql(self: &const View, other: &const View) bool
pub fn sliceBytes(self: &const View, start: usize, end: usize) []const u8;
// Slice from start to end of raw data
pub fn sliceBytesToEndFrom(self: &const View, start: usize) []const u8;
// Get all raw data
pub fn getBytes(self: &const View) []const u8;
pub fn byteLen(self: &const View) usize;
// Slices codepoints across a range not raw data.
pub fn sliceCodepoint(self: &const View, start: usize, end: usize) ![]const u8;
// Slices codepoints across a range not raw data.
// Range is from start to end of the byte data
pub fn sliceCodepointToEndFrom(self: &const View, start: usize) ![]const u8;
// Returns RAW byte at index
pub fn byteAt(self: &const View, index: usize) u8;
// Obtains a byte from the end i.e. [len - 1 - index]
pub fn byteFromEndAt(self: &const View, index: usize) u8;
// Returns unicode code point not the raw data at the given index
pub fn codePointAt(self: &const View, index: usize) !u32;
pub fn codePointFromEndAt(self: &const View, index: usize) !u32;
pub fn initComptime(comptime s: []const u8) View;
pub fn iterator(s: &const View) Iterator;

That is for the view, the iterator requires the following;

raw: []const u8,
index: usize,

// Reset iterator back to 0
pub fn reset(it: &Iterator);
// Get the next code point rather than raw data
pub fn nextCodepoint(it: &Iterator) ?u32;
// Get the next byte rather than raw data
pub fn nextBytes(it: &Iterator) ?[]const u8;

KEEP IN MIND: u8 is just a 'type' that can be used, you can interchange it as you wish :), you could change it to char* or u16 as the string functions take in the 'baseType' as well. Note: they also often take a codepoint type as that is important for comparison in some cases.

So an example definition for the new split is;

pub fn t_SplitIt(comptime viewType: type, comptime iteratorType: type, comptime baseType: type, comptime codepointType: type) type;

How to use the new functions in comparison to the old

YOU USE TO CALL IT LIKE;

mem.split("A, B, C", ",");

YOU NOW CALL IT LIKE;

string.utf8Split("A, B, C", ",");
// OR, ascii
string.asciiSplit("A, B, C", ",");
// And so on

It is implemented as;

pub const t_AsciiSplitIt = t_SplitIt(ascii.View, ascii.Iterator, []const u8, u8);

pub fn asciiSplit(a: []const u8, splitBytes: []const u8) !t_AsciiSplitIt {
    return try t_AsciiSplitIt.init(a, splitBytes);
}

This means that overall you DON'T need to actually even realise that it requires all this extra information, however this means it also is easy to build your own views and iterators and hook them up as a user for your own goals; as stated previously. Maybe for an API you need Utf27 or some other weird format and for simplicity thus being able to use the string functions nicely without converting is definitely a good thing we should aim for.

I guess it all comes down to this; the 'logic' for a unicode split/trim/whatever is the SAME regardless of its composition thus why shouldn't we allow for multiple different versions it doesn't actually result in any less efficient code (unless merely allocating a struct requires effort in zig, though I felt it was the same as C and simply offsets in which case we are talking trivial amounts of difference).

Note:

This DOES not include Locale, this is an independent system :). Locale will be a DIFFERENT proposal currently I have conflated the two and I apologise for that confusion, I'm doing this one first as this is much more complete and I have a more deep understanding of it in terms of good solutions. I don't want to take down the PR (and put one up that is just this proposal) due to the amount of good discussion there so in the next day or two I'll remove locale from the PR (and put it into a separate PR :D).

Hopefully this quite clearly and succinctly indicates why I think it is a good idea :).

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Apr 12, 2018
@andrewrk andrewrk added this to the 0.4.0 milestone Apr 12, 2018
@andrewrk
Copy link
Member

andrewrk commented Apr 13, 2018

I think it makes sense to have string manipulation functions for []u21 - fully decoded unicode strings. And it makes sense to have byte manipulation functions for []u8. Bytes, not ASCII. I'm not convinced that there needs to be this View abstraction.

the issue is if someone else wanted to do interfacing with Windows such to use Utf16 would they also have to utilise these manipulations; taking in a Utf16 array from windows converting to Utf8, doing the manipulations and converting back; to me that seems convoluted and is a key example why this kind of system is beneficial.

This is the fundamental premise of this proposal, but I am not convinced that it is sound.

The goal of cross platform abstraction is to encapsulate the platform specific differences and then share as much higher level code - and assumptions - as possible. You're proposing to make the default string encoding, instead of UTF-8, generic depending on the target. This is certainly more convoluted from an API usage perspective.

@BraedonWooding
Copy link
Contributor Author

This is hard to answer because I guess we disagree about the same point. Because this is kinda the crux of my viewpoint; If it helps even a small percentage but doesn't require any more effort for the others then why not i.e. the whole edge cases matter thingy. For example whenever you interface with C/C++ you will most likely use something like char* or w_char* (std::string is also common but most C++ APIs also include a 'C' way to do things). So currently it is quite annoying to interface with them get their input and then manipulate it; a key example would be ICU where you actually probably would want to do many of these utilities, I guess I see the following as more ugly;

var readBytes : []const u16 = win.readBytes();
var winBytes : []const u8 = Win16.ConvertToUtf8(readBytes);
var it = mem.split(winBytes, '/');
while (it.next()) |next| {
    var backTo16 : []const u16 = Win16.ConvertToUtf16(next);
    win.writeBytes(backTo16);
    // Whatever
}

Then just

var readBytes : []const u16 = win.readBytes();
var it = string.utf16Split(readBytes, "/");
while (it.nextBytes()) |next| {
    win.writeBytes(next);
    // Whatever
}

Like idk, you could still use utf8 split if you wanted and converted but now you have a CHOICE.

You're proposing to make the default string encoding, instead of UTF-8, generic depending on the target

No I would say I'm making the default string encoding UTF-8, BUT let others do the same operations on UTF-16 and so on :).

This is certainly more convoluted from an API usage perspective.

Keep in mind you still call it like string.utf8Join(MyAllocator, ", ", "a", "b", "c") so instead of just saying mem.join you now say string.utf8Join I fail to see how this is more convoluted :).

I'm not convinced that there needs to be this View abstraction.

The user will rarely ever use Views, UNLESS they add their own new type such as lets say Utf27 in which case they would add a view and iterator and probably add a call to each of the std.string functions that includes the required types.

Wrapup

I guess it comes down to this; this proposal is all about taking away that annoyance of handling with a singular string type in a program that will handle many, you still have your CORE standard which is Utf8, since if you really want to do things like writing to files or whatever you'll probably want to use Utf-8, but if you don't and you just want to read from a windows u16 buffer do some splitting then put it somewhere else still in utf16 then you can without that pushback.

I guess I fail to see where this would be viewed as 'negative' for the standard library as it doesn't perpetuate anything like conflating the idea of a string.

@BraedonWooding
Copy link
Contributor Author

Closing as PR decision was made

@andrewrk andrewrk modified the milestones: 0.4.0, 0.3.0 Sep 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

2 participants