New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utf8 Encoding from Codepoint to Bytes #954

Merged
merged 2 commits into from Apr 29, 2018

Conversation

Projects
None yet
3 participants
@BraedonWooding
Contributor

BraedonWooding commented Apr 25, 2018

Just the reverse of decode :). Follows the standard way of doing it, everything is in u3 to match the style of decode, I used a buffer rather than using an allocator as I expect this to be called continuously for each codepoint in a 'sentence' or 'input' if it is used, in which case a buffer is preferred as either you can allocate a large buffer and do something like;

var buffer: [MAX_BUF_SIZE]u8;
var i: usize = 0;
for (codepoints) |c| {
    i += utf8Encode(c, buffer[i..]);
}

Or just override the buffer with each character if you don't care about the previous. In either case an allocator is the 'worst' way to go (just too many allocations for any really serious case) and you can always allocate a buffer equal to the utf8CodePointSequenceLength if you really want.

Also I generally 'overcomment' showing how I reach certain math formulas such as derivations and steps (in this case I show that it is 60^2, then I show value), if this 'commenting' style is too much for the std I can remove some of the extraneous comments though I do think they hold value in that you don't have to think about where the values came from.

1 => out[0] = u8(c), // Can just do 0 + codepoint for initial range
2 => {
// 64 to convert the codepoint into its segments
out[0] = u8(0b11000000 + c / 64);

This comment has been minimized.

@tiehuis

tiehuis Apr 25, 2018

Member

I think this would be a bit clearer if you used bitwise shifts and masks here instead.

This comment has been minimized.

@BraedonWooding

BraedonWooding Apr 25, 2018

Contributor

Sure, I'll update it to use shifts and masks; I had a feeling it would look worse but we'll see!

This comment has been minimized.

@BraedonWooding

BraedonWooding Apr 25, 2018

Contributor

I still like the long binary bits at the beginning of each 'out =' so I won't shorten them to hexadecimal it just makes it easier to see what it is doing, I'll remove the math for shifting/bitwise though.

@thejoshwolfe thejoshwolfe merged commit 8c567d8 into ziglang:master Apr 29, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment