Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.ascii: general improvements #11629

Closed
wants to merge 6 commits into from
Closed

Conversation

wooster0
Copy link
Contributor

@wooster0 wooster0 commented May 10, 2022

This PR:

  • Improves existing documentation and adds more.
  • Adds more tests.
  • Shows off Zig's awesome comptime feature by easily generating all the tables at compile time! This is much easier to read and to maintain. I mean the tables looked cool but I had no idea what they did or how they worked because there wasn't even any documentation. Now anyone should be able to add new things to it with a lot less effort. There are now naive functions which do what the exposed public equivalents do. They're used to generate those tables and they're used if the table isn't available: I thought why not make use of those naive functions and use them in place of the table implementation if we're in ReleaseSmall mode? It saves some 256 bytes.
  • It makes isAlNum have its own dedicated table entry and removes isGraph's table entry. I found this to be faster by one or a few instructions.

@ifreund ifreund added breaking Implementing this issue could cause existing code to no longer compile or have different behavior. standard library This issue involves writing Zig code for the standard library. labels May 10, 2022
lib/std/ascii.zig Outdated Show resolved Hide resolved
lib/std/ascii.zig Show resolved Hide resolved
lib/std/ascii.zig Show resolved Hide resolved
lib/std/ascii.zig Show resolved Hide resolved
@wooster0
Copy link
Contributor Author

wooster0 commented May 11, 2022

I really wish we could use u7 for this. It's the actual range that std.ascii works with and it'd fit perfectly. But currently u8 is used everywhere.
There is this comment in the original:

// I could have taken only a u7 to make this clear, but it would be slower
// It is my opinion that encodings other than UTF-8 should not be supported.
//
// (and 128 bytes is not much to pay).
// Also does not handle Unicode character classes.

Why would it be slower? I think through the type system we can make this not be slower.
I wanted to conduct some testing regarding this but sadly you can't even take a string like "hello" as []const u7 for a function parameter so language support is kind of missing here. Any opinions on this?
Wouldn't it be great if the following worked and we had automatic coercion?

pub fn lowerString(output: []u7, ascii_string: []const u7) []u7 {
    std.debug.assert(output.len >= ascii_string.len);
    for (ascii_string) |c, i| {
        output[i] = toLower(c);
    }
    return output[0..ascii_string.len];
}

pub fn main() void {
    var buf: [1024]u7 = undefined;
    lowerString(&buf, "HELLO WORLD"); // The string will be coerced because all characters are < 128 
}

Any opinions? One advantage of using u7 would be that the combined table could always be 128 zero bytes smaller than currently. This is not optimized out. This could also be solved by introducing a char < 128 check in CombinedTable.contains but that might actually be slower. In that case u7 feels like the better way.

@ifreund
Copy link
Member

ifreund commented May 11, 2022

The main problem with using u7 is that it is undefined behavior in the current stage1 zig compiler to cast between []const u8 and []const u7 and read the memory as a type differing from the type used for the last write. This is due to the fact that we use llvm's arbitrary bit width integers and therefore inherit their semantics. In particular, see the documentation of the load and store llvm IR instructions here: https://llvm.org/docs/LangRef.html

It is an open question whether/how we will change these semantics for the zig language spec, possibly enabling usage of u7 here. For now I think we have no choice but to stick to u8, at least where slices are concerned.

Note that there's a similar issue regarding the standard library's usage of u21 for unicode codepoints. It works fine as long as one is only concerned with a single codepoint but if UTF-32 text is required to be handled then a slice of u32 must be used instead.

@wooster0
Copy link
Contributor Author

I see, yes, then it is probably better to leave it as-is for now. Maybe it can be changed at some point.

lib/std/ascii.zig Outdated Show resolved Hide resolved
@jayschwa
Copy link
Contributor

jayschwa commented May 13, 2022

I think you should split this into separate pull requests. One for API changes and one for internal lookup table improvements. While I am in favor of expanding abbreviated names, I think removing functions deserves a more focused pull request. The existing ones are related to C's character classification, and removing them should not be taken lightly.

@jayschwa
Copy link
Contributor

It is an open question whether/how we will change these semantics for the zig language spec, possibly enabling usage of u7 here. For now I think we have no choice but to stick to u8, at least where slices are concerned.

Note that there's a similar issue regarding the standard library's usage of u21 for unicode codepoints. It works fine as long as one is only concerned with a single codepoint but if UTF-32 text is required to be handled then a slice of u32 must be used instead.

Would it be reasonable to do pub const Char = u8; and use that for the API's parameter types? Then it might be easier to test or implement a hypothetical switch to u7 in the future.

@wooster0
Copy link
Contributor Author

wooster0 commented May 13, 2022

I think you should split this into separate pull requests.

I would like to get a word from a team member here if that's a requirement. I personally don't agree as this PR isn't necessarily what the final API will look like anyway. People can always follow up with PRs and issues to discuss and improve it further.
As for clarity on what's changing, I've outlined all changes both in the PR description and you can see them in the PR itself at the end of the ascii.zig file as well (where the deprecations happen).

Would it be reasonable to do pub const Char = u8; and use that for the API's parameter types?

If you want you can follow up with a PR and do that. Generally s/u8/u7ing the file works very well.

@jayschwa
Copy link
Contributor

I personally don't agree as this PR isn't necessarily what the final API will look like anyway. People can always follow up with PRs and issues to discuss and improve it further.

It's true the API is not set in stone yet, but C character class support was a deliberate addition, and I think more consideration should be given before starting to undo that. Meanwhile, the comptime table generation seems less controversial and could probably merge sooner if it were separate.

@ifreund
Copy link
Member

ifreund commented May 13, 2022

I think you should split this into separate pull requests.

I would like to get a word from a team member here if that's a requirement.

Breaking up changes into smaller patches makes it far easier to review and merge your changes. In particular, keeping implementation detail improvements separate from API bikeshedding avoids the former never being merged due to disagreement about the API or hesitation due to the API breakage.

I personally don't agree as this PR isn't necessarily what the final API will look like anyway.

Ideally, we should always shoot for an API that we would be OK with maintaining indefinitely. Why would we merge these breaking changes now if we'll need to break everything again down the line?

@wooster0
Copy link
Contributor Author

I already spent a lot of time on this PR. How about I reorganize the commits and split the table changes, the API changes, and whatever else in different commits? I agree that commit 05d8422 isn't very organized. Would that be fine?

@wooster0
Copy link
Contributor Author

Or should I instead open an issue about the API changes first? Then we can discuss everything there and once we agreed on everything we can move on with this.

@marler8997
Copy link
Contributor

Apologies for bikeshedding but this one feels important to me. I much prefer IgnoreCase over Insensitive as it seems more clear to me what it means without having to look through documentation for clarification.

@wooster0
Copy link
Contributor Author

wooster0 commented May 13, 2022

I don't have a big opinion on that but I think I might also prefer IgnoreCase more. I was simply doing what this TODO told me to do: https://github.com/r00ster91/zig/blob/master/lib/std/ascii.zig#L419

Anyway, this is out of my hands now because apparently this PR was unreviewable so I removed all API changes. Could you take a look again?

@Vexu
Copy link
Member

Vexu commented May 13, 2022

I was simply doing what this TODO told me to do: https://github.com/r00ster91/zig/blob/master/lib/std/ascii.zig#L419

See 490654c#r47643278

@wooster0
Copy link
Contributor Author

wooster0 commented May 17, 2022

I think people could easily be misled by the outdated description and the "breaking" label on this PR.
I've updated the description. This PR does not do any breaking changes.
Can a team member kindly remove the "breaking" label and or review this PR?

@jayschwa jayschwa removed the breaking Implementing this issue could cause existing code to no longer compile or have different behavior. label May 17, 2022
@wooster0 wooster0 changed the title std.ascii: overhaul and simplifications std.ascii: general improvements May 17, 2022
@ifreund
Copy link
Member

ifreund commented May 25, 2022

  • It makes some functions like isDigit no longer use a table - char >= '0' and char <= '9' is actually faster, unless I'm doing my research wrong: https://zig.godbolt.org/z/sWYYr5WcK

How are you determining that it's faster? Counting the instructions generated by the compiler?

The version using a table is branchless while the other version uses a branch. I would expect the table to be faster, or at least more consistent, in benchmarks.

@wooster0
Copy link
Contributor Author

@ifreund

const std = @import("std");
const debug = std.debug;
const time = std.time;

const ascii = @import("ascii.zig");

pub fn main() !void {
    var before = time.nanoTimestamp();

    const file = try std.fs.openFileAbsolute("path_to_file_with_byte", .{});

    const c = try file.reader().readByte();

    var index: usize = 0;
    while (index < 10000) : (index += 1)
        std.mem.doNotOptimizeAway(ascii.isDigit(c));

    debug.print("{}\n", .{time.nanoTimestamp() - before});

    before = time.nanoTimestamp();

    try file.seekTo(0);
    const c2 = try file.reader().readByte();

    index = 0;
    while (index < 10000) : (index += 1)
        std.mem.doNotOptimizeAway(ascii.isDigitTable(c2));

    debug.print("{}\n", .{time.nanoTimestamp() - before});
}
$ zig run -O ReleaseFast x.zig
24861
7054

Wow, I think you're right! Good catch.

@wooster0
Copy link
Contributor Author

wooster0 commented May 25, 2022

@ifreund Now while trying to fix this I encounter this issue:

    const Index = enum(u3) {
        alphabetic,
        hexadecimal,
        space,
        digit,
        lower,
        upper,
        punct,

        control,
        alphanumeric,

We have too many variants. With alphanumeric we have one variant too much. Do you know what to do about this? Should we actually add a second table? It can't become u4 right?
For now I will remove the variants added in this PR and leave them as-is.

@wooster0
Copy link
Contributor Author

We might even want to allow the user to choose what to include in the table, so we would provide some kind of customization for their specific usage.

@jayschwa
Copy link
Contributor

jayschwa commented May 25, 2022

We might even want to allow the user to choose what to include in the table, so we would provide some kind of customization for their specific usage.

I may be mistaken, but having separate tables per function might have that desired effect. Unused functions and tables would be eliminated from the final executable, and no need for explicit configuration.

Example:

const fooTable: [_]u1 = { ... };
pub fn isFoo(c: u8) { ... }

const barTable: [_]u1 = { ... };
pub fn isBar(c: u8) { ... }

const bazTable: [_]u1 = { ... };
pub fn isBaz(c: u8) { ... }

If user code only calls isBar, then the fooTable and bazTable will be eliminated.

This is moving back towards what is already merged into the master branch, but the tables could still be built at comptime from the "naive" functions. That still seems like a clear improvement over the current hard-coded tables in master.

@wooster0
Copy link
Contributor Author

We might even want to allow the user to choose what to include in the table, so we would provide some kind of customization for their specific usage.

I may be mistaken, but having separate tables per function might have that desired effect. Unused functions and tables would be eliminated from the final executable, and no need for explicit configuration.

Example:

const fooTable: [_]u1 = { ... };
pub fn isFoo(c: u8) { ... }

const barTable: [_]u1 = { ... };
pub fn isBar(c: u8) { ... }

const bazTable: [_]u1 = { ... };
pub fn isBaz(c: u8) { ... }

If user code only calls isBar, then the fooTable and bazTable will be eliminated.

This is moving back towards what is already merged into the master branch, but the tables could still be built at comptime from the "naive" functions. That still seems like a clear improvement over the current hard-coded tables in master.

You're right. Maybe we can just do something like that and don't actually have to tightly pack it like this. It would allow any function to use a fast lookup table. I would like to see how well that works but I won't do it in this PR.

@ifreund does it look mergeable?

@jayschwa
Copy link
Contributor

jayschwa commented May 25, 2022

I may be mistaken, but having separate tables per function might have that desired effect. Unused functions and tables would be eliminated from the final executable, and no need for explicit configuration.

One downside of [256]u1 is that it takes up 256 bytes when it could theoretically be packed into 32 bytes. I'm not sure if future Zig will be capable of doing that. PackedIntArray appears to be the current way of achieving that outcome.

@wooster0
Copy link
Contributor Author

wooster0 commented May 29, 2022

I just remembered about #8419 and that we wanted to remove isPunct anyway so then we could have something more important in place of it and add that to the table.
But before we continue with that I'd like to wait until this PR is merged.

@wooster0 wooster0 mentioned this pull request Aug 12, 2022
@topolarity
Copy link
Contributor

The version using a table is branchless while the other version uses a branch. I would expect the table to be faster, or at least more consistent, in benchmarks.

Small drive-by correction: Both versions are actually branchless. The table-free version generates a cmp but doesn't actually branch on it. The benchmark above is also misleading because it includes the file opening/reading in the measurements.

Given that the lookup table is dependent on cache performance, I think the table-free version is preferable.

@wooster0
Copy link
Contributor Author

wooster0 commented Aug 15, 2022

After some discussion @topolarity and I came to the conclusion that it's probably better to remove the LUT (Look-Up Table) altogether because relying on the cache like that is not that good and some benchmarks from above were definitely wrong and actually the naive version is faster. For now I started with #12448 to supersede this PR and get the renamings and deprecations in first and then removing the LUT can come later.

@wooster0
Copy link
Contributor Author

wooster0 commented Sep 1, 2022

To avoid further confusion, I will close this PR before the superseder #12448 is merged. All this PR does is done by #12448 minus the LUT changes because the LUT should probably be removed in a future PR.

@wooster0 wooster0 closed this Sep 1, 2022
wooster0 added a commit to wooster0/zig that referenced this pull request Oct 16, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on cache performance results in inconsistent performance. In many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Oct 16, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Oct 29, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Oct 29, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Oct 31, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Nov 1, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Nov 5, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Nov 18, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Nov 18, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Nov 19, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
wooster0 added a commit to wooster0/zig that referenced this pull request Dec 9, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
kcbanner pushed a commit to kcbanner/zig that referenced this pull request Dec 10, 2022
This makes it so that we no longer use a LUT (Look-Up Table):
* The code is much simpler and easier to understand now.
* Using a LUT means we rely on a warm cache.
  Relying on the cache like this results in inconsistent performance and in many cases codegen will be worse.
  Also as @topolarity once pointed out, in some cases while it seems like the code may branch, it actually doesn't:
  ziglang#11629 (comment)
* Other languages' standard libraries don't do this either.
  JFF I wanted to see what other languages codegen compared to us now:
  https://rust.godbolt.org/z/Te4ax9Edf, https://zig.godbolt.org/z/nTbYedWKv
  So we are pretty much on par or better than other languages now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants