Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Number literal separators #504

Closed
tiehuis opened this issue Sep 29, 2017 · 20 comments
Closed

Proposal: Number literal separators #504

tiehuis opened this issue Sep 29, 2017 · 20 comments
Labels
accepted This proposal is planned. contributor friendly This issue is limited in scope and/or knowledge of Zig internals. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@tiehuis
Copy link
Member

tiehuis commented Sep 29, 2017

This is found in many other languages, aimed at making longer literals easier to read at a glance by grouping together logical units within numbers. This is especially useful for the longer 128-bit and beyond literals that are available in zig.

I propose allowing a _ separator anywhere in a number literal to align with being the simplest rule to understand. Numeric literals are parsed into values as if the separators were not present.

Examples:

const a = 0x1234_2839_1083_1928;
const b = 0x123_190.109_038_018p102;
const c = 0_x0123; // Not allowed, cannot insert separator on radix prefix
const d = _1238; // This is parsed as an identifier
const e = 9_________123123; // Multiple literals are allowed in sequence.

A more in-depth reference of other implementations can be found in the javascript proposal.

@tiehuis tiehuis added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Sep 29, 2017
tiehuis added a commit that referenced this issue Sep 29, 2017
@tiehuis
Copy link
Member Author

tiehuis commented Sep 29, 2017

Pushed an initial implementation to a new branch here.

@PavelVozenilek
Copy link

A note: it may be useful to allow other common numeral separators - only per project, not universally.

For example, Indian numbers use comma:

3,00,00,000

https://en.wikipedia.org/wiki/Indian_numbering_system


Trailing underscores could be allowed, for situations like:

arr[0] = 73__;
arr[1] = 8655;
arr[2] = 1___;
arr[3] = 0___;
arr[4] = 12__;
arr[5] = 987_;

which gives a hint about expected max range for a group of numbers.

Or for alignment:

arr[8_] = 1;
arr[9_] = 2;
arr[10] = 1;
arr[11] = 2;

@tiehuis
Copy link
Member Author

tiehuis commented Sep 29, 2017

Not sure if there would be mich benefit for customisable separators. These are only really for visual grouping and not so much for accurate numeric localization.

The current implementation does allow trailing dashes as part of the literal. Those examples of yours should work.

@PavelVozenilek
Copy link

PavelVozenilek commented Sep 29, 2017

@tiehuis: program with lot of hardcoded numbers (e.g. ballistic tables) may have better readability and higher chance of catching typos, due to familiar style.

But this is not feature for everyone, and if ever implemented, it should allow ad-hoc project customisation. I imagine wild things like ability to avoid repeated numbers:

...
    80482.23
    ....3.23
    ....4.93
    80493.22
...

I have a hope that Zig's metaprogramming will enable these "tricks".


The current implementation does allow trailing dashes as part of the literal. Those examples of yours should work.

Great.

@andrewrk
Copy link
Member

andrewrk commented Sep 29, 2017

I think this has pros and cons. Reading 1_000_000 is a little better than reading 1000000. The only reason I hesitate to merge this right away is that 1______0__00__0___0________0 is much worse than 1000000 and now it becomes possible to have working, compiling code that looks like that. Even though 1000000 could look a little better, it's at least reasonable, and the only way you can write that number.

@andrewrk
Copy link
Member

Another thing to think about is that a bare number literal is a math construct. 1000000 means the same thing in every region. However once you start introducing separators, regional differences creep into code. Some people might want 100_00_00. Maybe it's better to avoid that whole class of problems.

@thejoshwolfe
Copy link
Sponsor Contributor

thejoshwolfe commented Sep 29, 2017

Here's my alternative proposal that uses status quo:

To compete with this Java:

long hexBytes = 0xFF_EC_DE_5E;
long hexWords = 0xCAFE_F00D;
long maxLong = 0x7fff_ffff_ffff_ffffL;
byte nybbles = 0b0010_0101;
long bytes = 0b11010010_01101001_10010100_10010010;

Here's the Zig:

//                  ++--++--
const hex_in_8s = 0xFFECDE5E;

//                   ++++----
const hex_in_16s = 0xCAFEF00D;

//                /--\/--\/--\/--\
const max_s64 = 0x7fffffffffffffff;

//                hhhhllll
const nybbles = 0b00100101;

//              33333333222222221111111100000000
const bytes = 0b11010010011010011001010010010010;

Takes up extra space, but the free-form nature of comments means you can write whatever you want there, which is arguably more powerful than only being able to group digits.

I admit the _ grouping looks nicer in some cases. But I'm not sure that's a compelling reason to make the language more complicated.

I think my biggest objection to this proposal is that it introduces language complexity in the form of syntactic sugar without encouraging any different semantics. This proposal enables the subjective concept of making long literals easier to read by grouping digits in some way.

Consider that there are even more ways to write a literal in Zig that allow even more expression of intent than this proposal. For example:

const max_s64 = (1 << 63) - 1;
const bytes =
    (0b11010010 << 24) |
    (0b01101001 << 16) |
    (0b10010100 << 8) |
    (0b10010010 << 0);
// see https://github.com/gcc-mirror/gcc/blob/61eae75c6230c7df9fa3e935b2efadda61667c5f/libiberty/crc32.c#L70
const crc32_table: [256]u32 = comptime generate_crc32_table();

@PavelVozenilek
Copy link

1______0__00__0___0________0

I have yet to see people doing things like this outside IOCCC.

OTOH they go to great lengths to create helpful visual artifacts in the code, like column alignment. Technical authors do the same with numeric tables or math heavy texts.

Why not to make it per project option? It someone fears he can switch it off.

@andrewrk andrewrk added this to the 0.2.0 milestone Sep 29, 2017
@tiehuis
Copy link
Member Author

tiehuis commented Sep 29, 2017

@andrewrk

Instead of allowing a separator everywhere, we could be more restrictive and only allow single separators between digits. This actually seems to be pretty normal in other languages. Ada, C++, Ruby and Julia use this method.

Regarding different region details. If there are implicit semantics behind the meaning of a number literal, separators actually may help convey to a reader that there are some implied extra details. Of course if they are just separating something without any specific meaning then that is a valid concern.

@thejoshwolfe

Valid alternative.

The main draws I see over just comments are two. A standardized way of doing this is a bonus and means we don't get different competing styles to represent the same thing (only a minor). The other would be that because literal separators are much easier to insert (one character vs. annotating an entire line) this means that it is probably more likely that they would be used vs. a comment-based approach, helping code readability. It also allows one to leave comments for more important details like why the particular value may have been chosen, for example.

@andrewrk
Copy link
Member

andrewrk commented Dec 8, 2017

@tiehuis I really appreciate the writeup, and especially the fact that you went off and coded it. Your arguments are reasonable. But I'm going to have to go with keeping the language small and only 1 way to do things.

@momumi
Copy link
Contributor

momumi commented Dec 26, 2019

But I'm going to have to go with keeping the language small and only 1 way to do things.

If you use that reasoning, I'd personally drop the 0o and 0b prefixes as well.

The only use case I've ever seen for octal is in unix file permissions which is something you could easily handle using constants.

Anything you can represent in binary, you can just represent in hexadecimal. For example compare 0xff == 0b1111111, or 0x8000 == 0b1000000000000000. Personally, binary literals are very hard to read without the _ separator, and even C rejected binary literals due to lack of precedent and insufficient utility. cf. 6.4.4.1

@scottjmaddox
Copy link

Just an FYI: this can be implemented with pretty tiny changes to the syntax and lexer. However, it would require an extra sentence to explain it in the documentation, and it does mean there are more ways to write equivalent integer literals than the already existing decimal, hex, octal, and binary literals. Personally, I think it would be worth it.

The grammar changes from this:

FLOAT
    <- "0x" hex+   "." hex+   ([pP] [-+]? hex+)?   skip
     /      [0-9]+ "." [0-9]+ ([eE] [-+]? [0-9]+)? skip
     / "0x" hex+   "."? [pP] [-+]? hex+   skip
     /      [0-9]+ "."? [eE] [-+]? [0-9]+ skip
INTEGER
    <- "0b" [01]+  skip
     / "0o" [0-7]+ skip
     / "0x" hex+   skip
     /      [0-9]+ skip

to this:

hex_ <- [0-9a-fA-F_]
FLOAT
    <- "0x" hex_+   "." hex_+   ([pP] [-+]? hex_+)?   skip
     /      [0-9] [0-9_]+ "." [0-9_]+ ([eE] [-+]? [0-9_]+)? skip
     / "0x" hex_+   "."? [pP] [-+]? hex_+   skip
     /      [0-9] [0-9_]+ "."? [eE] [-+]? [0-9_]+ skip
INTEGER
    <- "0b" [01_]+  skip
     / "0o" [0-7_]+ skip
     / "0x" hex_+   skip
     /      [0-9] [0-9_]+ skip

And the lexer (or parser, depending on implementation) just needs to skip the underscores when evaluating the number.

@momumi
Copy link
Contributor

momumi commented Jan 19, 2020

@andrewrk could you consider reopening this?

After this issue was closed in 2017 almost every mainstream language has come to support this feature. If zig's goal is to replace C and become the new lingua franca, it make sense adopting the syntax that other languages are using.

I've compiled an extensive list of languages that support _ as a digit separator:

@pixelherodev
Copy link
Contributor

pixelherodev commented Jan 19, 2020

I'd just like to note that C isn't on that list.

One of the things that differentiates C from most languages is, IMO, its simplicity. While this proposal is, itself, not complex, I find that languages aren't generally brought down by a few major changes, but by many minor ones.

Imagine if a dozen similar changes to the grammar were made. Each one, on its own, is relatively benign; together, they remove everything that makes Zig what it is. If Zig were to adopt every minor change that "every mainstream" language supports, there wouldn't really be a point to Zig at all.

@momumi
Copy link
Contributor

momumi commented Jan 19, 2020

@pixelherodev Also note that C doesn't have binary literals 0b. Most of these languages added binary literals and _ separators at the same time. People have longer response times when counting more than 4 objects and binary literals have a large number of elements, so they are hard for people to parse without a visual separator. It's hard to tell the difference between 0b11111111 and 0b111111111. Using visual separators makes the code much easier to read: 0b1111_1111 and 0b1_1111_1111.

@pixelherodev
Copy link
Contributor

That's true, but it's also still possible to write out, say, 0xFF or 0x1FF, is it not? If you're writing out large binary strings manually, maybe you should switch to hex.

Or, if numerical separators are important, here's an alternate proposal: a comptime function in zag (which, in case you haven't come across me using that term elsewhere, is what I've started calling the Zig standard library) which takes a string literal - like, say, "393_219_293_192", parses and removes the separator, and parses as an integer? This also has the advantage of allowing a single function that supports every base by simply passing the base on to parseInt in std.fmt.

Usage:

const a = std.fmt.parseSeparatedInt("1f3a_3904_a9ca_299c", 16);

This leaves the grammar as is, provides most (albeit not all) of the advantages of implementing it as a language feature, and slightly reduces how large a Zig compiler needs to be to compete with the current stage1.

@momumi
Copy link
Contributor

momumi commented Jan 19, 2020

@pixelherodev I think encouraging parsing functions for something so elementary is the wrong way to go. Using something like std.fmt.parseSeparatedInt is cumbersome, so people will be encouraged to take short cuts. However, different people are going to take different shortcuts:

Person A might do this:

const p = std.fmt.parseSeparatedInt;

// ...

const y = switch (x) {
    p("0010_1111", 2) .. p("0011_1111", 2) => symbol_1(x),
    p("1111_0000", 2) .. p("1111_1111", 2) => symbol_2(x),
    // ...
};

Person B might do this:

function b(comptime str: []const u8) comptime_int {
    return std.fmt.parseSeparatedInt(str, 2);
}

// ...

const y = switch (x) {
    b("0010_1111") .. b("0011_1111") => symbol_1(x),
    b("1111_0000") .. b("1111_1111") => symbol_2(x),
    // ...
};

However, as the reader of this code, how am I supposed to know what p and b do? I can't guarantee what it does, so I have to check the definition. Having builtin syntax introduces much less cognitive overhead:

const y = switch (x) {
    0b0010_1111 .. 0b0011_1111 => symbol_1(x),
    0b1111_0000 .. 0b1111_1111 => symbol_2(x),
    // ...
};

@thejoshwolfe
Copy link
Sponsor Contributor

The list of languages that @momumi compiled that support this feature is the most compelling argument i've seen.

I don't think it's fair to support 0b literals and not separators; it seems like an arbitrary design decision to me. Both have their niche uses; both give more than one obvious way to do things; both are unsupported in C; both are supported by most major modern languages (i think?).

@andrewrk andrewrk reopened this Jan 19, 2020
@andrewrk andrewrk modified the milestones: 0.2.0, 0.6.0 Jan 19, 2020
@andrewrk andrewrk added accepted This proposal is planned. contributor friendly This issue is limited in scope and/or knowledge of Zig internals. labels Feb 10, 2020
@andrewrk andrewrk modified the milestones: 0.6.0, 0.7.0 Feb 10, 2020
@momumi
Copy link
Contributor

momumi commented Mar 15, 2020

Had a go at implement this in #4741

This implementation is similar to the javascript version where _ may only be placed between two digits.

So these are valid:

  • 1_000_000
  • 1_0_0_0_0_0_0
  • 0x1234_5678
  • 0x12_34_56_78
  • 1_000.000_001e1_000

These are invalid:

  • 1__0
  • 10_
  • 0_b10
  • 0b_10
  • 1_.0
  • 1._0
  • 1.0_e1
  • 1.0e_1
  • 1.0e1_
  • 1.0e+_1

@andrewrk
Copy link
Member

Implemented by @momumi in #4741, landed in 13d04f9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This proposal is planned. contributor friendly This issue is limited in scope and/or knowledge of Zig internals. proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

7 participants