Proposal: Number literal separators #504

tiehuis · 2017-09-29T00:05:21Z

This is found in many other languages, aimed at making longer literals easier to read at a glance by grouping together logical units within numbers. This is especially useful for the longer 128-bit and beyond literals that are available in zig.

I propose allowing a _ separator anywhere in a number literal to align with being the simplest rule to understand. Numeric literals are parsed into values as if the separators were not present.

Examples:

const a = 0x1234_2839_1083_1928;
const b = 0x123_190.109_038_018p102;
const c = 0_x0123; // Not allowed, cannot insert separator on radix prefix
const d = _1238; // This is parsed as an identifier
const e = 9_________123123; // Multiple literals are allowed in sequence.

A more in-depth reference of other implementations can be found in the javascript proposal.

The text was updated successfully, but these errors were encountered:

Closes #504.

tiehuis · 2017-09-29T04:54:36Z

Pushed an initial implementation to a new branch here.

PavelVozenilek · 2017-09-29T10:08:57Z

A note: it may be useful to allow other common numeral separators - only per project, not universally.

For example, Indian numbers use comma:

3,00,00,000

https://en.wikipedia.org/wiki/Indian_numbering_system

Trailing underscores could be allowed, for situations like:

arr[0] = 73__;
arr[1] = 8655;
arr[2] = 1___;
arr[3] = 0___;
arr[4] = 12__;
arr[5] = 987_;

which gives a hint about expected max range for a group of numbers.

Or for alignment:

arr[8_] = 1;
arr[9_] = 2;
arr[10] = 1;
arr[11] = 2;

tiehuis · 2017-09-29T10:15:12Z

Not sure if there would be mich benefit for customisable separators. These are only really for visual grouping and not so much for accurate numeric localization.

The current implementation does allow trailing dashes as part of the literal. Those examples of yours should work.

PavelVozenilek · 2017-09-29T11:03:53Z

@tiehuis: program with lot of hardcoded numbers (e.g. ballistic tables) may have better readability and higher chance of catching typos, due to familiar style.

But this is not feature for everyone, and if ever implemented, it should allow ad-hoc project customisation. I imagine wild things like ability to avoid repeated numbers:

...
    80482.23
    ....3.23
    ....4.93
    80493.22
...

I have a hope that Zig's metaprogramming will enable these "tricks".

The current implementation does allow trailing dashes as part of the literal. Those examples of yours should work.

Great.

andrewrk · 2017-09-29T14:57:26Z

I think this has pros and cons. Reading 1_000_000 is a little better than reading 1000000. The only reason I hesitate to merge this right away is that 1______0__00__0___0________0 is much worse than 1000000 and now it becomes possible to have working, compiling code that looks like that. Even though 1000000 could look a little better, it's at least reasonable, and the only way you can write that number.

andrewrk · 2017-09-29T15:00:52Z

Another thing to think about is that a bare number literal is a math construct. 1000000 means the same thing in every region. However once you start introducing separators, regional differences creep into code. Some people might want 100_00_00. Maybe it's better to avoid that whole class of problems.

thejoshwolfe · 2017-09-29T15:27:03Z

Here's my alternative proposal that uses status quo:

To compete with this Java:

long hexBytes = 0xFF_EC_DE_5E;
long hexWords = 0xCAFE_F00D;
long maxLong = 0x7fff_ffff_ffff_ffffL;
byte nybbles = 0b0010_0101;
long bytes = 0b11010010_01101001_10010100_10010010;

Here's the Zig:

//                  ++--++--
const hex_in_8s = 0xFFECDE5E;

//                   ++++----
const hex_in_16s = 0xCAFEF00D;

//                /--\/--\/--\/--\
const max_s64 = 0x7fffffffffffffff;

//                hhhhllll
const nybbles = 0b00100101;

//              33333333222222221111111100000000
const bytes = 0b11010010011010011001010010010010;

Takes up extra space, but the free-form nature of comments means you can write whatever you want there, which is arguably more powerful than only being able to group digits.

I admit the _ grouping looks nicer in some cases. But I'm not sure that's a compelling reason to make the language more complicated.

I think my biggest objection to this proposal is that it introduces language complexity in the form of syntactic sugar without encouraging any different semantics. This proposal enables the subjective concept of making long literals easier to read by grouping digits in some way.

Consider that there are even more ways to write a literal in Zig that allow even more expression of intent than this proposal. For example:

const max_s64 = (1 << 63) - 1;
const bytes =
    (0b11010010 << 24) |
    (0b01101001 << 16) |
    (0b10010100 << 8) |
    (0b10010010 << 0);
// see https://github.com/gcc-mirror/gcc/blob/61eae75c6230c7df9fa3e935b2efadda61667c5f/libiberty/crc32.c#L70
const crc32_table: [256]u32 = comptime generate_crc32_table();

PavelVozenilek · 2017-09-29T18:02:43Z

1______0__00__0___0________0

I have yet to see people doing things like this outside IOCCC.

OTOH they go to great lengths to create helpful visual artifacts in the code, like column alignment. Technical authors do the same with numeric tables or math heavy texts.

Why not to make it per project option? It someone fears he can switch it off.

tiehuis · 2017-09-29T22:40:48Z

@andrewrk

Instead of allowing a separator everywhere, we could be more restrictive and only allow single separators between digits. This actually seems to be pretty normal in other languages. Ada, C++, Ruby and Julia use this method.

Regarding different region details. If there are implicit semantics behind the meaning of a number literal, separators actually may help convey to a reader that there are some implied extra details. Of course if they are just separating something without any specific meaning then that is a valid concern.

@thejoshwolfe

Valid alternative.

The main draws I see over just comments are two. A standardized way of doing this is a bonus and means we don't get different competing styles to represent the same thing (only a minor). The other would be that because literal separators are much easier to insert (one character vs. annotating an entire line) this means that it is probably more likely that they would be used vs. a comment-based approach, helping code readability. It also allows one to leave comments for more important details like why the particular value may have been chosen, for example.

andrewrk · 2017-12-08T05:08:01Z

@tiehuis I really appreciate the writeup, and especially the fact that you went off and coded it. Your arguments are reasonable. But I'm going to have to go with keeping the language small and only 1 way to do things.

momumi · 2019-12-26T11:30:06Z

But I'm going to have to go with keeping the language small and only 1 way to do things.

If you use that reasoning, I'd personally drop the 0o and 0b prefixes as well.

The only use case I've ever seen for octal is in unix file permissions which is something you could easily handle using constants.

Anything you can represent in binary, you can just represent in hexadecimal. For example compare 0xff == 0b1111111, or 0x8000 == 0b1000000000000000. Personally, binary literals are very hard to read without the _ separator, and even C rejected binary literals due to lack of precedent and insufficient utility. cf. 6.4.4.1

scottjmaddox · 2020-01-08T05:26:56Z

Just an FYI: this can be implemented with pretty tiny changes to the syntax and lexer. However, it would require an extra sentence to explain it in the documentation, and it does mean there are more ways to write equivalent integer literals than the already existing decimal, hex, octal, and binary literals. Personally, I think it would be worth it.

The grammar changes from this:

FLOAT
    <- "0x" hex+   "." hex+   ([pP] [-+]? hex+)?   skip
     /      [0-9]+ "." [0-9]+ ([eE] [-+]? [0-9]+)? skip
     / "0x" hex+   "."? [pP] [-+]? hex+   skip
     /      [0-9]+ "."? [eE] [-+]? [0-9]+ skip
INTEGER
    <- "0b" [01]+  skip
     / "0o" [0-7]+ skip
     / "0x" hex+   skip
     /      [0-9]+ skip

to this:

hex_ <- [0-9a-fA-F_]
FLOAT
    <- "0x" hex_+   "." hex_+   ([pP] [-+]? hex_+)?   skip
     /      [0-9] [0-9_]+ "." [0-9_]+ ([eE] [-+]? [0-9_]+)? skip
     / "0x" hex_+   "."? [pP] [-+]? hex_+   skip
     /      [0-9] [0-9_]+ "."? [eE] [-+]? [0-9_]+ skip
INTEGER
    <- "0b" [01_]+  skip
     / "0o" [0-7_]+ skip
     / "0x" hex_+   skip
     /      [0-9] [0-9_]+ skip

And the lexer (or parser, depending on implementation) just needs to skip the underscores when evaluating the number.

momumi · 2020-01-19T04:38:39Z

@andrewrk could you consider reopening this?

After this issue was closed in 2017 almost every mainstream language has come to support this feature. If zig's goal is to replace C and become the new lingua franca, it make sense adopting the syntax that other languages are using.

I've compiled an extensive list of languages that support _ as a digit separator:

java (SE 7)
javascript (planned, already implemented in browsers and nodejs)
python (3.6)
C# (7.0)
C++ (C++14 actually uses ' as the separator, but it would have used _ if it didn't conflict with the grammar)
php (7.4)
go (1.13)
rust (1.0)
ruby (1.0)
Visual Basic (Visual Basic 2017)
perl (2.0)
D (1.0)
swift (1.0)
kotlin (1.1)
haskell (8.6.1)
F# (4.1)
assembly (nasm 0.99.06, fasm 1.71.56)
Verilog (95)
VHDL (1993 cf. §13.4)
julia (1.0)
erlang (eep 51 accepted)
octave (4.2)
typescript (2.7)
elixir (1.1)
ocaml (3.07)
scheme (srfi-169)
eiffel (5.6)
ada (1983)

pixelherodev · 2020-01-19T04:53:50Z

I'd just like to note that C isn't on that list.

One of the things that differentiates C from most languages is, IMO, its simplicity. While this proposal is, itself, not complex, I find that languages aren't generally brought down by a few major changes, but by many minor ones.

Imagine if a dozen similar changes to the grammar were made. Each one, on its own, is relatively benign; together, they remove everything that makes Zig what it is. If Zig were to adopt every minor change that "every mainstream" language supports, there wouldn't really be a point to Zig at all.

momumi · 2020-01-19T05:28:22Z

@pixelherodev Also note that C doesn't have binary literals 0b. Most of these languages added binary literals and _ separators at the same time. People have longer response times when counting more than 4 objects and binary literals have a large number of elements, so they are hard for people to parse without a visual separator. It's hard to tell the difference between 0b11111111 and 0b111111111. Using visual separators makes the code much easier to read: 0b1111_1111 and 0b1_1111_1111.

pixelherodev · 2020-01-19T05:36:30Z

That's true, but it's also still possible to write out, say, 0xFF or 0x1FF, is it not? If you're writing out large binary strings manually, maybe you should switch to hex.

Or, if numerical separators are important, here's an alternate proposal: a comptime function in zag (which, in case you haven't come across me using that term elsewhere, is what I've started calling the Zig standard library) which takes a string literal - like, say, "393_219_293_192", parses and removes the separator, and parses as an integer? This also has the advantage of allowing a single function that supports every base by simply passing the base on to parseInt in std.fmt.

Usage:

const a = std.fmt.parseSeparatedInt("1f3a_3904_a9ca_299c", 16);

This leaves the grammar as is, provides most (albeit not all) of the advantages of implementing it as a language feature, and slightly reduces how large a Zig compiler needs to be to compete with the current stage1.

momumi · 2020-01-19T07:51:00Z

@pixelherodev I think encouraging parsing functions for something so elementary is the wrong way to go. Using something like std.fmt.parseSeparatedInt is cumbersome, so people will be encouraged to take short cuts. However, different people are going to take different shortcuts:

Person A might do this:

const p = std.fmt.parseSeparatedInt;

// ...

const y = switch (x) {
    p("0010_1111", 2) .. p("0011_1111", 2) => symbol_1(x),
    p("1111_0000", 2) .. p("1111_1111", 2) => symbol_2(x),
    // ...
};

Person B might do this:

function b(comptime str: []const u8) comptime_int {
    return std.fmt.parseSeparatedInt(str, 2);
}

// ...

const y = switch (x) {
    b("0010_1111") .. b("0011_1111") => symbol_1(x),
    b("1111_0000") .. b("1111_1111") => symbol_2(x),
    // ...
};

However, as the reader of this code, how am I supposed to know what p and b do? I can't guarantee what it does, so I have to check the definition. Having builtin syntax introduces much less cognitive overhead:

const y = switch (x) {
    0b0010_1111 .. 0b0011_1111 => symbol_1(x),
    0b1111_0000 .. 0b1111_1111 => symbol_2(x),
    // ...
};

thejoshwolfe · 2020-01-19T14:13:29Z

The list of languages that @momumi compiled that support this feature is the most compelling argument i've seen.

I don't think it's fair to support 0b literals and not separators; it seems like an arbitrary design decision to me. Both have their niche uses; both give more than one obvious way to do things; both are unsupported in C; both are supported by most major modern languages (i think?).

momumi · 2020-03-15T03:10:41Z

Had a go at implement this in #4741

This implementation is similar to the javascript version where _ may only be placed between two digits.

So these are valid:

1_000_000
1_0_0_0_0_0_0
0x1234_5678
0x12_34_56_78
1_000.000_001e1_000

These are invalid:

1__0
10_
0_b10
0b_10
1_.0
1._0
1.0_e1
1.0e_1
1.0e1_
1.0e+_1

andrewrk · 2020-03-23T04:55:42Z

Implemented by @momumi in #4741, landed in 13d04f9.

tiehuis added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Sep 29, 2017

tiehuis added a commit that referenced this issue Sep 29, 2017

Allow _ separator within numeric literals

9652854

Closes #504.

andrewrk added this to the 0.2.0 milestone Sep 29, 2017

andrewrk closed this as completed Dec 8, 2017

andrewrk added the rejected label Dec 8, 2017

bnoordhuis mentioned this issue Apr 19, 2018

Std.os.time #933

Merged

Hejsil mentioned this issue Oct 27, 2018

[WIP] New Zig formal grammar #1685

Merged

emekoi mentioned this issue Dec 26, 2019

Allow underscores _ as digit separates in float and integer literals. #3983

Closed

daurnimator mentioned this issue Jan 8, 2020

Allow underscores in numeric literals for improved readability? #4108

Closed

andrewrk reopened this Jan 19, 2020

andrewrk modified the milestones: 0.2.0, 0.6.0 Jan 19, 2020

andrewrk added accepted This proposal is planned. contributor friendly This issue is limited in scope and/or knowledge of Zig internals. labels Feb 10, 2020

andrewrk modified the milestones: 0.6.0, 0.7.0 Feb 10, 2020

alexnask mentioned this issue Mar 5, 2020

Add support for sized integer literals #4644

Closed

momumi mentioned this issue Mar 15, 2020

allow _ separators in number literals (stage 1) #4741

Merged

andrewrk closed this as completed Mar 23, 2020

andrewrk modified the milestones: 0.7.0, 0.6.0 Mar 23, 2020

mikdusan mentioned this issue Apr 13, 2020

fix some typos in 0.6.0 release notes ziglang/www.ziglang.org#60

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Number literal separators #504

Proposal: Number literal separators #504

tiehuis commented Sep 29, 2017

tiehuis commented Sep 29, 2017

PavelVozenilek commented Sep 29, 2017

tiehuis commented Sep 29, 2017

PavelVozenilek commented Sep 29, 2017 •

edited

andrewrk commented Sep 29, 2017 •

edited

andrewrk commented Sep 29, 2017

thejoshwolfe commented Sep 29, 2017 •

edited

PavelVozenilek commented Sep 29, 2017

tiehuis commented Sep 29, 2017

andrewrk commented Dec 8, 2017

momumi commented Dec 26, 2019

scottjmaddox commented Jan 8, 2020

momumi commented Jan 19, 2020

pixelherodev commented Jan 19, 2020 •

edited

momumi commented Jan 19, 2020

pixelherodev commented Jan 19, 2020

momumi commented Jan 19, 2020

thejoshwolfe commented Jan 19, 2020

momumi commented Mar 15, 2020

andrewrk commented Mar 23, 2020

Proposal: Number literal separators #504

Proposal: Number literal separators #504

Comments

tiehuis commented Sep 29, 2017

tiehuis commented Sep 29, 2017

PavelVozenilek commented Sep 29, 2017

tiehuis commented Sep 29, 2017

PavelVozenilek commented Sep 29, 2017 • edited

andrewrk commented Sep 29, 2017 • edited

andrewrk commented Sep 29, 2017

thejoshwolfe commented Sep 29, 2017 • edited

PavelVozenilek commented Sep 29, 2017

tiehuis commented Sep 29, 2017

andrewrk commented Dec 8, 2017

momumi commented Dec 26, 2019

scottjmaddox commented Jan 8, 2020

momumi commented Jan 19, 2020

pixelherodev commented Jan 19, 2020 • edited

momumi commented Jan 19, 2020

pixelherodev commented Jan 19, 2020

momumi commented Jan 19, 2020

thejoshwolfe commented Jan 19, 2020

momumi commented Mar 15, 2020

andrewrk commented Mar 23, 2020

PavelVozenilek commented Sep 29, 2017 •

edited

andrewrk commented Sep 29, 2017 •

edited

thejoshwolfe commented Sep 29, 2017 •

edited

pixelherodev commented Jan 19, 2020 •

edited