Skip to content

Support conversion of hexidecimal values#173

Closed
marcpaterno wants to merge 7 commits intor-lib:mainfrom
marcpaterno:main
Closed

Support conversion of hexidecimal values#173
marcpaterno wants to merge 7 commits intor-lib:mainfrom
marcpaterno:main

Conversation

@marcpaterno
Copy link
Copy Markdown

This small addition allows as.integer64.character to handle hexidecimal printed representation of large integers, in the style of as.integer in base R.

@MichaelChirico
Copy link
Copy Markdown
Collaborator

thanks for the PR! the idea seems great!

would you mind

  1. filing an issue that motivates the change, including the current failure mode / (optional) known workarounds
  2. including a test that actually touches the 64-bit range, i.e. that wouldn't fit in 32-bit integers (or I guess that's covered already?, I never remember what the exact boundary is... maybe make it at least 2^33 for clarity)
  3. include a few more tests, e.g. of negative #s, hex #s that don't fit in 64 bits, any edge cases you can think of

@MichaelChirico
Copy link
Copy Markdown
Collaborator

WDYT about just letting strtoll() handle this by setting base=0 instead?

https://en.cppreference.com/w/c/string/byte/strtol

Comment thread src/integer64.c
Comment thread src/integer64.c
@MichaelChirico
Copy link
Copy Markdown
Collaborator

I think the issue with just doing base=0 is it will not be back-compatible (c.f. as.integer64('011'), which will be 11LL on main but 9LL with base=0).

We can easily imagine there's ppl working with long-string identifiers relying on the old behavior.

That said, I think we can do this?

-ret[i] = strtoll(str, &endpointer, 10);
+int negative=str[0] == '-';
+if (!strncmp(str+negative, "0x", 2) || !strncmp(str+negative,"0X", 2)) {
+  ret[i] = strtoll(str, &endpointer, 16);
+} else {
+  ret[i] = strtoll(str, &endpointer, 10);
+}

@marcpaterno
Copy link
Copy Markdown
Author

I think the issue with just doing base=0 is it will not be back-compatible (c.f. as.integer64('011'), which will be 11LL on main but 9LL with base=0).

We can easily imagine there's ppl working with long-string identifiers relying on the old behavior.

That said, I think we can do this?

-ret[i] = strtoll(str, &endpointer, 10);
+int negative=str[0] == '-';
+if (!strncmp(str+negative, "0x", 2) || !strncmp(str+negative,"0X", 2)) {
+  ret[i] = strtoll(str, &endpointer, 16);
+} else {
+  ret[i] = strtoll(str, &endpointer, 10);
+}

I have tested this code and it yields different behavior in one edge case. I thought I had picked the appropriate behavior, but you can of course correct me.

In the case of handling the string "-0x8000000000000000", my original code yielded -9223372036854775807, the first value in the result of lim.integer64(). Modified as you propose above, the result is "NA".

I will add the new testing (that includes the handling of input "0") to the PR, and await your decision on the desired behavior for handling "-0x8000000000000000".

@MichaelChirico
Copy link
Copy Markdown
Collaborator

Let's pick whatever strtoll("-0x8000000000000000", &endpointer, 16) does, I think it's best to be consistent with & as close to the well-tested underlying C library routine.

@MichaelChirico
Copy link
Copy Markdown
Collaborator

please add a NEWS entry btw!

Comment thread tests/testthat/test-integer64.R Outdated
Comment thread tests/testthat/test-integer64.R
@marcpaterno
Copy link
Copy Markdown
Author

I do not think it is possible to match the behavior of strtoll with the current representation of integer64.
strtoll will read "-9223372036854775808" and represent it exactly. This value is 1 smaller than what lim.integer64() indicates can be represented by integer64. strtoll will read "-9223372036854775809" and yield the value -9223372036854775808.

How would you like to proceed?

@MichaelChirico
Copy link
Copy Markdown
Collaborator

Oh, great point. That's because -922...808 is how NA_integer64_ is represented (just as NA_integer_ is represented as -214...647):

#define NA_INTEGER64 LLONG_MIN

In fact you've found what I think is kind of buggy behavior in as.integer64.character for decimal string input, too: #175.

Part of the issue is, there's no 128-bit double to case into with full fidelity to work with these out-of-range values before casting back down to 64 bits; that's what's done in the 32-bit case by R:

https://github.com/r-devel/r-svn/blob/c8ab68478c5d2322956e19764c20c092916133cd/src/main/util.c#L2112

Anyway, for now, I think we should aim to be consistent between hex & decimal strings, and to bitstrings & return NA_integer64_ for -9223372036854775808

# as.bitstring(lim.integer64())
as.integer64(structure("1000000000000000000000000000000000000000000000000000000000000000", class='bitstring'))
# integer64
# [1] <NA>

@MichaelChirico
Copy link
Copy Markdown
Collaborator

Superseded by #228, thanks for the work & discussion which shaped the result!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants