Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can POW10_SPLIT_2 be smaller? #105

Closed
StephanTLavavej opened this issue Mar 9, 2019 · 2 comments
Closed

Can POW10_SPLIT_2 be smaller? #105

StephanTLavavej opened this issue Mar 9, 2019 · 2 comments

Comments

@StephanTLavavej
Copy link
Contributor

POW10_SPLIT_2 is large, 8 * 4445 * 3 = 106,680 bytes:

static const uint64_t POW10_SPLIT_2[4445][3] = {

Interestingly, it contains 1177 rows that are entirely zero. For example, lines 1583 and 1591 have nonzero elements, but lines 1584-1590 are entirely zero:

ryu/ryu/d2fixed_full_table.h

Lines 1583 to 1591 in 9976c57

{ 0u, 0u, 64000000000u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 0u, 0u, 0u },
{ 175162308u, 0u, 0u },

These entirely zero rows are consuming 26.5% of the table, or 8 * 1177 * 3 = 28,248 bytes. Looking at the usage:

ryu/ryu/d2fixed.c

Lines 448 to 459 in 9976c57

const uint32_t p = POW10_OFFSET_2[idx] + i;
if (p >= POW10_OFFSET_2[idx + 1]) {
// If the remaining digits are all 0, then we might as well use memset.
// No rounding required in this case.
const int32_t fill = precision - 9 * i;
memset(result + index, '0', fill);
index += fill;
break;
}
// Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is
// a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers.
uint32_t digits = mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

ryu/ryu/d2fixed.c

Lines 650 to 653 in 9976c57

const uint32_t p = POW10_OFFSET_2[idx] + i;
// Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is
// a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers.
digits = (p >= POW10_OFFSET_2[idx + 1]) ? 0 : mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

makes me think that reorganizing POW10_OFFSET_2 and POW10_SPLIT_2 might allow these entirely zero rows to be eliminated for free (i.e. no additional lookups/branches), possibly even improving performance if this results in skipping mulShift_mod1e9 calls, but this is currently at the outer limit of my understanding of the algorithm.

@StephanTLavavej
Copy link
Contributor Author

StephanTLavavej commented Mar 9, 2019

In each group denoted by POW10_OFFSET_2 there are all-zero rows at both the beginning and the end. I can easily trim the end with zero cost (no perf improvement though). I can also shift the multipliers by 8 (no observable perf improvement). However, I haven't figured out how to trim the beginning except by adding another small lookup table, which seems to very slightly degrade perf (while saving the 27.6 KB), and I suspect I've damaged correctness by missing branches. There may be a better way to trim the beginning.

@ulfjack
Copy link
Owner

ulfjack commented Mar 13, 2019

POW10_SPLIT_2 is a concatenation of arrays. For each sub-array, we need to know the offset and the length of the subarray, and we also need to know the block that the first sub-array element corresponds to.

We already have MIN_BLOCK_2 to skip the all-0 entries at the beginning; we only need to also use it to compute the correct index into the POW10_SPLIT_2 table, while making sure we don't access elements for the next larger index.

      const uint32_t p = i - MIN_BLOCK_2[idx] + POW10_OFFSET_2[idx];
      if (p >= POW10_OFFSET_2[idx + 1]) {
        ...
      }
      uint32_t digits = mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

We simultaneously need to remove MIN_BLOCK_2[idx] rows from the beginning of each sub-array. I think this transformation shouldn't cost any performance. We already load all of MIN_BLOCK_2[idx], POW10_OFFSET_2[idx] and POW10_OFFSET_2[idx + 1], and the term - MIN_BLOCK_2[idx] + POW10_OFFSET_2[idx] is loop-invariant, so can be hoisted before the loop.

We could additionally interleave MIN_BLOCK_2 and POW10_OFFSET_2 to increase the chance that they're all on the same cache line.

I think that should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants