Can POW10_SPLIT_2 be smaller? #105

StephanTLavavej · 2019-03-09T00:37:53Z

POW10_SPLIT_2 is large, 8 * 4445 * 3 = 106,680 bytes:

Line 1286 in 9976c57

static const uint64_t POW10_SPLIT_2[4445][3] = {

Interestingly, it contains 1177 rows that are entirely zero. For example, lines 1583 and 1591 have nonzero elements, but lines 1584-1590 are entirely zero:

ryu/ryu/d2fixed_full_table.h

Lines 1583 to 1591 in 9976c57

    
           {                    0u,                    0u,          64000000000u }, 
        
           {                    0u,                    0u,                    0u }, 
        
           {                    0u,                    0u,                    0u }, 
        
           {                    0u,                    0u,                    0u }, 
        
           {                    0u,                    0u,                    0u }, 
        
           {                    0u,                    0u,                    0u }, 
        
           {                    0u,                    0u,                    0u }, 
        
           {                    0u,                    0u,                    0u }, 
        
           {            175162308u,                    0u,                    0u },

These entirely zero rows are consuming 26.5% of the table, or 8 * 1177 * 3 = 28,248 bytes. Looking at the usage:

ryu/ryu/d2fixed.c

Lines 448 to 459 in 9976c57

    
           const uint32_t p = POW10_OFFSET_2[idx] + i; 
        
           if (p >= POW10_OFFSET_2[idx + 1]) { 
        
             // If the remaining digits are all 0, then we might as well use memset. 
        
             // No rounding required in this case. 
        
             const int32_t fill = precision - 9 * i; 
        
             memset(result + index, '0', fill); 
        
             index += fill; 
        
             break; 
        
           } 
        
           // Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is 
        
           // a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers. 
        
           uint32_t digits = mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

ryu/ryu/d2fixed.c

Lines 650 to 653 in 9976c57

    
           const uint32_t p = POW10_OFFSET_2[idx] + i; 
        
           // Temporary: j is usually around 128, and by shifting a bit, we push it to 128 or above, which is 
        
           // a slightly faster code path in mulShift_mod1e9. Instead, we can just increase the multipliers. 
        
           digits = (p >= POW10_OFFSET_2[idx + 1]) ? 0 : mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

makes me think that reorganizing POW10_OFFSET_2 and POW10_SPLIT_2 might allow these entirely zero rows to be eliminated for free (i.e. no additional lookups/branches), possibly even improving performance if this results in skipping mulShift_mod1e9 calls, but this is currently at the outer limit of my understanding of the algorithm.

The text was updated successfully, but these errors were encountered:

StephanTLavavej · 2019-03-09T04:47:39Z

In each group denoted by POW10_OFFSET_2 there are all-zero rows at both the beginning and the end. I can easily trim the end with zero cost (no perf improvement though). I can also shift the multipliers by 8 (no observable perf improvement). However, I haven't figured out how to trim the beginning except by adding another small lookup table, which seems to very slightly degrade perf (while saving the 27.6 KB), and I suspect I've damaged correctness by missing branches. There may be a better way to trim the beginning.

ulfjack · 2019-03-13T22:58:00Z

POW10_SPLIT_2 is a concatenation of arrays. For each sub-array, we need to know the offset and the length of the subarray, and we also need to know the block that the first sub-array element corresponds to.

We already have MIN_BLOCK_2 to skip the all-0 entries at the beginning; we only need to also use it to compute the correct index into the POW10_SPLIT_2 table, while making sure we don't access elements for the next larger index.

      const uint32_t p = i - MIN_BLOCK_2[idx] + POW10_OFFSET_2[idx];
      if (p >= POW10_OFFSET_2[idx + 1]) {
        ...
      }
      uint32_t digits = mulShift_mod1e9(m2 << 8, POW10_SPLIT_2[p], j + 8);

We simultaneously need to remove MIN_BLOCK_2[idx] rows from the beginning of each sub-array. I think this transformation shouldn't cost any performance. We already load all of MIN_BLOCK_2[idx], POW10_OFFSET_2[idx] and POW10_OFFSET_2[idx + 1], and the term - MIN_BLOCK_2[idx] + POW10_OFFSET_2[idx] is loop-invariant, so can be hoisted before the loop.

We could additionally interleave MIN_BLOCK_2 and POW10_OFFSET_2 to increase the chance that they're all on the same cache line.

I think that should work.

ulfjack closed this as completed in 10c8d11 Mar 16, 2019

CrustyAuklet mentioned this issue Apr 25, 2024

To chars snitch-org/snitch#160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can POW10_SPLIT_2 be smaller? #105

Can POW10_SPLIT_2 be smaller? #105

StephanTLavavej commented Mar 9, 2019

StephanTLavavej commented Mar 9, 2019 •

edited

Loading

ulfjack commented Mar 13, 2019

Can POW10_SPLIT_2 be smaller? #105

Can POW10_SPLIT_2 be smaller? #105

Comments

StephanTLavavej commented Mar 9, 2019

StephanTLavavej commented Mar 9, 2019 • edited Loading

ulfjack commented Mar 13, 2019

StephanTLavavej commented Mar 9, 2019 •

edited

Loading