Refactor SIMD code and add ARM Neon nibble2base() implementation #1795

jmarshall · 2024-06-24T13:28:07Z

This PR moves nibble2base_ssse3() from an internal header file to a new source file, simd.c, and adds an ARM implementation nibble2base_neon() alongside. Because these functions are always called via a pointer, they are never inlined so there is nothing gained (other than build complexity!) by having them in a header.

I initially translated the SSSE3 algorithm directly to Neon, during which I noticed some tweaks that could be usefully applied to the SSSE3 version — see the “Minor improvements” commit below. I then realised that the Neon implementation could be simplified by using Neon's rather excellent interleaved load/store instructions, which I then replaced by an in-register zip as it is (sadly) faster (on aarch64 at least). (It may be possible to apply a similar approach for Intel via the unpack instruction.)

Indicative timings for various versions of the Neon implementation, on ARM and, for interest, an iteration count on x86_64 that makes the nibble2base_default time comparable — numbers are comparable vertically, but not directly horizontally:

Apple M2  i5 SSSE3
 seconds   seconds

 262.389   916.215   nibble2base_single (i.e., scalar code doing one nucleotide at a time)
 186.283   186.778   nibble2base_default (i.e., scalar code doing two nucleotides at a time)
  36.053    46.700   Ruben's nibble2base_ssse3 algorithm
  24.783             interleaved using vst2q_u8
  19.287             interleaved using vzipq_u8

I moved the existing code to the new source file in a pair of commits marked as authored by @rhpvorderman and @daviesrob, so that git blame preserves the authorship of their untouched lines. Hence ideally this PR would be merged via a merge commit rather than squashed.

This function is always called via a function pointer, hence it is never inlined so need not be in a header. [Commit marked as authored by RV so git blame attributes the moved code to Ruben, as it is unchanged from the original code. Temporarily include simd.c from the header so this commit builds and runs. -JM]

Also add copyright boilerplate to the new file. [Commit marked as authored by RMD so git blame attributes the boilerplate and moved code to Rob, as the latter is unchanged from the original code. -JM]

… file Build simd.c as a source file, so there is one copy of nibble2base_ssse3() rather than potentially one per translation unit that uses it. When everything was static, the names needed no prefix. However now that there is a non-static function pointer, it needs an htslib-specific prefix to avoid polluting the namespace. (In libhts.so it will usually be a non-visible symbol, but in libhts.a it needs a prefix.)

Configure checks for __builtin_cpu_supports("ssse3"), so the check for __attribute__((target)) might as well be there too. This enables this internal-only infrastructure to be removed from htslib/hts_defs.h, which is primarily for defines needed in the public headers. Add a configure check for __attribute__((constructor)) too, as we will likely be using it on other SIMD-using platforms as well.

Notably the adjustment to seq_vec_end_ptr stops it from reverting to the scalar code prematurely: at present it drops back to scalar when <=32 entries remain, rather than <=31.

As ARM CPU capability instructions are privileged, we also need to add cpu_supports_neon() and implement it to query for Neon/AdvSIMD in various platform-dependent ways. (On 32-bit ARM, glibc helpfully renamed the HWCAP_* macros to HWCAP_ARM_*, so accept both on Linux.) 32-bit ARM GCC erroneously does not define vst1q_u8_x2() (bug 71233, fixed in v14.1 in this case), so avoid it on 32-bit ARM by writing interleaved via vst2q_u8() instead.

jkbonfield · 2024-06-24T14:24:22Z

That's impressive. On my intel machine SSSE3 appears to be running about 4x faster than the default version. You're getting 8-9x speed up which is great.

PS. The mythical htslib v2.0 will not use nibble encoding in memory. It's a royal pain in the derriere!

jmarshall · 2024-06-25T11:54:39Z

I updated the timings to values from the final code and test code in this PR (previously the timings were from earlier development code), included the 1-nibble-at-a-time naive scalar code timings, and added some x86_64 timings on a different iteration count that makes the times on one row comparable. (Not totally sure I believe that 916 seconds! And I only ran it once. But it's indicative that the difference between the two scalar versions is much bigger on x86_64.)

It's not really meaningful to compare between the columns, but what it does tell you is this: the 2-nibbles-at-a-time scalar code is a much smaller win over 1-nibble-at-a-time on ARM than on x86_64. So this suggests that the apparent 8-9x speed up might not be that astonishing, because there is a lot left on the table by the 2-at-a-time scalar code on ARM, at least as it is written here. Or something — who knows, it's a nice speed up anyway 😄

jkbonfield · 2024-06-25T13:50:47Z

Yep, very nice and thank you.

I hunted the origin of the nibble2base and backtracked it to my initial BAM implementation in io_lib in Jan 2013 :-)
It's since been tarted up a bit and avoiding the need for unaligned access ifdefs by use of memcpy, but essentially the same deal. It got retrofitted into the BAM handling code in htslib shortly after CRAM was added.

https://github.com/jkbonfield/io_lib/blob/master/io_lib/bam.c#L3650-L3657

rhpvorderman · 2024-06-29T19:47:56Z

It may be possible to apply a similar approach for Intel via the unpack instruction.

I think it would be possible, and possibly more optimal. Interesting find!

rhpvorderman · 2024-06-29T20:03:17Z

Did a quick test in Sequali, the following code is simpler and all tests pass:

static void 
decode_bam_sequence_ssse3(uint8_t *dest, const uint8_t *encoded_sequence, size_t length) 
{

    static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
    const uint8_t *dest_end_ptr = dest + length;
    uint8_t *dest_cursor = dest;
    const uint8_t *encoded_cursor = encoded_sequence;
    const uint8_t *dest_vec_end_ptr = dest_end_ptr - (2 * sizeof(__m128i) - 1);
    __m128i nuc_lookup_vec = _mm_lddqu_si128((__m128i *)nuc_lookup);
    /* Nucleotides are encoded 4-bits per nucleotide and stored in 8-bit bytes 
       as follows: |AB|CD|EF|GH|. The 4-bit codes (going from 0-15) can be used 
       together with the pshufb instruction as a lookup table. The most efficient
       way is to use bitwise AND and shift to create two vectors. One with all 
       the upper codes (|A|C|E|G|) and one with the lower codes (|B|D|F|H|). 
       The lookup can then be performed and the resulting vectors can be 
       interleaved again using the unpack instructions. */
    while (dest_cursor < dest_vec_end_ptr) {
        __m128i encoded = _mm_lddqu_si128((__m128i *)encoded_cursor);
        __m128i encoded_upper = _mm_srli_epi64(encoded, 4);
        encoded_upper = _mm_and_si128(encoded_upper, _mm_set1_epi8(15));
        __m128i encoded_lower = _mm_and_si128(encoded, _mm_set1_epi8(15));
        __m128i nucs_upper = _mm_shuffle_epi8(nuc_lookup_vec, encoded_upper);
        __m128i nucs_lower = _mm_shuffle_epi8(nuc_lookup_vec, encoded_lower);
        __m128i first_nucleotides = _mm_unpacklo_epi8(nucs_upper, nucs_lower);
        __m128i second_nucleotides = _mm_unpackhi_epi8(nucs_upper, nucs_lower);
        _mm_storeu_si128((__m128i *)dest_cursor, first_nucleotides);
        _mm_storeu_si128((__m128i *)(dest_cursor + 16), second_nucleotides);
        encoded_cursor += sizeof(__m128i);
        dest_cursor += 2 * sizeof(__m128i);
    }
    decode_bam_sequence_default(dest_cursor, encoded_cursor, dest_end_ptr - dest_cursor);
}

Thanks @jmarshall! This is a great improvement codewise!

EDIT: (Of course I will contribute this after this is merged, or feel free to incorporate it right away).

daviesrob · 2024-07-01T11:21:31Z

When building on my Mac without using ./configure first, I get:

simd.c:63:19: warning: unused function 'cpu_supports_neon' [-Wunused-function]
static inline int cpu_supports_neon(void) {

This is because use of the function is gated by defined __ARM_NEON but its definition is not. Presumably this warning would also appear when building on an ARM variant that doesn't have NEON.

jmarshall · 2024-07-01T12:02:31Z

Oops. I think it was more because on ARM non-configure builds, it wasn't building nibble2base_neon() because HAVE_ATTRIBUTE_CONSTRUCTOR wasn't defined. Fixed now.

Probably in theory the definition of cpu_supports_neon() should be gated by #if defined BUILDING_SIMD_NIBBLE2BASE || defined BUILDING_SIMD_WHATEVER_ELSE || … but life is short and ARM compilers should be supporting building these functions even if the eventual worker CPU doesn't support running them…

daviesrob · 2024-07-01T13:51:00Z

Cheers, that's fixed the problem.

rhpvorderman and others added 7 commits June 2, 2024 19:32

Move nibble2base_resolve() infrastructure to simd.c

87186f8

Also add copyright boilerplate to the new file. [Commit marked as authored by RMD so git blame attributes the boilerplate and moved code to Rob, as the latter is unchanged from the original code. -JM]

Minor improvements to Intel implementation

a5ba1a6

Notably the adjustment to seq_vec_end_ptr stops it from reverting to the scalar code prematurely: at present it drops back to scalar when <=32 entries remain, rather than <=31.

Add correctness and speed tests for nibble2base() routines

2c48f9f

daviesrob assigned jkbonfield Jun 27, 2024

This was referenced Jun 30, 2024

Improve BAM nibble decoding and dispatch functions at load time. rhpvorderman/sequali#177

Merged

Decode BAM sequence with less code. Dispatch function at dynamic link time rather than at run time. marcelm/dnaio#135

Merged

Define HAVE_ATTRIBUTE_CONSTRUCTOR on ARM too for non-configure builds

552905d

jkbonfield merged commit 1b4cda6 into samtools:develop Jul 4, 2024
9 checks passed

jmarshall deleted the simd branch July 4, 2024 09:44

rhpvorderman mentioned this pull request Jul 8, 2024

Simplify SSSE3 nibble2base function #1802

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SIMD code and add ARM Neon nibble2base() implementation #1795

Refactor SIMD code and add ARM Neon nibble2base() implementation #1795

jmarshall commented Jun 24, 2024 •

edited

Loading

jkbonfield commented Jun 24, 2024

jmarshall commented Jun 25, 2024 •

edited

Loading

jkbonfield commented Jun 25, 2024

rhpvorderman commented Jun 29, 2024

rhpvorderman commented Jun 29, 2024 •

edited

Loading

daviesrob commented Jul 1, 2024

jmarshall commented Jul 1, 2024 •

edited

Loading

daviesrob commented Jul 1, 2024

Refactor SIMD code and add ARM Neon nibble2base() implementation #1795

Refactor SIMD code and add ARM Neon nibble2base() implementation #1795

Conversation

jmarshall commented Jun 24, 2024 • edited Loading

jkbonfield commented Jun 24, 2024

jmarshall commented Jun 25, 2024 • edited Loading

jkbonfield commented Jun 25, 2024

rhpvorderman commented Jun 29, 2024

rhpvorderman commented Jun 29, 2024 • edited Loading

daviesrob commented Jul 1, 2024

jmarshall commented Jul 1, 2024 • edited Loading

daviesrob commented Jul 1, 2024

jmarshall commented Jun 24, 2024 •

edited

Loading

jmarshall commented Jun 25, 2024 •

edited

Loading

rhpvorderman commented Jun 29, 2024 •

edited

Loading

jmarshall commented Jul 1, 2024 •

edited

Loading