Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor SIMD code and add ARM Neon nibble2base() implementation #1795

Merged
merged 8 commits into from
Jul 4, 2024

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Jun 24, 2024

This PR moves nibble2base_ssse3() from an internal header file to a new source file, simd.c, and adds an ARM implementation nibble2base_neon() alongside. Because these functions are always called via a pointer, they are never inlined so there is nothing gained (other than build complexity!) by having them in a header.

I initially translated the SSSE3 algorithm directly to Neon, during which I noticed some tweaks that could be usefully applied to the SSSE3 version — see the “Minor improvements” commit below. I then realised that the Neon implementation could be simplified by using Neon's rather excellent interleaved load/store instructions, which I then replaced by an in-register zip as it is (sadly) faster (on aarch64 at least). (It may be possible to apply a similar approach for Intel via the unpack instruction.)

Indicative timings for various versions of the Neon implementation, on ARM and, for interest, an iteration count on x86_64 that makes the nibble2base_default time comparable — numbers are comparable vertically, but not directly horizontally:

Apple M2  i5 SSSE3
 seconds   seconds

 262.389   916.215   nibble2base_single (i.e., scalar code doing one nucleotide at a time)
 186.283   186.778   nibble2base_default (i.e., scalar code doing two nucleotides at a time)
  36.053    46.700   Ruben's nibble2base_ssse3 algorithm
  24.783             interleaved using vst2q_u8
  19.287             interleaved using vzipq_u8

I moved the existing code to the new source file in a pair of commits marked as authored by @rhpvorderman and @daviesrob, so that git blame preserves the authorship of their untouched lines. Hence ideally this PR would be merged via a merge commit rather than squashed.

rhpvorderman and others added 7 commits June 2, 2024 19:32
This function is always called via a function pointer, hence it is
never inlined so need not be in a header.

[Commit marked as authored by RV so git blame attributes the moved
code to Ruben, as it is unchanged from the original code. Temporarily
include simd.c from the header so this commit builds and runs. -JM]
Also add copyright boilerplate to the new file. [Commit marked as
authored by RMD so git blame attributes the boilerplate and moved
code to Rob, as the latter is unchanged from the original code. -JM]
… file

Build simd.c as a source file, so there is one copy of nibble2base_ssse3()
rather than potentially one per translation unit that uses it.

When everything was static, the names needed no prefix. However now that
there is a non-static function pointer, it needs an htslib-specific
prefix to avoid polluting the namespace. (In libhts.so it will usually
be a non-visible symbol, but in libhts.a it needs a prefix.)
Configure checks for __builtin_cpu_supports("ssse3"), so the check
for __attribute__((target)) might as well be there too. This enables
this internal-only infrastructure to be removed from htslib/hts_defs.h,
which is primarily for defines needed in the public headers.

Add a configure check for __attribute__((constructor)) too, as we will
likely be using it on other SIMD-using platforms as well.
Notably the adjustment to seq_vec_end_ptr stops it from reverting
to the scalar code prematurely: at present it drops back to scalar
when <=32 entries remain, rather than <=31.
As ARM CPU capability instructions are privileged, we also need to
add cpu_supports_neon() and implement it to query for Neon/AdvSIMD
in various platform-dependent ways. (On 32-bit ARM, glibc helpfully
renamed the HWCAP_* macros to HWCAP_ARM_*, so accept both on Linux.)

32-bit ARM GCC erroneously does not define vst1q_u8_x2() (bug 71233,
fixed in v14.1 in this case), so avoid it on 32-bit ARM by writing
interleaved via vst2q_u8() instead.
@jkbonfield
Copy link
Contributor

That's impressive. On my intel machine SSSE3 appears to be running about 4x faster than the default version. You're getting 8-9x speed up which is great.

PS. The mythical htslib v2.0 will not use nibble encoding in memory. It's a royal pain in the derriere!

@jmarshall
Copy link
Member Author

jmarshall commented Jun 25, 2024

I updated the timings to values from the final code and test code in this PR (previously the timings were from earlier development code), included the 1-nibble-at-a-time naive scalar code timings, and added some x86_64 timings on a different iteration count that makes the times on one row comparable. (Not totally sure I believe that 916 seconds! And I only ran it once. But it's indicative that the difference between the two scalar versions is much bigger on x86_64.)

It's not really meaningful to compare between the columns, but what it does tell you is this: the 2-nibbles-at-a-time scalar code is a much smaller win over 1-nibble-at-a-time on ARM than on x86_64. So this suggests that the apparent 8-9x speed up might not be that astonishing, because there is a lot left on the table by the 2-at-a-time scalar code on ARM, at least as it is written here. Or something — who knows, it's a nice speed up anyway 😄

@jkbonfield
Copy link
Contributor

Yep, very nice and thank you.

I hunted the origin of the nibble2base and backtracked it to my initial BAM implementation in io_lib in Jan 2013 :-)
It's since been tarted up a bit and avoiding the need for unaligned access ifdefs by use of memcpy, but essentially the same deal. It got retrofitted into the BAM handling code in htslib shortly after CRAM was added.

https://github.com/jkbonfield/io_lib/blob/master/io_lib/bam.c#L3650-L3657

@rhpvorderman
Copy link
Contributor

It may be possible to apply a similar approach for Intel via the unpack instruction.

I think it would be possible, and possibly more optimal. Interesting find!

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Jun 29, 2024

Did a quick test in Sequali, the following code is simpler and all tests pass:

static void 
decode_bam_sequence_ssse3(uint8_t *dest, const uint8_t *encoded_sequence, size_t length) 
{

    static const uint8_t *nuc_lookup = (uint8_t *)"=ACMGRSVTWYHKDBN";
    const uint8_t *dest_end_ptr = dest + length;
    uint8_t *dest_cursor = dest;
    const uint8_t *encoded_cursor = encoded_sequence;
    const uint8_t *dest_vec_end_ptr = dest_end_ptr - (2 * sizeof(__m128i) - 1);
    __m128i nuc_lookup_vec = _mm_lddqu_si128((__m128i *)nuc_lookup);
    /* Nucleotides are encoded 4-bits per nucleotide and stored in 8-bit bytes 
       as follows: |AB|CD|EF|GH|. The 4-bit codes (going from 0-15) can be used 
       together with the pshufb instruction as a lookup table. The most efficient
       way is to use bitwise AND and shift to create two vectors. One with all 
       the upper codes (|A|C|E|G|) and one with the lower codes (|B|D|F|H|). 
       The lookup can then be performed and the resulting vectors can be 
       interleaved again using the unpack instructions. */
    while (dest_cursor < dest_vec_end_ptr) {
        __m128i encoded = _mm_lddqu_si128((__m128i *)encoded_cursor);
        __m128i encoded_upper = _mm_srli_epi64(encoded, 4);
        encoded_upper = _mm_and_si128(encoded_upper, _mm_set1_epi8(15));
        __m128i encoded_lower = _mm_and_si128(encoded, _mm_set1_epi8(15));
        __m128i nucs_upper = _mm_shuffle_epi8(nuc_lookup_vec, encoded_upper);
        __m128i nucs_lower = _mm_shuffle_epi8(nuc_lookup_vec, encoded_lower);
        __m128i first_nucleotides = _mm_unpacklo_epi8(nucs_upper, nucs_lower);
        __m128i second_nucleotides = _mm_unpackhi_epi8(nucs_upper, nucs_lower);
        _mm_storeu_si128((__m128i *)dest_cursor, first_nucleotides);
        _mm_storeu_si128((__m128i *)(dest_cursor + 16), second_nucleotides);
        encoded_cursor += sizeof(__m128i);
        dest_cursor += 2 * sizeof(__m128i);
    }
    decode_bam_sequence_default(dest_cursor, encoded_cursor, dest_end_ptr - dest_cursor);
}

Thanks @jmarshall! This is a great improvement codewise!

EDIT: (Of course I will contribute this after this is merged, or feel free to incorporate it right away).

@daviesrob
Copy link
Member

When building on my Mac without using ./configure first, I get:

simd.c:63:19: warning: unused function 'cpu_supports_neon' [-Wunused-function]
static inline int cpu_supports_neon(void) {

This is because use of the function is gated by defined __ARM_NEON but its definition is not. Presumably this warning would also appear when building on an ARM variant that doesn't have NEON.

@jmarshall
Copy link
Member Author

jmarshall commented Jul 1, 2024

Oops. I think it was more because on ARM non-configure builds, it wasn't building nibble2base_neon() because HAVE_ATTRIBUTE_CONSTRUCTOR wasn't defined. Fixed now.

Probably in theory the definition of cpu_supports_neon() should be gated by #if defined BUILDING_SIMD_NIBBLE2BASE || defined BUILDING_SIMD_WHATEVER_ELSE || … but life is short and ARM compilers should be supporting building these functions even if the eventual worker CPU doesn't support running them…

@daviesrob
Copy link
Member

Cheers, that's fixed the problem.

@jkbonfield jkbonfield merged commit 1b4cda6 into samtools:develop Jul 4, 2024
9 checks passed
@jmarshall jmarshall deleted the simd branch July 4, 2024 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants