Skip to content

updatehtslib against intel_mods #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 511 commits into
base: intel_mods
Choose a base branch
from
Draft

Conversation

eneskuluk
Copy link

No description provided.

ubuntolog and others added 30 commits February 28, 2022 13:55
Following a tidy up in 7981bc9 and 0d83a7b, all format strings
in HTSlib are now literals.  Prevent non-literal strings from
coming back by using -Wformat=2 in one of the CI tests, which
enables -Wformat-nonliteral.  This has to be a gcc test as it
excludes functions that take a va_list from the format-nonliteral
warning while clang doesn't, and we need to be able to pass
nonliteral format strings to these functions.

While this might be slightly inconvenient for developers, any
annoyances are far outweighed by being able to automatically
detect a class of nasty and otherwise difficult to spot bugs.
The context parameter was omitted, causing the format string
to be used for it instead, and the next one to be used as the
format.  This resulted in not very useful error messages and
possible issues with incorrect interpretation of varargs
parameters.
Double check the file coordinates are 1-based unless tabix -0 is
specified.  We treat a 0 coordinate as a warning now.  The
documentation is also more explicit that -0 also implied a half-open
coordinate system.

Also improve region parsing.  The code supports regions such as
"x"/"x:" (all of "x"), "x:10-20" "x:-20" (up to pos 20) and "x:20-"
(position 20 onwards).  However as genome coordinates are 1-based and
internally we have 0-based, we subtract one during parsing.  The
"x:-20" is done by negative value detection, but this was also
triggered with the "x:0" region (treated as "up to 1").

Also, illegal regions such as "x:-10-20" were treated as "x:-20".
This is now a hard error.

Fixes samtools#1409
Invalid definitions are fixed internally and warning such as

    [W::bcf_hdr_register_hrec] The definition of Flag "INFO/SNP" is invalid, forcing Number=0

are printed so that downstream analyses can work (e.g. `bcftools merge`).
However, output VCF headers are not fixed.

This could go one step further and also modify the headers.

See also samtools/bcftools#1685
 - The name tokeniser now validates the stored length in the data
   stream matches the decoded length.
 - Make data types consistent in rans_compress_bound4x16.
 - Fix an endless loop in arith_dynamic and rans4x16pr involving
   X_STRIPE with 0 stripes.
 - Prevent memcpy(dest, NULL, 0) calls.
 - Reject attempts to duplicate non-existent name tokeniser
   streams.
 - Fix possible use of uninitialised memory in
   rans_uncompress_O1_4x16.
 - Improve error detection from fqzcomp's read_array function.
 - Reject fqzcomp parameters with inconsistent "sel" parameters.
 - Fix uninitialised access to element 4095 in r4x8 O0 arrays.
The previous code here buffered up 1000 BAM records before dispatching
a job to convert to SAM.  We now buffer up either 1000 BAM records or
(estimated) ~240KB of encoded SAM, whichever is hit first.

Note these figures were previously written in macros NB and NM, which
have been replaced with more meaningful names.

I tested it with data having an average length of 148bp (Illumina)
15949bp (ONT) and 13441 (PB CCS), using the GIAB HG002 chr1 public data
sets.  Maximum lengths were 148, 1044875 and 19288.

/usr/bin/time real time and maxresident figures, before:

               Illumina              ONT               PB
    -@8     43,128/2m43   6,040,568/1m39   1,593,932/1m15
    -@16    70,880/1m42   8,710,132/1m08   3,026,604/0m52
    -@32    70,272/1m29  11,940,036/0m55   5,722,952/0m47
    -@64   190,584/1m21  17,007,840/0m56  10,835,512/0m48

After:

               Illumina              ONT               PB
    -@16    50,208/1m36     696,276/1m09      63,496/0m57
    -@32    86,044/1m21   1,054,524/0m53     109,696/0m44
    -@64   149,024/1m24   1,616,720/0m55     195,676/0m48

The effect on memory (KB) is vast, although it's still a bit higher on
ONT.  This is probably related to the maximum lengths being used
and the reuse of BAM structs (never shrinking them) rather than the
average, so a long tail on the distribution causes memory growth.

We could address that in later updates, but this is still a huge
improvement.

Obviously as we get to very long records, we'll be dispatching very
commonly (maybe every alignment), but I don't yet know how inefficient
the threading becomes then.  Memory usage will grow as we cannot store
half an alignment, but it won't explode so fast.  We may wish to try
larger values of SAM_NBYTES, but note I tried 100KB to 1MB and this
came out fairly optimal on CPU so there was little reason to trade
more memory for CPU.
Double check the file coordinates are 1-based unless tabix -0 is
specified.  We treat a 0 coordinate as a warning now.  The
documentation is also more explicit that -0 also implied a half-open
coordinate system.

Also improve region parsing.  The code supports regions such as
"x"/"x:" (all of "x"), "x:10-20" "x:-20" (up to pos 20) and "x:20-"
(position 20 onwards).  However as genome coordinates are 1-based and
internally we have 0-based, we subtract one during parsing.  The
"x:-20" is done by negative value detection, but this was also
triggered with the "x:0" region (treated as "up to 1").

Also, illegal regions such as "x:-10-20" were treated as "x:-20".
This is now a hard error.

Fixes samtools#1409
Invalid definitions are fixed internally and warning such as

    [W::bcf_hdr_register_hrec] The definition of Flag "INFO/SNP" is invalid, forcing Number=0

are printed so that downstream analyses can work (e.g. `bcftools merge`).
However, output VCF headers are not fixed.

This could go one step further and also modify the headers.

See also samtools/bcftools#1685
 - The name tokeniser now validates the stored length in the data
   stream matches the decoded length.
 - Make data types consistent in rans_compress_bound4x16.
 - Fix an endless loop in arith_dynamic and rans4x16pr involving
   X_STRIPE with 0 stripes.
 - Prevent memcpy(dest, NULL, 0) calls.
 - Reject attempts to duplicate non-existent name tokeniser
   streams.
 - Fix possible use of uninitialised memory in
   rans_uncompress_O1_4x16.
 - Improve error detection from fqzcomp's read_array function.
 - Reject fqzcomp parameters with inconsistent "sel" parameters.
 - Fix uninitialised access to element 4095 in r4x8 O0 arrays.
Add -s, --random-seed option and a constant default, and seed srand()
with it. Currently test_bcf_sr_sort() is the only user of rand(),
which it uses to seed test_bcf_sr_sort.pl's random number generator.
This isn't permitted by the BAM specification, but was accepted by
earlier htslib release.  62f9909 added code to check the maximum
length.  This now has a warning at 2GB and the hard-failure at 4GB.

Fixes samtools#1420.  Fixes samtools/samtools#1613
These define explicit vs implicit coordinates.  They are now part of
the MM specification, but we don't do anything with this data yet.
This PR simply permits them to be parsed without choking, and ignores
the additional markup.  A subsequent PR will improve on this.

Fixes samtools#1418
If the user specifies the wrong reference, the CRAM slice header
MD5sum checks fail.  We now report the SQ line M5 string too so it is
possible to validate against the whole chr in the ref.fa file.

The error message has also been improved to report the reference name
instead of #num.

Finally, we now hint at the likely cause, which counters the
misleading samtools supplied error of "truncated or corrupt" file.

See samtools/samtools#1640.
It's trying to spot error messages starting with lowercase letters,
but in doing so forbids things like "@sq" as it's not capital.
The previous commit permitted these to exist, but didn't make the data
available to the caller.

This extends the API with additional queries to distinguish the
specifics about the modification types present.
Done to fix a problem where clang fails to install due to an
inconsistency in the apt sources used by the ubuntu kinetic (a.k.a
devel) Docker image.  This means an update to clang via the
kinetic-proposed source makes it uninstallable on that image
until the proposed change makes it to the kinetic one.

Switching to ubuntu:latest means we won't be quite as leading-edge
but it's less likely to break unexpectedly.
This attempted to grow memory by the maximum amount of space a base
modification would take up, but due to a misunderstanding of kstring
it kept adding this to the original size rather than actually growing
the allocated size.

(Probably) fixes samtools/samtools#1652
gcc-12.1 produced a warning that NULL could be passed to
strcmp() via str_class.  I'm not sure if that can actually
happen, but just in case add a check.
Update htscodecs to bring in the rANS 32x16 codecs

Add new htscodecs source files and dependencies into htslib
makefiles.  Some htscodecs functions have changed name slightly
so a couple of cram source files are updated to reflect that.

These changes are enough to build the non-SIMD versions of
the new codecs, but don't enable the accelerated versions yet.
Adds the checks necessary to detect x86_64 SIMD support and
turn it on the htscodecs if it's available.

As ssse3, popcnt and sse4.1 are used together, they're tested
for as a group.
For the benefit of htslib embeds that don't want to create and run
a configure script.  Adds a small script that does a similar
job by probing a few compiler options and then outputs makefile
lines to set variables if they succeed.  These lines are added
to the default 'htscodecs.mk' file that gets built if configure
hasn't already made one.  Adding them here means the probing
will be remembered until the next "make distclean".

The script fragment that builds the default 'config.h' checks
to see if the variables have been set, and if so adds the
appropriate 'HAVE_' lines for the feature.
Mainly to ensure that the fuzzer build doesn't start complaining
about unaligned access.
PR samtools#1406 fixed which arguments were used as printf() format strings
in realn_check_tag(), which is a subroutine of sam_prob_realn().
Correct the name of the function affected, and add a note of which
HTSlib releases were affected.
Under mingw clang requires dllexport to be applied to both function
declarations and function definitions.  (Gcc is happy for the
definition only to have the dllexport attribute.)

Note this exports several "internal" functions, including some
apparently unused anywhere (hts_json_*).   Perhaps they were once used
in htslib-plugins, but no more.

I haven't taken the decision to remove these though, but it's worth
considering for whenever we next do an ABI breaking change.  Either
that or stop pretending that their internal only when clearly they are
not, and move their API to the external htslib/*.h files instead.
These appear to be somewhat in Limbo right now.

Fixes samtools#1433
daviesrob and others added 30 commits July 25, 2023 09:16
partially parsed ones.

It makes little sense for this to exist as a public API when it's only
capable of handling the internal during-SAM-parse situation, and the
changes are relatively minor.

Also fixes an undocumented assumption that end == &in.

Fixes samtools#1650
For formats like BAM and BCF that are normally compressed, report
it explicitly when encountering a raw uncompressed such file.
See samtools/samtools#1884 for motivation.
This simply turns the monolithic vcf_parse_format functions into a
series of numbers sub-functions whose primary is to localise the
variables to that code block and to make it easier to see the
structure of the tasks being performed.

There is no code optimisation here and the main algorithm is
unchanged, so this is just moving of code from 1 function to multiple
functions.  However it makes the next commit easier to understand as
we're not trying to see a delta mixed in with restructuring.

An unexpected consequence however of making the variables local to
their blocks is that it also speeds up the code.

The first step in separating this code into functions was simply
adding curly braces around each code segment and moving the
function-global variables into their respective blocks.  The
before/after benchmarkjs on 100,000 lines of a multi-sample G1K VCF
are ("perf stat" cycle counts):

        ORIG            LOCALISED
gcc7    29335942870     27353443704
gcc13   31974757094     31452452908
clang13 31290989382     29508665020

Benchmarked again after moving to actual functions, but the difference
was tiny in comparison.)
This is the main meat of the VCF read speedup, following on from the
previous code refactoring.  Combined timings on testing GNOMAD very
INFO heavy single-sample file, a many-sample (approx 4000) FORMAT rich
file for different compilers, and the GIAB HG002 VCF truth set are:

INFO heavy          (15-29% speedup)    (34-39% speedup)
                    dev(s)  PR(s)       dev(s)  PR(s)
    clang13         6.29    5.34        2.84    1.85
    gcc13           6.74    5.22        2.93    1.93
    gcc7            7.96    5.65        3.25    1.98

FORMAT heavy        (6-19% speedup)     (18-22% speedup)
                    dev     PR          dev     PR
    clang13         9.17    8.58        5.45    4.48
    gcc13           9.88    8.04        5.08    3.95
    gcc7            9.12    8.33        4.87    3.98

GIAB HG002          (28-29% speedup)    (33-37% speedup)
                    dev     PR          dev     PR
    clang13         12.88   9.30        5.12    3.29
    gcc13           12.04   8.60        4.74    3.19
    gcc7            12.87   9.37        5.32    3.34

(Tested on Intel Xeon) Gold 6142 and an AMD Zen4 respectively)

Bigger speedups (see first message in PR) were seen on some older
hardware.

Specific optimisations along with estimates of their benefit include,
in approximate order of writing / testing:

- Adding consts and caching of bcf_hdr_nsamples(h).
  No difference on system gcc (gcc7) and clang13, but a couple percent
  gain on gcc13.

- Remove the need for most calls to hts_str2uint by recognising that
  most GT numbers are single digits.  This was 4-5% saving for gcc and
  9-10% on clang.

- Remove kputc calls in bcf_enc_vint / bcf_enc_size, avoiding repeated
  ks_resize checking.  This is a further ~10% speedup.

- Unrolling in bcf_enc_vint to encourage SIMD.

- Improve speed of bgzf_getline and kstrrok via memchr/strchr.

  In tabix timings indexing VCF, bgzf_getline change is 9-22% quicker
  with clang 13 and 19-25% quicker with gcc 7.  I did investigate a
  manually unrolled 64-bit search, before I remembered the existance of
  memchr (doh!).  This is often faster on clang (17-19%), but marginally
  slower on gcc.  The actual speed up to this function however is
  considerably more (3-4x quicker).

  For interest, I include the equivalent code here, as it may be useful in
  other contexts:

    #if HTS_ALLOW_UNALIGNED != 0 && ULONG_MAX == 0xffffffffffffffff
    // 64-bit unrolled delim detection
    #define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL)
            // Quicker broadcast on clang than bit shuffling in delim
            union {
                uint64_t d8;
                uint8_t d[8];
            } u;
            memset(u.d, delim, 8);
            const uint64_t d8 = u.d8;

            uint64_t *b8 = (uint64_t *)(&buf[fp->block_offset]);
            const int l8 = (fp->block_length-fp->block_offset)/8;
            for (l = 0; l < (l8 & ~3); l+=4) {
                if (haszero(b8[l+0] ^ d8))
                    break;
                if (haszero(b8[l+1] ^ d8)) {
                    l++;
                    break;
                }
                if (haszero(b8[l+2] ^ d8)) {
                    l+=2;
                    break;
                }
                if (haszero(b8[l+3] ^ d8)) {
                    l+=3;
                    break;
                }
            }
            l *= 8;
            for (l += fp->block_offset;
                 l < fp->block_length && buf[l] != delim;
                 l++);

  The analogous kstrtok change is using strchr+strlen instead of memchr
  as we don't know the string end. This makes kstrtok around 150%
  quicker when parsing a single sample VCF.

  When not finding aux->sep in the string, strchr returns NULL rather
  than end of string, so we need an additional strlen to set aux->p.
  However there is also glibc's strchrnul which solves this in a single
  call.  This makes kstrtok another 40% quicker on this test, but
  overall it's not a big bottleneck any more.

- Use strchr in vcf_parse_info.

  This is a major speed increase over manual searching on Linux.
  TODO: is this just glibc?  Eg libmusl speeds, etc?  Other OSes?

  It saves about 33% of time in vcf_parse (vcf_parse_info inlined to it)
  with gcc.  Even more with clang.  The total speed gain on a single
  sample VCF view (GIAB truth set) is 12-19% fewer cycles:

- Minor "GT" check improvement.  This has no real affect on gcc13 and
  clang13, but the system gcc (gcc7) speeds up single sample VCF decoding by 7%

- Speed up the unknown value check (strcmp(p, ".").  Helps gcc7 the
  most (9%), with gcc13/clang13 in the 3-4% gains.

- Speed up vcf_parse_format_max3.

  This is the first parse through the FORMAT fields.  Ideally we'd merge
  this and fill5 (the other parse through), but that is harder due to
  the data pivot / rotate.

  For now we just optimise the existing code path.  Instead of a
  laborious switch character by character, we have an initial tight loop
  to find the first meta-character and then a switch to do char
  dependant code.

  This is 5% to 13% speed up depending on data set.

- Remove kputc and minimise resize for bcf_enc_int1.
  3-8% speedup depending on data / compiler.

- Use memcmp instead of strcmp for "END" and ensure we have room.
  Also memset over explicit nulling of arrays.

- Force BCF header dicts to be larger than needed.

  This is a tactic to reduce hash collisions due to the use of overly
  simple hash functions.  It seems to typically be around 3-8% speed gain.

- Restructure of main vcf_parse function.

  This can speed things up by 6-7% on basic single-sample files.

  The previous loop caused lots of branch prediction misses due to the
  counter 'i' being used to do 8 different parts of code depending on
  token number.  Additionally it's got better error checking now as
  previously running out of tokens early just did a return 0 rather than
  complaining about missing columns.
The two we keep are the internal while loop to find the next ; or =
instead of iterating back in the outer for loop, and memcmp instead of
strcmp for "END".

The strchr changes do help glibc on excessively long INFO tokens, seen
in the GIAB truth set and GNOMAD files, but they have no impact on
most mainstream VCF outputs.  Furthermore, other C libraries, such as
MUSL, are considerably slowed down by the use of strchr.  Hence this
isn't a particularly robust or warranted change.
See samtools/samtools#813

(Will close issue with man page update which needs applying there)
sam_hdr_remove_lines and sam_hdr_remove_except are very slow when
removing large numbers of header lines due to the effort of deleting
hash entries one by one.  This commit instead rebuilds the hashes at
the end of the deletion process.
As it's difficult for fuzzers to get CRC checksums right, prevent
CRC checks from failing in fuzzing build mode.  This should
expand the number of code paths that the fuzzer can explore.
* Limit max. IDX numbers in VCF to prevent large allocations
* Limit max. sum of shared_len + indiv_len in bcf_read1_core()
* Limit max. header size in bcf_hdr_read()
* Limit max. header size in cram_read_SAM_hdr()
* Limit max. header size in bam_hdr_read()
* Limit max. n_targets in bam_hdr_read()
* Limit max. number of landmarks in cram_read_container()
* Limit max. number of huffman codes in cram_huffman_decode_init()
* Limit max. record size in sam_realloc_bam_data()

Adds a header where the memory limit for fuzzing can be set.  This
involves a bit more work, but there is benefit to having this in
one clearly defined place.
... so that the following code
```
static inline void u32_to_le(uint32_t val, uint8_t *buf) {
    *((uint32_u *) buf) = val;
...
```

will not cause -fsanitize=alignment failures when building with Clang.
Enables repeated hopen("-") / hclose() / hopen("-") where previously
the underlying STDIN/OUT_FILENO would have been prematurely closed.

This means that stdout is never really closed and hclose()'s return
value does not reflect closing the underlying fd. Hence particularly
paranoid programs that have written significant data to stdout will
want to close and check it themselves just before the end of main():

  if (fclose(stdout) != 0 && errno != EBADF) perror("closing stdout")

(Ignore EBADF as stdout may already have been closed and checked,
e.g. if the program has been linked to an earlier HTSlib where
hclose() still closes STDOUT_FILENO.)
We don't need to dup(STDOUT_FILENO) now that hclose()/hts_close()
no longer irretrievably close stdout. Instead fclose(stdout)
explicitly just before the end of main() as it is a last chance
to observe I/O errors.
…1671)

If we have an unofficial CRAM v4.0 encoded file and are attempting to
use non-byte based data in an EXTERNAL encoded block, then we didn't
free the cram_codecs structure when failing to parse and initialise.

Credit to OSS-Fuzz
Fixes oss-fuzz 62144
Reimplemented to reduce complexities.
Update to avoid test failure in windows
Updated to work with shared stdin/out hfile
I suspect this was initially hard as on, but later made something we
explicitly enable but forgetting to add that code into htslib.

On Illumina it made little difference, but wasn't detrimental.
I need bigger data sets, but they're mostly on unavailable systems
right now.  Small tests demonstrate utility though, specifically on
decode speeds.

For other platforms:

Ultima Genomics
===============

Orig

real    0m25.784s
user    0m24.506s
sys     0m1.189s

real    0m9.155s
user    0m7.775s
sys     0m1.379s

RANS_ORDER_SIMD_AUTO

real    0m24.987s
user    0m23.699s
sys     0m1.219s

real    0m8.097s
user    0m6.635s
sys     0m1.461s

That's 13% quicker decode and 3% quicker encode.

It's mostly QS and tags:

$ ~/samtools/samtools cram-size -v _.cram|grep 32x16
BLOCK       10       617823        77895  12.61% r32x16-o1
BLOCK       12    911236491    188134803  20.65% r32x16-o1R  QS
BLOCK       27       232221        38816  16.72% r32x16-o0   FC
BLOCK       31        54067        10718  19.82% r32x16-o0   BS
BLOCK  7614554    917596491     50148593   5.47% r32x16-o1   t0Z
BLOCK  7630914    931877007    108982153  11.69% r32x16-o1R  tpB

ONT
===

Orig

real    0m3.018s
user    0m2.854s
sys     0m0.130s

real    0m0.578s
user    0m0.538s
sys     0m0.040s

RANS_ORDER_SIMD_AUTO

real    0m2.912s
user    0m2.740s
sys     0m0.120s

real    0m0.500s
user    0m0.430s
sys     0m0.070s

That's 16% quicker decode and 4% quicker encode, but sample size is
admittedly tiny for both tests.

File size changes are under 0.1% growth, mainly due to 32 rANS states
instead of 4.  The RANS_ORDER_SIMD_AUTO flag basically enables the
32-way rANS if the block is sufficiently large (>50kb), so it's
the extra 112 byte state overhead isn't significant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.