forked from Intel-HLS/htslib
-
Notifications
You must be signed in to change notification settings - Fork 1
updatehtslib against intel_mods #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
eneskuluk
wants to merge
511
commits into
intel_mods
Choose a base branch
from
ek_updatehtslib
base: intel_mods
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Following a tidy up in 7981bc9 and 0d83a7b, all format strings in HTSlib are now literals. Prevent non-literal strings from coming back by using -Wformat=2 in one of the CI tests, which enables -Wformat-nonliteral. This has to be a gcc test as it excludes functions that take a va_list from the format-nonliteral warning while clang doesn't, and we need to be able to pass nonliteral format strings to these functions. While this might be slightly inconvenient for developers, any annoyances are far outweighed by being able to automatically detect a class of nasty and otherwise difficult to spot bugs.
The context parameter was omitted, causing the format string to be used for it instead, and the next one to be used as the format. This resulted in not very useful error messages and possible issues with incorrect interpretation of varargs parameters.
Double check the file coordinates are 1-based unless tabix -0 is specified. We treat a 0 coordinate as a warning now. The documentation is also more explicit that -0 also implied a half-open coordinate system. Also improve region parsing. The code supports regions such as "x"/"x:" (all of "x"), "x:10-20" "x:-20" (up to pos 20) and "x:20-" (position 20 onwards). However as genome coordinates are 1-based and internally we have 0-based, we subtract one during parsing. The "x:-20" is done by negative value detection, but this was also triggered with the "x:0" region (treated as "up to 1"). Also, illegal regions such as "x:-10-20" were treated as "x:-20". This is now a hard error. Fixes samtools#1409
Invalid definitions are fixed internally and warning such as [W::bcf_hdr_register_hrec] The definition of Flag "INFO/SNP" is invalid, forcing Number=0 are printed so that downstream analyses can work (e.g. `bcftools merge`). However, output VCF headers are not fixed. This could go one step further and also modify the headers. See also samtools/bcftools#1685
- The name tokeniser now validates the stored length in the data stream matches the decoded length. - Make data types consistent in rans_compress_bound4x16. - Fix an endless loop in arith_dynamic and rans4x16pr involving X_STRIPE with 0 stripes. - Prevent memcpy(dest, NULL, 0) calls. - Reject attempts to duplicate non-existent name tokeniser streams. - Fix possible use of uninitialised memory in rans_uncompress_O1_4x16. - Improve error detection from fqzcomp's read_array function. - Reject fqzcomp parameters with inconsistent "sel" parameters. - Fix uninitialised access to element 4095 in r4x8 O0 arrays.
The previous code here buffered up 1000 BAM records before dispatching a job to convert to SAM. We now buffer up either 1000 BAM records or (estimated) ~240KB of encoded SAM, whichever is hit first. Note these figures were previously written in macros NB and NM, which have been replaced with more meaningful names. I tested it with data having an average length of 148bp (Illumina) 15949bp (ONT) and 13441 (PB CCS), using the GIAB HG002 chr1 public data sets. Maximum lengths were 148, 1044875 and 19288. /usr/bin/time real time and maxresident figures, before: Illumina ONT PB -@8 43,128/2m43 6,040,568/1m39 1,593,932/1m15 -@16 70,880/1m42 8,710,132/1m08 3,026,604/0m52 -@32 70,272/1m29 11,940,036/0m55 5,722,952/0m47 -@64 190,584/1m21 17,007,840/0m56 10,835,512/0m48 After: Illumina ONT PB -@16 50,208/1m36 696,276/1m09 63,496/0m57 -@32 86,044/1m21 1,054,524/0m53 109,696/0m44 -@64 149,024/1m24 1,616,720/0m55 195,676/0m48 The effect on memory (KB) is vast, although it's still a bit higher on ONT. This is probably related to the maximum lengths being used and the reuse of BAM structs (never shrinking them) rather than the average, so a long tail on the distribution causes memory growth. We could address that in later updates, but this is still a huge improvement. Obviously as we get to very long records, we'll be dispatching very commonly (maybe every alignment), but I don't yet know how inefficient the threading becomes then. Memory usage will grow as we cannot store half an alignment, but it won't explode so fast. We may wish to try larger values of SAM_NBYTES, but note I tried 100KB to 1MB and this came out fairly optimal on CPU so there was little reason to trade more memory for CPU.
Double check the file coordinates are 1-based unless tabix -0 is specified. We treat a 0 coordinate as a warning now. The documentation is also more explicit that -0 also implied a half-open coordinate system. Also improve region parsing. The code supports regions such as "x"/"x:" (all of "x"), "x:10-20" "x:-20" (up to pos 20) and "x:20-" (position 20 onwards). However as genome coordinates are 1-based and internally we have 0-based, we subtract one during parsing. The "x:-20" is done by negative value detection, but this was also triggered with the "x:0" region (treated as "up to 1"). Also, illegal regions such as "x:-10-20" were treated as "x:-20". This is now a hard error. Fixes samtools#1409
Invalid definitions are fixed internally and warning such as [W::bcf_hdr_register_hrec] The definition of Flag "INFO/SNP" is invalid, forcing Number=0 are printed so that downstream analyses can work (e.g. `bcftools merge`). However, output VCF headers are not fixed. This could go one step further and also modify the headers. See also samtools/bcftools#1685
- The name tokeniser now validates the stored length in the data stream matches the decoded length. - Make data types consistent in rans_compress_bound4x16. - Fix an endless loop in arith_dynamic and rans4x16pr involving X_STRIPE with 0 stripes. - Prevent memcpy(dest, NULL, 0) calls. - Reject attempts to duplicate non-existent name tokeniser streams. - Fix possible use of uninitialised memory in rans_uncompress_O1_4x16. - Improve error detection from fqzcomp's read_array function. - Reject fqzcomp parameters with inconsistent "sel" parameters. - Fix uninitialised access to element 4095 in r4x8 O0 arrays.
Add -s, --random-seed option and a constant default, and seed srand() with it. Currently test_bcf_sr_sort() is the only user of rand(), which it uses to seed test_bcf_sr_sort.pl's random number generator.
This isn't permitted by the BAM specification, but was accepted by earlier htslib release. 62f9909 added code to check the maximum length. This now has a warning at 2GB and the hard-failure at 4GB. Fixes samtools#1420. Fixes samtools/samtools#1613
These define explicit vs implicit coordinates. They are now part of the MM specification, but we don't do anything with this data yet. This PR simply permits them to be parsed without choking, and ignores the additional markup. A subsequent PR will improve on this. Fixes samtools#1418
If the user specifies the wrong reference, the CRAM slice header MD5sum checks fail. We now report the SQ line M5 string too so it is possible to validate against the whole chr in the ref.fa file. The error message has also been improved to report the reference name instead of #num. Finally, we now hint at the likely cause, which counters the misleading samtools supplied error of "truncated or corrupt" file. See samtools/samtools#1640.
It's trying to spot error messages starting with lowercase letters, but in doing so forbids things like "@sq" as it's not capital.
The previous commit permitted these to exist, but didn't make the data available to the caller. This extends the API with additional queries to distinguish the specifics about the modification types present.
Done to fix a problem where clang fails to install due to an inconsistency in the apt sources used by the ubuntu kinetic (a.k.a devel) Docker image. This means an update to clang via the kinetic-proposed source makes it uninstallable on that image until the proposed change makes it to the kinetic one. Switching to ubuntu:latest means we won't be quite as leading-edge but it's less likely to break unexpectedly.
This attempted to grow memory by the maximum amount of space a base modification would take up, but due to a misunderstanding of kstring it kept adding this to the original size rather than actually growing the allocated size. (Probably) fixes samtools/samtools#1652
gcc-12.1 produced a warning that NULL could be passed to strcmp() via str_class. I'm not sure if that can actually happen, but just in case add a check.
Update htscodecs to bring in the rANS 32x16 codecs Add new htscodecs source files and dependencies into htslib makefiles. Some htscodecs functions have changed name slightly so a couple of cram source files are updated to reflect that. These changes are enough to build the non-SIMD versions of the new codecs, but don't enable the accelerated versions yet.
Adds the checks necessary to detect x86_64 SIMD support and turn it on the htscodecs if it's available. As ssse3, popcnt and sse4.1 are used together, they're tested for as a group.
For the benefit of htslib embeds that don't want to create and run a configure script. Adds a small script that does a similar job by probing a few compiler options and then outputs makefile lines to set variables if they succeed. These lines are added to the default 'htscodecs.mk' file that gets built if configure hasn't already made one. Adding them here means the probing will be remembered until the next "make distclean". The script fragment that builds the default 'config.h' checks to see if the variables have been set, and if so adds the appropriate 'HAVE_' lines for the feature.
Mainly to ensure that the fuzzer build doesn't start complaining about unaligned access.
PR samtools#1406 fixed which arguments were used as printf() format strings in realn_check_tag(), which is a subroutine of sam_prob_realn(). Correct the name of the function affected, and add a note of which HTSlib releases were affected.
Under mingw clang requires dllexport to be applied to both function declarations and function definitions. (Gcc is happy for the definition only to have the dllexport attribute.) Note this exports several "internal" functions, including some apparently unused anywhere (hts_json_*). Perhaps they were once used in htslib-plugins, but no more. I haven't taken the decision to remove these though, but it's worth considering for whenever we next do an ABI breaking change. Either that or stop pretending that their internal only when clearly they are not, and move their API to the external htslib/*.h files instead. These appear to be somewhat in Limbo right now. Fixes samtools#1433
partially parsed ones. It makes little sense for this to exist as a public API when it's only capable of handling the internal during-SAM-parse situation, and the changes are relatively minor. Also fixes an undocumented assumption that end == &in. Fixes samtools#1650
For formats like BAM and BCF that are normally compressed, report it explicitly when encountering a raw uncompressed such file. See samtools/samtools#1884 for motivation.
This simply turns the monolithic vcf_parse_format functions into a series of numbers sub-functions whose primary is to localise the variables to that code block and to make it easier to see the structure of the tasks being performed. There is no code optimisation here and the main algorithm is unchanged, so this is just moving of code from 1 function to multiple functions. However it makes the next commit easier to understand as we're not trying to see a delta mixed in with restructuring. An unexpected consequence however of making the variables local to their blocks is that it also speeds up the code. The first step in separating this code into functions was simply adding curly braces around each code segment and moving the function-global variables into their respective blocks. The before/after benchmarkjs on 100,000 lines of a multi-sample G1K VCF are ("perf stat" cycle counts): ORIG LOCALISED gcc7 29335942870 27353443704 gcc13 31974757094 31452452908 clang13 31290989382 29508665020 Benchmarked again after moving to actual functions, but the difference was tiny in comparison.)
This is the main meat of the VCF read speedup, following on from the previous code refactoring. Combined timings on testing GNOMAD very INFO heavy single-sample file, a many-sample (approx 4000) FORMAT rich file for different compilers, and the GIAB HG002 VCF truth set are: INFO heavy (15-29% speedup) (34-39% speedup) dev(s) PR(s) dev(s) PR(s) clang13 6.29 5.34 2.84 1.85 gcc13 6.74 5.22 2.93 1.93 gcc7 7.96 5.65 3.25 1.98 FORMAT heavy (6-19% speedup) (18-22% speedup) dev PR dev PR clang13 9.17 8.58 5.45 4.48 gcc13 9.88 8.04 5.08 3.95 gcc7 9.12 8.33 4.87 3.98 GIAB HG002 (28-29% speedup) (33-37% speedup) dev PR dev PR clang13 12.88 9.30 5.12 3.29 gcc13 12.04 8.60 4.74 3.19 gcc7 12.87 9.37 5.32 3.34 (Tested on Intel Xeon) Gold 6142 and an AMD Zen4 respectively) Bigger speedups (see first message in PR) were seen on some older hardware. Specific optimisations along with estimates of their benefit include, in approximate order of writing / testing: - Adding consts and caching of bcf_hdr_nsamples(h). No difference on system gcc (gcc7) and clang13, but a couple percent gain on gcc13. - Remove the need for most calls to hts_str2uint by recognising that most GT numbers are single digits. This was 4-5% saving for gcc and 9-10% on clang. - Remove kputc calls in bcf_enc_vint / bcf_enc_size, avoiding repeated ks_resize checking. This is a further ~10% speedup. - Unrolling in bcf_enc_vint to encourage SIMD. - Improve speed of bgzf_getline and kstrrok via memchr/strchr. In tabix timings indexing VCF, bgzf_getline change is 9-22% quicker with clang 13 and 19-25% quicker with gcc 7. I did investigate a manually unrolled 64-bit search, before I remembered the existance of memchr (doh!). This is often faster on clang (17-19%), but marginally slower on gcc. The actual speed up to this function however is considerably more (3-4x quicker). For interest, I include the equivalent code here, as it may be useful in other contexts: #if HTS_ALLOW_UNALIGNED != 0 && ULONG_MAX == 0xffffffffffffffff // 64-bit unrolled delim detection #define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL) // Quicker broadcast on clang than bit shuffling in delim union { uint64_t d8; uint8_t d[8]; } u; memset(u.d, delim, 8); const uint64_t d8 = u.d8; uint64_t *b8 = (uint64_t *)(&buf[fp->block_offset]); const int l8 = (fp->block_length-fp->block_offset)/8; for (l = 0; l < (l8 & ~3); l+=4) { if (haszero(b8[l+0] ^ d8)) break; if (haszero(b8[l+1] ^ d8)) { l++; break; } if (haszero(b8[l+2] ^ d8)) { l+=2; break; } if (haszero(b8[l+3] ^ d8)) { l+=3; break; } } l *= 8; for (l += fp->block_offset; l < fp->block_length && buf[l] != delim; l++); The analogous kstrtok change is using strchr+strlen instead of memchr as we don't know the string end. This makes kstrtok around 150% quicker when parsing a single sample VCF. When not finding aux->sep in the string, strchr returns NULL rather than end of string, so we need an additional strlen to set aux->p. However there is also glibc's strchrnul which solves this in a single call. This makes kstrtok another 40% quicker on this test, but overall it's not a big bottleneck any more. - Use strchr in vcf_parse_info. This is a major speed increase over manual searching on Linux. TODO: is this just glibc? Eg libmusl speeds, etc? Other OSes? It saves about 33% of time in vcf_parse (vcf_parse_info inlined to it) with gcc. Even more with clang. The total speed gain on a single sample VCF view (GIAB truth set) is 12-19% fewer cycles: - Minor "GT" check improvement. This has no real affect on gcc13 and clang13, but the system gcc (gcc7) speeds up single sample VCF decoding by 7% - Speed up the unknown value check (strcmp(p, "."). Helps gcc7 the most (9%), with gcc13/clang13 in the 3-4% gains. - Speed up vcf_parse_format_max3. This is the first parse through the FORMAT fields. Ideally we'd merge this and fill5 (the other parse through), but that is harder due to the data pivot / rotate. For now we just optimise the existing code path. Instead of a laborious switch character by character, we have an initial tight loop to find the first meta-character and then a switch to do char dependant code. This is 5% to 13% speed up depending on data set. - Remove kputc and minimise resize for bcf_enc_int1. 3-8% speedup depending on data / compiler. - Use memcmp instead of strcmp for "END" and ensure we have room. Also memset over explicit nulling of arrays. - Force BCF header dicts to be larger than needed. This is a tactic to reduce hash collisions due to the use of overly simple hash functions. It seems to typically be around 3-8% speed gain. - Restructure of main vcf_parse function. This can speed things up by 6-7% on basic single-sample files. The previous loop caused lots of branch prediction misses due to the counter 'i' being used to do 8 different parts of code depending on token number. Additionally it's got better error checking now as previously running out of tokens early just did a return 0 rather than complaining about missing columns.
The two we keep are the internal while loop to find the next ; or = instead of iterating back in the outer for loop, and memcmp instead of strcmp for "END". The strchr changes do help glibc on excessively long INFO tokens, seen in the GIAB truth set and GNOMAD files, but they have no impact on most mainstream VCF outputs. Furthermore, other C libraries, such as MUSL, are considerably slowed down by the use of strchr. Hence this isn't a particularly robust or warranted change.
See samtools/samtools#813 (Will close issue with man page update which needs applying there)
sam_hdr_remove_lines and sam_hdr_remove_except are very slow when removing large numbers of header lines due to the effort of deleting hash entries one by one. This commit instead rebuilds the hashes at the end of the deletion process.
As it's difficult for fuzzers to get CRC checksums right, prevent CRC checks from failing in fuzzing build mode. This should expand the number of code paths that the fuzzer can explore.
* Limit max. IDX numbers in VCF to prevent large allocations * Limit max. sum of shared_len + indiv_len in bcf_read1_core() * Limit max. header size in bcf_hdr_read() * Limit max. header size in cram_read_SAM_hdr() * Limit max. header size in bam_hdr_read() * Limit max. n_targets in bam_hdr_read() * Limit max. number of landmarks in cram_read_container() * Limit max. number of huffman codes in cram_huffman_decode_init() * Limit max. record size in sam_realloc_bam_data() Adds a header where the memory limit for fuzzing can be set. This involves a bit more work, but there is benefit to having this in one clearly defined place.
... so that the following code ``` static inline void u32_to_le(uint32_t val, uint8_t *buf) { *((uint32_u *) buf) = val; ... ``` will not cause -fsanitize=alignment failures when building with Clang.
Enables repeated hopen("-") / hclose() / hopen("-") where previously the underlying STDIN/OUT_FILENO would have been prematurely closed. This means that stdout is never really closed and hclose()'s return value does not reflect closing the underlying fd. Hence particularly paranoid programs that have written significant data to stdout will want to close and check it themselves just before the end of main(): if (fclose(stdout) != 0 && errno != EBADF) perror("closing stdout") (Ignore EBADF as stdout may already have been closed and checked, e.g. if the program has been linked to an earlier HTSlib where hclose() still closes STDOUT_FILENO.)
We don't need to dup(STDOUT_FILENO) now that hclose()/hts_close() no longer irretrievably close stdout. Instead fclose(stdout) explicitly just before the end of main() as it is a last chance to observe I/O errors.
…left out to make test pass for now
…1671) If we have an unofficial CRAM v4.0 encoded file and are attempting to use non-byte based data in an EXTERNAL encoded block, then we didn't free the cram_codecs structure when failing to parse and initialise. Credit to OSS-Fuzz Fixes oss-fuzz 62144
Reimplemented to reduce complexities. Update to avoid test failure in windows Updated to work with shared stdin/out hfile
I suspect this was initially hard as on, but later made something we explicitly enable but forgetting to add that code into htslib. On Illumina it made little difference, but wasn't detrimental. I need bigger data sets, but they're mostly on unavailable systems right now. Small tests demonstrate utility though, specifically on decode speeds. For other platforms: Ultima Genomics =============== Orig real 0m25.784s user 0m24.506s sys 0m1.189s real 0m9.155s user 0m7.775s sys 0m1.379s RANS_ORDER_SIMD_AUTO real 0m24.987s user 0m23.699s sys 0m1.219s real 0m8.097s user 0m6.635s sys 0m1.461s That's 13% quicker decode and 3% quicker encode. It's mostly QS and tags: $ ~/samtools/samtools cram-size -v _.cram|grep 32x16 BLOCK 10 617823 77895 12.61% r32x16-o1 BLOCK 12 911236491 188134803 20.65% r32x16-o1R QS BLOCK 27 232221 38816 16.72% r32x16-o0 FC BLOCK 31 54067 10718 19.82% r32x16-o0 BS BLOCK 7614554 917596491 50148593 5.47% r32x16-o1 t0Z BLOCK 7630914 931877007 108982153 11.69% r32x16-o1R tpB ONT === Orig real 0m3.018s user 0m2.854s sys 0m0.130s real 0m0.578s user 0m0.538s sys 0m0.040s RANS_ORDER_SIMD_AUTO real 0m2.912s user 0m2.740s sys 0m0.120s real 0m0.500s user 0m0.430s sys 0m0.070s That's 16% quicker decode and 4% quicker encode, but sample size is admittedly tiny for both tests. File size changes are under 0.1% growth, mainly due to 32 rANS states instead of 4. The RANS_ORDER_SIMD_AUTO flag basically enables the 32-way rANS if the block is sufficiently large (>50kb), so it's the extra 112 byte state overhead isn't significant.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.