Fix buffer overflow in hts_md5_update()#1068
Merged
daviesrob merged 1 commit intosamtools:developfrom May 15, 2020
Merged
Conversation
Contributor
|
You beat me to it. I was just wrestling vcf a bit more and was going to take a peek this afternoon. Yes the fix looks correct. Although I wonder if it should just be |
Contributor
Author
That would probably be the cleaner solution. |
wstokes
pushed a commit
to GSLBiotech/htslib
that referenced
this pull request
Jan 6, 2021
commit a7a90fe913f8a466f32f6e284cf46653944acd6f
Merge: fd0f895 cd0a84b
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Tue Sep 22 13:15:49 2020 +0100
Release 1.11
commit cd0a84b490e910c085b987e3026f724be98349bf
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Tue Sep 22 13:14:15 2020 +0100
More minor NEWS fixes
commit 80161279b755d83df6e0dbc52942f625df414452
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Fri Sep 18 11:44:04 2020 +0100
Add back assert.h to vcf.h.
This undoes a change in #1060.
It's a philosphical point. On one hand we have an inclusion of a
spurious file that we do not utilise ourselves. On the other hand we
have an API breakage where old software (in this case bcftools 1.10)
attemting to compile against current htslib no longer succeeds because
we've changed which symbols we declare.
This is unclean, but I think backwards compatibity trumps tidyness in
this case.
commit 75e2017bb097cb47722ac8796de10036d21c9625
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Thu Sep 17 09:43:36 2020 +0100
Minor NEWS updates
Reword tabix --separate-regions description. Fix PR#1094 text -- the
original has <space> in the description (edit the PR comment to see it).
Credit @lczech. Trivial formatting fixes. Spelling mistakes PR surely
too trivial to mention :-).
commit 1748a2dd34e5ff4b4800fcf04dbd5860d8c30f22
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Sep 16 10:10:46 2020 +0100
More NEWS updates
commit d50a3a531c597a24644c6878fc0068d0d1cc0a0d
Author: Andrew Whitwham <whitwham@users.noreply.github.com>
Date: Mon Sep 14 20:49:22 2020 +0100
News update Sept 2020.
commit af6dbb49b2d522bfd6d9762bd9329b28aa0b1e70
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Wed Sep 16 10:17:50 2020 +0100
Spelling corrections [minor] (PR #1137)
commit 170047656473a7fadf9f23f3afdfac46b9c52b21
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Mon Aug 17 17:18:27 2020 +0100
Handle q==0.0 case in kt_fisher_exact() and add tests
When some of kt_fisher_exact()'s input arguments are very large,
hypergeo(...) is dominated by the subtracted lbinom() so returns
exp(-750 or less), which underflows, hence hypergeo() returns 0.0.
When q is 0.0, kt_fisher_exact()'s loops never execute and which of
i and j is closer to n11 is indeterminate. To avoid this problem
a special case is added for q == 0.0 which determines the output
values by comparing n11 to the location of the peak of the
hypergeometric distribution.
Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>
commit 9067dee60e91eafba657b13e502307d500dcd489
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Fri Sep 11 14:16:28 2020 +0100
Fixed a crash in bcf_record_check.
Commit 3ac8a04 breaks the check if hdr is passed in as NULL. This
never happens in htslib tests, but does in bcftools.
commit 8bab82bdb8c2613e1ca7bd5573d7c12117a2dc02
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Thu Sep 10 07:05:16 2020 +0000
Check against the proper type lenght.
commit 34ba6c726925efceaa2cc995d7d4b0409907f331
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Tue Sep 8 18:23:24 2020 +0100
Add sam_format_aux1 and bam_aux_get_str APIs. (PR #1134)
sam_format_aux1 is extracted out from sam_format1_append, and
indeed is the lion's share of this function. It is useful as an external
API for anything that is generating its own tags or a partial tag list in
SAM format, such as "samtools fastq". The tag key, type and data
are passed in as separate pointers to aid usage with bam_aux_get
which returns 2 bytes into the in-memory representation of a tag.
It also allows the function to be used in cases where these are not
stored together, for example the internal CRAM representation.
bam_aux_get_str is a wrapper around sam_format_aux1 which
simplifies the case of converting a single tag from a record in
a bam1_t struct.
commit 3ac8a04f8f6071be0901a9ddcda296f58b2bcf0c
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Wed Aug 26 14:29:55 2020 +0100
Sped up bcf_check_record.
This is mainly by speeding up bcf_dec_typed_int1_safe, but also by
some other code in there to streamline the function and keep the
loops tighter.
The effect on clang is slight. On gcc9 reading an uncompressed BCF
file can be a bit faster (up to 15% on one file), bringing the speed
of gcc and clang to similar speeds when using -O3. (Gcc -O2 is still
noticably poorer than clang -O2 on this test though.)
Examples on CHM1_CHM13_2.1.gatk_raw.gvcf.ubcf and gcc9 -O3.
Before:
1549.79 msec task-clock # 0.999 CPUs util
5549055377 cycles # 3.581 GHz
16764915993 instructions # 3.02 insn per
After:
1345.83 msec task-clock # 0.999 CPUs util
4806341286 cycles # 3.571 GHz
13942053336 instructions # 2.90 insn per
Record checking is still around 50% of the entire decode CPU cost though.
commit f4f7f24914251744a928eb6f2aa8e89fa5788d0c
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Tue Sep 1 12:28:33 2020 +0100
Improvement (personal preference) to previous C99 fix.
In #498 I removed the warning:
vcf.c:3030:26: warning: assignment makes pointer from integer without a cast [-Wint-conversion]
d->allele[i] = tmp.l;
by storing a pointer starting from a fixed "" string, and later on
subtracting it back again to get the length. While that works, it's
pretty icky and I think undefined behaviour as it uses pointer
arithmetic outsides the bounds of an object. (In reality it's fine on
all modern systems as the days of segmented pointer archictures are
long gone.)
Instead I now cast the length into a char* as the size of allele is
never going to grow beyond the size of a pointer, and then cast back
prior to adding to the d->als pointer.
[Ideally we wouldn't need these shenanigans at all and keep it in index
space instead of pointers so growing memory doesn't require a lot of
pointer juggling, but bcf_dec_t is a public structure.]
commit e9447f86ec386798bdf0cc9ee3646f982328e321
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Thu Aug 27 12:22:01 2020 +0100
Remove now unused ks_resize2 function.
NB: This never made it into a release so it isn't an ABI breaking
change and nor does it need to be listed in the NEWS file.
commit e1d314931fe6b2c8d8bc5d0330b989d33adb175d
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Thu Aug 27 11:09:08 2020 +0300
Have realloc and free in the same layer of kstring (#1129)
Bring realloc back to the inline function, so that the allocation and freeing
of memory are done by the calling layer.
Replace the roundup to a power of 2 with a simpler formula that multiplies
the desired realloc size with 1.5.
Try to allocate a maximum of SIZE_MAX/2 bytes.
commit 78648e2a93aaec850ae47fa56cc5c68deb924fa0
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Tue Aug 25 16:52:59 2020 +0100
Bug fix to the pileup constructor.
We call constructor and destructor for each new bam object that is
added to and removed from the pileup. Pileup also takes its own copy
of the bam1_t struct as sam_read1 loops may be using the same
structure over and over again.
However the constructor was called with the passed in bam1_t instead
of pileups own copy of it (unlike destructor). Hence anything the
constructor attempts to cache, such as the location of an aux tag,
may get written over.
commit 0456cec92e06f4354bbdde26dd3a3b9d3b8bb0ae
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Fri Jul 31 11:16:59 2020 +0100
Clarify that htsFile fields are not to be used directly
commit adeeb8870eadf3e005ae0f17e773d7d90cf58624
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Thu Jul 30 11:27:20 2020 +0100
Add bcf_ prefix to variant_t struct
This struct is used only as the type of a field within bcf_dec_t,
which itself is an internal field of bcf1_t. So user code is very
unlikely to be using this identifier directly.
commit 71b9e4ff763e97f66e0a139a4f3b2974075fb317
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Sat Jul 11 12:34:42 2020 +0100
Rename public HTSlib struct tags to omit leading underscores
File-scope identifiers starting with an underscore are reserved to
the compiler, and there is nothing wrong with a struct tag matching
a typedef name.
In cases where foo_t is actually a typedef to a pointer, rename the
struct tag from __foo_t to foo_s to distinguish it from foo_t.
commit 2fb3b7e178304eff76381c630e54e6408a94f834
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Thu Jul 30 10:53:04 2020 +0100
Remove unused aux_key_t typedef
This was added during PR #608 but accidentally not removed when the
code using the typedef was removed later in the PR's development.
commit 3c6f04be2fa45ee8eb4c2f9a171f2ba8a6032588
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Wed Jul 29 14:09:14 2020 +0100
Add struct tags for public typedeffed structs
This enables user code to forward declare these types with
e.g. "struct htsFile;" without including <htslib/hts.h>.
This commit adds tags for typedefs that previously didn't have one.
commit 7ecaaf90d4515fa9747e7536234722e5d31e514a
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Thu Aug 20 17:43:27 2020 +0100
Ignore errors when closing the input file in the fuzzer
Since 3855184b, hts_close() returns non-zero if earlier
errors were detected on the file handle. As this frequently
happens with fuzz input, it should not be treated as a fuzzing
failure.
commit 4162046b28a7d9d8a104ce28086d9467cc05c212
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Thu Aug 13 15:45:59 2020 +0100
Tidy bcf_hdr_subset() error handling
Gather clean-up code in one place, fixing some
cases that were incompletely handled.
Catch more memory errors directly.
commit 564baf0364762cdd7fadde03ff333c379ec078f8
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Thu Aug 13 12:21:29 2020 +0100
Improve clean-up in hts_tpool_init()
Allocate hts_tpool::t_stack earlier so it's easier to clean up
if it fails.
Add error handling code to tidy up any threads that got created
prior to a pthread_create() failure. Rename variable i to
t_idx so it's easier to see how it links between the thread creation
and the clean-up code. Adds a dependency on htslib/hts_log.h so
it can use hts_log_error().
commit 260c5a35ad855788baba16c5419ae50b6d23061a
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Mon Aug 10 09:59:50 2020 +0100
Improve bcf_hdr_format error checking.
It now handles ksprintf failing, and all the calls to it now also
check for failure and bubble it up one layer. Everything that calls
that layer already had checks, although mostly these are absent in
bcftools.
(That's an entirely new repository and PR though.)
Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>
commit 6e759d374751f2743811bedfb761d6feacbced6b
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Mon Aug 10 09:46:04 2020 +0100
Improved return from bcf_hdr_parse_line to enable error detection.
The function returns NULL for both failure and also for running out of
text. Hence it's executed in a loop until it returns NULL, assuming
success.
On successful end it also sets *len to 0, so now on error we
explicitly set *len to non-zero so it is possible to use that as
a side channel for disambiguating EOT from error. Values returned
in *len are documented in the header file.
If bcf_hdr_parse_line() returns NULL due to a malformed line,
bcf_hdr_parse() can use the returned value in *len to skip to
the next one instead of searching for the newline itself.
To stop multiple warning messages from being printed on finding
malformed lines, make hts_hdr_parse_line() responsible for logging
both cases that it traps and remove the corresponding logging from
hts_hdr_parse(). Use hts_strprint() when printing bad lines to
sanitise any unprintable characters.
Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>
commit 3855184b58ff51735e9a70282a286702a24ae7f3
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Thu Aug 6 11:11:35 2020 +0100
More mem-checking & threading hardening (write side).
- Failing to allocate the structure to store a thread result now shuts
down all processing queues. This avoids potential deadlock.
- Failure in the thread pool now sets the shutdown signal to 2 instead
of 1, offering a simple way to determine error-induced shutdown
instead of normal shutdown.
- BGZF now checks for errors in bgzf_close. Previously some errors
were caught, but still returned 0 from close and weren't propagated
back to the application.
- khash failure in cram_encode_aux could leave fd->metrics_lock
locked. Delete unused hash entry if cram_new_metrics() fails.
- Improved hts_tpool_process_flush to avoid deadlocks by periodically
checking for shutdown. Also fixed a bug in a previous commit where
a shutdown queue could lead to flush returning before n_processing
hits zero, which in turn could lead to memory being freed for
running jobs.
- Check for string_ndup failure in cram_encode_aux.
- Removed double free in error handling of cram_stats_encoding.
- As with SAM reader, the threaded writer now uses the
hts_tpool_process_shutdown function instead of ..._destroy to avoid
a race condition between that and fd->q = NULL.
- Don't attempt to free h->target_name elements when they havn't been
initialised yet due to error in creation.
- Fix potential double free in sam_format_worker (on error).
commit 8ae25df4bfba9c2bc9b4ac1f6c9fa1de70d5eec9
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Wed Aug 5 12:22:55 2020 +0100
More protection against running out of memory (all formats read side).
These have been found by using a malloc replacement (added to the
Makefile plus header include using CFLAGS="-g -include malloc_fuzz.h")
so that the Nth memory allocation will always fail. We then loop from
1 upwards until we succeed, checking every place for a proper handling
of the error.
Eg:
for i in `seq 1 20000`
do
echo $i
FUZZ=$i ./test/test_view -@4 -B in.bam 2>&1
test "$?" ">" 1 && break
done
Changes include
- bgzf MT reading now sets mt->command = CLOSE and has additional
calls to shutdown the queue on memory errors.
This is checked for in bgzf_check_EOF to avoid deadlock.
- bgzf now has a pool_create() error check.
- Distinguish between errors vs would-block in cram_decode_slice_mt
- Make cram_drain_rqueue cope with failed jobs.
- Improve cram_close to cope with shuting down after failed
initialisation, such as rqueue not being initialised.
- Check for realloc failure in refs_load_fai.
- Fix segfault in cram_decode_slice if cram_get_ref failed.
- Removed abort in load_hfile_plugins. It now returns -1 and is
checked for in find_scheme_handler.
- Documented return values in hts_getline
- Attempt to propagate memory error in kgetline2 to file pointer, but
this only works for hgets (not gets). Fundamentally this is
unsolvable with the current API.
- Fixed memory tear-down in sam_parse_worker so it doesn't claim to
have allocated more than it managed.
- sam_set_thread_pool now copes with failure to create the queue.
- Tabix index_load now checks calloc return.
- Extra allocation return checks in hts_tpool_process_init and
hts_tpool_init.
- Check ksprintf worked in vcf fix_chromosome.
Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>
commit 773b6ed3c962c434901e0194607bbc8fc6c2fde2
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Mon Aug 3 11:25:03 2020 +0100
Fixed deadlocks and crashes in threaded SAM parsing.
There was a race with the closing code, caused by early termination
(eg failure during parsing). This has been solved by waiting for the
sam_dispatcher_read to acknowledge and return a SAM_CLOSE_DONE call.
There were also some crashes caused by aborting even earlier, eg
failing to launch all the threads due to running out of memory.
Note extreme low memory may still lead to crashes in this code (but no
more deadlocks). It's challenging to cover even the basics failing to
initialise. In practical scenarios it should be robust.
(Via Rob Davies): sam_dispatcher_read now uses
hts_tpool_process_shutdown instead of hts_tpool_process_destroy. It
also removes any change from SAM_CLOSE_DONE to SAM_CLOSE or SAM_NONE
to avoid race conditions or deadlocks.
Fixed a thread pool bug where worker threads could still launch new
jobs after a queue (process) had been shutdown provided the pool
itself hadn't. We now check both pool and process.
Similarly fixed hts_tpool_process_flush to work after shutdown instead
of wedging. However we don't use this so the change is not strictly
needed for this particular fix.
Finally improved on the error handling so we use errno over EIO if
errno is set, to avoid masking the true source of error.
commit 0e8a6e21e8e30182041cb0b83407de952b521ffd
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Wed Aug 5 10:08:43 2020 +0100
Fix bug in the line reallocation for threaded SAM decoding.
(The +NM/2 on the wrong side of the comparison!)
The +8 changes here may not be necessary, but we did it for some
allocs so I'm just being consistent and making sure it's always got 8
more than it uses. Either it's not necessary in which case those
changes are harmless, or it is necessary in which case we were
sometimes not always leaving at least 8 bytes remaining.
I believe this fixes https://github.com/samtools/samtools/issues/1293
commit a79009b38ce83e39bcbc8f54c00cf203621aa5bb
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Wed Aug 5 10:17:46 2020 +0100
Don't trash errno in bam_tag2cigar.
bam_aux_get sets errno=ENOENT when we look for a tag and don't find
it. Restore it to the old value when failing to find CG tags as
not finding the tag is not an error.
The consequence of samtools/samtools#1296 was that some errors
found while decoding a BAM file got reported as "No such file
or directory" prior to this commit.
Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>
commit ab045d8c9d13a023ed873bd4788e851dd52937f2
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Aug 4 15:24:20 2020 +0100
Add missing Makefile dependency [minor]
commit 31f0a76d338c9bf3a6893b71dd437ef5fcaaea0e
Author: Andrew Whitwham <whitwham@users.noreply.github.com>
Date: Wed Jul 22 09:55:16 2020 +0100
Update dates in LICENSE and program versions.
commit f098e4da5728e8ba4ca4351b9a1b945d17dd8982
Author: Andrew Whitwham <whitwham@users.noreply.github.com>
Date: Fri Jul 17 12:01:59 2020 +0100
Update the copyright to 2020 where needed.
commit a2fdf3b2fa29844a2d48c4d786471f05bd7821a9
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Sat May 9 13:42:28 2020 +0100
Consider alignments which consume no reference bases to have length 1
In bam_endpos(), similarly to reads that are unmapped or have no CIGAR,
consider reads whose CIGARs consume no reference bases to have length 1.
Thus consider them to cover one reference position (rather than zero)
for indexing purposes. Fixes samtools/samtools#1240.
Use bam_cigar2rlen() directly in bam_plp_push() to avoid this and other
adjustments. (This function already ignores unmapped reads separately.)
Update code that sets the bin field accordingly: use bam_endpos() where
applicable; when not all fields have been set up, add similar code to
make adjustments when rlen==0.
commit 302843d739d65d3192c52558b154c25f2a00c103
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Tue Jul 28 15:59:14 2020 +0300
[tabix] Add --separate-regions option (#1108)
* Add tabix --separate-regs option.
* Rename option to --separate-regions.
Print the region name on top of the corresponding group.
Add unit test and minor refactoring.
commit 3a359cd61322a2fc546198f72cb36988924ab765
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Jul 28 12:01:52 2020 +0100
Download index files atomically in idx_test_and_fetch() (PR #1112)
* Reuse cram_populate_ref()'s atomic downloading in idx_test_and_fetch()
Extract filename uniquification into a new hts_open_tmpfile() function
declared in hts_internal.h. No code in hts.c currently uses pthreads
directly, so instead of mixing pthread_self() into tmpname, mix in the
pointer value of one of the arguments -- which will vary across threads.
Test whether the local index file already exists using access() rather
than a trial fopen() into local_fp, which is no longer a FILE*.
* Improve cram_populate_ref() download error handling
Clean up the temporary file if any of hwrite/hclose/chmod/rename
fail. Putting hclose() first in the if condition ensures it is always
executed, and putting rename() last ensures that the file is still
at path_tmp in any failing case.
commit da595880d134b29fb62f726b72bdc6302ffc9ccc
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Tue Jul 21 10:19:56 2020 +0100
Fixed a (harmless) CRAM uninitialised data access.
This was caused by the fix in PR #1094.
The problem there was data without explicit quality scores being set
(CF data series with "preserve qual scores" not set) could be coupled
with explicit Q or q feature codes to add qualities at specific
sites. Without quals, the default is "*" (a run of Q255), but we
can't just amend a few qualities here and there as that leads to a mix
of Q255 and Q-something-else. We memset from Q255 to Q-something-else
the on the first explicit quality change.
The flaw is that this check is applied even on data where the
"preserve qual scores" CF value is set, but we haven't initialised the
quality string to all-Q255 in that scenario.
However it was totally harmless, as if we preserve quality scores,
then the entire array of quality values are then subsequently
overwritten (after applying any per-base qualities - dumb I know, but
it was never intended for that scenario). So irrespective of the
outcome of the uninitialised check and whether or not the memset to
Q30 happened, the data still ends up the same thing.
Note curiously this does not show up with -fsanitize=address, but
valgrind does detect it. Hence the test harness was passing fine.
commit dcd4b7304941a8832fba2d0fc4c1e716e7a4e72c
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Mon Jul 13 11:48:49 2020 +0100
Fix check for VCF record size
The check for excessive record size in vcf_parse_format() only
looked at individual fields. It was therefore possible to
exceed the limit and overflow fmt_aux_t::offset by having
multiple fields with a combined size that went over INT_MAX.
Fix by including the amount of memory used so far in the check.
Credit to OSS-Fuzz
Fixes oss-fuzz 24097
commit df31e599a3d3778804b6a7fda9bb5fdc683ee34c
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Fri Jul 10 16:53:43 2020 +0100
Avoid crash for `bcf_update_alleles(hdr, rec, NULL, 0)`
Updating a bcf1_t to have 0 alleles is not particularly meaningful, but
it is a valid state as that's what bcf_init() and bcf_clear() produce.
Hence it is a valid thing for bcf_update_alleles() to do too, so calling
it with nals==0 should not crash. Instead set rlen the same way that
bcf_init() and bcf_clear() leave it. Fixes #994.
Also fix bcf_get_format_values() allocation failure check [minor].
Hat tip @reedacartwright.
commit c7c7fb56dba6f81a56a5ec5ea20b8ad81ce62a43
Author: daviesrob <rmd+git@sanger.ac.uk>
Date: Wed Jul 15 09:03:25 2020 +0100
Fix hfile_libcurl breakage when using libcurl 7.69.1 or later (#1105)
Curl update https://github.com/curl/curl/pull/5050 (pause: bail out
on bad input) made curl_easy_pause() return an error if passed
a disconnected CURL handle. In hfile_libcurl this could happen
when libcurl_read() or libcurl_close() was called after all
bytes had been read; or all the time in restart_from_position().
Fix by checking for fp->finished in libcurl_read() and
libcurl_close(); and by removing the curl_easy_pause() in
restart_from_position().
Thanks to @jmarshall for tracking down the exact libcurl change
that caused the incompatibility.
commit 9c357445c5dab64db4000497193975fa12f63225
Author: Ilya Vorontsov <vorontsov.i.e@gmail.com>
Date: Wed Jul 8 15:52:24 2020 +0300
typo "Number=R" instead of "Number=G" in error msg
commit 2c49d1913b05e754e1f180fc64d187ba0baba1a8
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Jun 24 18:44:24 2020 +0100
Don't call exit(1) on invalid INFO numbers in vcf_format()
Instead, set errno = EINVAL and return -1, as for other types of
invalid record.
commit 6fc307b777cb20ea52820d88e7e7c2c645edfd4b
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Jun 24 17:02:14 2020 +0100
Catch invalid dictionary indexes in vcf_format()
bcf_record_check() can't check the validity of references to the
header dictionaries when using the iterator interface as it doesn't
have access to the header. Add a check to vcf_format() to catch any
invalid records that make it that far.
Changes the existing, incomplete, check for FORMAT ids so
that it returns -1 instead of calling abort().
Adds 'const' qualifier on 'rec' to bcf_seqname() and
bcf_seqname_safe() they can be used without casts in vcf_format().
commit 1d6b68263a58786b6ab88f579e2866ad9e13c17b
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Jun 24 16:29:12 2020 +0100
Better catch BCF CHROM/FILTER/FORMAT ids absent from the header
It is possible to make a sparse set of ids in the VCF
header using IDX= annotations. This can cause problems
if a BCF record refers to one of the indexes that does
not exist.
Update bcf_record_check() so that these bad records while
files are being read. It already did this for INFO; this
adds similar checks for CHROM, FILTER and FORMAT data.
Credit to OSS-Fuzz
Fixes oss-fuzz 23670
commit 832e4ba7b43b68a53ef2277cfbc643b85c8d32fd
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Tue Jul 7 18:46:05 2020 +0100
Fix paths for Windows tests.
commit 818b271f8be6d71258364aff69075ede97a49054
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Jul 1 18:33:18 2020 +0100
Add synced reader no-index tests
commit 6e0ed10ef42f393628ee4a0e628a8acc5023ed0a
Author: Petr Danecek <pd3@sanger.ac.uk>
Date: Wed Jun 24 14:46:17 2020 +0100
Support for simultaneous reading of unindexed VCF/BCF files.
In order for this to work, the caller must ensure that the input
files have chromosomes in the same order and consistent with the
order of sequences in the header.
Replaces #1076.
commit c9fdcbf2bb1e700f388a3302914c6f272e3f84ff
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Fri Jun 12 16:43:36 2020 +0100
Don't dlclose() plugins in atexit() handler
Prevents a segfault when HTSlib has been imported using dlopen(),
a plugin that itself links against HTSlib is in use and
hts_lib_shutdown() was not called before dlclose(htslib).
If hts_lib_shutdown() is not called, dlclose(htslib) will
not remove libhts.so if any plugins linking it are installed.
In this case the atexit() handler runs at program exit where
it can clean up any memory still in use. There's no need to
remove the plugin shared libraries (which causes the segfault
as it removes libhts.so while code inside it is running) as
the runtime will tidy them up anyway.
If hts_lib_shutdown() has been called, the plugin shared
libraries will already have been removed.
commit 913fccf7edeb4e959428996e040b958645dc0b2b
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Fri Jun 12 12:55:13 2020 +0100
Add --enable-plugins to travis (and fix -ldl configure test breakage)
Using Address Sanitizer on Ubuntu Xenial with gcc-8 somehow
breaks the configure test to see if -ldl is needed, making it
incorrectly decide that it isn't. This causes a link failure
when building test/plugins-dlhts when the compiler decides
that it really did want -ldl after all. Oddly, other
outputs that reference dlopen(), dlsym() etc. have already
built successfully by this point.
Fix by making configure test for dlsym() instead of dlopen(),
which seems to work better.
commit 9a15df9d394777c45abcdebf27dd65fbaf1a8b79
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Jun 23 13:40:45 2020 +0100
Log when calling hts_lib_shutdown/dlclose/etc
Facilitates adding debugging printfs to e.g. hfile_shutdown()
to observe exactly where in the sequence they occur.
commit 00c61d2c8d83324494d4edab42c452c2d3cc26fc
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Jun 16 08:06:49 2020 +0100
Provide separate dlsym() wrappers for functions and data
Avoid -pedantic warnings about data pointer <-> function pointer
conversions (which are required to work on POSIX platforms so that
dlsym() works, but are nonetheless warned about by compilers).
As we're providing wrappers around dlsym() anyway, we can provide
separate function and data wrappers; change load_plugin()'s return
type as it (practically) always returns a function. This uses the
workaround recommended in POSIX Issue 6 dlsym() (since removed in
Issue 7 by [1]; at least for GCC, type-punning through a union
might be preferable). Hat tip Rob Davies.
[1] https://www.austingroupbugs.net/view.php?id=74
commit 2cb229aedbdf81c968a9030521a9b7ed42cce599
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Sat May 16 09:01:12 2020 +0100
Link libhts.so into plugins on Unix&macOS; add hts_lib_shutdown()
When libhts.so/.dylib has been dynamically loaded with RTLD_LOCAL
(which is the default on Linux), plugins that use hts_always_remote()
etc will not be loadable unless they make libhts's symbols available
themselves.
When libhts.so/.dylib has been dynamically loaded, some plugins have been
opened, and dlclose(htslib) is called: htslib is not actually unloaded
(or its atexit functions e.g. hfile_exit() called) as the plugins are still
using libhts symbols. Eventually at program exit, all atexit functions are
called; so the plugins are closed by hfile_exit(), htslib is suddenly
unloaded when the final plugin is unloaded, and a segfault occurs because
this all happens within hfile_exit() -- which has just been unloaded.
To break this cycle, introduce hts_lib_shutdown() which must be called
before dlclose(htslib), if that is called explicitly prior to program exit.
This unloads the plugins so that dlclose() can immediately unload htslib.
Add test exercising libhts via dlopen(). This ensures that all available
hFILE plugins have been loaded, then calls hts_lib_shutdown() and
dlclose(htslib). As this is the first htslib test that tests things
linked to the shared library, we need to set $LD_LIBRARY_PATH etc so
that the freshly-built libhts.so/.dylib/etc is used.
Add with-shlib.sh script that creates a suitable temporary libdir
containing *only* the freshly-built shared libhts. On Windows platforms,
this directory is added to %PATH% so shouldn't contain anything else.
On macOS, having hfile_*.bundle files in $DYLD_LIBRARY_PATH would reduce
the ability to test against different sets of plugins by varying $HTS_PATH.
commit 82d01ba63b20eed8905ea60c2e1cfe0f96c73bab
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Fri Jul 3 14:41:19 2020 +0300
Implement vcf_open_mode method (PR #1096)
Which can detect the opening mode of a file, based on its extension.
commit cb44caaebb25ec334f257b96f9c207cfa6f346c7
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Wed Jun 24 07:32:51 2020 +0100
Check for null pointer in source string.
commit 29d06c2f84dc20fbc5f94ca1b181292433b69a21
Author: daviesrob <rmd+git@sanger.ac.uk>
Date: Thu Jul 2 13:07:21 2020 +0100
Improve bam_aux_update_str() (#1088)
Makes bam_aux_update_str() easier to use by not insisting that
the byte count passed in via the len parameter should include
the NUL-terminator for the string. This makes it less likely
that an invalid tag will be generated by forgetting to include
the NUL, and also allows tags to be made from part of a longer
string. To support this, it is made to handle adding new tags
itself instead of delegating to bam_aux_append().
Where existing tags are replaced, bam_aux_update_str() is
changed to more accurately calculate how much extra space is
required. A call to bam_aux_del() is removed and the tag
shuffling code updated to reduce the number of memmove() calls
needed to get the following tags in the right place.
Adds some extra tests to ensure growing and shrinking tags
both work correctly. Also fixes a bug in the test/sam.c
test_mempolicy() function, which was not adding the correct
number of bytes when appending the ZZ tag to each alignment
record.
commit 33bfcfa37df8beae67d3b0874c7253fe14d8929e
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Tue Jun 30 17:01:27 2020 +0100
Fix mpileup regression between 1.9 and 1.10.
When RNEXT / PNEXT / TLEN are set to the uninitialised values (* / 0 /
0) the overlap_push function was escaping the overlap removal code.
We now still look for overlaps in this case. It's not as efficient
when it doesn't have these values as we can't optimise the case of
detecting non-overlapping reads, but we do get the correct answer again.
commit 3d55140fead0a42762abc7c62032e3ee1265f30f
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Fri Jun 26 12:29:48 2020 +0100
Fix support for decoding CRAM 'q' feature.
This was incorrectly marching along seq / ref / cig-len, which broke
the CIGAR string and base calls.
Test file: https://github.com/jkbonfield/hts-specs/blob/CRAM_tests/test/cram/passed/1005_qual.cram
commit 146e6a28c948e935add16e106f77482cfc7e383d
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Thu Jun 25 14:18:07 2020 +0100
Set CRAM default qual for lossy qual modes.
If we have lossy quality enabled and couple that with the use of 'B',
'q' or 'Q' features, CRAM starts off with QUAL being all 255 (as per
BAM spec and "*" quality) and then starts to modify individual
qualities as dictated by the specific features.
However that then produces ASCII quality <space> (q=-1) for the
unmodified bases. Instead we use ASCII quality "?" (q=30), as per
htsjdk. We use have quality 255 for sequences with no modifications
at all though.
NB: Neither htslib nor scramble can produce test data to demonstrate
this bug. I used gdb to tweak the CF value (b cram_encode.c:2769 and
then set cr->cram_flags=0) along with using a SAM file using IUPAC
codes in the sequence.
commit 808affd6c25d2df7c59ab8d463dc76015bcbc31a
Author: Damien Zammit <damien@zamaudio.com>
Date: Sat Jun 27 02:17:23 2020 +1000
Fix dirty default build by including latest pkg.m4 instead of using aclocal.m4 (PR #1091)
commit 1e1186923f67d82edb181ce6da9ab7d38a20e3eb
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Thu Jun 25 14:07:30 2020 +0100
Fix a bug, which caused cram_codec_decoder2encoder to return an error code.
commit 313a9fc1e3df139bb32d81d9b660a43a5dc9483a
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Fri Jun 19 15:27:08 2020 +0100
CRAM fix for MD tag generation with "b" FC.
We don't normally generate these features, using "B" instead. However
some hand crafted CRAM files have shown we didn't create the correct
MD tag due to forgetting to clear the distance since last edit, giving
rise to the last numeric in MD being high.
Eg: MD:Z:0A98T0 came out as MD:Z:0A98T98 if that last T edit was a "b"
feature.
commit 16f62c5663bc09fda7c87f78322d1c561f483055
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Fri Jun 5 19:23:19 2020 +0100
Avoid warnings from recent and upcoming Clang and GCC releases
Prevent GCC 10 strncpy() warning by telling the compiler that
cram_file_def::file_id is not a C-style NUL-terminated string.
GCC 10 warns for `int_var = uint32_var = (uint32_t) -1`.
Clang 11 warns that float doesn't have enough precision to represent
RAND_MAX (0x7fffffff) exactly. Recode using uint64_t arithmetic rather
than (32-bit) float arithmetic.
commit 34c15969fa49d78b07c649374fb008fb3ed1e98a
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Fri Jun 5 14:35:42 2020 +0100
Add documentation to hts_itr_t structure.
Add a few minor changes.
commit d3147150b4f90a75746500137357f63bb8530d81
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Wed May 6 15:30:03 2020 +0100
Fix the multi-threaded CRAM multi-region iterator regression.
Fixes #1061
This appears to have been introduced during bfc9f0d and fixed for BAM
only in 6149ea6. The effect on multi-threading decoding for CRAM was
significant.
This fix takes a totally different approach, which not only fixes the
regression but passes it. CRAM can do CRAM_OPT_RANGE queries, used by
the single version of the iterator, which informs the decode of both
start and end range positions. When threading this means we don't
preemptively decode the next container unless it'll actually be used.
This same logic is now used in the multi-region iterator, although
it's complex.
The general strategy is as follows:
- Ensure we know the next container start from index. This needs a
small tweak to cram_index struct as the next container isn't quite
the same as this container + slice offset + slice size (sadly). I
think it's missing the size of the container struct itself.
- When being given a start..end offset, step into the reg_list to find
the corresponding chr:start-end range.
- Produce a new CRAM_OPT_RANGE_NOSEEK option that does the same job
as the old CRAM_OPT_RANGE opt but without the seek. This is
necessary hts.c does the seek for us and the pseek/ptell only work
with each other and not within the cram subdir code itself).
- Identify neighbouring file offsets and merge together their
correponding ranges so a block of adjacent offsets becomes a single
CRAM_OPT_RANGE query. We cache the end offset (.v) in
iter->end so we can avoid duplicating the seek / range request
in subsequent intervals.
- Manage the EOF vs EOR (end of range) return values. For EOR we have
do the incrementing ourselves as we need to restart the loop without
triggering the end-of-file or end-of-multi-iterator logic below it.
- Tweak the region sorting a bit so ties are resolved by the max
value. We want the index into reg_list to also be sorted, so we can
accurately step through them. Note this logic is CRAM specific, as
the sorting wouldn't work on BAI anyway due to the R-tree.
Some benchmarks on ~37,000 regions returning ~110 million seqs across
NA06985.final.cram:
CRAM-1.10 no threads: real 4m37.361s user 4m21.128s sys 0m7.988s
CRAM-1.10 -@16: real 3m14.371s user 28m48.872s sys 4m52.442s
CRAM-dev -@16: real 5m55.670s user 61m0.005s sys 11m23.147s
CRAM-current -@16: real 1m55.701s user 5m54.234s sys 0m51.699s
The increase in user time between unthreaded and 16 threads is modest.
System time increase is considerable, but this appears to primarily be
down to glibc malloc inefficiencies. Using libtcmalloc.so instead gives:
1M CRAM-current -@16: real 1m30.882s user 6m9.103s sys 0m4.960s
We're still nowhere near using 16 threads, but this is because the
average size of each region is significantly smaller than 16
containers worth.
commit 9de45b72b3a3f3df7a0ffda3554aeab1f6da50ba
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Tue May 12 17:28:21 2020 +0100
Further speed up of VCF parser (formats).
The if elseif checks are now a switch statement and juggled in order a
bit. Also the fmt[j].max_m type code is now f->max_m with f
incremented along with j.
The impact on gcc builds is minor (maybe 1%), but for clang this was
8-9% speed improvement on a 1000 genome multi-sample VCF. Example
timings for clang:
Previous commit
23333.77 msec task-clock # 1.000 CPUs utilized
83667512727 cycles # 3.586 GHz
199145555089 instructions # 2.38 insn per cycle
43099743981 branches # 1847.097 M/sec
665687093 branch-misses # 1.54% of all branches
This commit
75967289857 cycles # 3.585 GHz
195076309580 instructions # 2.57 insn per cycle
43265084488 branches # 2041.736 M/sec
640008186 branch-misses # 1.48% of all branches
commit b1acab6219c12c3a6cc127b37e0b03bcf4b90c3f
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Thu Apr 23 13:23:03 2020 +0100
Implement and use the `hts_str2dbl` method (by @jkbonfield)
For faster conversion of strings to floating point values. It
speeds up the conversion of values which match the regexp
[+-]?[0-9]*[.]?[0-9]* and have at least one and no more than 15
digits in total. Values that don't match (for example they include
an exponent) are handled by passing to strtod().
Co-Authored-By: James Bonfield <jkb@sanger.ac.uk>
Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>
commit 6d7da635eaa744f8140e10be90995864da381c58
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Tue Apr 21 11:42:23 2020 +0100
Speed up GT parsing
* Parse numeric GTs as unsigned integers with at most 30 bits.
(no more then 30 bits are needed due to the maximum permitted
value of the field).
* Pull error checking out of the innermost loop.
commit 9c4c64d86a1711a1f67add84a42e8d7a823d33c5
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Wed Feb 19 14:27:37 2020 +0000
Extract functionality from `vcf_parse` into individual parsing methods.
commit 4a783b84276997187c3a168d161429a23a2e1548
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Thu Feb 6 12:58:11 2020 +0000
Use hts_str2int, instead of strtol, in the VCF parser.
commit 08488f7a7910d8fdedf9c0d5ba7a882cdcabedc0
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Thu May 21 00:18:17 2020 +0100
Use including-file-relative paths to top-level private headers
Similarly to the previous commits, includes of these headers from
files in subdirectories can be subverted if $CPATH/etc or $(CFLAGS)
activates a directory containing them. This is unlikely, but the
particularly misguided might perhaps add a previous HTSlib source
directory to $C_INCLUDE_PATH.
To avoid that scenario, and for consistency with all the other
intra-HTSlib #includes, add ../ to #includes of top-level headers
in subdirectories.
With this, the only remaining #includes in HTSlib requiring `-I.` (as
is hard-coded in Makefile) are each *.c file's <config.h> inclusion.
(Using <> for that is as recommended by the autoconf documentation.)
commit 1ce80dd6e006b6554d1387d013bbfba942eaf824
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Jan 7 16:06:15 2020 +0000
Use including-file-relative paths to cram/*.h headers
Similarly to the previous commit, includes of cram/*.h from files in
subdirectories can be subverted if $CPATH/etc or $(CFLAGS) activates
a directory containing cram/*.h files. This is less likely than the
previous case of previously-installed public htslib/*.h headers, but
note that e.g. Debian for a time shipped cram/*.h headers, and the
particularly misguided might perhaps add a previous HTSlib source
directory to $C_INCLUDE_PATH.
Avoid this by removing cram/ from cram/*.h #includes in cram/*.[ch]
(and adding ../ elsewhere) and converting any instances of <> to "".
commit e569d806ea84b70793d08f290b85e294257d7814
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Jan 7 15:32:47 2020 +0000
Use including-file-relative paths to public htslib/*.h headers
The full preprocessor search path for #include "..." is typically:
1) directory of the file containing the #include directive
2) environment variables $CPATH, $C_INCLUDE_PATH, etc
3) command line options -I etc
4) system include directories
When a top-level *.[ch] file #includes "htslib/foo.h", the nearby
header file within the source tree is found due to (1). When a file
in a subdirectory (htslib/*.h, cram/*.[ch], or test/*.c) does the
same #include, the nearby header file is intended to be found due
to (3) via the hard-coded `-I.` in the Makefile. (`.` here being the
compiler process's $PWD, i.e., the top-level HTSlib source directory.)
However the latter can be subverted if there is an "htslib" directory
containing foo.h in $CPATH, $C_INCLUDE_PATH, or on the command line
prior to the hard-coded `-I.`. This will be the case if users have
made a previous HTSlib installation accessible via $CPATH/etc or by
adding a -I option to $(CFLAGS) (rather than $(CPPFLAGS)).
Avoid this by adding ../ to #includes in subdirectories (and always
using "" rather than <>), so that all includes of htslib/*.h within
HTSlib are found due to (1). Fixes #347.
commit e3cdb1df911116978c40c7fbfb3fd69ad20c3c1e
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Wed May 20 07:36:36 2020 +0100
Use the new hts_strprint() instead of dump_char()
commit 3a73df587add7a2bc7600a57a3ab06ff629475f5
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Jan 21 14:07:05 2020 +0000
Improve SAM parser unrecognised RNAME warnings
Fix tid vs mtid typo in RNEXT parsing. Fix "urecognized" typo.
Use __VA_ARGS__ to simplify the _parse_err macros. Use hts_strprint()
to add the unrecognised reference names to the warning messages.
commit 9d5c6533edbdf8eba7f4bc621695a802099ffa37
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue May 19 22:36:16 2020 +0100
Add new hts_strprint() utility function
Prints user-supplied text, optionally with quote marks, truncated
to a sensible length and with unprintable characters escaped.
Thus it is usable on potentially malicious input data.
commit e05d8dcd435876c6ec7635fd3819ca7915e302f3
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Mon May 18 15:27:16 2020 +0100
Fix test data to have "=" for RNEXT when matching RNAME.
commit c0e65a1ec4db767808ffcde1b44305051aea11c2
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Fri May 15 13:41:41 2020 +0100
Mention TBI chromosome length limit in tabix man page
commit 382867a850b74e7285166a67ee3243560cd974ac
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Fri May 15 17:19:35 2020 +0100
Fix reference length check. (PR #1068)
Credit to OSS-Fuzz
Fixes oss-fuzz 22231
commit fc431ebc252166f47f798e20954c5ccf8be83a8c
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Thu May 14 09:28:17 2020 +0100
Take even more care in converting uintN_t to intN_t
This code converts positive and negative values from uintN_t to intN_t
separately, but was being undone by having different types on the two
sides of the colon, and hence their common type being uintN_t -- thus
requiring an implicit conversion to intN_t after all!
(int8_t and int16_t are smaller than int so this matters even less
for these types, but it's good to follow the pattern anyway.)
Uncovered via -Wsign-compare, and grepping the code for /:.*(int/ etc.
Followup to dc3303b6da542f74c965b3f3f50a58776f87104b.
commit ea16e9d736911f757d830b7e718a7281758123ed
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Wed May 13 22:58:19 2020 +0100
Always check for malformed end < beg ranges
Check this too in the case that the "change of chromosome" if-statement
above was also executed.
commit 315d08dca68bdb7fcbd7386108307a4ab48d2246
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Wed May 13 22:43:25 2020 +0100
Prevent -Wsign-compare warnings in public htslib/*.h headers
(Also fix nearby htslib/hfile.h whitespace.)
commit b6d48aaa3809767417422595277eac8fbfe56d34
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Tue May 12 12:44:46 2020 +0100
Tolerate PG lines with invalid PP entries, already present in the file,
but log a warning.
commit cdf5d18768ba29975b70de3194e1fd9ab0c5f062
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Thu May 7 12:57:15 2020 +0100
Fix a test case.
commit e49b6665ec0d1383a5574459959c7a7a925daeed
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Wed May 6 16:30:04 2020 +0100
Compute the PG chain lengths and don't count leaves as starting points.
If there are only leaves (no chain longer than 1), choose the last entry.
Redo the PG links before adding a new PG line.
commit cb8a05138e3904f7993d71026d3ee463e0cc80e9
Author: Petr Danecek <pd3@sanger.ac.uk>
Date: Wed Apr 22 11:11:58 2020 +0100
Print position of offending VCF/BCF records to help with debugging
Adds bcf_seqname_safe which returns "(unknown)" when the sequence
name cannot be determined.
commit 8859b092f23ece485391a251be616feb2422ad70
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Feb 18 11:13:43 2020 +0000
Remove spurious system headers from public HTSlib headers
Reduce unnecessary inclusion of in particular <assert.h> and <stdio.h>.
commit 0344174128c7a234c6a2d2a23e2ae9483ee4b434
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Thu Mar 19 11:11:05 2020 +0000
Use ks_expand() instead of kroundup32+realloc [minor]
Use newer kstring utility function to avoid explicit low-level munging.
Report any memory allocation error in bgzf_getline(); ks_getuntil2()
has no error handling ability, so ignores ks_expand()'s return value.
commit 48c4e9a4e6b7ded874630f4ff84f458c77402266
Author: daviesrob <rmd+git@sanger.ac.uk>
Date: Wed Apr 29 12:35:04 2020 +0100
Add tabix --cache option; make bgzf cache work better with hfile_libcurl (#1053)
* Add tabix bgzf block caching.
* Preserve buffer contents for deferred seek in hfile_libcurl.
Commit ad2e24f added delayed seeks to hfile_libcurl so that
calling hseek() lots of times without an hread() in between would
not cause unnecessary web requests (the bgzf block cache tends to
do this). While it mostly worked, a small amount of data that
has been read but not consumed yet was being dropped by hseek().
Where this data was actually needed after all the hseek()ing,
it would cause hfile_libcurl() to have to rewind slightly resulting
in a new web request. Again, this can frequently happen when
using the bgzf block cache.
To prevent this, make hfile_libcurl take a copy of the data that
would have been lost. If a later read with a deferred seek hits
the preserved region, it can be returned from the cached data
without having to start a new web request. Once this cached
data has been used up, the original web connection can carry
on reading from where it left off.
* Fix libcurl_read() when not all preserved data is used up
fp->preserved_bytes should not be altered.
Assertion at line 780 needs to be changed as it's no longer
true that fp->base.begin and fp->base.end will always point to
the start of the buffer.
commit 2e36fa6ea4d753fc9aac0c6259078bdc9720a930
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Tue Mar 31 15:29:30 2020 +0100
Fix memory leak on failure in cram_decode_slice()
Make the error case decrement the reference count on any used
refs and free the refs array.
Credit to OSS-Fuzz
Fixes oss-fuzz 21393
commit f58a6f3796f94f3b7ed933b24af57c13e0215282
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Tue Mar 31 12:17:03 2020 +0100
Document special END tag handling in bcf_update_info()
commit 88a588a7b083c1a24051ce8b9ef16c70e78b9f65
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Mar 4 11:18:33 2020 +0000
Add tests for VCF file invalid END tag handling
commit e784f64950b91cdba8fc8e6f203218957139fd91
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Thu Feb 27 10:00:21 2020 +0000
Try to handle BCF files with negative rlen fields.
This can happen where VCF files with invalid END tag values (less
than the reference position) have been converted to BCF.
bcf_read1_core() is changed to treat rlen as a signed value (which
is what the BCF2 specification says). This means the maximum
rlen now supported by HTSlib for BCF files is reduced to 2^31.
If rlen is negative, bcf_record_check() will now print out a
warning. It will also attempt to fix up rlen by substituting
the length of the reference allele.
A call to bcf_record_check() is added to bcf_readrec() so
reads using iterators will be protected from bad input data.
Unfortunately checks involving the header have to be disabled
for this interface as it isn't available.
commit 6a14289be5b349afcb99bf730bbcbb38ffa6703a
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Thu Feb 27 09:54:42 2020 +0000
Add check for invalid VCF END tags to tabix
Where END is less than or equal to the reference position, a
warning is printed and the length of the REF allele (which has
already been measured) will be used instead.
commit d704389e051e5fdc8e69530233b10b33c653176f
Author: Petr Danecek <pd3@sanger.ac.uk>
Date: Thu Feb 20 09:46:19 2020 +0000
Make sure rlen is updated only from non-missing END values.
commit db079d5be536d170d9868127857b66834de3caa3
Author: Petr Danecek <pd3@sanger.ac.uk>
Date: Wed Feb 12 13:50:12 2020 +0800
Prevent negative rlen on erroneous INFO/END in bcf_update_info()
This is an extension of ea879e19b8 which added checks for reading
VCF.
Resolves https://github.com/samtools/bcftools/issues/1154
commit 0bff7c1cb60f3465e86e6e5c826f37a5948d8d04
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Mar 11 11:17:43 2020 +0000
Fix missing newline in error messages [minor]
commit b34a1ae1a91f7878eabfd288de92a4c7d29dab2e
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Mar 11 11:06:23 2020 +0000
Make hts_itr_next() return an error on seek failure
Return code needs to be less that -1, or the caller will
incorrectly interpret it as end of data, causing it to silently
finish too early.
Also adds error logging messages.
commit 90d3002401c8813c0ed64c314e0877a65b68bcd2
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Mar 11 10:25:16 2020 +0000
Make bgzf_read_block() print more error messages
To make it easier to find out where BGZF reads have failed.
It prints the offset of the BGZF block where the read / decompress
failed, which makes it much easier to locate bad parts of files.
Unfortunately it can't report the file name as it's not stored
by bgzf or hfile, but hopefully one of the callers will divulge
it.
This introduces a small change of behaviour as block_address is
now updated after an empty block has been read. This may change
the location stored in indexes of files with empty blocks,
however the index should be more efficient as it will now point
to the location where data starts instead of the previous
empty block.
commit 04b71e6b10bdc55cd288a45118a9a563b67add5f
Author: John Marshall <John.W.Marshall@glasgow.ac.uk>
Date: Tue Mar 17 18:52:37 2020 +0000
Move the bulk of ks_resize() to a non-inlined helper function
commit ba8d1a11bed8f660594ffd104149df31bb31ba0c
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Tue Mar 17 12:41:00 2020 +0000
Unify kroundup macros and fix gcc warnings
Replace several kroundup32 and kroundup_size_t macro definitions
with a new unified implementation. The new one will work
correctly for all integer types (both signed and unsigned) up to
64 bits. An attempt to round up that would overflow results
in the maximum value that can be stored in the given type.
The implementation avoids conditionals involving large values
in an attempt to fix some gcc warnings in samtools and bcftools.
They appeared after commit 29c294e which fixed a bug where kroundup
would wrap around to zero on large values. The warnings are
caused by jump threading optimisations that create code paths
for the large value cases, which gcc then warns about. See:
https://developers.redhat.com/blog/2019/03/13/understanding-gcc-warnings-part-2/
Includes extra tests to ensure the new kroundup works as desired.
commit 9a1035586382ef3272e81c2a5f9eeba876986115
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Tue Mar 17 18:07:37 2020 +0000
Prevent a double free in case of failure. (PR #1047)
commit b22e03d805866c04ea02a4b04c9b58a167853418
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Fri Mar 6 17:02:31 2020 +0000
Add crypt4gh file detection
Adds format detection for crypt4gh files (see
https://samtools.github.io/hts-specs/crypt4gh.pdf).
If hts_hopen() finds a crypt4gh file, it will try to re-open it
via the hfile_crypt4gh() plug-in. Detection occurs at the
same point as htsget, so it does both in a loop to allow for
crypt4gh served via htsget.
Actual decryption of the file requires an external plug-in,
available from https://github.com/samtools/htslib-crypt4gh
As this may not be present, a dummy handler is included that
prints an error message indicating that it is needed. This
is overridden by the real one when it is available.
commit c6ce820d5ccfeca1c38e6644d72ca6543da01eb1
Author: Petr Danecek <pd3@sanger.ac.uk>
Date: Wed Mar 11 15:30:28 2020 +0100
Acknowledge the existence of the symbolic <NON_REF> allele produced by GATK (PR #1045)
commit 29c294e6842a56ba3b9a24a24a5f6de1575b0961
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Tue Mar 10 10:31:45 2020 +0000
Fixed a raft of integer overflows in VCF land.
- Cast data into size_t before multiplication to avoid wrapping around
int32.
- Added checks for return values to align_mem and ks_resize
- Simplified the byzantine calculation in align_mem
- Fixed kroundup_size_t and kroundup32 so they cannot wrap around to
zero and turn the realloc into a free.
- Also added a check for ~2Gb on total length of FORMAT fields, which
nullifies the need for some of the above. We may wish to remove
this at some point if we want to cope with truely mammoth
multi-sample data, and the above fixes means doing so will not
expose bugs.
However for now this check adds protection against malformed data
creating excessive memory usage and CPU requirements.
Credit to OSS-Fuzz
Fixes oss-fuzz 21139
Fixes oss-fuzz 20881
commit 8ff8bb328f5d4c2981933ff7847374ddc747d572
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Mon Mar 9 11:45:22 2020 +0000
Improved CRAM error messages in cram_decode_slice (PR #1042)
To help diagnose #1038 (Embedded reference issues) by reporting
which ref_id and region is affected.
commit 3b1ff68575c9db582a122048a7e95aa2b1a47b11
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Thu Mar 5 21:39:48 2020 +0000
Add tabix --verbosity option
commit 31b23540491e114e7cbfde790f5129c26c18eef4
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Wed Mar 4 14:12:20 2020 +0000
Add more error checking to the tabix application
Check for error returns from bcf_itr_next(), hts_getline()
and tbx_itr_next() so failures can be reported properly.
Add checks in other places that could fail, mostly memory
allocations and writes to the output file.
Note that bcf_itr_querys() and tbx_itr_querys() currently returns
NULL for both general failures and those caused by the requested
region not being present. As tabix silently skips missing regions,
it is currently not possible to detect errors in these functions.
commit 9979fac560319507cefe5d5682617304d609b589
Author: Valeriu Ohan <vo2@sanger.ac.uk>
Date: Mon Mar 2 08:16:20 2020 -0800
Free allocated SN strings when encountering an error. (PR #1034)
Upon successful completion of sam_hdr_create, ownership of the
newly created SN strings is given to sam_hdr_t::target_name, which
takes care of freeing them at the end. But if an error occurs, the
handover might fail and the allocated strings need to be cleaned up
by iterating through the hash table instead.
Credit to OSS-Fuzz
Fixes oss-fuzz 20888
commit 6149ea6f779939451b56e6f63b28409baead2990
Author: Rob Davies <rmd+git@sanger.ac.uk>
Date: Mon Feb 24 09:41:15 2020 +0000
Make multi-region iterator better at skipping unneeded data
Revert change to hts_itr_t::off in bfc9f0d so the `max` field
can be used to store information about which interval goes
with each entry.
Don't merge hts_itr_t::off entries for different intervals.
Make hts_itr_multi_next() skip offsets that will return no
data as they are for intervals that have already been completed.
Deal with possible failure of reg2intervals() in
hts_itr_multi_bam().
commit 27b670444a372809586eabd921d1f568c8cc6c30
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Fri Feb 21 11:22:11 2020 +0000
Add BAI linear index usage for multi-region iterator
commit 45715e421e1ce9d15d8880a24454ec7dc42bf4bc
Author: James Bonfield <jkb@sanger.ac.uk>
Date: Thu Feb 20 14:21:12 2020 +0000
Add use of BAI linear index to BGZF querys.
Previously it used the bin "loff" field for BAI, to match the CSI data
structure (which no longer has a linear index). Unfortunately when
coupled with the compress_binning() function this causes BAI to seek
to data earlier than is necessary when bins have been compressed.
This generally only affects low coverage data set and has minimal
impact on deep data as the bins are too large to remove.
For CSI we have no option of using the linear index, so this does not
change behaviour. Sadly this means CSI indices are less efficient to
use than BAI for many small queries on shallow data sets.
Some benchmarks on a shallow 1.9x coverage data set, using 1000
randomly picked regions on chr1 each between 1 bp and 10k bp in size.
Indexer Branch Time(s) BAM I/O(Mb) Idx I/O(Mb)
-----------------------------------------------------
htslib-BAI develop 1.488 164.1 2.43
htsjdk-BAI develop 0.502 53.6 8.27
htslib-BAI this 0.498 53.5 2.43
htsjdk-BAI this 0.530 53.3 8.27
htslib-CSI this 1.532 163.3 0.49
For completeness, CRAM with varying seqs_per_slice of 300, 1000 or the
default of 10000, demonstrating the usefulness of changing this if you
need fine grained random access.
CRAM-s10k this 22.12 602.1 0.11
CRAM-s1k this 6.062 92.7 0.93
CRAM-s300 this 5.115 42.1 2.98
Htslib BAI index, samtools 1.10
-------------------------------
$ io_trace -S -r 9827_2 -- samtools1.10 view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R
File: 9827_2#49.bam
Num. open 1
Num. close 1
Num. seeks 969
Num. read 6040
Bytes read 164144977
File: 9827_2#49.bam.bai
Num. open 1
Num. close 1
Num. read 40
Bytes read 2431088
real 0m1.488s
user 0m1.448s
sys 0m0.036s
Htslib BAI index, this commit
-----------------------------
io_trace -S -r 9827_2 -- …
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The already present check in
cram_decode_slicedid not account for the last byte. This has been corrected.Credit to OSS-Fuzz
Fixes oss-fuzz 22231