Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hts_itr_query fails on empty BCFs with gapped tid blocks #1534

Closed
pd3 opened this issue Dec 8, 2022 · 1 comment · Fixed by #1545
Closed

hts_itr_query fails on empty BCFs with gapped tid blocks #1534

pd3 opened this issue Dec 8, 2022 · 1 comment · Fixed by #1545
Assignees

Comments

@pd3
Copy link
Member

pd3 commented Dec 8, 2022

This is related to #1533 and concerns BCFs with edited headers and missing data records. Consider this example

$ echo -e '##fileformat=VCFv4.2\n##contig=<ID=chr22,length=51304566,IDX=1>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\nchr22\t1\t.\tA\tC\t.\t.\t.' > test.vcf
$ bcftools view test.vcf -o test.bcf
$ bcftools index test.bcf

Note the header line contains the field IDX=1 which makes it behave as if BCF was edited and the first chromosome with IDX=0 was removed.

Prior to the commit d64e710 this command fails with

$ bcftools isec -n 2 test.bcf test.bcf
bcftools: vcf.c:2223: bcf_hdr_seqnames: Assertion `tid<m' failed.
Aborted

with the fix applied, it works

$ bcftools-d64e710 isec -C test.bcf test.bcf
chr22	1	A	C	11

However, if there are no data records, hts_itr_query returns an error and the program fails

$ echo -e '##fileformat=VCFv4.2\n##contig=<ID=chr22,length=51304566,IDX=1>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' > test2.vcf
$ bcftools view test2.vcf -o test2.bcf
$ bcftools index test2.bcf
$ bcftools isec -n 2 test2.bcf test2.bcf
[E::_reader_seek] Could not seek: chr22:1-17592186044415
bcftools: synced_bcf_reader.c:532: _reader_seek: Assertion `0' failed.
Aborted

This problem does not appear when the chromosome tid block has no gaps, i.e. starts with IDX=0

$ echo -e '##fileformat=VCFv4.2\n##contig=<ID=chr22,length=51304566,IDX=0>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' > test3.vcf
$ bcftools view test3.vcf -o test3.bcf
$ bcftools index test3.bcf
$ bcftools isec -n 2 test3.bcf test3.bcf
@daviesrob
Copy link
Member

The difference is due to the index on the file with no data records claiming that there is only one reference, while the one with data records says there are two. In the no-data case, trying to look up the index entry causes hts_itr_query() to return NULL thanks to this check on tid triggering this assertion in _reader_seek().

The initial count of the number of references is made in idx_calc_n_lvls_ids(), which does not check the headers for IDX= values. Once you start adding data records, the problem if fixed up by hts_idx_push() which expands the index if it gets a tid value outside the expected range.

I'm not sure if this counts as an indexing error, or if hts_itr_query() should be more generous when it's given a tid with no corresponding index entry.

As a side note: the idx->bidx[tid] == NULL test in hts_itr_query() only works on refs with no data because idx_read_core() makes bidx[] entries for everything, even if there's nothing in the index for a given reference. The same wouldn't be true for an index you've just built though - for them you only have a bidx entry if some data existed. At the moment trying to look up a reference with no data on an index you've just built would fail.

@daviesrob daviesrob self-assigned this Jan 3, 2023
daviesrob added a commit to daviesrob/htslib that referenced this issue Jan 6, 2023
Instead return one that instantly finishes.

Fixes an edge case (issue samtools#1534) where an index did not include
an entry for a chromosome that was mentioned in the file header
but had no data records.  Normally these would be present but
empty, but it was possible to use the IDX= key in a VCF file
to make an index where the chromosome simply did not appear.  In
this case, rather than an error, we want to return the equivalent
of HTS_IDX_NONE so the iterator produces no data.

Another scenario where this is useful is if you build an index,
and then try to use it immediately without first saving and
reading it back in again.  Such an index will have NULL entries
in bidx[] for any chromosomes with no data.  Again we want to return
an HTS_IDX_NONE iterator if one of those chromosomes is queried.
(This issue didn't usually occur because most programs are loading
in an existing index, and idx_read_core() makes bidx[] entries
for everything even if there's nothing in the index for the
chromosome.)

Note that this changes vcf_loop() in test_view.c so that it
now treats bcf_itr_querys() failures as an error.  The new
behaviour matches sam_loop() and is needed to detect the problem
being fixed here.  All the other tests still work after this change
no nothing was relying on the old behaviour of ignoring the errors.
daviesrob added a commit to daviesrob/htslib that referenced this issue Jan 6, 2023
Instead return one that instantly finishes.

Fixes an edge case (issue samtools#1534) where an index did not include
an entry for a chromosome that was mentioned in the file header
but had no data records.  Normally these would be present but
empty, but it was possible to use the IDX= key in a VCF file
to make an index where the chromosome simply did not appear.  In
this case, rather than an error, we want to return the equivalent
of HTS_IDX_NONE so the iterator produces no data.

Another scenario where this is useful is if you build an index,
and then try to use it immediately without first saving and
reading it back in again.  Such an index will have NULL entries
in bidx[] for any chromosomes with no data.  Again we want to return
an HTS_IDX_NONE iterator if one of those chromosomes is queried.
(This issue didn't usually occur because most programs are loading
in an existing index, and idx_read_core() makes bidx[] entries
for everything even if there's nothing in the index for the
chromosome.)

Note that this changes vcf_loop() in test_view.c so that it
now treats bcf_itr_querys() failures as an error.  The new
behaviour matches sam_loop() and is needed to detect the problem
being fixed here.  All the other tests still work after this change
no nothing was relying on the old behaviour of ignoring the errors.
jkbonfield pushed a commit that referenced this issue Jan 16, 2023
Instead return one that instantly finishes.

Fixes an edge case (issue #1534) where an index did not include
an entry for a chromosome that was mentioned in the file header
but had no data records.  Normally these would be present but
empty, but it was possible to use the IDX= key in a VCF file
to make an index where the chromosome simply did not appear.  In
this case, rather than an error, we want to return the equivalent
of HTS_IDX_NONE so the iterator produces no data.

Another scenario where this is useful is if you build an index,
and then try to use it immediately without first saving and
reading it back in again.  Such an index will have NULL entries
in bidx[] for any chromosomes with no data.  Again we want to return
an HTS_IDX_NONE iterator if one of those chromosomes is queried.
(This issue didn't usually occur because most programs are loading
in an existing index, and idx_read_core() makes bidx[] entries
for everything even if there's nothing in the index for the
chromosome.)

Note that this changes vcf_loop() in test_view.c so that it
now treats bcf_itr_querys() failures as an error.  The new
behaviour matches sam_loop() and is needed to detect the problem
being fixed here.  All the other tests still work after this change
no nothing was relying on the old behaviour of ignoring the errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants