Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCF header contig IDX #64

Closed
malthesr opened this issue Jan 4, 2022 · 2 comments
Closed

BCF header contig IDX #64

malthesr opened this issue Jan 4, 2022 · 2 comments
Assignees
Labels
bcf bug Something isn't working enhancement New feature or request

Comments

@malthesr
Copy link
Contributor

malthesr commented Jan 4, 2022

I was having problems using a BCF file produced by noodles in bcftools. After a bit of digging, I believe my issue has to do with the handling of the IDX field in the header contig field. To illustrate, running

echo -e '##fileformat=VCFv4.3\n##contig=<ID=1>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tsample' | bcftools view -O b > test.bcf.gz

and then

use std::{error, fs, io};

use noodles_bcf as bcf;
use noodles_vcf as vcf;

fn main() -> Result<(), Box<dyn error::Error>> {
    let mut reader = fs::File::open("test.bcf.gz").map(bcf::Reader::new)?;

    reader.read_file_format()?;
    let raw_header = reader.read_header()?;
    let header: vcf::Header = raw_header.parse()?;

    let mut writer = vcf::Writer::new(io::stdout());

    writer.write_header(&header)?;

    Ok(())
}

prints something like

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=1,IDX="0">
##bcftools_viewVersion=1.14-25-gc304cb1+htslib-1.14-9-ge769401
##bcftools_viewCommand=view -O b; Date=Tue Jan  4 13:22:23 2022
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample

Note in particular that the value "0" of the IDX field on the contig record is quoted ##contig=<ID=1,IDX="0">, since it is encoded as a regular entry of a IndexMap<String, String> rather than as a particular Option<usize> field as used e.g. for filters.

When writing BCFs with noodles, this causes issues when interfacing e.g. with bcftools. Modifying the above slightly to write a BCF,

Code
use std::{error, fs, io};

use noodles_bcf as bcf;
use noodles_vcf as vcf;

fn main() -> Result<(), Box<dyn error::Error>> {
    let mut reader = fs::File::open("test.bcf.gz").map(bcf::Reader::new)?;

    reader.read_file_format()?;
    let raw_header = reader.read_header()?;
    let header: vcf::Header = raw_header.parse()?;

    let mut writer = bcf::Writer::new(io::stdout());

    writer.write_file_format()?;
    writer.write_header(&header)?;

    Ok(())
}

and running for instance

cargo run | bcftools view

produces

[W::bcf_hdr_register_hrec] Error parsing the IDX tag, skipping
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##bcftools_viewVersion=1.14-25-gc304cb1+htslib-1.14-9-ge769401
##bcftools_viewCommand=view -O b; Date=Tue Jan  4 13:22:23 2022
##bcftools_viewCommand=view; Date=Tue Jan  4 13:38:29 2022
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample

where the contig entry has been skipped due to an error parsing the string IDX tag.

Digging around in the BCF specs sections 6.2.1-6.2.2, it's not clear to me how, if at all, the IDX field should work for contigs. My impression from reading samtools/hts-specs#591, however, is that contig IDXs should somehow be handled, albeit separately from other header records? I don't understand the details of the BCF spec well enough to be sure that I completely follow that thread, however.

Feel free to close if you think this is a problem in bcftools or with the spec, rather than with noodles - I figured either way it might save someone else a bit of time repeating the above investigation if there's a searchable issue describing the problem.

@zaeleus zaeleus added the bcf label Jan 4, 2022
@zaeleus
Copy link
Owner

zaeleus commented Jan 5, 2022

Thanks for the detailed report. I see the issue here.

It was an unfortunate decision to entangle the format/spec with an implemntation detail. A few changes will have to be made to provide compatibility with htslib.

  • bcf::header::StringMap needs to be reworked to allow nonsequential entries. This means it can no longer be backed by an IndexSet. (4291d88 and 4e39a77)
  • A StringMap will need to be created for contig header records. Given that absolute positions can override relative positions, the order of vcf::header::Contigs cannot be used. (17397d7)
  • Document that a record chromosome ID (chrom) is not an index into vcf::header::Contigs but an index into the contig StringMap (6b73f93).
  • Writing a VCF record needs to use the contig StringMap instead of the VCF header contigs (9da12e7).
  • Converting a BCF record to BCF record needs to the contig StringMap instead of the VCF header contigs (9c97ba8).
  • Parse IDX field from raw record in vcf::header::Contig (22046f0).

I will work on these.

@zaeleus zaeleus self-assigned this Jan 5, 2022
@zaeleus
Copy link
Owner

zaeleus commented Jan 13, 2022

noodles 0.17.0 includes the changes that resolve this issue, mainly parsing the IDX field on VCF header contig records and also building a string map for those records.

The implementation is a breaking change. The user now has to parse a bcf::header::StringMaps, which includes both the dictionary of strings and dictionary of contigs. Typically, if any method will need both maps (e.g., Record::try_into_vcf_record and Writer::write_vcf_record), use the collection; otherwise, use the respective string map (StringMaps::strings or StringMaps::contigs). For example,

let raw_header = reader.read_header()?;
let header = raw_header.parse()?;
let string_maps: StringMaps = raw_header.parse()?;

let record = bcf::Record::default();
let dp = record.info().get(&header, string_maps.strings(), Key::TotalDepth);

let chromosome_id = usize::try_from(record.chromosome_id()).expect("invalid chrom");
let chromosome_name = strings_map.contigs().get_index(chromosome_id);

writer.write_vcf_record(&header, &string_maps, &record)?;

@zaeleus zaeleus closed this as completed Jan 13, 2022
@zaeleus zaeleus added bug Something isn't working enhancement New feature or request labels Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bcf bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants