Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools can't parse CONTIG ID containing a comma #266

Closed
talshmaya opened this issue May 21, 2015 · 6 comments
Closed

bcftools can't parse CONTIG ID containing a comma #266

talshmaya opened this issue May 21, 2015 · 6 comments
Assignees

Comments

@talshmaya
Copy link

I have vcf files with this line in the header:

##reference=<ID=hs37d5, Source=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz>

bcftools fails to parse the file due to the space after the comma (before Source=). It works when I delete the space.

I get this error:

$ ./bcftools merge file1.vcf.gz file2.vcf.gz
Could not parse the header line: "##reference=<ID=hs37d5, Source=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz>"
Could not parse the header line: "##reference=<ID=hs37d5, Source=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz>"
[W::vcf_parse] INFO 'DPR' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'NREF' is not defined in the header, assuming Type=String
[W::vcf_parse] FILTER 'avr010' is not defined in the header
[W::vcf_parse] INFO 'DPR' is not defined in the header, assuming Type=String
[W::vcf_parse] FILTER 'avr010' is not defined in the header
[W::vcf_parse] INFO 'XRX' is not defined in the header, assuming Type=String
Error: The INFO field is not defined in the header: DPR
mcshane added a commit to mcshane/htslib that referenced this issue Jun 1, 2015
* allow spaces between keys and values when parsing in header lines
* these spaces will be dropped when writing out the header

e.g. `##reference=<ID=hs37d5, Source=blah>` and `##reference=<ID=hs37d5, Source = blah>`
     will become `##reference=<ID=hs37d5,Source=blah>`

Fixes samtools/bcftools#266
mcshane added a commit to mcshane/htslib that referenced this issue Jul 6, 2015
* allow spaces between keys and values when parsing in header lines
* these spaces will be dropped when writing out the header

e.g. `##reference=<ID=hs37d5 , Source=blah>` and `##reference=<ID=hs37d5, Source = blah >`
     will become `##reference=<ID=hs37d5,Source=blah>`

Fixes samtools/bcftools#266
@lskatz
Copy link

lskatz commented Jan 5, 2016

This is a similar bug that I just found with @andrewdhuang. We tried bcftools concat on a set of VCFs that have a comma at the end of the seqname O103H2084006_A,. This is the error we get:

Could not parse the header line: "##contig=<ID=O103H2084006_A,,IDX=0>"

We believe that it is attempting to put the ##contig header into the output file without quoting the ID. Without quoting the ID, the comma becomes a reserved letter and indicates a missing field in this header. I believe that there should either be an error with this kind of contig name or that the contig name should be quoted somehow.

Please forgive me if this is the wrong place to report this bug.

@pd3 pd3 self-assigned this Jan 12, 2016
@pd3
Copy link
Member

pd3 commented Jan 12, 2016

Hi, this is an issue in htslib, but the report here is OK. It is now fixed by pd3/htslib@45379c2 and pd3/htslib@d04b77a.

(This test file in bcftools will need update after this is merged, in the test fills sequence names on the fly https://github.com/samtools/bcftools/blob/develop/test/isec.tab.out)

Cheers

@jmarshall
Copy link
Member

Original bug auto-closed now that samtools/htslib#214 has been merged; re-opening this issue to track the new comma/quoting issue.

@jmarshall jmarshall reopened this Jan 27, 2016
@jmarshall jmarshall changed the title bcftools can't parse header when spaces are present bcftools can't parse CONTIG ID containing a comma Feb 15, 2016
mcshane pushed a commit to samtools/htslib that referenced this issue Feb 18, 2016
…e,1">

* fixes a parsing problem when comma in the contig name (closes samtools/bcftools#266)
* when injecting contigs in the header from the index, use quoted contig IDs
* some bcftools test output files will need to be updated if this is pulled in
* there are places in at least bcftools/vcfconvert.c and samtools/bam_plcmd.c
  where contig header lines are created. These should also be changed to have quoted IDs

* still and issue is what happens if there is a `"` in the contig name
@pd3
Copy link
Member

pd3 commented Jan 3, 2018

I believe this is now solved, please reopen if not.

@pd3 pd3 closed this as completed Jan 3, 2018
@zpingfeng
Copy link

I run bcftools on the vcf files generated from freebayes, but got problems on parsing the header lines (see below commands and error messages). I used bcftools version 1.8, but sorted with samtools version 1.3. Is it a problem?

bcftools consensus -f ../../Euc_RefSeq.fas -I test.vcf.gz -o test.unmask.fa

[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=Egrandis_33521_Branched_chain_aminotransferase_BCAT1,_pyridoxal_phosphate_enzymes_type_IV_superfamily,length=1650>"

[W::bcf_hdr_parse] Could not parse header line: ##contig=<ID=Egrandis_33521_Branched_chain_aminotransferase_BCAT1,_pyridoxal_phosphate_enzymes_type_IV_superfamily,length=1650>

[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=Egrandis_contig_40711_Uncharacterized_membrane_protein,_predicted_efflux_pump,length=1488>"

[W::bcf_hdr_parse] Could not parse header line: ##contig=<ID=Egrandis_contig_40711_Uncharacterized_membrane_protein,_predicted_efflux_pump,length=1488>

[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=Egrandis_contig_40796__Protein_of_unknown_function_(DUF3675)_Zinc_finger,C3HC4_type(RING_finger),length=810>"

[W::bcf_hdr_parse] Could not parse header line: ##contig=<ID=Egrandis_contig_40796__Protein_of_unknown_function_(DUF3675)_Zinc_finger,C3HC4_type(RING_finger),length=810>

....

Note: the --sample option not given, applying all records

The fasta sequence does not match the REF allele at cpl_Euc_Mauve_Alignment_extraction_ndhF:196:

.vcf: [N]

.vcf: [N] <- (ALT)

.fa: [M]GAGTTCGGTCACTTAATAGATCCACTTACTTCTATTATGTTAATATTAATTACTACTGTTGGAATTTTGGTTCTTTTTTATAGTGACAATTATATGTCTCATGATCAAGGATATTTGAGATTTTTTGCTTATATGAGTTTTTTCAATACTTCCATGTTGGGATTAGTTACTAGTTCGAATTTGATACAAATTTATATTTTTTGGGAATTAGTTGGAATGTGTTCTTATCTATTAATAGGTTTTTGGTTCACACGACCTAGTGCGGCGACTGCTTGTCAAAAAGCGTTTGTAACGAATCGTGTAGGCGATTTTGGTTTATTATTAGGAATTTTAGGTCTTTATTGGATAACCGGTAGTTTTGAATTTCGGGATTTGTTCCAAATATTGAATAACTTGATTTATAATAATGAGGTTCCTTTTTTATTTCTTACTTTGTGTGCCTTTCTTTTATTTGCAGGTGCAGTTGCGAAATCGGCACAATTCCCCCTTCATGTATGGTTACCTGATGCCATGGAAGGCCCTACTCCCATTTCGGCTCTTATACATGCCGCTACTATGGTAGCAGCGGGCATTTTTCTTGTAGCTCGACTTCTTCCTCTTTTTATAATCATACCTTACATAATGAATTTCATATCTTTAATAGGTATAATAACAGTATTATTAGGAGCTACTTTAGCTCTTGCTCAAAAAGATATTAAAAGAGGTTTAGCTTATTCTACAATGTCTCAATTGGGTTATATGATGTTAGCTCTAGGTATGGGGTCTTATCGAGCCGCTTTATTTCATTTGATTACTCATGCTTATTCAAAAGCATTGTTGTTTTTAGGATCCGGATCAATTATTCATTCAATGGAAGCTATTGTTGGATATTCTCCAGATAAAAGTCAGAATATGGTTCTTATGGGAGGTTTAAAAAAGCATGTACCAATTACAAAAACTGCTTTTTTAGTAGGTACACTTTCTCTTTGTGGTATTCCCCCCCTTGCTTGTTTTTGGTCCAAAGATGAAATTCTTAATGATAGTTGGTTGTATTCACCTATTTTCGCAATAATAGCTTGTTCCACAGCAGGATTAACCGCATTTTATATGTTTCGAATCTATTTMCTTACTTTTGAGGGACATTTCAATGTTCATTTTCAAAATTACAATGGTCAAAAAAGTAGTTCCTGCTATTCAATATCTCTATGGGGAAAAGAAGTGCCAAAAAYGATTAAAAATCATTTTTGTTTATTAAGTTTATTRACAATGAATAATAATGAAAGGRCTTCTTTTTTTTCGAATAARACATATCAAATTGATGGTAATGGAAAAAACAGGATACGYCCTTTTATTACTATTACTMATTTTGTCACTAAAAAWACTTTCTCTTATCCTCATGAATCGGACAATACCATGTTRTTTTCTATGGTTATATTAGTGYYATTTACTTTGTTTGTTGGGGTCGTAGGAATTCCCTTTGCTTTTAATCAAGAAGAAATTCATTTGGATATATTATCTAAATTGTTAAATCCGTCTATAAACCTTTTACATCCGAATTCAAATAATTCGGTGGATTGGTATGAATTTGTGACAAATGCAAGTTTTTCTGTCAGWATAGCTTTTTTCGGAATATTTATAGSGTCTTTTTTATATAASCCTATTTATTCATCTTTACAAAATTTGAACTTACTRAATTCGTTTTCTAAAAGAGGTYCTAATMGAATTTTAGGGGACAGAATAAGAAATGGGATATATGATTGGTCATATAATCGTGGTTACATAGATGCTTTTTATACAATAYCCTTAACTCAGGGTATAAGAGGACTAGCTGAACTAATTCATTTTTTGGATAGACGASTAATTGATGGAATTACGAATGGTYTCG

bcftools consensus -f ../../Euc_RefSeq.fas -I -m test-region.bed test.vcf.gz -o test.mask.fa

[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=Egrandis_33521_Branched_chain_aminotransferase_BCAT1,_pyridoxal_phosphate_enzymes_type_IV_superfamily,length=1650>"

[W::bcf_hdr_parse] Could not parse header line: ##contig=<ID=Egrandis_33521_Branched_chain_aminotransferase_BCAT1,_pyridoxal_phosphate_enzymes_type_IV_superfamily,length=1650>

[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=Egrandis_contig_40711_Uncharacterized_membrane_protein,_predicted_efflux_pump,length=1488>"

[W::bcf_hdr_parse] Could not parse header line: ##contig=<ID=Egrandis_contig_40711_Uncharacterized_membrane_protein,_predicted_efflux_pump,length=1488>

[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=Egrandis_contig_40796__Protein_of_unknown_function_(DUF3675)_Zinc_finger,C3HC4_type(RING_finger),length=810>"
....

Could not parse bed line: cpl_Egrandis_20012500 480

Failed to initialize mask regions

@Paul-rk-cruz
Copy link

Still issues in 2021...

[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=JF781502 Human rhinovirus B strain HRV-B84_p1098_sR861_2008 polyprotein gene, complete cds,length=6941>"
[E::bcf_hdr_parse_line] Could not parse the header line: "##contig=<ID=KY369881 Human rhinovirus C2 strain SC9735, complete genome,length=7064>"
[E::faidx_adjust_position] The sequence "JQ837720 Human rhinovirus C strain HRV-C17_p1192_sR2967_2009 polyprotein gene, complete cds" was not found
[E::bam_plp_push] The input is not sorted (reads out of order)
[W::bcf_record_check] Bad BCF record at 544: Invalid CONTIG id 277

Any solutions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants