Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[E::bcf_hdr_read] Failed to read BCF header #672

Open
Jolvii85 opened this issue Aug 30, 2017 · 9 comments
Open

[E::bcf_hdr_read] Failed to read BCF header #672

Jolvii85 opened this issue Aug 30, 2017 · 9 comments

Comments

@Jolvii85
Copy link

Hi,

I am using Samtools and Bcftools to call SNPs with sugar pine as a reference (58407655 scaffolds). I met an error message "[E::bcf_hdr_read] Failed to read BCF header". Could anyone can help me fix this?

Best,

Wei

@pd3
Copy link
Member

pd3 commented Aug 30, 2017

The error comes from here https://github.com/samtools/htslib/blob/develop/vcf.c#L923. Most likely the header is too big to fit in SIZE_MAX bytes. I am not sure this check is a good idea, apparently the standard merely says that the size must be at least 65535 bytes, which is too limiting. As far as I know, the BCF specification does not impose any limit on the header size, certainly not this small. @daviesrob do you want to comment?

@Jolvii85
Copy link
Author

You are right, the header could cause this error. For the same data and commands, if I use pinus taeda as reference, there are no errors.

Could you help me to fix this problem?

Thank you!

@daviesrob
Copy link
Member

SIZE_MAX is the maximum value that a size_t can hold, which is either 4294967295 or 18446744073709551615 bytes, depending on if you have a 32-bit or 64 bit platform. The check is there to ensure that hlen + 1 does not overflow on 32-bit platforms.

BCF does have a limit on the size of the header, as it stores the length in the four bytes following the BCF\2\2 "magic" string (interestingly, this does not seem to be mentioned in the specification). This means you can't have a header longer than 4294967295 bytes in a BCF2 file. Unfortunately, there's no overflow check when this length is written out so currently HTSlib will write BCF headers that can't be read back in again if the header is too long.

Given the number of scaffolds, it's quite possible that this header is longer than the limit. bcf_hdr_read is probably not reading the entire header, and failing because bcf_hdr_parse finds an incomplete record.

I'm afraid the only solutions would be to use VCF instead of BCF, or try to shrink the header somehow. Given the number of records, you would need to use less than 73 bytes per scaffold.

@pd3
Copy link
Member

pd3 commented Aug 30, 2017

As a quick temporary workaround, try to comment this line and recompile https://github.com/samtools/htslib/blob/develop/vcf.c#L914
** Do not ** remove this line. It exists to prevent a buffer overflow on 32 bit platforms where malloc(0) returns a valid pointer:

  • Attacker prepares a BCF file with hlen encoded as 0xffffffff.
  • malloc(hlen + 1) becomes malloc(0) due to integer wrap-around.
  • malloc(0) returns a non-NULL pointer in htxt
  • HTSlib tries to read up to 4294967295 bytes to memory pointed to by htxt
  • Attacker is able to write any data they want into the heap.

This would be a very bad vulnerability if it were allowed to happen.

@pd3
Copy link
Member

pd3 commented Aug 30, 2017

@daviesrob My information about the SIZE_MAX variable was based on this thread. I did not investigate whether the assertion in the answer marked as best is correct https://stackoverflow.com/questions/22514803/maximum-size-of-size-t

The limit on the size of the BCF header is of course part of the implementation, but not of the specification yet.

Perhpaps the sanity check should be done differently and not rely on SIZE_MAX. We know exactly what the maximum value can be in four-byte integer. The format cannot be platform dependent.

@Jolvii85
Copy link
Author

I will try to comment that line and recompile htslib and bcftools to see whether it will work or not. I will give you the feedback later, Thank you.

@pd3
Copy link
Member

pd3 commented Aug 30, 2017

Unless you are using some obscure system, I think @daviesrob is right and it will not help - the header is probably too big to be represented in this version of BCF.

@Jolvii85
Copy link
Author

Hi @daviesrob , what do you mean by " use VCF instead of BCF"? I want to use Samtools and Bcftools pipeline for my project. I have more than 400 bam files, it is hard to shrink the header for each file. Can you give some suggestion?
Thank you!

@daviesrob
Copy link
Member

I mean you need to write out the text VCF format instead of the binary BCF format. You can get samtools and bcftools to do this by using the right command-line options, for example:

samtools mpileup --VCF  [ ... other options here ... ]
bcftools mpileup --output-type z [ ... other options here ... ]

For bcftools --output-type z will give you a compressed VCF file, which is best if you want to save the result. --output-type v will write uncompressed VCF, which you can view directly. When saving, you should also use a .vcf suffix instead of .bcf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants