Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BamUtil: recab error: failed to open reference genome #62

Closed
cbostwick87 opened this issue May 3, 2021 · 6 comments
Closed

BamUtil: recab error: failed to open reference genome #62

cbostwick87 opened this issue May 3, 2021 · 6 comments

Comments

@cbostwick87
Copy link

Hello,
I am trying to use the TOPMed Variant Calling Pipeline (latest: Freeze 8) to analyze human genetic data. I am following the instructions on the Readme (https://github.com/statgen/topmed_variant_calling) and have aligned my sequenced reads in BAM format (and indexed them) using bwa to the GRCh38 (full analysis set plus decoy hla) reference genome fasta. (Found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/).

The TOPMed pipeline readme does not mention the steps of deduplicating and recalibrating the BAM files, but as mentioned in the wiki here (https://genome.sph.umich.edu/wiki/GotCloud) and my own knowledge of variant calling pipeline best practices, I believe these steps are highly recommended.

So I would like to use BamUtil to deduplicate and recalibrate my BAM files. Following the instructions here (https://genome.sph.umich.edu/wiki/BamUtil:_recab), my command is:

bamUtil/bin/bam dedup --recab --in input.bam --out input.bam.recab --force --refFile /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa --dbsnp /ref/dbsnp_142.b38.vcf.gz --oneChrom --storeQualTag OQ --maxBaseQual 40

But it fails with the following error message:

Failed to open reference genome /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla-bs.umfa: wrong type of file (expected type 460927905 but got 1096302936)

I also tried the same command as above but using a similar(?) reference genome (hs38DH.fa), which I obtained (following the instructions here: https://github.com/statgen/topmed_variant_calling) using the command:
wget ftp://share.sph.umich.edu/1000genomes/fullProject/hg38_resources/topmed_variant_calling_example_resources.tar.gz

This returned a similar error:

GenomeSequence::open: failed to open file /ref/hs38DH-bs.umfa Failed to open reference genome /ref/hs38DH.fa hs38DH-bs.umfa: wrong type of file (expected type 460927905 but got 1415074904)

I am not sure what it means by "expected type 460927905". I would greatly appreciate any help in solving this issue. Thank you!

@jonathonl
Copy link
Contributor

Which version of bamUtil are you using? Can you try the NonPrimaryDedup branch.

@cbostwick87
Copy link
Author

cbostwick87 commented May 4, 2021

Hi Jonathon,
Thank you for your help! I cloned bamUtil (along with the other tools I am using) from the https://github.com/statgen/topmed_variant_calling Github repo. Therefore, I believe the version is v.1.0.15
When I click on the NonPrimaryDedup branch you linked (https://github.com/statgen/bamUtil/tree/NonPrimaryDedup), it also says it is version v.1.0.15, but I will try to clone it and use this bamUtil branch and report back.

After cloning and making the NonPrimaryDedup branch bamUtil, I tried to use the dedup --recab command as above, but it returned exactly the same errors (trying both reference genomes) as before.

@jonathonl
Copy link
Contributor

Ok. After cloning, you can switch to that branch by running git checkout NonPrimaryDedup.

@cbostwick87
Copy link
Author

cbostwick87 commented May 5, 2021

I edited my previous comment, but I'm not sure it was seen. I used the command:
git clone https://github.com/statgen/bamUtil.git
cd bamUtil
make LIB_PATH_GENERAL=/topmed_variant_calling/libStatGen/ INSTALLDIR=.
to make the NonPrimaryDedup branch bamUtil.
I tried to use the bam dedup --recab command as above, but it returned exactly the same errors (trying both reference genomes) as before.

@jonathonl
Copy link
Contributor

I cannot reproduce (see below). Can you try using test/testFiles/sortedBam1.bam as input? You could also try deleting /ref/hs38DH-bs.umfa. It may be corrupted and will be regenerated automatically once deleted.

$ bin/bam dedup --recab --in test/testFiles/sortedBam1.bam --out sortedBam1.bam.recab --force --refFile ../test_resources/resources/ref/hs38DH.fa --dbsnp ../test_resources/resources/ref/dbsnp_142.b38.vcf.gz --oneChrom --storeQualTag OQ --maxBaseQual 40
Creating FASTA binary cache file '../test_resources/resources/ref/hs38DH-bs.umfa'.
FASTA binary cache file '../test_resources/resources/ref/hs38DH-bs.umfa' created.
Load dbSNP file '../test_resources/resources/ref/dbsnp_142.b38.vcf.gz': (as text file) DONE!

@cbostwick87
Copy link
Author

Hi, thank you and sorry I took so long to respond.
I tested the sortedBam1.bam file and got the same error about the reference genome. I then tried deleting the .umfa file and created a new one using the NonPrimaryDedup branch bamUtil. Using this new .umfa file, I was able to successfully run the dedup and recalibration steps on the test bam and my own files! So it would seem that perhaps the /ref/hs38DH-bs.umfa file was indeed corrupted as you suggested. Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants