BamUtil: recab error: failed to open reference genome #62

cbostwick87 · 2021-05-03T18:59:42Z

Hello,
I am trying to use the TOPMed Variant Calling Pipeline (latest: Freeze 8) to analyze human genetic data. I am following the instructions on the Readme (https://github.com/statgen/topmed_variant_calling) and have aligned my sequenced reads in BAM format (and indexed them) using bwa to the GRCh38 (full analysis set plus decoy hla) reference genome fasta. (Found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/).

The TOPMed pipeline readme does not mention the steps of deduplicating and recalibrating the BAM files, but as mentioned in the wiki here (https://genome.sph.umich.edu/wiki/GotCloud) and my own knowledge of variant calling pipeline best practices, I believe these steps are highly recommended.

So I would like to use BamUtil to deduplicate and recalibrate my BAM files. Following the instructions here (https://genome.sph.umich.edu/wiki/BamUtil:_recab), my command is:

bamUtil/bin/bam dedup --recab --in input.bam --out input.bam.recab --force --refFile /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa --dbsnp /ref/dbsnp_142.b38.vcf.gz --oneChrom --storeQualTag OQ --maxBaseQual 40

But it fails with the following error message:

Failed to open reference genome /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa /ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla-bs.umfa: wrong type of file (expected type 460927905 but got 1096302936)

I also tried the same command as above but using a similar(?) reference genome (hs38DH.fa), which I obtained (following the instructions here: https://github.com/statgen/topmed_variant_calling) using the command:
wget ftp://share.sph.umich.edu/1000genomes/fullProject/hg38_resources/topmed_variant_calling_example_resources.tar.gz

This returned a similar error:

GenomeSequence::open: failed to open file /ref/hs38DH-bs.umfa Failed to open reference genome /ref/hs38DH.fa hs38DH-bs.umfa: wrong type of file (expected type 460927905 but got 1415074904)

I am not sure what it means by "expected type 460927905". I would greatly appreciate any help in solving this issue. Thank you!

The text was updated successfully, but these errors were encountered:

jonathonl · 2021-05-04T22:28:46Z

Which version of bamUtil are you using? Can you try the NonPrimaryDedup branch.

cbostwick87 · 2021-05-04T22:58:19Z

Hi Jonathon,
Thank you for your help! I cloned bamUtil (along with the other tools I am using) from the https://github.com/statgen/topmed_variant_calling Github repo. Therefore, I believe the version is v.1.0.15
When I click on the NonPrimaryDedup branch you linked (https://github.com/statgen/bamUtil/tree/NonPrimaryDedup), it also says it is version v.1.0.15, but I will try to clone it and use this bamUtil branch and report back.

After cloning and making the NonPrimaryDedup branch bamUtil, I tried to use the dedup --recab command as above, but it returned exactly the same errors (trying both reference genomes) as before.

jonathonl · 2021-05-05T00:04:07Z

Ok. After cloning, you can switch to that branch by running git checkout NonPrimaryDedup.

cbostwick87 · 2021-05-05T18:40:19Z

I edited my previous comment, but I'm not sure it was seen. I used the command:
git clone https://github.com/statgen/bamUtil.git
cd bamUtil
make LIB_PATH_GENERAL=/topmed_variant_calling/libStatGen/ INSTALLDIR=.
to make the NonPrimaryDedup branch bamUtil.
I tried to use the bam dedup --recab command as above, but it returned exactly the same errors (trying both reference genomes) as before.

jonathonl · 2021-05-05T20:37:31Z

I cannot reproduce (see below). Can you try using test/testFiles/sortedBam1.bam as input? You could also try deleting /ref/hs38DH-bs.umfa. It may be corrupted and will be regenerated automatically once deleted.

$ bin/bam dedup --recab --in test/testFiles/sortedBam1.bam --out sortedBam1.bam.recab --force --refFile ../test_resources/resources/ref/hs38DH.fa --dbsnp ../test_resources/resources/ref/dbsnp_142.b38.vcf.gz --oneChrom --storeQualTag OQ --maxBaseQual 40
Creating FASTA binary cache file '../test_resources/resources/ref/hs38DH-bs.umfa'.
FASTA binary cache file '../test_resources/resources/ref/hs38DH-bs.umfa' created.
Load dbSNP file '../test_resources/resources/ref/dbsnp_142.b38.vcf.gz': (as text file) DONE!

cbostwick87 · 2021-05-13T17:34:31Z

Hi, thank you and sorry I took so long to respond.
I tested the sortedBam1.bam file and got the same error about the reference genome. I then tried deleting the .umfa file and created a new one using the NonPrimaryDedup branch bamUtil. Using this new .umfa file, I was able to successfully run the dedup and recalibration steps on the test bam and my own files! So it would seem that perhaps the /ref/hs38DH-bs.umfa file was indeed corrupted as you suggested. Thank you so much for your help!

cbostwick87 closed this as completed May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BamUtil: recab error: failed to open reference genome #62

BamUtil: recab error: failed to open reference genome #62

cbostwick87 commented May 3, 2021

jonathonl commented May 4, 2021

cbostwick87 commented May 4, 2021 •

edited

Loading

jonathonl commented May 5, 2021

cbostwick87 commented May 5, 2021 •

edited

Loading

jonathonl commented May 5, 2021

cbostwick87 commented May 13, 2021

BamUtil: recab error: failed to open reference genome #62

BamUtil: recab error: failed to open reference genome #62

Comments

cbostwick87 commented May 3, 2021

jonathonl commented May 4, 2021

cbostwick87 commented May 4, 2021 • edited Loading

jonathonl commented May 5, 2021

cbostwick87 commented May 5, 2021 • edited Loading

jonathonl commented May 5, 2021

cbostwick87 commented May 13, 2021

cbostwick87 commented May 4, 2021 •

edited

Loading

cbostwick87 commented May 5, 2021 •

edited

Loading