-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble converting .cram to .fq #1515
Comments
The cause of your error appears to be a build issue. I don't recognise the -ffile-prefix-map options, so I'm guessing this is some binary install from a distribution somewhere rather than a manual user driven build. It looks like htslib has been compiled without finding a working copy of curl, so it has no access to fetching references via https. That'll be where the "Protocol not supported" line comes from. A standard htslib build would claim:
Meanwhile, this looks like GRCh37 reference, so you can probably download a copy to use locally (eg one of https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/). Edit: sorry I see you already tried that. I'm not sure why it can't read it. Maybe you need to run |
Thanks for your quick reply! I installed samtools using Ubuntu's apt, I will try it again by building samtools from source. |
Discussing this over our coffee break, we realised the most likely cause is you have a If you unset |
The problem finding plug-ins looks like a Debian/Ubuntu packaging problem. It's surprising that no-one has reported it before. Trying on a fresh Ubuntu jammy install I got:
Using
It looks like the same problem may exist in the current Debian htslib package so I guess we should report the problem there. |
The build instructions in the Debian repo are here: https://salsa.debian.org/med-team/htslib/-/blob/master/debian/rules The They should just replace it with |
|
I build samtools 1.16.1 from source and run the cmd:
|
Firstly, the hang is probably simply a networking problem. Either local bandwidth is being throttled, making fetching the remote reference painfully slow, or the EBI are serving data up at an extremely slow rate. You'd probably need to use As for why it's even going offsite to fetch a reference, this is sadly a classic issue of bad data provenance by the data submitters.
Note the chromosomes are named "chr1", not "1". Hence this is not the 1000 genomes hs37d5.fa file, and that is why samtools is attempting to fetch your reference from a remote resource (via the md5sum), because it cannot find it in the specified file. This is perhaps a situation we should report as a warning rather than simply trying harder to solve the mistake. As for what the actual reference is, that's challenging. Seriously I wish the NCBI and ENA archives simply point blank refused to accept submissions that don't tie back to a known registered assembly file. It makes life so hard! A bit of data sleuthing shows this has 86 references. The length of MT is 16569, not 16571 (and MT not M), so it's not hgc19 (as that was one difference between that and grch37 as I recall). Armed with that, I produced an md5sum of all md5sums and compared that to a bunch of known local references and it does indeed match my hs37d5.fa. It's just that someone for some baffling reason thought it'd be a good idea to rename all the chromosomes! (https://www.edvardmunch.org/the-scream.jsp) So from here you have a couple of choices.
The |
I can confirm the glacial rate is due to the EBI's reference server. It was estimating 30minutes to download Human chr1. :-( I'd strongly recommend building your own local directory of pre-cached md5sums per sequence. Although it's more effort to maintain, it makes you immune to the vagarities of people who like to rename chromosomes for giggles, as the MD5sum becomes the access key instead, it affords deduplication when we get differing combinations of references with most of the major ones being the same (GRCh37 vs h37d5), and it's also faster as there's no fasta format parsing necessary. |
Oh, I see, the real problem is the submiter did not use the standard human ref genome to generate cram files! "It make life so hard"! Thanks a lot for your help! |
Eg "samtools view -T human.fa human.cram" but we have a "1" vs "chr1" mix up. It's not a hard error, as we may have a partial reference we wish to use and there are further fallback measures to continue. Similarly we cannot check this at header loading time, as possibly we have a reference file for a single chromosome, a CRAM aligned against that single chromosome, but all SQ lines present in the header. Improves the diagnostics for samtools#1515
Eg "samtools view -T human.fa human.cram" but we have a "1" vs "chr1" mix up. It's not a hard error, as we may have a partial reference we wish to use and there are further fallback measures to continue. Similarly we cannot check this at header loading time, as possibly we have a reference file for a single chromosome, a CRAM aligned against that single chromosome, but all SQ lines present in the header. Improves the diagnostics for #1515
hello,
I download some .cram files from ENA (https://www.ebi.ac.uk/ena/browser/view/PRJEB42969?show=reads), and the human ref genome file(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/ phase2_reference_assembly_sequence/hs37d5.fa.gz). I first want to convert .cram file to .bam file, and then .fq file. I got errors when I running the following cmd:
samtools view -b -T ~/hs37d5.fa -@ 8 -o RADS27LK_DNA.bam ../bulk_DNA_WES/RADS27LK_DNA.cram
and error info is
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:822981-1026969
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:12979-823012
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1026877-1269314
[E::cram_next_slice] Slice decode failure
samtools view: error reading file "../bulk_DNA_WES/RADS27LK_DNA.cram"
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1269215-1452587
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1452511-1653336
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1653238-1902042
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:2487960-3124516
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:3124629-3552717
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1901952-2235974
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:2235878-2488058
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:3552623-3750770
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:3750688-4254076
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:5236370-5748914
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:5748849-6133821
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:4253995-4803687
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6579376-6702774
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:4803613-5236461
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6133722-6311715
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6311616-6579475
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6702723-6961376
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:7822405-7902854
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:7440071-7822462
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6961374-7440105
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:7902755-8057250
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:8370582-8617569
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:8057335-8370672
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:8617470-8853078
samtools version info:
samtools 1.13
Using htslib 1.13+ds
Copyright (C) 2021 Genome Research Ltd.
Samtools compilation details:
Features: build=configure curses=yes
CC: gcc
CPPFLAGS: -frelease -Wdate-time -D_FORTIFY_SOURCE=2
CFLAGS: -g -O2 -ffile-prefix-map=▒BUILDPATH▒=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security
LDFLAGS: -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now
HTSDIR:
LIBS:
CURSES_LIB: -lcurses
HTSlib compilation details:
Features: build=configure plugins=yes, plugin-path=/usr/local/lib/htslib:/usr/local/libexec/htslib:: libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes htscodecs=1.1.1
CC: gcc
CPPFLAGS: -I. -DSAMTOOLS=1 -Wdate-time -D_FORTIFY_SOURCE=2
CFLAGS: -g -O2 -ffile-prefix-map=/build/htslib-TQtOKr/htslib-1.13+ds=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -flto -fvisibility=hidden -flto -fvisibility=hidden
LDFLAGS: -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,-flto -fvisibility=hidden -Wl,-flto -fvisibility=hidden
HTSlib URL scheme handlers present:
built-in: preload, data, file
crypt4gh-needed: crypt4gh
mem: mem
System info: Ubuntu 22.04 LTS
The text was updated successfully, but these errors were encountered: