Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble converting .cram to .fq #1515

Closed
0106WeiWeiDeng opened this issue Oct 19, 2022 · 10 comments
Closed

Trouble converting .cram to .fq #1515

0106WeiWeiDeng opened this issue Oct 19, 2022 · 10 comments

Comments

@0106WeiWeiDeng
Copy link

hello,

I download some .cram files from ENA (https://www.ebi.ac.uk/ena/browser/view/PRJEB42969?show=reads), and the human ref genome file(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/ phase2_reference_assembly_sequence/hs37d5.fa.gz). I first want to convert .cram file to .bam file, and then .fq file. I got errors when I running the following cmd:

samtools view -b -T ~/hs37d5.fa -@ 8 -o RADS27LK_DNA.bam ../bulk_DNA_WES/RADS27LK_DNA.cram

and error info is

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:822981-1026969

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:12979-823012

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1026877-1269314

[E::cram_next_slice] Slice decode failure
samtools view: error reading file "../bulk_DNA_WES/RADS27LK_DNA.cram"
[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1269215-1452587

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1452511-1653336

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1653238-1902042

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:2487960-3124516

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:3124629-3552717

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:1901952-2235974

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:2235878-2488058

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:3552623-3750770

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:3750688-4254076

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:5236370-5748914

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:5748849-6133821

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:4253995-4803687

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6579376-6702774

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:4803613-5236461

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6133722-6311715

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6311616-6579475

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6702723-6961376

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:7822405-7902854

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:7440071-7822462

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:6961374-7440105

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:7902755-8057250

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:8370582-8617569

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:8057335-8370672

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported
[E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa
[E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa'
[E::cram_get_ref] Failed to populate reference for id 0
[E::cram_decode_slice] Unable to fetch reference #0:8617470-8853078

samtools version info:

samtools 1.13
Using htslib 1.13+ds
Copyright (C) 2021 Genome Research Ltd.

Samtools compilation details:
Features: build=configure curses=yes
CC: gcc
CPPFLAGS: -frelease -Wdate-time -D_FORTIFY_SOURCE=2
CFLAGS: -g -O2 -ffile-prefix-map=▒BUILDPATH▒=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security
LDFLAGS: -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now
HTSDIR:
LIBS:
CURSES_LIB: -lcurses

HTSlib compilation details:
Features: build=configure plugins=yes, plugin-path=/usr/local/lib/htslib:/usr/local/libexec/htslib:: libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes htscodecs=1.1.1
CC: gcc
CPPFLAGS: -I. -DSAMTOOLS=1 -Wdate-time -D_FORTIFY_SOURCE=2
CFLAGS: -g -O2 -ffile-prefix-map=/build/htslib-TQtOKr/htslib-1.13+ds=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -flto -fvisibility=hidden -flto -fvisibility=hidden
LDFLAGS: -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,-flto -fvisibility=hidden -Wl,-flto -fvisibility=hidden

HTSlib URL scheme handlers present:
built-in: preload, data, file
crypt4gh-needed: crypt4gh
mem: mem

System info: Ubuntu 22.04 LTS

@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 19, 2022

The cause of your error appears to be a build issue. I don't recognise the -ffile-prefix-map options, so I'm guessing this is some binary install from a distribution somewhere rather than a manual user driven build. It looks like htslib has been compiled without finding a working copy of curl, so it has no access to fetching references via https. That'll be where the "Protocol not supported" line comes from.

A standard htslib build would claim:

HTSlib URL scheme handlers present:
    built-in:	 preload, data, file
    S3 Multipart Upload:	 s3w, s3w+https, s3w+http
    Amazon S3:	 s3+https, s3+http, s3
    Google Cloud Storage:	 gs+http, gs+https, gs
    libcurl:	 imaps, pop3, http, smb, gopher, ftps, imap, smtp, smtps, rtsp, ftp, telnet, rtmp, ldap, https, ldaps, tftp, pop3s, smbs, dict
    crypt4gh-needed:	 crypt4gh
    mem:	 mem

Meanwhile, this looks like GRCh37 reference, so you can probably download a copy to use locally (eg one of https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/). Edit: sorry I see you already tried that. I'm not sure why it can't read it. Maybe you need to run samtools faidx on it first, although I thought htslib did that automatically.

@0106WeiWeiDeng
Copy link
Author

Thanks for your quick reply! I installed samtools using Ubuntu's apt, I will try it again by building samtools from source.

@jkbonfield
Copy link
Contributor

Discussing this over our coffee break, we realised the most likely cause is you have a HTS_PATH environment variable pointing to /usr/local/lib/htslib:/usr/local/libexec/htslib::. This ought to permit it to also search the default system directories, but it doesn't appear to be doing so.

If you unset HTS_PATH and run samtools version again does it then start listing the curl (etc) plugins?

@daviesrob
Copy link
Member

The problem finding plug-ins looks like a Debian/Ubuntu packaging problem. It's surprising that no-one has reported it before.

Trying on a fresh Ubuntu jammy install I got:

ubuntu@jammy-test:~$ samtools version
samtools 1.13
Using htslib 1.13+ds
Copyright (C) 2021 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             gcc
    CPPFLAGS:       -frelease  -Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS:         -g -O2 -ffile-prefix-map=?BUILDPATH?=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security
    LDFLAGS:        -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now
    HTSDIR:         
    LIBS:           
    CURSES_LIB:     -lcurses

HTSlib compilation details:
    Features:       build=configure plugins=yes, plugin-path=/usr/local/lib/htslib:/usr/local/libexec/htslib:: libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes htscodecs=1.1.1
    CC:             gcc
    CPPFLAGS:       -I. -DSAMTOOLS=1 -Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS:         -g -O2 -ffile-prefix-map=/build/htslib-TQtOKr/htslib-1.13+ds=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -flto -fvisibility=hidden -flto -fvisibility=hidden
    LDFLAGS:        -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,-flto -fvisibility=hidden -Wl,-flto -fvisibility=hidden

HTSlib URL scheme handlers present:
    built-in:	 preload, data, file
    crypt4gh-needed:	 crypt4gh
    mem:	 mem

Using HTS_PATH=/usr/lib/x86_64-linux-gnu/htslib to show where the plug-ins are fixes it:

ubuntu@jammy-test:~$ HTS_PATH=/usr/lib/x86_64-linux-gnu/htslib samtools version
samtools 1.13
Using htslib 1.13+ds
Copyright (C) 2021 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             gcc
    CPPFLAGS:       -frelease  -Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS:         -g -O2 -ffile-prefix-map=?BUILDPATH?=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security
    LDFLAGS:        -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now
    HTSDIR:         
    LIBS:           
    CURSES_LIB:     -lcurses

HTSlib compilation details:
    Features:       build=configure plugins=yes, plugin-path=/usr/lib/x86_64-linux-gnu/htslib: libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes htscodecs=1.1.1
    CC:             gcc
    CPPFLAGS:       -I. -DSAMTOOLS=1 -Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS:         -g -O2 -ffile-prefix-map=/build/htslib-TQtOKr/htslib-1.13+ds=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -flto -fvisibility=hidden -flto -fvisibility=hidden
    LDFLAGS:        -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,-flto -fvisibility=hidden -Wl,-flto -fvisibility=hidden

HTSlib URL scheme handlers present:
    built-in:	 preload, data, file
    S3 Multipart Upload:	 s3w, s3w+https, s3w+http
    Google Cloud Storage:	 gs+http, gs+https, gs
    Amazon S3:	 s3+https, s3+http, s3
    libcurl:	 imaps, pop3, gophers, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, mqtt, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict
    crypt4gh-needed:	 crypt4gh
    mem:	 mem

It looks like the same problem may exist in the current Debian htslib package so I guess we should report the problem there.

@jkbonfield
Copy link
Contributor

The build instructions in the Debian repo are here: https://salsa.debian.org/med-team/htslib/-/blob/master/debian/rules

The dh_auto_configure -- --enable-libcurl --enable-gcs --enable-s3 --enable-plugins --with-plugin-dir='$(libdir)/htslib' --with-external-htscodecs --with-plugin-path='/usr/local/lib/htslib:/usr/local/libexec/htslib:$(plugindir)' line indicates that likely $(plugindir) is unset.

They should just replace it with $(libdir)/htslib.

@daviesrob
Copy link
Member

$(plugindir) is a Makefile variable that should have been replaced with the value given for --with-plugin-dir. It works if you do ./configure --with-plugin-path='/usr/local/lib/htslib:/usr/local/libexec/htslib:$(plugindir)', but Debian aren't doing that directly. I suspect a loss of quoting between dh_auto_configure and configure.

@0106WeiWeiDeng
Copy link
Author

I build samtools 1.16.1 from source and run the cmd: samtools view -b -T ~/hs37d5.fa -o RADS27LK_DNA.bam ../bulk_DNA_WES/RADS27LK_DNA.cram, now it did not print the errors, but it stays at 'runing' for a long time(12+ hours now ). I use htop to view the state of samtools cmd, and found that it stays in s. It seems like samtools get stuck at some point. The samtools cmd output info is :

[W::cram_populate_ref] Creating reference cache directory /home/deng/.cache/hts-ref
This may become large; see the samtools(1) manual page REF_CACHE discussion

@jkbonfield
Copy link
Contributor

Firstly, the hang is probably simply a networking problem. Either local bandwidth is being throttled, making fetching the remote reference painfully slow, or the EBI are serving data up at an extremely slow rate. You'd probably need to use strace to really diagnose where it's spending the time.

As for why it's even going offsite to fetch a reference, this is sadly a classic issue of bad data provenance by the data submitters.

$ samtools view -H ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR528/ERR5285421/RADS27LK_DNA.cram | grep '@SQ' | head -1
@SQ	SN:chr1	LN:249250621	M5:1b22b98cdeb4a9304cb5d48026a85128	UR:/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa

Note the chromosomes are named "chr1", not "1". Hence this is not the 1000 genomes hs37d5.fa file, and that is why samtools is attempting to fetch your reference from a remote resource (via the md5sum), because it cannot find it in the specified file. This is perhaps a situation we should report as a warning rather than simply trying harder to solve the mistake.

As for what the actual reference is, that's challenging. Seriously I wish the NCBI and ENA archives simply point blank refused to accept submissions that don't tie back to a known registered assembly file. It makes life so hard!

A bit of data sleuthing shows this has 86 references. The length of MT is 16569, not 16571 (and MT not M), so it's not hgc19 (as that was one difference between that and grch37 as I recall). Armed with that, I produced an md5sum of all md5sums and compared that to a bunch of known local references and it does indeed match my hs37d5.fa. It's just that someone for some baffling reason thought it'd be a good idea to rename all the chromosomes! (https://www.edvardmunch.org/the-scream.jsp)

So from here you have a couple of choices.

  1. Produce a new local copy of hs37d5.fa with "chr" added to all refs. This is probably the easiest strategy.
sed 's/^>/>chr/' hs37d5.fa > NCBI37_DECOY.fa
  1. Produce an MD5sum directory structure so CRAM can use that for direct mmaps of references and avoid parsing the fasta files (untested, but something like this):
$ samtools/misc/seq_cache_populate.pl -root ~/.cache/hts-ref hs37d5.fa
$ export REF_CACHE=~/.cache/hts-ref/%2s/%2s/%s

The REF_CACHE setting here should be the default one so isn't needed if you use that location, but if you take this approach you may wish to change directory to somewhere more appropriate. I chose that as it's simply what htslib is trying to automatically locally populate for you by downloading the references from the EBI.

@jkbonfield
Copy link
Contributor

I can confirm the glacial rate is due to the EBI's reference server. It was estimating 30minutes to download Human chr1. :-(

I'd strongly recommend building your own local directory of pre-cached md5sums per sequence. Although it's more effort to maintain, it makes you immune to the vagarities of people who like to rename chromosomes for giggles, as the MD5sum becomes the access key instead, it affords deduplication when we get differing combinations of references with most of the major ones being the same (GRCh37 vs h37d5), and it's also faster as there's no fasta format parsing necessary.

@0106WeiWeiDeng
Copy link
Author

Oh, I see, the real problem is the submiter did not use the standard human ref genome to generate cram files! "It make life so hard"! Thanks a lot for your help!

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Oct 20, 2022
Eg "samtools view -T human.fa human.cram" but we have a "1" vs "chr1"
mix up.

It's not a hard error, as we may have a partial reference we wish to
use and there are further fallback measures to continue.  Similarly we
cannot check this at header loading time, as possibly we have a
reference file for a single chromosome, a CRAM aligned against that
single chromosome, but all SQ lines present in the header.

Improves the diagnostics for samtools#1515
whitwham pushed a commit that referenced this issue Oct 31, 2022
Eg "samtools view -T human.fa human.cram" but we have a "1" vs "chr1"
mix up.

It's not a hard error, as we may have a partial reference we wish to
use and there are further fallback measures to continue.  Similarly we
cannot check this at header loading time, as possibly we have a
reference file for a single chromosome, a CRAM aligned against that
single chromosome, but all SQ lines present in the header.

Improves the diagnostics for #1515
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants