Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools view command error #1622

Open
skatragadda-nygc opened this issue Mar 24, 2022 · 12 comments
Open

samtools view command error #1622

skatragadda-nygc opened this issue Mar 24, 2022 · 12 comments
Assignees

Comments

@skatragadda-nygc
Copy link

skatragadda-nygc commented Mar 24, 2022

Are you using the latest version of samtools and HTSlib? If not, please specify.

(run samtools --version)
samtools 1.15
Using htslib 1.15

Please describe your environment.

  • OS (run uname -sr on Linux/Mac OS or wmic os get Caption, Version on Windows)
  • machine architecture (run uname -m on Linux/Mac OS or wmic os get OSArchitecture on Windows)
  • compiler (run gcc --version or clang --version)
    Linux 5.10.0-12-cloud-amd64
    gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

samtools view -C -T https://storage.googleapis.com/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta -t https://storage.googleapis.com/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.fai -o somatic-test.cram ourfile.bam --verbosity=8

here is the error message
[I::cram_next_container] Flush container 1/10059..-1

  • Connection #0 to host storage.googleapis.com left intact
    [E::easy_errno] Libcurl reported error 16 (Error in the HTTP2 framing layer)
    [E::bgzf_read_block] Failed to read uncompressed data at offset 258539632: Input/output error
    [E::bgzf_read] Read block operation failed with error 4 after 7093421 of 244615464 bytes
    bgzf_read() on reference file: Input/output error
    [E::cram_encode_container] Failed to load reference #1
    samtools view: writing to "somatic-test.cram" failed: Input/output error
    [I::cram_encode_compression_header] Wrote compression block header in 6 bytes
    [E::easy_errno] Libcurl reported error 16 (Error in the HTTP2 framing layer)
@whitwham
Copy link
Contributor

It looks like a network error. Is your error reproducible?

Does your command work if you download the reference and index and run it locally?

I ran your command on some of our human data and it worked normally but since I don't know what is in your ourfile.bam I am not sure it is a good test.

@skatragadda-nygc
Copy link
Author

skatragadda-nygc commented Mar 24, 2022

Initially, it works for some time. After 5 or 10 minutes it stops working with the above error

Currently running inside a docker container on a GCP VM. Testing the command with local files works fine
@whitwham could you try to reproduce it on your end inside a container? unfortunately, I cannot share the bam file.

@whitwham
Copy link
Contributor

Testing the command with local files works fine @whitwham could you try to reproduce it on your end inside a container?

Not until tomorrow. My current working machine does not allow containers. I would have to set something up.

@skatragadda-nygc
Copy link
Author

skatragadda-nygc commented Mar 24, 2022

sounds good. Could you also verify if we can work with gs:// URLs instead of local files or HTTPS URLs?
somatic RNA conversion works fine with HTTPS URLs but somatic DNA conversion fails with above errors

@whitwham
Copy link
Contributor

@skatragadda-nygc I have run it in a docker container using a 1000 Genomes file and it worked for me. The running time was about 30 minutes on both coordinate and name sorted data. HTTPS and GS URLs both worked.

It may be something to do with the GCP VM but I do not have enough in depth knowledge to help there.

@xk42
Copy link

xk42 commented Mar 25, 2022

@whitwham I'm @skatragadda-nygc colleague. We did more debugging. Turns out that our end user have been running samtools with the default 1 thread. This will always trigger the libcurl error at exactly this position.
[I::cram_next_container] Flush container 1/10059..-1

If we increase the threads to 2 or more, the run completes successfully. I'm wondering if you're running samtools with more threads?
We tested this without docker too and in a few different OS and none GCP.
Ubuntu 22.04, 18.04, Centos 7.9 and even on a Mac.
Originally we thought perhaps it is due to different versions of libcurl4 so we have tried new and older version. All of them would trigger the error when running samtools with 1 thread. As soon as we increased the thread, the error goes away. Can you confirm if you're also testing with 1 thread?

@whitwham
Copy link
Contributor

@xk42 I'm running with only one thread. Pretty much the same as in the original command given by @skatragadda-nygc but with my own input file and no verbosity option.

I took a generic ubuntu image off docker and it is using an older Linux kernel. I have a VM with a newer kernel ready and I will see what running on that does.

@xk42
Copy link

xk42 commented Mar 25, 2022

We tested on two different bam files. It does look like the size might be triggering this. One of our bam is 25G and we don't have any issues with 1 or more threads. The other one is 150G and this is the one that is triggering the error when running with 1 thread.

@whitwham
Copy link
Contributor

Okay, the test with the newer kernel worked, though only on a 2.5G file.

I'm running a test on a 340G file, the biggest I have easily available, it may take some time.

@whitwham
Copy link
Contributor

Well this is interesting.

time samtools view -C -T https://storage.googleapis.com/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta -t https://storage.googleapis.com/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta.fai -o somatic-test.cram hg_align.sorted.bam
[E::bgzf_read_block] Failed to read uncompressed data at offset 281411696: Broken pipe
[E::bgzf_read] Read block operation failed with error 4 after 29965485 of 244615464 bytes
bgzf_read() on reference file: Broken pipe
[E::cram_encode_container] Failed to load reference #1
samtools view: writing to "somatic-test.cram" failed: Broken pipe
real 14m25.176s
user 12m16.194s
sys 0m20.698s

This was with the 340G bam file. It failed after 14 minutes, which is less time than the successful runs took. This, as you say, implies size is a factor. It did manage to write out a 2.7G cram file.

I'll need to discuss this with my colleagues after the weekend.

@daviesrob
Copy link
Member

I wonder if it's timing-related. Possibly Google is timing out the https connection between samtools reading the first and second chromosome references?

@jkbonfield
Copy link
Contributor

I'm not at all familier with GCS, but it occurs maybe this is an issue of retaining a file handle open for too long without interim usage. There is a significant difference with how CRAM works when threaded and unthreaded, which hints at this.

When unthreaded, it opens the appropriate reference file and then periodically reads from it (bgzf_read as successive CRAM containers march their way along that chromosome). I'm unsure of how the cloud I/O is done in htslib, but is it possible the bgzf_open is getting some session cookie (or equivalent concept) which in then used for all subsequent reads. If the time is too long between open and read, then it could fail.

Now this is where threading dramatically changes things. CRAM threaded encoding doesn't load the reference in bits as it goes as it's hard to work out when a reference section has gone out of scope and can be discarded. Instead it loads an entire chromosome at a time and holds it in memory. This makes the open/read calls (assuming separate chromosomes are separate connections) close together in time and avoids time-out issues.

Obviously using 2 threads (or maybe even explicitly asking for -@1, but I'm unsure on that) works around this, but if you wish to stick to 1 thread only then also using wget or similar on the reference file to cache it locally first would also solve the problem.

I'm unsure of how we solve this properly in htslib, except perhaps with some error catch so if a bgzf_read fails on a network connection then we attempt a reopen & seek first before failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants