Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading gzip files generates a CRC check failed error (version >= 0.7.0) #60

Closed
fjossandon opened this issue Mar 25, 2021 · 4 comments · Fixed by #61
Closed

Reading gzip files generates a CRC check failed error (version >= 0.7.0) #60

fjossandon opened this issue Mar 25, 2021 · 4 comments · Fixed by #61

Comments

@fjossandon
Copy link

Hello @rhpvorderman,
Yesterday, it happened to me and other bioinformaticians that the program that we were using (cutadapt) crashed unexpectedly when trying to open some gzipped files, which was the first time something like this happened: marcelm/cutadapt#520

fossandon@ubuntu:~/Documents/download$ cutadapt -a 'AACTTTYARCAAYGGATCTC;max_error_rate=0.1;min_overlap=20' -A 'TGATCCYTCCGCAGGT;max_error_rate=0.5;min_overlap=16' --pair-adapters --pair-filter any --cores 2 --output 94477_R1.fastq --paired-output 94477_R2.fastq 94477_S175_L001_R1_001.fastq.gz 94477_S175_L001_R2_001.fastq.gz
This is cutadapt 3.3 with Python 3.6.9
Command line parameters: -a AACTTTYARCAAYGGATCTC;max_error_rate=0.1;min_overlap=20 -A TGATCCYTCCGCAGGT;max_error_rate=0.5;min_overlap=16 --pair-adapters --pair-filter any --cores 2 --output 94477_R1.fastq --paired-output 94477_R2.fastq 94477_S175_L001_R1_001.fastq.gz 94477_S175_L001_R2_001.fastq.gz
Processing reads on 2 cores in paired-end mode ...
[ 8<---------] 00:00:03        88,831 reads  @     26.0 µs/read;   2.31 M reads/minuteERROR: Traceback (most recent call last):
  File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 556, in run
    dnaio.read_paired_chunks(f, f2, self.buffer_size)):
  File "/home/fossandon/.local/lib/python3.6/site-packages/dnaio/chunks.py", line 118, in read_paired_chunks
    bufend1 = f.readinto(memoryview(buf1)[start1:]) + start1  # type: ignore
  File "/usr/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.6/gzip.py", line 454, in read
    self._read_eof()
  File "/usr/lib/python3.6/gzip.py", line 501, in _read_eof
    hex(self._crc)))
OSError: CRC check failed 0x88b1f != 0x6fe5d9e4

ERROR: Traceback (most recent call last):
  File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 626, in run
    raise e
OSError: CRC check failed 0x88b1f != 0x6fe5d9e4

Traceback (most recent call last):
  File "/home/fossandon/.local/bin/cutadapt", line 8, in <module>
    sys.exit(main_cli())
  File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/__main__.py", line 848, in main_cli
    main(sys.argv[1:])
  File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/__main__.py", line 913, in main
    stats = r.run()
  File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 825, in run
    raise e
OSError: CRC check failed 0x88b1f != 0x6fe5d9e4

But using zcat and "gzip -t" on the files does not return any error, and they can be decompressed fine with "gzip -d", even running the same cutadapt command in different environments (python 3.6 and 3.8 were tested too) with the same version resulted in a crash for some environments and not for others. It took a long search and tests with a collegue, until we figure out that the key difference between crashing and not crashing was the version installed of the isal dependency (which uses the latest version when creating a docker image)... Using versions 0.8.0 and 0.7.0 generate the CRC error, but using 0.6.1 and 0.5.0 did not, so it seems the bug was introduced in 0.7.0, and keeping the intermediate dependencies the same but reverting isal to 0.6.1 allow it to work:

299	3047	0.0	8	2963 57 11 6 5 2 3
300	8	0.0	8	0 0 0 0 0 3 4 0 1
301	15028	0.0	8	0 14646 270 64 24 15 8 0 1


WARNING:
    One or more of your adapter sequences may be incomplete.
    Please see the detailed output above.
fossandon@ubuntu:~/Documents/temp$ pip3 list | egrep "cutadapt|dnaio|isal|xopen"
cutadapt              3.3
dnaio                 0.5.0               /home/fossandon/.local/lib/python3.6/site-packages
isal                  0.6.1
xopen                 1.1.0

In my case, I was processing a folder where all gzipped files came from a source where they were created at the same time, but only a portion consistently crashed and the others not. So to help you have a test case, I uploaded the files pair that I was using with the cutadapt example above, so you can reproduce it on your own, I couldn't find smaller ones that reproduced this error.
https://drive.google.com/drive/folders/1eTmLbd9WINctLb48pzn57_Ohp1amwZah?usp=sharing

Best regards,

@rhpvorderman
Copy link
Collaborator

@fjossandon
Thanks for reporting this bug.

I was a bit surprised at first. From 0.6.1 to 0.7.0 I removed lots of custom code to make igzip.py work. I had solved some incompatibilities in isal_zlib so isal_zlib and zlib supported the same calls to Decompressobj in the same way. This caused a massive reduction of code. Basically the read methods of GzipFile in igzip are the same as those in CPython's gzip.py.

Thanks to your providing of reproducing files I was able to find the error. It was an assumption in gzip.py in the buffer that it uses. I managed to find the error and will upload a bugfix release today.

@rhpvorderman
Copy link
Collaborator

I also found that the error is triggered by every multi-member gzip file. I created a very small reproducer (92K) and will add this to my test suite. Thanks for providing the tests files!

@fjossandon
Copy link
Author

fjossandon commented Mar 29, 2021

Excellent work!!! I just tested the 0.8.1 and the error is gone.

299	3047	0.0	8	2963 57 11 6 5 2 3
300	8	0.0	8	0 0 0 0 0 3 4 0 1
301	15028	0.0	8	0 14646 270 64 24 15 8 0 1


WARNING:
    One or more of your adapter sequences may be incomplete.
    Please see the detailed output above.
fossandon@ubuntu:~/Documents/temp$ pip3 list | egrep "cutadapt|dnaio|isal|xopen"
cutadapt              3.3
dnaio                 0.5.0               /home/fossandon/.local/lib/python3.6/site-packages
isal                  0.8.1
xopen                 1.1.0

Then I will take down the shared files, glad that they helped you with this. Thanks for the fix!

@rhpvorderman
Copy link
Collaborator

Lots of thanks again for writing such an extensive bug report including reproducing files. You really made it a lot easier for me to fix it. Your help is very much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants