Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gz output writes bgzf #2455

Closed
eseiler opened this issue Mar 17, 2021 · 7 comments · Fixed by #2458, #2495 or #2493
Closed

gz output writes bgzf #2455

eseiler opened this issue Mar 17, 2021 · 7 comments · Fixed by #2458, #2495 or #2493
Assignees
Labels
bug faulty or wrong behaviour of code

Comments

@eseiler
Copy link
Member

eseiler commented Mar 17, 2021

Description

When writing file to gzipped output, we will write bgzf instead of gz.
Output will have, e.g., 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 eb 58 which is a bgzf magic header, instead of the expected 1f 8b 08 magic header.

Even though both compressions are compatible, it's technically incorrect.
The decision happens here:

if ((extension == ".gz") || (extension == ".bgzf") || (extension == ".bam"))

We would need a different case for just ".gz".

This may lead to problems downstream, because the file is not what it seems.
Parsing such files again with seqan will lead to high CPU usage because of:

inline static uint64_t bgzf_thread_count = std::thread::hardware_concurrency();

(= use all threads when using bgzf)

Even if you know that we use all CPUs for bgzf, you would not expect this to happen for gzip.

@eseiler eseiler added the bug faulty or wrong behaviour of code label Mar 17, 2021
@eseiler
Copy link
Member Author

eseiler commented Mar 17, 2021

diff --git a/include/seqan3/io/detail/misc_output.hpp b/include/seqan3/io/detail/misc_output.hpp
index 568b0c59..c0332226 100644
--- a/include/seqan3/io/detail/misc_output.hpp
+++ b/include/seqan3/io/detail/misc_output.hpp
@@ -46,16 +46,24 @@ inline auto make_secondary_ostream(std::basic_ostream<char_t> & primary_stream,
 
     std::string extension = filename.extension().string();
 
-    if ((extension == ".gz") || (extension == ".bgzf") || (extension == ".bam"))
+    if (extension == ".gz")
+    {
+    #ifdef SEQAN3_HAS_ZLIB
+        filename.replace_extension("");
+        return {new contrib::basic_gz_ostream<char_t>{primary_stream}, stream_deleter_default};
+    #else
+        throw file_open_error{"Trying to write a gzipped file, but no ZLIB available."};
+    #endif
+    }
+    else if ((extension == ".bgzf") || (extension == ".bam"))
     {
     #ifdef SEQAN3_HAS_ZLIB
         if (extension != ".bam") // remove extension except for bam
             filename.replace_extension("");
 
-        return {new contrib::basic_bgzf_ostream<char_t>{primary_stream},
-                stream_deleter_default};
+        return {new contrib::basic_bgzf_ostream<char_t>{primary_stream}, stream_deleter_default};
     #else
-        throw file_open_error{"Trying to write a gzipped file, but no ZLIB available."};
+        throw file_open_error{"Trying to write a bgzf-compressed file, but no ZLIB available."};
     #endif
     }
     else if (extension == ".bz2")

would work.
(gzip is around 3-3.5 times slower than bgzf in my use case).

@eseiler
Copy link
Member Author

eseiler commented Mar 21, 2021

Resolution 19-03-2021

  • Apply the proposed patch, should have a more extensive changelog entry because it affects performance.
  • Add a cookbook entry on how to adjust the bgzf_thread_count.
  • Link to the cookbook entry in relevant documentation entities (we have a section about (de)compression).
  • Include seqan3/io/exception.hpp in seqan3/io/detail/misc_{in,out}put.hpp.

In general, we do not like that there is a seqan3::contrib::bgzf_thread_count that needs to be overwritten to adjust the threads.
A better place to do this would be the file itself, especially the options of the file.
Alas, the stream is already constructed before the options can be set.
We also do not want to add/extend the constructor to take an argument for the number of threads to use.
It could also be a property of the format, but this would face similar problems.

@eseiler eseiler self-assigned this Mar 21, 2021
@simonsasse simonsasse linked a pull request Mar 22, 2021 that will close this issue
@eseiler
Copy link
Member Author

eseiler commented Mar 24, 2021

Still needs documentation

@h-2
Copy link
Member

h-2 commented Apr 14, 2021

This was actually intentional. BGZF is standard-conforming Gzip. There is no disadvantage except maybe 1% file size difference. Using multiple threads for compression is an advantage. It should just be configurable through the file interface.

Furthermore, file endings like .sam.gz or .vcf.gz have to be BGZF. Writing those files as plain Gzip will cause errors in downstream applications.

@marehr
Copy link
Member

marehr commented Apr 14, 2021

This was actually intentional. BGZF is standard-conforming Gzip. There is no disadvantage except maybe 1% file size difference. Using multiple threads for compression is an advantage. It should just be configurable through the file interface.

That's not true, I have seen instances where it had significant overheads.

Furthermore, file endings like .sam.gz or .vcf.gz have to be BGZF. Writing those files as plain Gzip will cause errors in downstream applications.

Interesting, there was no documentation that pointed to that fact. Can you give an article that specifies that .sam.gz expects BGZF? Which application does this?

@h-2
Copy link
Member

h-2 commented Apr 16, 2021

That's not true, I have seen instances where it had significant overheads.

Do you mean overhead in file size? That surprising to me, but it may be.
Just to make sure: did you verify this with the official bgzip or with our implementation? Because our implementation does not compress the blocks at regular compression level (Z_DEFAULT_COMPRESSION) and instead uses the "fast" profile (Z_BEST_SPEED). See here:
https://github.com/seqan/seqan3/blob/master/include/seqan3/contrib/stream/bgzf_stream_util.hpp#L116
This means that our bgzf files are typically larger than our plain gz files, but I am not sure about the reference implementations.
FWIW I don't think that this was a good design choice and would prefer if it was simply configurable via regular io configuration.

Interesting, there was no documentation that pointed to that fact.

I didn't know either, but it is a requirement for all indexes to work (fasta/fastq index, sam index, vcf index), so all machines output these formats by default and all tools expect them. It is not required by SAM spec, but if you read through the man pages of e.g. samtools or bcftools, you will see that whenever compression is mentioned, it is always BGZF.

This is the from the bcftools manual:

Most commands accept VCF, bgzipped VCF and BCF with filetype detected automatically even when streaming from a pipe.

Note that whenever these tools mention "bgzipped" formats, they also mean a plain .gz extension. I have never encountered .bgzf in the wild, I think it was an invention of David. Some projects have decided to use .bgz (which we should also support), but even the official bgzip application currently requires a .gz extension. [1]

This is considered a leftover, but it shows how 99% of users expect all bgzf files to end in ".gz". And I think that the reverse is also true, i.e. users almost always want a bzgf-compressed fasta, sam, vcf when they say ".gz". If we find out that there is a size discrepancy also in the official formats, I agree that we should have a way of creating plain-old Gzip, but I think that the default behaviour for sequence and alignment files ending in ".gz" should definitely be the BGZF format.

[1] samtools/htslib#129

@h-2
Copy link
Member

h-2 commented Apr 16, 2021

Do you mean overhead in file size? That surprising to me, but it may be.

I just double-checked bgzip vs gzip (without args) on some files and I do see differences of up to 10%. So I agree that it should be possible to get plain gzip somehow. I still stand with my arguments above that for most of our formatted files BGZIP should be the default gz-implementation, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug faulty or wrong behaviour of code
Projects
None yet
3 participants