gz output writes bgzf #2455

eseiler · 2021-03-17T17:11:07Z

Description

When writing file to gzipped output, we will write bgzf instead of gz.
Output will have, e.g., 1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 eb 58 which is a bgzf magic header, instead of the expected 1f 8b 08 magic header.

Even though both compressions are compatible, it's technically incorrect.
The decision happens here:

seqan3/include/seqan3/io/detail/misc_output.hpp

Line 49 in e0a1a95

if ((extension == ".gz") || (extension == ".bgzf") || (extension == ".bam"))

We would need a different case for just ".gz".

This may lead to problems downstream, because the file is not what it seems.
Parsing such files again with seqan will lead to high CPU usage because of:

seqan3/include/seqan3/contrib/stream/bgzf_stream_util.hpp

Line 40 in f521b69

inline static uint64_t bgzf_thread_count = std::thread::hardware_concurrency();

(= use all threads when using bgzf)

Even if you know that we use all CPUs for bgzf, you would not expect this to happen for gzip.

The text was updated successfully, but these errors were encountered:

eseiler · 2021-03-17T17:44:33Z

diff --git a/include/seqan3/io/detail/misc_output.hpp b/include/seqan3/io/detail/misc_output.hpp
index 568b0c59..c0332226 100644
--- a/include/seqan3/io/detail/misc_output.hpp
+++ b/include/seqan3/io/detail/misc_output.hpp
@@ -46,16 +46,24 @@ inline auto make_secondary_ostream(std::basic_ostream<char_t> & primary_stream,
 
     std::string extension = filename.extension().string();
 
-    if ((extension == ".gz") || (extension == ".bgzf") || (extension == ".bam"))
+    if (extension == ".gz")
+    {
+    #ifdef SEQAN3_HAS_ZLIB
+        filename.replace_extension("");
+        return {new contrib::basic_gz_ostream<char_t>{primary_stream}, stream_deleter_default};
+    #else
+        throw file_open_error{"Trying to write a gzipped file, but no ZLIB available."};
+    #endif
+    }
+    else if ((extension == ".bgzf") || (extension == ".bam"))
     {
     #ifdef SEQAN3_HAS_ZLIB
         if (extension != ".bam") // remove extension except for bam
             filename.replace_extension("");
 
-        return {new contrib::basic_bgzf_ostream<char_t>{primary_stream},
-                stream_deleter_default};
+        return {new contrib::basic_bgzf_ostream<char_t>{primary_stream}, stream_deleter_default};
     #else
-        throw file_open_error{"Trying to write a gzipped file, but no ZLIB available."};
+        throw file_open_error{"Trying to write a bgzf-compressed file, but no ZLIB available."};
     #endif
     }
     else if (extension == ".bz2")

would work.
(gzip is around 3-3.5 times slower than bgzf in my use case).

eseiler · 2021-03-21T12:53:02Z

Resolution 19-03-2021

Apply the proposed patch, should have a more extensive changelog entry because it affects performance.
Add a cookbook entry on how to adjust the bgzf_thread_count.
Link to the cookbook entry in relevant documentation entities (we have a section about (de)compression).
Include seqan3/io/exception.hpp in seqan3/io/detail/misc_{in,out}put.hpp.

In general, we do not like that there is a seqan3::contrib::bgzf_thread_count that needs to be overwritten to adjust the threads.
A better place to do this would be the file itself, especially the options of the file.
Alas, the stream is already constructed before the options can be set.
We also do not want to add/extend the constructor to take an argument for the number of threads to use.
It could also be a property of the format, but this would face similar problems.

eseiler · 2021-03-24T09:21:53Z

Still needs documentation

h-2 · 2021-04-14T17:18:59Z

This was actually intentional. BGZF is standard-conforming Gzip. There is no disadvantage except maybe 1% file size difference. Using multiple threads for compression is an advantage. It should just be configurable through the file interface.

Furthermore, file endings like .sam.gz or .vcf.gz have to be BGZF. Writing those files as plain Gzip will cause errors in downstream applications.

marehr · 2021-04-14T17:29:13Z

This was actually intentional. BGZF is standard-conforming Gzip. There is no disadvantage except maybe 1% file size difference. Using multiple threads for compression is an advantage. It should just be configurable through the file interface.

That's not true, I have seen instances where it had significant overheads.

Furthermore, file endings like .sam.gz or .vcf.gz have to be BGZF. Writing those files as plain Gzip will cause errors in downstream applications.

Interesting, there was no documentation that pointed to that fact. Can you give an article that specifies that .sam.gz expects BGZF? Which application does this?

h-2 · 2021-04-16T13:01:10Z

That's not true, I have seen instances where it had significant overheads.

Do you mean overhead in file size? That surprising to me, but it may be.
Just to make sure: did you verify this with the official bgzip or with our implementation? Because our implementation does not compress the blocks at regular compression level (Z_DEFAULT_COMPRESSION) and instead uses the "fast" profile (Z_BEST_SPEED). See here:
https://github.com/seqan/seqan3/blob/master/include/seqan3/contrib/stream/bgzf_stream_util.hpp#L116
This means that our bgzf files are typically larger than our plain gz files, but I am not sure about the reference implementations.
FWIW I don't think that this was a good design choice and would prefer if it was simply configurable via regular io configuration.

Interesting, there was no documentation that pointed to that fact.

I didn't know either, but it is a requirement for all indexes to work (fasta/fastq index, sam index, vcf index), so all machines output these formats by default and all tools expect them. It is not required by SAM spec, but if you read through the man pages of e.g. samtools or bcftools, you will see that whenever compression is mentioned, it is always BGZF.

This is the from the bcftools manual:

Most commands accept VCF, bgzipped VCF and BCF with filetype detected automatically even when streaming from a pipe.

Note that whenever these tools mention "bgzipped" formats, they also mean a plain .gz extension. I have never encountered .bgzf in the wild, I think it was an invention of David. Some projects have decided to use .bgz (which we should also support), but even the official bgzip application currently requires a .gz extension. [1]

This is considered a leftover, but it shows how 99% of users expect all bgzf files to end in ".gz". And I think that the reverse is also true, i.e. users almost always want a bzgf-compressed fasta, sam, vcf when they say ".gz". If we find out that there is a size discrepancy also in the official formats, I agree that we should have a way of creating plain-old Gzip, but I think that the default behaviour for sequence and alignment files ending in ".gz" should definitely be the BGZF format.

[1] samtools/htslib#129

h-2 · 2021-04-16T14:03:25Z

Do you mean overhead in file size? That surprising to me, but it may be.

I just double-checked bgzip vs gzip (without args) on some files and I do see differences of up to 10%. So I agree that it should be possible to get plain gzip somehow. I still stand with my arguments above that for most of our formatted files BGZIP should be the default gz-implementation, though.

eseiler added the bug faulty or wrong behaviour of code label Mar 17, 2021

eseiler self-assigned this Mar 21, 2021

simonsasse linked a pull request Mar 22, 2021 that will close this issue

[FIX] compression streams writing wrong format #2458

Merged

marehr closed this as completed in #2458 Mar 24, 2021

eseiler reopened this Mar 24, 2021

eseiler mentioned this issue Mar 25, 2021

[DOC] Add changelog entry for #2458 #2495

Merged

This was linked to pull requests Mar 25, 2021

[DOC] Add changelog entry for #2458 #2495

Merged

[DOC] Add cookbook entry for setting compression threads #2493

Merged

marehr closed this as completed in #2495 Mar 25, 2021

marehr mentioned this issue Apr 12, 2021

Reading a BAM file uses up all available CPU cores #1195

Closed

marehr mentioned this issue Apr 18, 2021

io: Make output compression configurable, not just deduction by extension seqan/product_backlog#362

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gz output writes bgzf #2455

gz output writes bgzf #2455

eseiler commented Mar 17, 2021

eseiler commented Mar 17, 2021

eseiler commented Mar 21, 2021 •

edited

eseiler commented Mar 24, 2021

h-2 commented Apr 14, 2021

marehr commented Apr 14, 2021

h-2 commented Apr 16, 2021

h-2 commented Apr 16, 2021

gz output writes bgzf #2455

gz output writes bgzf #2455

Comments

eseiler commented Mar 17, 2021

Description

eseiler commented Mar 17, 2021

eseiler commented Mar 21, 2021 • edited

Resolution 19-03-2021

eseiler commented Mar 24, 2021

h-2 commented Apr 14, 2021

marehr commented Apr 14, 2021

h-2 commented Apr 16, 2021

h-2 commented Apr 16, 2021

eseiler commented Mar 21, 2021 •

edited