-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gz output writes bgzf #2455
Comments
diff --git a/include/seqan3/io/detail/misc_output.hpp b/include/seqan3/io/detail/misc_output.hpp
index 568b0c59..c0332226 100644
--- a/include/seqan3/io/detail/misc_output.hpp
+++ b/include/seqan3/io/detail/misc_output.hpp
@@ -46,16 +46,24 @@ inline auto make_secondary_ostream(std::basic_ostream<char_t> & primary_stream,
std::string extension = filename.extension().string();
- if ((extension == ".gz") || (extension == ".bgzf") || (extension == ".bam"))
+ if (extension == ".gz")
+ {
+ #ifdef SEQAN3_HAS_ZLIB
+ filename.replace_extension("");
+ return {new contrib::basic_gz_ostream<char_t>{primary_stream}, stream_deleter_default};
+ #else
+ throw file_open_error{"Trying to write a gzipped file, but no ZLIB available."};
+ #endif
+ }
+ else if ((extension == ".bgzf") || (extension == ".bam"))
{
#ifdef SEQAN3_HAS_ZLIB
if (extension != ".bam") // remove extension except for bam
filename.replace_extension("");
- return {new contrib::basic_bgzf_ostream<char_t>{primary_stream},
- stream_deleter_default};
+ return {new contrib::basic_bgzf_ostream<char_t>{primary_stream}, stream_deleter_default};
#else
- throw file_open_error{"Trying to write a gzipped file, but no ZLIB available."};
+ throw file_open_error{"Trying to write a bgzf-compressed file, but no ZLIB available."};
#endif
}
else if (extension == ".bz2")
would work. |
Resolution 19-03-2021
In general, we do not like that there is a |
Still needs documentation |
This was actually intentional. BGZF is standard-conforming Gzip. There is no disadvantage except maybe 1% file size difference. Using multiple threads for compression is an advantage. It should just be configurable through the file interface. Furthermore, file endings like |
That's not true, I have seen instances where it had significant overheads.
Interesting, there was no documentation that pointed to that fact. Can you give an article that specifies that |
Do you mean overhead in file size? That surprising to me, but it may be.
I didn't know either, but it is a requirement for all indexes to work (fasta/fastq index, sam index, vcf index), so all machines output these formats by default and all tools expect them. It is not required by SAM spec, but if you read through the man pages of e.g. samtools or bcftools, you will see that whenever compression is mentioned, it is always BGZF. This is the from the bcftools manual:
Note that whenever these tools mention "bgzipped" formats, they also mean a plain This is considered a leftover, but it shows how 99% of users expect all bgzf files to end in ".gz". And I think that the reverse is also true, i.e. users almost always want a bzgf-compressed fasta, sam, vcf when they say ".gz". If we find out that there is a size discrepancy also in the official formats, I agree that we should have a way of creating plain-old Gzip, but I think that the default behaviour for sequence and alignment files ending in ".gz" should definitely be the BGZF format. |
I just double-checked |
Description
When writing file to gzipped output, we will write bgzf instead of gz.
Output will have, e.g.,
1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 eb 58
which is a bgzf magic header, instead of the expected1f 8b 08
magic header.Even though both compressions are compatible, it's technically incorrect.
The decision happens here:
seqan3/include/seqan3/io/detail/misc_output.hpp
Line 49 in e0a1a95
We would need a different case for just
".gz"
.This may lead to problems downstream, because the file is not what it seems.
Parsing such files again with seqan will lead to high CPU usage because of:
seqan3/include/seqan3/contrib/stream/bgzf_stream_util.hpp
Line 40 in f521b69
(= use all threads when using bgzf)
Even if you know that we use all CPUs for bgzf, you would not expect this to happen for gzip.
The text was updated successfully, but these errors were encountered: