Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using bgz extension instead of gz for bgziped files #129

Open
hguturu opened this issue Sep 9, 2014 · 13 comments
Open

Using bgz extension instead of gz for bgziped files #129

hguturu opened this issue Sep 9, 2014 · 13 comments

Comments

@hguturu
Copy link

hguturu commented Sep 9, 2014

Would make it very clear what type of file is being dealth with. Additionally, can avoid issues with things like vim being able to read bgziped files saved with .gz extensions and resaving them as regular gzip.

I believe minimal changes to bgzip.c to autoappend .bgz instead of .gz and tabix.c to auto detect files with .bgz extensions.

I can submit a pull request to these if needed.

@winni2k
Copy link

winni2k commented Nov 18, 2015

bgzipped files are gzip compliant. Why not tell vim to gzip everything with bgzip instead?

@hguturu
Copy link
Author

hguturu commented Nov 18, 2015

vim was just an example, not the motivation. The more general motivation is
bgzip is slightly different than gzip and there doesn't appear to a reason
to conflate the two. Especially since sometimes its preferred to just have
pure gzip (no index will be built/the file has nothing to do with genomics).

On Wed, Nov 18, 2015 at 8:39 AM, Warren Kretzschmar <
notifications@github.com> wrote:

bgzipped files are gzip compliant. Why not tell vim to gzip everything
with bgzip instead?


Reply to this email directly or view it on GitHub
#129 (comment).

@winni2k
Copy link

winni2k commented Nov 19, 2015

Yep, you're right. Even different implementations of the same standard (for example .lz and .lzma here) appear to get different file endings.

@jrandall
Copy link
Contributor

For bgzip, we could make the ".bgz" extension an optional (non-default) filename.

For the rest of htslib/samtools/bcftools, we should make sure that ".bgz" files are recognized as bgzip for input and output filenames.

@dprat
Copy link

dprat commented Apr 18, 2018

.bgz are still not recognized when using bgzip.
I needed to modify all my file name from .bgz to gz, which is not an intuitive thing since i haven't found any information about this little detail...
I guess i'm not the only one to download file with ".bgz" extension from database (e.g. gnomAD vcf).
And i guess it's an easy modification to do

Thank you

@pd3
Copy link
Member

pd3 commented Apr 18, 2018

Why do you need to rename your files? The suffix name does matter for the operation of the program.

@dprat
Copy link

dprat commented Apr 18, 2018

i have to rename them from ".bgz" to ".gz" so bgzip can work, otherwise i get this error "unknown suffix -- ignored "
like i said, they are named ".bgz" on gnomAD
but since the tool is called bgzip (and not gzip), it would make sense (for me) that the extension could be ".bgz"

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 18, 2018

That's likely a bug in bgzip as it ought to be using the magic number instead of filename.

However I see similar login in tabix, which only works on (for example) foo.bed.gz and wouldn't accept foo.bed.bgz. This is why renaming files to your own suffixes is problematic and I'd be reluctant to tinker with this. Even if we change it in htslib, it'll cause problems for people using old installs and we have no idea how many other applications out there are assuming .gz instead of .bgz. I agree bgz would have been better, but IMO this ship sailed long ago.

@dprat
Copy link

dprat commented Apr 18, 2018

I agree that it should not use the filename...

My first idea was to have bgzip able to work on ".gz", but also on ".bgz" so old installs as you said would still work on ".gz".

@pd3
Copy link
Member

pd3 commented Apr 18, 2018

Yes, it should use the magic number, this fails:

bgzip -d test.bgz
[bgzip] test.bgz: unknown suffix -- ignored

As a quick workaround, use

gunzip -c test.bgz

@jmarshall
Copy link
Member

jmarshall commented Apr 18, 2018

This code in bgzip is checking that the file is compressed, hence in a position to be decompressed. Doing that via filename-extension checking code is ancient, from before we had easy magic-number sniffing infrastructure.

[Edit: the similar logic in tabix.c — in file_type() — is just a shortcut: if the extension heuristic doesn't trigger, the code sniffs the file contents. So I think tabix is fine and would accept foo.bed.bgz just fine; certainly it queries .bgz files downloaded from gnomAD happily.]

We'd now be in a position to move bgzip's is-it-compressed test to after bgzf_open() and use bgzf_compression() instead — compare the tbx.c part of ec1d68e.

The code that strips .gz off the end of the input filename to construct an output filename would also need generalising to handle bgzip -d foo.bed.bgz, but that's not insurmountable.

@dprat
Copy link

dprat commented Apr 18, 2018

@pd3 yes in the end looks like this is the best solution
@jmarshall yes that's the idea, i'll try to do it on my own

@alam-shahul
Copy link

It seems like this was never done...? When using bgzip -d after installing the latest version of htslib, I still have to change the extension of the file that I am decompressing from .bgz -> .gz

jmarshall added a commit to jmarshall/htslib that referenced this issue Sep 17, 2019
Check that the file is actually compressed rather than that it ends
in ".gz", and form the output filename by stripping an extension rather
than exactly 3 characters.

Enables e.g. `bgzip -d foo.bgz`; fixes (the non-policy part of) samtools#129.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants