New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work with an extra field of gzip and zip files #61881
Comments
Gzip files can contains an extra field and some applications use this for extending gzip format. The current GzipFile implementation ignores this field on input and doesn't allow to create a new file with an extra field. I propose to save an extra field data on reading as a GzipFile attribute and add new parameter for GzipFile constructor for creating new file with an extra field. |
I'll be glad to do it, but having some questions for discussing. First about FEXTRA format - it consists of a series of subfields [1] and current Lib/test/test_gzip.py :: test_read_with_extra having a bit incorrect extra field - sure, if somebody using format from RFC1952. You having a real samples with extra field?. Next about GzipFile's public interface - GzipFile(...).extra look ugly. Should I extend this ticket to support all metadata headers? FNAME, FCOMMENT, FHCRC, etc - correctly reading now, but no ways to get it outside (and no ways to create a file with FCOMMENT and FHCRC now). Eg, something to like this: |
I have an almost ready patch but I doubt about interface. It can be discussed. ZIP file entries have similar extra field and I'm planning to add similar feature to the zipfile module too. Here are preliminary patches. |
Some examples: >>> import zipfile
>>> z = zipfile.ZipFile('README.zip')
>>> z.filelist[0].extra
b'UT\x05\x00\x03\xe0\xc3\x87Rux\x0b\x00\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00'
>>> z.filelist[0].extra_map
<zipfile.ExtraMap object at 0xb6fe8bec>
>>> list(z.filelist[0].extra_map.items())
[(21589, b'\x03\xe0\xc3\x87R'), (30837, b'\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00')]
>>> import gzip
>>> gz = gzip.open('README.dz')
>>> gz.extra_bytes
b''
>>> gz.extra_map
<gzip.ExtraMap object at 0xb6fd04ac>
>>> list(gz.extra_map.items())
[]
>>> gz.read(1)
b'T'
>>> gz.extra_bytes
b'RA\x08\x00\x01\x00\xcb\xe3\x01\x00T\x0b'
>>> list(gz.extra_map.items())
[(b'RA', b'\x01\x00\xcb\xe3\x01\x00T\x0b')] |
What's needed to get this integrated? It will be great to not have to fork the GZIP. |
Agreed, it would be really nice to integrate these changes. These special fields are found in gzipped .bam files, a common DNA sequence alignment format used in the bioinformatics community. It would be nice to be able to read and write them with the standard library. |
There is a comment field too which would be nice to support. The Go gzip module has a Header class that describe all the metadata. I see in 3.8 mtime was made configurable, so hopefully we can add comment and extra. https://golang.org/pkg/compress/gzip/#Header For our purpose we'd like to put arbitrary stuff in a gzip file but it is complicated to do so, I might use the patch here and apply to the python gzip module, but that feels a bit hackish. |
type Header struct { This is what the header/extra things look like for reference. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: