Rework on compressed file based Dataset #91

yongtang · 2019-02-13T18:14:19Z

As the number of file based Dataset is growing, code duplications start to happen. The biggest area of duplication is the compression support. There are two types of compressions:

ZLIB/GZIP where you have a single compressed entry
ZIP where you have multiple entries inside (e.g, npz file is essentially a ZIP).
The compression topic itself could be complicated, like recursive compression. The goal of tensorflow-io though, is to support formats that are commonly used in machine learning community. So one level of compression is enough.

We should rework on Dataset to have a CompressedFileDataset like abstraction.

yongtang · 2019-03-09T11:53:12Z

Looks like https://github.com/libarchive/libarchive could be a decent choice for compression. Had some initial success with cifar dataset. Will create a PR soon for some initial checkin.

yongtang · 2019-05-09T16:51:04Z

Our compression and archive support has been baked in. Now any format could add compression and archive support by specify a filter.

yongtang added the enhancement Enhancement request label Mar 3, 2019

yongtang closed this as completed May 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework on compressed file based Dataset #91

Rework on compressed file based Dataset #91

yongtang commented Feb 13, 2019

yongtang commented Mar 9, 2019

yongtang commented May 9, 2019

Rework on compressed file based Dataset #91

Rework on compressed file based Dataset #91

Comments

yongtang commented Feb 13, 2019

yongtang commented Mar 9, 2019

yongtang commented May 9, 2019