# The Tarproc Utilities

For many big data applications, it is convenient to process data in record-sequential formats.
One of the most common such formats is `tar` archives.

We adopt the following conventions for record storage in tar archive:

- files are split into a key and a field name
- the key is the directory name plus the file name before the first dot
- the field name is the file name after the first dot
- files with the same key are grouped together and treated as a sample or record

This convention is followed both by these utilities as well as the `webdataset` `DataSet` implementation for PyTorch, available at http://github.com/tmbdev/webdataset

Here is an example of the ImageNet training data for deep learning:

In [1]:
tar tf testdata/imagenet-000000.tar | sed 5q

10.cls
10.png
10.wnid
10.xml
12.cls
tar: write error


The `tarshow` utility displays images and data from tar files.

In [2]:
tarshow -d 0 'testdata/imagenet-000000.tar#0,3'

__key__             	10
__source__          	testdata/imagenet-000000.tar
cls                 	b'304'
png                 	b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02X\x00\x00\x
wnid                	b'n04380533'
xml                 	b'None'

__key__             	12
__source__          	testdata/imagenet-000000.tar
cls                 	b'551'
png                 	b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\xc8\x00\x0
wnid                	b'n03485407'
xml                 	b'None'

__key__             	13
__source__          	testdata/imagenet-000000.tar
cls                 	b'180'
png                 	b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\x90\x00\x0
wnid                	b'n02088632'
xml                 	b'None'

__key__             	15
__source__          	testdata/imagenet-000000.tar
cls                 	b'165'
png                 	b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xf4\x00\x0
wnid                	b'n02410509'
xml                 	b'<annotation>\n\

The `tarfirst` command outputs the first file matching some specification; this is useful for debugging.

In [3]:
tarfirst -f wnid testdata/imagenet-000000.tar

10.wnid
n04380533

In [4]:
tarfirst testdata/imagenet-000000.tar > _test.image
file _test.image

10.png
_test.image: PNG image data, 600 x 793, 8-bit/color RGB, non-interlaced


We can actually search with an arbitrary Python expression; `_` is a dict with the field name as the key and the file contents as the value.

In [5]:
tarfirst -S 'int(_["cls"]) == 180' -f cls testdata/imagenet-000000.tar 

13.cls
180

# Creating Tar Shards

The `tarsplit` utility is useful for creating sharded tar files.

In [6]:
tarsplit -n 20 -o _test testdata/sample.tar

Traceback (most recent call last):
  File "./tarsplit", line 22, in <module>
    from tarproclib import reader, writer, paths
  File "/home/tmb/exp/tarproc/tarproclib/writer.py", line 19
    def __init__(self, fileobj, keep_meta=False, user="bigdata", group="bigdata", mode=0o0444, compress=None, encoder=None, mode=None):
    ^
SyntaxError: duplicate argument 'mode' in function definition


: 1

Commonly, we might use it with something more complex like this:

In [None]:
(cd /mdata/imagenet-raw/train && find . -name '*.JPEG' | tar -T - -cf -) | tarsplit --maxshards=5 -s 1e8 -o _test

# Concatenating Tar Files

You can reshard with a combination of `tarcat` and `tarsplit` (here we're using the same tar file as input multiple times, but in practice, you'd of course use separate shards).

In [None]:
tarscat testdata/sample.tar testdata/sample.tar | tarsplit -n 60

The `tarscat` utility also lets you specify a downloader command (for accessing object stores) and can expand shard syntax. Here is a more complex example. Downloader commands are specified by setting environment variables for each URL schema.

In [None]:
export GOPEN_GS="gsutil cat '{}'"
export GOPEN_HTTP="curl --silent -L '{}'"

In [None]:
tarscat -c 10 'gs://lpr-imagenet/imagenet_train-0000.tgz' | tar2tsv -f cls

In [None]:
tarscat --shuffle 100 -c 3 -b 'gs://lpr-imagenet/imagenet_train-{0000..0147}.tgz' > _temp.tar

In [None]:
tarshow -d 0 _temp.tar

In [None]:
tarshow -d 0 'gs://lpr-imagenet/imagenet_train-{0000..0099}.tgz#0,3'

# Creating Tar Files from TSV Files

You can create `tar` archives from TSV files. The first line is a header that gives the field names, subsequent lines are data. Headers starting with "@" cause the corresponding field content to be interpreted as a file name that gets incorporated by binary-reading it.

Of course, this too combines with `tarsplit` and other utilities.

In [None]:
sed 3q testdata/plan.tsv

In [None]:
tarcreate -C testdata testdata/plan.tsv | tarshow -c 3

# Sorting

You can sort the records (grouped files) in a `tar` archive using `tarsort`.

You can use any content for sorting. Here, we sort on the content of the `cls` field, interpreting it as an `int`.

In [None]:
tarsort --sortkey cls --sorttype int --update testdata/imagenet-000000.tar > _sorted.tar

In [None]:
tar2tsv -c 5 -f "cls wnid" testdata/imagenet-000000.tar
echo
tar2tsv -c 5 -f "cls wnid" _sorted.tar

You can also use `tarsort` for shuffling records.

In [None]:
tarsort --sorttype shuffle < testdata/imagenet-000000.tar > _sorted.tar
tar2tsv -c 5 -f "cls wnid" _sorted.tar

# Mapping / Parallel Processing

The `tarproc` utility lets you map command line programs and scripts over the samples in a tar file.

In [None]:
time tarproc -c "gm mogrify -size 256x256 *.png" < testdata/imagenet-000000.tar -o - > _out.tar

You can even parallelize this (somewhat analogous to `xargs`):

In [None]:
time tarproc -p 8 -c "gm mogrify -size 256x256 *.png" < testdata/imagenet-000000.tar -o - > _out.tar

# Python Interface

In [None]:
from tarproclib import reader, gopen
from itertools import islice

gopen.handlers["gs"] = "gsutil cat '{}'"

for sample in islice(reader.TarIterator("gs://lpr-imagenet/imagenet_train-0000.tgz"), 0, 10):
    print(sample.keys())