Skip to content

Commit

Permalink
vendor: adding tar-split dependency for graph
Browse files Browse the repository at this point in the history
tar-split is a facility to disassemble and reassemble tar archives

Signed-off-by: Vincent Batts <vbatts@redhat.com>
  • Loading branch information
vbatts committed Jul 21, 2015
1 parent 1ca7378 commit 5ddec2a
Show file tree
Hide file tree
Showing 45 changed files with 4,983 additions and 1 deletion.
3 changes: 2 additions & 1 deletion hack/vendor.sh
Expand Up @@ -33,8 +33,9 @@ clone git github.com/samuel/go-zookeeper d0e0d8e11f318e000a8cc434616d69e329edc37
clone git github.com/coreos/go-etcd v2.0.0
clone git github.com/hashicorp/consul v0.5.2

# get distribution packages
# get graph and distribution packages
clone git github.com/docker/distribution 419bbc2da637d9b2a812be78ef8436df7caac70d
clone git github.com/vbatts/tar-split v0.9.3

clone git github.com/opencontainers/runc v0.0.2 # libcontainer
# libcontainer deps (see src/github.com/docker/libcontainer/update-vendor.sh)
Expand Down
13 changes: 13 additions & 0 deletions vendor/src/github.com/vbatts/tar-split/.travis.yml
@@ -0,0 +1,13 @@
language: go
go:
- 1.4.2
- 1.3.3

# let us have pretty, fast Docker-based Travis workers!
sudo: false

# we don't need "go get" here <3
install: go get -d ./...

script:
- go test -v ./...
36 changes: 36 additions & 0 deletions vendor/src/github.com/vbatts/tar-split/DESIGN.md
@@ -0,0 +1,36 @@
Flow of TAR stream
==================

The underlying use of `github.com/vbatts/tar-split/archive/tar` is most similar
to stdlib.


Packer interface
----------------

For ease of storage and usage of the raw bytes, there will be a storage
interface, that accepts an io.Writer (This way you could pass it an in memory
buffer or a file handle).

Having a Packer interface can allow configuration of hash.Hash for file payloads
and providing your own io.Writer.

Instead of having a state directory to store all the header information for all
Readers, we will leave that up to user of Reader. Because we can not assume an
ID for each Reader, and keeping that information differentiated.



State Directory
---------------

Perhaps we could deduplicate the header info, by hashing the rawbytes and
storing them in a directory tree like:

./ac/dc/beef

Then reference the hash of the header info, in the positional records for the
tar stream. Though this could be a future feature, and not required for an
initial implementation. Also, this would imply an owned state directory, rather
than just writing storage info to an io.Writer.

19 changes: 19 additions & 0 deletions vendor/src/github.com/vbatts/tar-split/LICENSE
@@ -0,0 +1,19 @@
Copyright (c) 2015 Vincent Batts, Raleigh, NC, USA

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
181 changes: 181 additions & 0 deletions vendor/src/github.com/vbatts/tar-split/README.md
@@ -0,0 +1,181 @@
tar-split
========

[![Build Status](https://travis-ci.org/vbatts/tar-split.svg?branch=master)](https://travis-ci.org/vbatts/tar-split)

Extend the upstream golang stdlib `archive/tar` library, to expose the raw
bytes of the TAR, rather than just the marshalled headers and file stream.

The goal being that by preserving the raw bytes of each header, padding bytes,
and the raw file payload, one could reassemble the original archive.


Docs
----

* https://godoc.org/github.com/vbatts/tar-split/tar/asm
* https://godoc.org/github.com/vbatts/tar-split/tar/storage
* https://godoc.org/github.com/vbatts/tar-split/archive/tar


Caveat
------

Eventually this should detect TARs that this is not possible with.

For example stored sparse files that have "holes" in them, will be read as a
contiguous file, though the archive contents may be recorded in sparse format.
Therefore when adding the file payload to a reassembled tar, to achieve
identical output, the file payload would need be precisely re-sparsified. This
is not something I seek to fix imediately, but would rather have an alert that
precise reassembly is not possible.
(see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)


Other caveat, while tar archives support having multiple file entries for the
same path, we will not support this feature. If there are more than one entries
with the same path, expect an err (like `ErrDuplicatePath`) or a resulting tar
stream that does not validate your original checksum/signature.


Contract
--------

Do not break the API of stdlib `archive/tar` in our fork (ideally find an
upstream mergeable solution)


Std Version
-----------

The version of golang stdlib `archive/tar` is from go1.4.1, and their master branch around [a9dddb53f](https://github.com/golang/go/tree/a9dddb53f)


Example
-------

First we'll get an archive to work with. For repeatability, we'll make an
archive from what you've just cloned:

```
git archive --format=tar -o tar-split.tar HEAD .
```

Then build the example main.go:

```
go build ./main.go
```

Now run the example over the archive:

```
$ ./main tar-split.tar
2015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
pax_global_header pre: 512 read: 52
.travis.yml pre: 972 read: 374
DESIGN.md pre: 650 read: 1131
LICENSE pre: 917 read: 1075
README.md pre: 973 read: 4289
archive/ pre: 831 read: 0
archive/tar/ pre: 512 read: 0
archive/tar/common.go pre: 512 read: 7790
[...]
tar/storage/entry_test.go pre: 667 read: 1137
tar/storage/getter.go pre: 911 read: 2741
tar/storage/getter_test.go pre: 843 read: 1491
tar/storage/packer.go pre: 557 read: 3141
tar/storage/packer_test.go pre: 955 read: 3096
EOF padding: 1512
Remainder: 512
Size: 215040; Sum: 215040
```

*What are we seeing here?*

* `pre` is the header of a file entry, and potentially the padding from the
end of the prior file's payload. Also with particular tar extensions and pax
attributes, the header can exceed 512 bytes.
* `read` is the size of the file payload from the entry
* `EOF padding` is the expected 1024 null bytes on the end of a tar archive,
plus potential padding from the end of the prior file entry's payload
* `Remainder` is the remaining bytes of an archive. This is typically deadspace
as most tar implmentations will return after having reached the end of the
1024 null bytes. Though various implementations will include some amount of
bytes here, which will affect the checksum of the resulting tar archive,
therefore this must be accounted for as well.

Ideally the input tar and output `*.out`, will match:

```
$ sha1sum tar-split.tar*
ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar
ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar.out
```


Stored Metadata
---------------

Since the raw bytes of the headers and padding are stored, you may be wondering
what the size implications are. The headers are at least 512 bytes per
file (sometimes more), at least 1024 null bytes on the end, and then various
padding. This makes for a constant linear growth in the stored metadata, with a
naive storage implementation.

Reusing our prior example's `tar-split.tar`, let's build the checksize.go example:

```
go build ./checksize.go
```

```
$ ./checksize ./tar-split.tar
inspecting "tar-split.tar" (size 210k)
-- number of files: 50
-- size of metadata uncompressed: 53k
-- size of gzip compressed metadata: 3k
```

So assuming you've managed the extraction of the archive yourself, for reuse of
the file payloads from a relative path, then the only additional storage
implications are as little as 3kb.

But let's look at a larger archive, with many files.

```
$ ls -sh ./d.tar
1.4G ./d.tar
$ ./checksize ~/d.tar
inspecting "/home/vbatts/d.tar" (size 1420749k)
-- number of files: 38718
-- size of metadata uncompressed: 43261k
-- size of gzip compressed metadata: 2251k
```

Here, an archive with 38,718 files has a compressed footprint of about 2mb.

Rolling the null bytes on the end of the archive, we will assume a
bytes-per-file rate for the storage implications.

| uncompressed | compressed |
| :----------: | :--------: |
| ~ 1kb per/file | 0.06kb per/file |


What's Next?
------------

* More implementations of storage Packer and Unpacker
- could be a redis or mongo backend
* More implementations of FileGetter and FilePutter
- could be a redis or mongo backend
* cli tooling to assemble/disassemble a provided tar archive
* would be interesting to have an assembler stream that implements `io.Seeker`

License
-------

See LICENSE


0 comments on commit 5ddec2a

Please sign in to comment.