Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdin and stdout #3

Closed
richud opened this issue Jan 17, 2012 · 15 comments
Closed

stdin and stdout #3

richud opened this issue Jan 17, 2012 · 15 comments

Comments

@richud
Copy link

richud commented Jan 17, 2012

Hi vasi,
Great program, but could you please add options for processing stdin / stdout ? I need to pipe input in from a program and pipe the output to another program.
I need a fast multithreaded decompressor for lzma/xz like pigz !
Many thanks.

@vasi
Copy link
Owner

vasi commented Jan 17, 2012

Hi, this should already work! Eg:

pixz < data.tar > compressed.tpxz
pixz -d < compressed.xz > output

Is it not working for you in some way?

@vasi
Copy link
Owner

vasi commented Jan 17, 2012

PS: Not all files can be decompressed using multiple CPUs. You will only get the speed-up if it was originally compressed with pixz or another tool that compresses by segments.

@richud
Copy link
Author

richud commented Jan 17, 2012

thanks for the quick reply, this is a simplified version of what I am trying to do, as I do it with xz;

wget -qO- http://xxx/myimage.xz | xz -c -d | ntfsclone -r --overwrite /dev/sda1

I could not get pixz to do this.
Thanks for any help

@richud richud closed this as completed Jan 17, 2012
@vasi
Copy link
Owner

vasi commented Jan 17, 2012

pixz can only work on seekable compressed data. This is due to limitations of the .xz format, not anything about pixz itself. I suppose it would be possible to detect that it's un-seekable and just revert to single-CPU mode, but there's no real advantage to using pixz over xz in this case.

In the specific case of HTTP, it would theoretically be possible to make Range requests in lieu of seeking. Maybe there's a FUSE filesystem somewhere that can "mount" an HTTP request as a file? That would be really interesting!

@richud
Copy link
Author

richud commented Jan 17, 2012

vasi: sorry I am somewhat confused! [below on the same hardware, mid range Core2Duo]

I am currently using pigz to multithread decompress gzip piped from wget as above. Gigabit connection, both cpu's are loaded to about 30% but hdd write speed is actually limiting factor.
I have tried above using xz (5.1.1) with a 'normal' xz image and that also works from wget, but both cpu cores only load to 40-60% and the hdd isn't limiting, it takes about 4x longer. I dont understand what is limiting in this case, nor do I understand why both cores are loaded as I understand xz 5.1.1. is only multithreaded compressing, not decompressing.
As it works I assume a normal xz image doesnt need to be seekable.

Are you saying pigz xz creates an image that needs seek, whereas normal xz is streamable?

Thanks!

@richud
Copy link
Author

richud commented Jan 17, 2012

hmm, an image created with pixz streams ok using xz, decompressing through wget

@vasi
Copy link
Owner

vasi commented Jan 17, 2012

Ok, first I'll explain the CPU usage thing, though it's a bit extraneous. Basically, when something is single-threaded, it means it can only use one CPU at a time, but which CPU it's using at any moment could be arbitrary, it's up to the operating system. So imagine a usage pattern like this:

Time    CPU 1    CPU 2
0       xz       free
10ms    free     xz
20ms    xz       free
30ms    free     xz
40ms    xz       free
etc...

Your CPU usage monitor probably only measures usage once a second, so it will see this as 500ms on CPU1, and 500ms on CPU2, and will show you 50% on each CPU.

Ok, now back to xz. Data compressed by pixz is entirely compatible with the .xz format. It can be decompressed by .xz, and doesn't require seeking to decompress. You can totally do 'pixz < some.data | xz -cd > thesame.data'.

What does require seeking is multi-threaded decompression. If a .xz file (from xz or pixz) contains multiple 'segments' that can be decompressed in parallel, information about those segments is usually at the end of the file. So without seeking, those segments can't be found, and decompression has to be single-threaded.

I realize now that it's theoretically possible to create a .xz file with multiple segments, and with segment information stored inline. This could in fact be both streamed and parallelized. Unfortunately, it's a fair bit more difficult to create files like this.

@richud
Copy link
Author

richud commented Jan 17, 2012

sorry you are quite right about cpu/threads now I looked at top while it was running more carefully! that was stupid of me.

I guess I will have to stick with pigz then as its so much faster decompressing :(
(faster than just gzip, now I see its also single threaded but docs says it does other things with separate threads)

My image shrank a lot with xz, 3.5 Gb (gzip) > 2.6Gb (xz). I like your inline idea, I guess you would have a unique feature and speed up decompressing speed greatly as most people have at least 2 cores nowadays?

Do you know of anything thats currently mulithreaded for decompression that compresses better than .gz ?

@vasi vasi reopened this Jan 17, 2012
@vasi
Copy link
Owner

vasi commented Jan 17, 2012

Maybe threadzip? I'll see if I can modify pixz to produce files that can be streamed and decompressed in parallel, but it won't be right away.

@richud
Copy link
Author

richud commented Mar 31, 2012

any joy updating for multithreaded decompress?

@vasi
Copy link
Owner

vasi commented Oct 13, 2012

Alright, so I'm just documenting what needs to happen to make this work:

For parallel decompression, we have to be able to split the compressed data. There are two cases that allow this:

  • If the file is seekable, and an xz index is present at the end of the file. This requires seekability.
  • If blocks contain a "compressed size" field. This would allow streaming. Currently pixz compresses files without adding "compressed size" to blocks, but xz 5.1 alpha does the right thing.

When decompressing, we may encounter the following cases:

  • An index is accessible: Seekable files produced by current pixz or xz 5.1. We should try to use the file-index if present for fast listing/extraction. Decompression should be done in parallel.
  • No index is accessible, but blocks have "compressed size" fields: Streaming files produced by xz 5.1 or future pixz. We can't access the file-index, and should warn if the user requests filtered extraction. Decompression should be done in parallel.
  • No index is accessible, and blocks have no "compressed size": Streaming files produced by current pixz, and all files produced by xz 5.0.x and earlier. We can't access the file-index, see above. We also can't split the compressed data, so we fall back to single-threaded decompression.

But we expect some weird/unfortunate occurrences:

  • XZ files can contain multiple streams, eg: if they're concatenated. We should attempt to find all indexes and combine them. This is incompatible with a pixz file-index, so we should ensure we don't attempt to use one.
  • Some input may be partially parallelizable. Maybe two xz files were concatenated, one from xz 5.1 and one from xz 5.0. We should do blocks with "compressed size" in parallel, but blocks without should be done single-threaded.
  • Some or all blocks may be too large for memory. (This will especially happen with xz 5.0 and earlier, since it doesn't do any splitting at all.) We must force those blocks to be decompressed single-threaded. Currently all decompression is in-memory, so we have to ensure we have a more stream-oriented way available as well.
  • We might encounter an index that disagrees with "compressed size". This is definitely an error in the input, so we should at least warn the user. We could either exit with error, or attempt to continue by choosing either the index or "compressed size".

The implementation plan:

  1. In the compressor, add the "compressed size" field to blocks. This is done by writing the block header after compressing the block contents.
  2. Support dynamic decompression block sizes, since while streaming we can't precalculate the necessary block size.
  3. Support using "compressed size" as the decompression block size, triggered manually.
  4. Support complete absence of the index, and use that to trigger "compressed size".
  5. Add a streaming-mode for decompression, triggered manually. Instead of reading large chunks of data and passing them to the encode threads, the read thread will in this mode do the encoding itself, in small chunks. When it accumulates enough compressed output, it will send the output directly to the writer thread, and continue the same compression instead of starting anew. When it reaches the end of the block, it has to make sure it keeps any leftover input data around.
  6. Trigger streaming-mode on a block-by-block basis, when "compressed size" and the index are both not available, or a block is over some size threshold.
  7. Support the presence of multiple streams, including combining multiple indexes.

@vasi
Copy link
Owner

vasi commented Oct 15, 2012

I've done part 1, 2 and 3. Gonna re-order the other ones, and implement "streaming mode" next.

Unfortunately it's difficult, because in stream mode we only find out we're at the end of a block when liblzma tells us it's done. So we'll probably have read a bit too far, and have data left over. We need every part of the decompressor to deal with arbitrary amounts of initial data that may or may not already have been read. Ugh :(

@vasi
Copy link
Owner

vasi commented Oct 15, 2012

Part 1, writing compressed/uncompressed size into block headers, is committed to master. Any archives you create with pixz should now be streamable when, eventually, I finish the streaming work.

The current progress on streaming is in branch 'stream'. Be aware that this is a temporary branch, I may do amends and rebases.

@vasi
Copy link
Owner

vasi commented Nov 5, 2012

Ok, branch 'stream' has this feature implemented! :D https://github.com/vasi/pixz/tree/stream

A lot of changes to the codebase were involved, so I would hugely appreciate testing. Let me know how it goes!

@vasi
Copy link
Owner

vasi commented Nov 19, 2012

A user helped testing. Merged into master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants