-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stdin and stdout #3
Comments
Hi, this should already work! Eg: pixz < data.tar > compressed.tpxz Is it not working for you in some way? |
PS: Not all files can be decompressed using multiple CPUs. You will only get the speed-up if it was originally compressed with pixz or another tool that compresses by segments. |
thanks for the quick reply, this is a simplified version of what I am trying to do, as I do it with xz; wget -qO- http://xxx/myimage.xz | xz -c -d | ntfsclone -r --overwrite /dev/sda1 I could not get pixz to do this. |
pixz can only work on seekable compressed data. This is due to limitations of the .xz format, not anything about pixz itself. I suppose it would be possible to detect that it's un-seekable and just revert to single-CPU mode, but there's no real advantage to using pixz over xz in this case. In the specific case of HTTP, it would theoretically be possible to make Range requests in lieu of seeking. Maybe there's a FUSE filesystem somewhere that can "mount" an HTTP request as a file? That would be really interesting! |
vasi: sorry I am somewhat confused! [below on the same hardware, mid range Core2Duo] I am currently using pigz to multithread decompress gzip piped from wget as above. Gigabit connection, both cpu's are loaded to about 30% but hdd write speed is actually limiting factor. Are you saying pigz xz creates an image that needs seek, whereas normal xz is streamable? Thanks! |
hmm, an image created with pixz streams ok using xz, decompressing through wget |
Ok, first I'll explain the CPU usage thing, though it's a bit extraneous. Basically, when something is single-threaded, it means it can only use one CPU at a time, but which CPU it's using at any moment could be arbitrary, it's up to the operating system. So imagine a usage pattern like this:
Your CPU usage monitor probably only measures usage once a second, so it will see this as 500ms on CPU1, and 500ms on CPU2, and will show you 50% on each CPU. Ok, now back to xz. Data compressed by pixz is entirely compatible with the .xz format. It can be decompressed by .xz, and doesn't require seeking to decompress. You can totally do 'pixz < some.data | xz -cd > thesame.data'. What does require seeking is multi-threaded decompression. If a .xz file (from xz or pixz) contains multiple 'segments' that can be decompressed in parallel, information about those segments is usually at the end of the file. So without seeking, those segments can't be found, and decompression has to be single-threaded. I realize now that it's theoretically possible to create a .xz file with multiple segments, and with segment information stored inline. This could in fact be both streamed and parallelized. Unfortunately, it's a fair bit more difficult to create files like this. |
sorry you are quite right about cpu/threads now I looked at top while it was running more carefully! that was stupid of me. I guess I will have to stick with pigz then as its so much faster decompressing :( My image shrank a lot with xz, 3.5 Gb (gzip) > 2.6Gb (xz). I like your inline idea, I guess you would have a unique feature and speed up decompressing speed greatly as most people have at least 2 cores nowadays? Do you know of anything thats currently mulithreaded for decompression that compresses better than .gz ? |
Maybe threadzip? I'll see if I can modify pixz to produce files that can be streamed and decompressed in parallel, but it won't be right away. |
any joy updating for multithreaded decompress? |
Alright, so I'm just documenting what needs to happen to make this work: For parallel decompression, we have to be able to split the compressed data. There are two cases that allow this:
When decompressing, we may encounter the following cases:
But we expect some weird/unfortunate occurrences:
The implementation plan:
|
I've done part 1, 2 and 3. Gonna re-order the other ones, and implement "streaming mode" next. Unfortunately it's difficult, because in stream mode we only find out we're at the end of a block when liblzma tells us it's done. So we'll probably have read a bit too far, and have data left over. We need every part of the decompressor to deal with arbitrary amounts of initial data that may or may not already have been read. Ugh :( |
Part 1, writing compressed/uncompressed size into block headers, is committed to master. Any archives you create with pixz should now be streamable when, eventually, I finish the streaming work. The current progress on streaming is in branch 'stream'. Be aware that this is a temporary branch, I may do amends and rebases. |
Ok, branch 'stream' has this feature implemented! :D https://github.com/vasi/pixz/tree/stream A lot of changes to the codebase were involved, so I would hugely appreciate testing. Let me know how it goes! |
A user helped testing. Merged into master. |
Hi vasi,
Great program, but could you please add options for processing stdin / stdout ? I need to pipe input in from a program and pipe the output to another program.
I need a fast multithreaded decompressor for lzma/xz like pigz !
Many thanks.
The text was updated successfully, but these errors were encountered: