LZOP compression corrupts output for specific input #20

Closed
alexholmes opened this Issue Apr 12, 2011 · 6 comments

Comments

Projects
None yet
4 participants

We're using hadoop-lzo 0.4.7 with the patch for the empty file infinite loop (kevinweil/hadoop-lzo@9d06b25)

For a specific input string the LzopCodec seems to corrupt the compressed output. We have a repeatable test case demonstrating this. The output of the test case follows - the first block contains the output of the file prior to compression, the second block contains the corrupted contents of the compressed/decompressed file:

***************************************************
* Content being compressed
*   /tmp/lzop-test129499022709115218.cleartext
***************************************************
0.5 74  25425
0.9 200 25384
0.95    203 4
0.98    211 2
0.99    219 3
0.995   240 5
***************************************************

***************************************************
* Content after compression/decompression
*   compressed:   /tmp/lzop-test129499022709115218.cleartext.lzop
*   uncompressed: /tmp/lzop-test129499022709115218.cleartext.uncompressed
***************************************************
0.5 74  25425
0.9 200 25384t5 203 ?8  211 2u9H
                                                    9   3
0.995   240 5
***************************************************

If I use the lzop binary (LZOP(1)) to compress/uncompress it works as expected

$ lzop -c /tmp/lzop-test129499022709115218.cleartext  > output.lzop
$ lzop -d output.lzop
$ cat output
0.5 74  25425
0.9 200 25384
0.95    203 4
0.98    211 2
0.99    219 3
0.995   240 5

One more interesting piece of data is that if we set the LZO buffer size to a small value output is not corrupted:

c.set(LzoCodec.LZO_BUFFER_SIZE_KEY, "1024");

This is the test code to reproduce the problem:

import com.hadoop.compression.lzo.LzoCodec;
import com.hadoop.compression.lzo.LzopCodec;
import org.apache.commons.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import java.io.*;
import java.net.URI;

public class LzoTestCorruptReduceOutput4 {

    static String inputData = "0.5\t74\t25425\n"+
            "0.9\t200\t25384\n"+
            "0.95\t203\t4\n"+
            "0.98\t211\t2\n"+
            "0.99\t219\t3\n"+
            "0.995\t240\t5";

    static LzopCodec codec = new LzopCodec();
    static RawLocalFileSystem fs = new RawLocalFileSystem();

    public static void main(String[] args) throws Exception {

        Configuration c = new Configuration();
        // uncommenting the following line seems to fix the corruption
//        c.set(LzoCodec.LZO_BUFFER_SIZE_KEY, "1024");

        codec.setConf(c);
        fs.setConf(new Configuration());
        fs.initialize(new URI("file:///"), new Configuration());

        File cleartextFile = File.createTempFile("lzop-test", ".cleartext");
        File cleartextUncompressedFile = new File(cleartextFile.getAbsoluteFile() + ".uncompressed");
        File compressedFile = new File(cleartextFile.getAbsoluteFile() + ".lzop");

        FileUtils.writeStringToFile(cleartextFile, inputData);

        System.out.println("");
        System.out.println("***************************************************");
        System.out.println("* Content being compressed");
        System.out.println("*   " + cleartextFile.getAbsolutePath());
        System.out.println("***************************************************");
        System.out.println(inputData);
        System.out.println("***************************************************");

        compress(cleartextFile, compressedFile);

        decompress(compressedFile, cleartextUncompressedFile);

        System.out.println("");
        System.out.println("***************************************************");
        System.out.println("* Content after compression/decompression");
        System.out.println("*   compressed:   " + compressedFile.getAbsolutePath());
        System.out.println("*   uncompressed: " + cleartextUncompressedFile.getAbsolutePath());
        System.out.println("***************************************************");
        System.out.println(FileUtils.readFileToString(cleartextUncompressedFile));
        System.out.println("***************************************************");
    }

    public static void compress(File input, File output) throws IOException {
        copyStream(
                fs.open(new Path(input.getAbsolutePath())),
                codec.createOutputStream(fs.create(new Path(output.getAbsolutePath()), true)));
    }

    public static void decompress(File input, File output) throws IOException {
        copyStream(
                codec.createInputStream(fs.open(new Path(input.getAbsolutePath()))),
                fs.create(new Path(output.getAbsolutePath()), true));
    }

    public static void copyStream( InputStream is, OutputStream os) throws IOException {
        IOUtils.copy(is, os);
        IOUtils.closeQuietly(os);
    }
}

Any assistance would be greatly appreciated.

Contributor

miguno commented Apr 13, 2011

I think I have a patch for the issue. I'll add some test cases for it and submit it (pull request) for both Kevin's and Todd's branch.

Contributor

miguno commented Apr 13, 2011

I have a patch available at miguno@72b439b05407e05f9ae53aad4273244cb8eaaf7b. It is based on the HEAD of Todd's branch (), i.e. commit toddlipcon@2bd0d5b. The patch includes a test case using the problematic input data reported by @alexholmes.

I have also send a pull request to Todd to merge the patch into his branch:
toddlipcon#4

bliu commented Apr 13, 2011

Hello,

Do we know if version (0.4.9) is affected by this or not?

Regards...

Contributor

miguno commented Apr 14, 2011

Looking just at the code, I'd say yes.

If you want to be 100% sure, my patch (miguno@72b439b) includes a test case for triggering the bug. So you could use the respective input data added by the patch (src/test/data/issue20-lzop.txt) to find out.

Contributor

kevinweil commented Apr 24, 2011

Miguno, this fix looks good. Can you send me a pull request as well? Thanks.

rangadi added a commit that referenced this issue Jun 3, 2011

Merge pull request #21 from miguno/master
Fix issue #20: write uncompressed bytes if uncompressed size==compressed size

@alexholmes alexholmes closed this Oct 27, 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment