Split Bz2 files on Hadoop or on Unix without decompressing the whole file
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


# Copyright (c) 2012 Yahoo! Inc. All rights reserved.
# Copyrights licensed under the New BSD License. See the accompanying LICENSE file for terms.

Streaming BZip2 Splitter

This tool has been written by me to solve Hadoop's problem. Hadoop cannot split
bz2 files until 0.23. Some folks attempted to backport the patch to Hadoop 20
but its not stable, not available.

This is a simple tool which given one/more bz2 files, uses the bzip tools to
split it into multiple smaller files. The size of the smaller chunks/number
of chunks to split can be configured. During splitting, the tool ensures that
records are properly split among the chunks, i.e. a single record is only
available in one of the chunks and this is specially taken care. It can work
only if the record boundary is '\n'. The splitting happens fast even for
large text files because it does not attempt to decompress the whole file,
instead it only decompresses parts of the file required to handle record

Works with:
	Text files, bz2 compressed and record boundary '\n'

	Smaller files with same data but gzip compressed.

We use bzip2recover and generate smaller files and concatenate them into
larger files. Now Hadoop does not understand such files and reads only a
smaller portion of them. So, to fix them the bzip2 header of the chunks
should be fixed. Thats a little tricky and to get around the problem, I
bunzip them (yes bzip tools understand such formats) and gzip them - both
of them are pretty fast.

Drivers have been written to run the same logic on multiple files using
Hadoop Streaming. The scripts takes a directory of bz2 files, splits them
based on size/chunks, verifies them and launches another job to gzip them.
All along, the number of mappers can be controlled.

See Wiki for more information. Feedback welcome.