Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
# Copyright (c) 2012 Yahoo! Inc. All rights reserved. # Copyrights licensed under the New BSD License. See the accompanying LICENSE file for terms. Streaming BZip2 Splitter ------------------------ This tool has been written by me to solve Hadoop's problem. Hadoop cannot split bz2 files until 0.23. Some folks attempted to backport the patch to Hadoop 20 but its not stable, not available. This is a simple tool which given one/more bz2 files, uses the bzip tools to split it into multiple smaller files. The size of the smaller chunks/number of chunks to split can be configured. During splitting, the tool ensures that records are properly split among the chunks, i.e. a single record is only available in one of the chunks and this is specially taken care. It can work only if the record boundary is '\n'. The splitting happens fast even for large text files because it does not attempt to decompress the whole file, instead it only decompresses parts of the file required to handle record boundaries. Works with: Text files, bz2 compressed and record boundary '\n' Generates: Smaller files with same data but gzip compressed. We use bzip2recover and generate smaller files and concatenate them into larger files. Now Hadoop does not understand such files and reads only a smaller portion of them. So, to fix them the bzip2 header of the chunks should be fixed. Thats a little tricky and to get around the problem, I bunzip them (yes bzip tools understand such formats) and gzip them - both of them are pretty fast. Drivers have been written to run the same logic on multiple files using Hadoop Streaming. The scripts takes a directory of bz2 files, splits them based on size/chunks, verifies them and launches another job to gzip them. All along, the number of mappers can be controlled. See Wiki for more information. Feedback welcome. Thanks! Thiruvel @thiruvel