Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 44 lines (34 sloc) 1.88 kB
db24328 @thiruvel Initial push
authored
1 # Copyright (c) 2012 Yahoo! Inc. All rights reserved.
2 # Copyrights licensed under the New BSD License. See the accompanying LICENSE file for terms.
3
4 Streaming BZip2 Splitter
5 ------------------------
6
f26708a @thiruvel Format change
authored
7 This tool has been written by me to solve Hadoop's problem. Hadoop cannot split
8 bz2 files until 0.23. Some folks attempted to backport the patch to Hadoop 20
9 but its not stable, not available.
10
11 This is a simple tool which given one/more bz2 files, uses the bzip tools to
12 split it into multiple smaller files. The size of the smaller chunks/number
13 of chunks to split can be configured. During splitting, the tool ensures that
14 records are properly split among the chunks, i.e. a single record is only
15 available in one of the chunks and this is specially taken care. It can work
16 only if the record boundary is '\n'. The splitting happens fast even for
17 large text files because it does not attempt to decompress the whole file,
18 instead it only decompresses parts of the file required to handle record
db24328 @thiruvel Initial push
authored
19 boundaries.
20
21 Works with:
22 Text files, bz2 compressed and record boundary '\n'
23
24 Generates:
25 Smaller files with same data but gzip compressed.
26
f26708a @thiruvel Format change
authored
27 We use bzip2recover and generate smaller files and concatenate them into
28 larger files. Now Hadoop does not understand such files and reads only a
29 smaller portion of them. So, to fix them the bzip2 header of the chunks
30 should be fixed. Thats a little tricky and to get around the problem, I
31 bunzip them (yes bzip tools understand such formats) and gzip them - both
32 of them are pretty fast.
db24328 @thiruvel Initial push
authored
33
f26708a @thiruvel Format change
authored
34 Drivers have been written to run the same logic on multiple files using
35 Hadoop Streaming. The scripts takes a directory of bz2 files, splits them
36 based on size/chunks, verifies them and launches another job to gzip them.
37 All along, the number of mappers can be controlled.
db24328 @thiruvel Initial push
authored
38
10a407f @thiruvel Moved content to wiki and trimmed down README.
authored
39 See Wiki for more information. Feedback welcome.
db24328 @thiruvel Initial push
authored
40
41 Thanks!
42 Thiruvel
43 @thiruvel
Something went wrong with that request. Please try again.