Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Format change

  • Loading branch information...
commit f26708a7b09304db11c47f99bd56c383b6ed79c0 1 parent db24328
@thiruvel authored
Showing with 62 additions and 52 deletions.
  1. +62 −52 README
View
114 README
@@ -4,13 +4,18 @@
Streaming BZip2 Splitter
------------------------
-This tool has been written by me to solve Hadoop's problem. Hadoop cannot split bz2 files until 0.23. Some folks attempted to backport the patch to
-Hadoop 20 but its not stable, not available.
-
-This is a simple tool which given one/more bz2 files, uses the bzip tools to split it into multiple smaller files. The size of the smaller chunks/number
-of chunks to split can be configured. During splitting, the tool ensures that records are properly split among the chunks, i.e. a single record is only
-available in one of the chunks and this is specially taken care. It can work only if the record boundary is '\n'. The splitting happens fast even for
-large text files because it does not attempt to decompress the whole file, instead it only decompresses parts of the file required to handle record
+This tool has been written by me to solve Hadoop's problem. Hadoop cannot split
+bz2 files until 0.23. Some folks attempted to backport the patch to Hadoop 20
+but its not stable, not available.
+
+This is a simple tool which given one/more bz2 files, uses the bzip tools to
+split it into multiple smaller files. The size of the smaller chunks/number
+of chunks to split can be configured. During splitting, the tool ensures that
+records are properly split among the chunks, i.e. a single record is only
+available in one of the chunks and this is specially taken care. It can work
+only if the record boundary is '\n'. The splitting happens fast even for
+large text files because it does not attempt to decompress the whole file,
+instead it only decompresses parts of the file required to handle record
boundaries.
Works with:
@@ -19,27 +24,32 @@ Works with:
Generates:
Smaller files with same data but gzip compressed.
-We use bzip2recover and generate smaller files and concatenate them into larger files. Now Hadoop does not understand such files and reads only a
-smaller portion of them. So, to fix them the bzip2 header of the chunks should be fixed. Thats a little tricky and to get around the problem, I
-bunzip them (yes bzip tools understand such formats) and gzip them - both of them are pretty fast.
+We use bzip2recover and generate smaller files and concatenate them into
+larger files. Now Hadoop does not understand such files and reads only a
+smaller portion of them. So, to fix them the bzip2 header of the chunks
+should be fixed. Thats a little tricky and to get around the problem, I
+bunzip them (yes bzip tools understand such formats) and gzip them - both
+of them are pretty fast.
-Drivers have been written to run the same logic on multiple files using Hadoop Streaming. The scripts takes a directory of bz2 files, splits them
-based on size/chunks, verifies them and launches another job to gzip them. All along, the number of mappers can be controlled.
+Drivers have been written to run the same logic on multiple files using
+Hadoop Streaming. The scripts takes a directory of bz2 files, splits them
+based on size/chunks, verifies them and launches another job to gzip them.
+All along, the number of mappers can be controlled.
Here is a sample usage of the script:
[thiruvel@localhost StreamingBz2Split]$ ./run.sh -h
./run.sh: [-t] [-c <chunk size> | -n <number of chunks>] [-v] [-m no_maps] -i input_dir -o output_dir
- -t - Verify integrity of the input bzip2 files. OFF by default.
- -c - Chunk size of each bzip2 split in MB, final size of gzip files may vary. 4 by default.
- -n - Number of chunks to be generated, mutually exclusive to -c. Disabled by default.
- -v - Verify rowcounts between input and output - OFF by default.
- -m - Number of Maps to be launched, default number of maps = number of files.
- -i - Input dir. The directory should exist and contain bz2 files. Other files will be ignored.
- -o - Output dir. The directory will be cleaned if it exists and the output split files in .gz
- format will be placed here. It will also be used as a scratch directory.
- -h - Print usage
+ -t - Verify integrity of the input bzip2 files. OFF by default.
+ -c - Chunk size of each bzip2 split in MB, final size of gzip files may vary. 4 by default.
+ -n - Number of chunks to be generated, mutually exclusive to -c. Disabled by default.
+ -v - Verify rowcounts between input and output - OFF by default.
+ -m - Number of Maps to be launched, default number of maps = number of files.
+ -i - Input dir. The directory should exist and contain bz2 files. Other files will be ignored.
+ -o - Output dir. The directory will be cleaned if it exists and the output split files in .gz
+ format will be placed here. It will also be used as a scratch directory.
+ -h - Print usage
Example:
@@ -49,7 +59,7 @@ Input:
~~~~~
[thiruvel@localhost StreamingBz2Split]$ hadoop fs -ls /tmp/input
Found 1 items
--rw-r--r-- 1 thiruvel supergroup 66410623 2012-03-16 03:09 /tmp/input/cobaltblue.bz2
+-rw-r--r-- 1 thiruvel supergroup 66410623 2012-03-16 03:09 /tmp/input/CB.bz2
[thiruvel@localhost StreamingBz2Split]$
Execution:
@@ -65,7 +75,7 @@ Sample run:
Deleted hdfs://localhost.:8020/tmp/output
-/tmp/input/cobaltblue.bz2:/tmp/output/bz2out
+/tmp/input/CB.bz2:/tmp/output/bz2out
packageJobJar: [splitFile.sh, splitBzip2.sh, verifyRecordCount.sh, /home/thiruvel/cluster/tmp/hadoop-unjar1287667628951594655/] [] /tmp/streamjob3293762976681085942.jar tmpDir=null
12/03/26 04:15:53 INFO mapred.FileInputFormat: Total input paths to process : 1
@@ -83,23 +93,23 @@ packageJobJar: [splitFile.sh, splitBzip2.sh, verifyRecordCount.sh, /home/thiruve
Deleted hdfs://localhost:8020/tmp/output/hadoop_streaming_todelete
Deleted hdfs://localhost:8020/tmp/output/scratchstreaming
--rw-r--r-- 1 thiruvel supergroup 9096899 2012-03-26 04:16 /tmp/output/bz2out/chunk-1-cobaltblue.bz2
--rw-r--r-- 1 thiruvel supergroup 8761934 2012-03-26 04:16 /tmp/output/bz2out/chunk-2-cobaltblue.bz2
--rw-r--r-- 1 thiruvel supergroup 8523903 2012-03-26 04:16 /tmp/output/bz2out/chunk-3-cobaltblue.bz2
--rw-r--r-- 1 thiruvel supergroup 8869790 2012-03-26 04:16 /tmp/output/bz2out/chunk-4-cobaltblue.bz2
--rw-r--r-- 1 thiruvel supergroup 8580745 2012-03-26 04:16 /tmp/output/bz2out/chunk-5-cobaltblue.bz2
--rw-r--r-- 1 thiruvel supergroup 8496121 2012-03-26 04:16 /tmp/output/bz2out/chunk-6-cobaltblue.bz2
--rw-r--r-- 1 thiruvel supergroup 8854693 2012-03-26 04:16 /tmp/output/bz2out/chunk-7-cobaltblue.bz2
--rw-r--r-- 1 thiruvel supergroup 5272162 2012-03-26 04:16 /tmp/output/bz2out/chunk-8-cobaltblue.bz2
-
-/tmp/output/bz2out/chunk-1-cobaltblue.bz2:/tmp/output
-/tmp/output/bz2out/chunk-2-cobaltblue.bz2:/tmp/output
-/tmp/output/bz2out/chunk-3-cobaltblue.bz2:/tmp/output
-/tmp/output/bz2out/chunk-4-cobaltblue.bz2:/tmp/output
-/tmp/output/bz2out/chunk-5-cobaltblue.bz2:/tmp/output
-/tmp/output/bz2out/chunk-6-cobaltblue.bz2:/tmp/output
-/tmp/output/bz2out/chunk-7-cobaltblue.bz2:/tmp/output
-/tmp/output/bz2out/chunk-8-cobaltblue.bz2:/tmp/output
+-rw-r--r-- 1 thiruvel supergroup 9096899 2012-03-26 04:16 /tmp/output/bz2out/chunk-1-CB.bz2
+-rw-r--r-- 1 thiruvel supergroup 8761934 2012-03-26 04:16 /tmp/output/bz2out/chunk-2-CB.bz2
+-rw-r--r-- 1 thiruvel supergroup 8523903 2012-03-26 04:16 /tmp/output/bz2out/chunk-3-CB.bz2
+-rw-r--r-- 1 thiruvel supergroup 8869790 2012-03-26 04:16 /tmp/output/bz2out/chunk-4-CB.bz2
+-rw-r--r-- 1 thiruvel supergroup 8580745 2012-03-26 04:16 /tmp/output/bz2out/chunk-5-CB.bz2
+-rw-r--r-- 1 thiruvel supergroup 8496121 2012-03-26 04:16 /tmp/output/bz2out/chunk-6-CB.bz2
+-rw-r--r-- 1 thiruvel supergroup 8854693 2012-03-26 04:16 /tmp/output/bz2out/chunk-7-CB.bz2
+-rw-r--r-- 1 thiruvel supergroup 5272162 2012-03-26 04:16 /tmp/output/bz2out/chunk-8-CB.bz2
+
+/tmp/output/bz2out/chunk-1-CB.bz2:/tmp/output
+/tmp/output/bz2out/chunk-2-CB.bz2:/tmp/output
+/tmp/output/bz2out/chunk-3-CB.bz2:/tmp/output
+/tmp/output/bz2out/chunk-4-CB.bz2:/tmp/output
+/tmp/output/bz2out/chunk-5-CB.bz2:/tmp/output
+/tmp/output/bz2out/chunk-6-CB.bz2:/tmp/output
+/tmp/output/bz2out/chunk-7-CB.bz2:/tmp/output
+/tmp/output/bz2out/chunk-8-CB.bz2:/tmp/output
packageJobJar: [createGzipFromBzip.sh, /home/thiruvel/cluster/tmp/hadoop-unjar4063740201930314043/] [] /tmp/streamjob4343584794179189004.jar tmpDir=null
12/03/26 04:16:48 INFO mapred.FileInputFormat: Total input paths to process : 1
@@ -125,14 +135,14 @@ Deleted hdfs://localhost:8020/tmp/output/bz2out
Deleted hdfs://localhost:8020/tmp/output/hadoop_streaming_todelete
Deleted hdfs://localhost:8020/tmp/output/scratchstreaming
--rw-r--r-- 1 thiruvel supergroup 17282201 2012-03-26 04:17 /tmp/output/chunk-1-cobaltblue.gz
--rw-r--r-- 1 thiruvel supergroup 16683234 2012-03-26 04:17 /tmp/output/chunk-2-cobaltblue.gz
--rw-r--r-- 1 thiruvel supergroup 16506643 2012-03-26 04:18 /tmp/output/chunk-3-cobaltblue.gz
--rw-r--r-- 1 thiruvel supergroup 17012086 2012-03-26 04:18 /tmp/output/chunk-4-cobaltblue.gz
--rw-r--r-- 1 thiruvel supergroup 16463777 2012-03-26 04:19 /tmp/output/chunk-5-cobaltblue.gz
--rw-r--r-- 1 thiruvel supergroup 16322551 2012-03-26 04:19 /tmp/output/chunk-6-cobaltblue.gz
--rw-r--r-- 1 thiruvel supergroup 17030241 2012-03-26 04:20 /tmp/output/chunk-7-cobaltblue.gz
--rw-r--r-- 1 thiruvel supergroup 10128594 2012-03-26 04:19 /tmp/output/chunk-8-cobaltblue.gz
+-rw-r--r-- 1 thiruvel supergroup 17282201 2012-03-26 04:17 /tmp/output/chunk-1-CB.gz
+-rw-r--r-- 1 thiruvel supergroup 16683234 2012-03-26 04:17 /tmp/output/chunk-2-CB.gz
+-rw-r--r-- 1 thiruvel supergroup 16506643 2012-03-26 04:18 /tmp/output/chunk-3-CB.gz
+-rw-r--r-- 1 thiruvel supergroup 17012086 2012-03-26 04:18 /tmp/output/chunk-4-CB.gz
+-rw-r--r-- 1 thiruvel supergroup 16463777 2012-03-26 04:19 /tmp/output/chunk-5-CB.gz
+-rw-r--r-- 1 thiruvel supergroup 16322551 2012-03-26 04:19 /tmp/output/chunk-6-CB.gz
+-rw-r--r-- 1 thiruvel supergroup 17030241 2012-03-26 04:20 /tmp/output/chunk-7-CB.gz
+-rw-r--r-- 1 thiruvel supergroup 10128594 2012-03-26 04:19 /tmp/output/chunk-8-CB.gz
[thiruvel@localhost StreamingBz2Split]$
FAQ:
@@ -141,12 +151,12 @@ FAQ:
Set DEBUG env. variable to 0 and run the script.
2. Is there no other way to read bz2 files in Hadoop without this script?
-Pig can read and process bz2 files in parallel, but the limitation is that the data should be on a directory whose name should end with '.bz2'.
-Not many directories on HDFS are in this format and it would have been nice if Pig supports a way to use a regular directory with bz2 files with a knob.
-
+Pig can read and process bz2 files in parallel, but the limitation is that
+the data should be on a directory whose name should end with '.bz2'.
+Not many directories on HDFS are in this format and it would have been nice
+if Pig supports a way to use a regular directory with bz2 files with a knob.
-Start using this script at your own risk - tools provided to validate data and you can use your own validators too.
Please email me your feedbacks if any.
Please sign in to comment.
Something went wrong with that request. Please try again.