Permalink
Browse files

Moved content to wiki and trimmed down README.

  • Loading branch information...
1 parent f26708a commit 10a407f33f165eaa1ab76cb81d0ae51015071bcb @thiruvel thiruvel committed Jul 4, 2012
Showing with 1 addition and 123 deletions.
  1. +1 −123 README
View
124 README
@@ -36,129 +36,7 @@ Hadoop Streaming. The scripts takes a directory of bz2 files, splits them
based on size/chunks, verifies them and launches another job to gzip them.
All along, the number of mappers can be controlled.
-Here is a sample usage of the script:
-
-[thiruvel@localhost StreamingBz2Split]$ ./run.sh -h
-
-./run.sh: [-t] [-c <chunk size> | -n <number of chunks>] [-v] [-m no_maps] -i input_dir -o output_dir
- -t - Verify integrity of the input bzip2 files. OFF by default.
- -c - Chunk size of each bzip2 split in MB, final size of gzip files may vary. 4 by default.
- -n - Number of chunks to be generated, mutually exclusive to -c. Disabled by default.
- -v - Verify rowcounts between input and output - OFF by default.
- -m - Number of Maps to be launched, default number of maps = number of files.
- -i - Input dir. The directory should exist and contain bz2 files. Other files will be ignored.
- -o - Output dir. The directory will be cleaned if it exists and the output split files in .gz
- format will be placed here. It will also be used as a scratch directory.
- -h - Print usage
-
-
-Example:
---------
-
-Input:
-~~~~~
-[thiruvel@localhost StreamingBz2Split]$ hadoop fs -ls /tmp/input
-Found 1 items
--rw-r--r-- 1 thiruvel supergroup 66410623 2012-03-16 03:09 /tmp/input/CB.bz2
-[thiruvel@localhost StreamingBz2Split]$
-
-Execution:
-~~~~~~~~~~
-
-Split into 8 chunks: $./run.sh -i /tmp/input -o /tmp/output -n 8
-Split into chunks of size approx. 4MB: $./run.sh -i /tmp/input -o /tmp/output -c 4
-
-Sample run:
-~~~~~~~~~~~
-
-[thiruvel@localhost StreamingBz2Split]$ ./run.sh -i /tmp/input -o /tmp/output -n 8
-
-Deleted hdfs://localhost.:8020/tmp/output
-
-/tmp/input/CB.bz2:/tmp/output/bz2out
-
-packageJobJar: [splitFile.sh, splitBzip2.sh, verifyRecordCount.sh, /home/thiruvel/cluster/tmp/hadoop-unjar1287667628951594655/] [] /tmp/streamjob3293762976681085942.jar tmpDir=null
-12/03/26 04:15:53 INFO mapred.FileInputFormat: Total input paths to process : 1
-12/03/26 04:15:53 INFO streaming.StreamJob: getLocalDirs(): [/home/thiruvel/cluster/tmp/mapred/local]
-12/03/26 04:15:53 INFO streaming.StreamJob: Running job: job_201201082309_1228
-12/03/26 04:15:53 INFO streaming.StreamJob: To kill this job, run:
-12/03/26 04:15:53 INFO streaming.StreamJob: hadoop-0.20.205.0.3.1112071329/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:50300 -kill job_201201082309_1228
-12/03/26 04:15:53 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201201082309_1228
-12/03/26 04:15:54 INFO streaming.StreamJob: map 0% reduce 0%
-12/03/26 04:16:10 INFO streaming.StreamJob: map 100% reduce 0%
-12/03/26 04:16:43 INFO streaming.StreamJob: map 100% reduce 100%
-12/03/26 04:16:43 INFO streaming.StreamJob: Job complete: job_201201082309_1228
-12/03/26 04:16:43 INFO streaming.StreamJob: Output: /tmp/output/hadoop_streaming_todelete
-
-Deleted hdfs://localhost:8020/tmp/output/hadoop_streaming_todelete
-Deleted hdfs://localhost:8020/tmp/output/scratchstreaming
-
--rw-r--r-- 1 thiruvel supergroup 9096899 2012-03-26 04:16 /tmp/output/bz2out/chunk-1-CB.bz2
--rw-r--r-- 1 thiruvel supergroup 8761934 2012-03-26 04:16 /tmp/output/bz2out/chunk-2-CB.bz2
--rw-r--r-- 1 thiruvel supergroup 8523903 2012-03-26 04:16 /tmp/output/bz2out/chunk-3-CB.bz2
--rw-r--r-- 1 thiruvel supergroup 8869790 2012-03-26 04:16 /tmp/output/bz2out/chunk-4-CB.bz2
--rw-r--r-- 1 thiruvel supergroup 8580745 2012-03-26 04:16 /tmp/output/bz2out/chunk-5-CB.bz2
--rw-r--r-- 1 thiruvel supergroup 8496121 2012-03-26 04:16 /tmp/output/bz2out/chunk-6-CB.bz2
--rw-r--r-- 1 thiruvel supergroup 8854693 2012-03-26 04:16 /tmp/output/bz2out/chunk-7-CB.bz2
--rw-r--r-- 1 thiruvel supergroup 5272162 2012-03-26 04:16 /tmp/output/bz2out/chunk-8-CB.bz2
-
-/tmp/output/bz2out/chunk-1-CB.bz2:/tmp/output
-/tmp/output/bz2out/chunk-2-CB.bz2:/tmp/output
-/tmp/output/bz2out/chunk-3-CB.bz2:/tmp/output
-/tmp/output/bz2out/chunk-4-CB.bz2:/tmp/output
-/tmp/output/bz2out/chunk-5-CB.bz2:/tmp/output
-/tmp/output/bz2out/chunk-6-CB.bz2:/tmp/output
-/tmp/output/bz2out/chunk-7-CB.bz2:/tmp/output
-/tmp/output/bz2out/chunk-8-CB.bz2:/tmp/output
-
-packageJobJar: [createGzipFromBzip.sh, /home/thiruvel/cluster/tmp/hadoop-unjar4063740201930314043/] [] /tmp/streamjob4343584794179189004.jar tmpDir=null
-12/03/26 04:16:48 INFO mapred.FileInputFormat: Total input paths to process : 1
-12/03/26 04:16:49 INFO streaming.StreamJob: getLocalDirs(): [/home/thiruvel/cluster/tmp/mapred/local]
-12/03/26 04:16:49 INFO streaming.StreamJob: Running job: job_201201082309_1229
-12/03/26 04:16:49 INFO streaming.StreamJob: To kill this job, run:
-12/03/26 04:16:49 INFO streaming.StreamJob: hadoop-0.20.205.0.3.1112071329/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:50300 -kill job_201201082309_1229
-12/03/26 04:16:49 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201201082309_1229
-12/03/26 04:16:50 INFO streaming.StreamJob: map 0% reduce 0%
-12/03/26 04:17:05 INFO streaming.StreamJob: map 13% reduce 0%
-12/03/26 04:17:08 INFO streaming.StreamJob: map 25% reduce 0%
-12/03/26 04:18:08 INFO streaming.StreamJob: map 38% reduce 0%
-12/03/26 04:18:11 INFO streaming.StreamJob: map 50% reduce 0%
-12/03/26 04:18:53 INFO streaming.StreamJob: map 63% reduce 0%
-12/03/26 04:18:56 INFO streaming.StreamJob: map 75% reduce 0%
-12/03/26 04:19:47 INFO streaming.StreamJob: map 88% reduce 0%
-12/03/26 04:19:50 INFO streaming.StreamJob: map 100% reduce 0%
-12/03/26 04:20:14 INFO streaming.StreamJob: map 100% reduce 100%
-12/03/26 04:20:14 INFO streaming.StreamJob: Job complete: job_201201082309_1229
-12/03/26 04:20:14 INFO streaming.StreamJob: Output: /tmp/output/hadoop_streaming_todelete
-
-Deleted hdfs://localhost:8020/tmp/output/bz2out
-Deleted hdfs://localhost:8020/tmp/output/hadoop_streaming_todelete
-Deleted hdfs://localhost:8020/tmp/output/scratchstreaming
-
--rw-r--r-- 1 thiruvel supergroup 17282201 2012-03-26 04:17 /tmp/output/chunk-1-CB.gz
--rw-r--r-- 1 thiruvel supergroup 16683234 2012-03-26 04:17 /tmp/output/chunk-2-CB.gz
--rw-r--r-- 1 thiruvel supergroup 16506643 2012-03-26 04:18 /tmp/output/chunk-3-CB.gz
--rw-r--r-- 1 thiruvel supergroup 17012086 2012-03-26 04:18 /tmp/output/chunk-4-CB.gz
--rw-r--r-- 1 thiruvel supergroup 16463777 2012-03-26 04:19 /tmp/output/chunk-5-CB.gz
--rw-r--r-- 1 thiruvel supergroup 16322551 2012-03-26 04:19 /tmp/output/chunk-6-CB.gz
--rw-r--r-- 1 thiruvel supergroup 17030241 2012-03-26 04:20 /tmp/output/chunk-7-CB.gz
--rw-r--r-- 1 thiruvel supergroup 10128594 2012-03-26 04:19 /tmp/output/chunk-8-CB.gz
-[thiruvel@localhost StreamingBz2Split]$
-
-FAQ:
-----
-1. Is there any way to turn off the verbosity?
-Set DEBUG env. variable to 0 and run the script.
-
-2. Is there no other way to read bz2 files in Hadoop without this script?
-Pig can read and process bz2 files in parallel, but the limitation is that
-the data should be on a directory whose name should end with '.bz2'.
-Not many directories on HDFS are in this format and it would have been nice
-if Pig supports a way to use a regular directory with bz2 files with a knob.
-
-
-
-Please email me your feedbacks if any.
+See Wiki for more information. Feedback welcome.
Thanks!
Thiruvel

0 comments on commit 10a407f

Please sign in to comment.