Browse files

Merge pull request #88 from geota/master

Update README.md to explain DeprecatedLzoTextInputFormat
  • Loading branch information...
2 parents 478aa84 + bf18ac9 commit 015e93e6bae89a7ea2ed67c6e3b8e9e23cee299d @rangadi rangadi committed Feb 25, 2014
Showing with 1 addition and 1 deletion.
  1. +1 −1 README.md
View
2 README.md
@@ -126,6 +126,6 @@ Either way, after 10-20 seconds there will be a file named big_file.lzo.index.
#### Running MR Jobs over Indexed Files
-Now run any job, say wordcount, over the new file. In Java-based M/R jobs, just replace any uses of TextInputFormat by LzoTextInputFormat. In streaming jobs, add "-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat" (streaming still uses the old APIs, and needs a class that inherits from org.apache.hadoop.mapred.InputFormat). For Pig jobs, email me or check the pig list -- I have custom LZO loader classes that work but are not (yet) contributed back.
+Now run any job, say wordcount, over the new file. In Java-based M/R jobs, just replace any uses of TextInputFormat by LzoTextInputFormat. In streaming jobs, add "-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat" (streaming still uses the old APIs, and needs a class that inherits from org.apache.hadoop.mapred.InputFormat). Note that to use the DeprecatedLzoTextInputFormat properly with hadoop-streaming, you should also set the jobconf property `stream.map.input.ignoreKey=true`. That will replicate the behavior of the default TextInputFormat by stripping off the byte offset keys from the input lines that get piped to the mapper process. For Pig jobs, email me or check the pig list -- I have custom LZO loader classes that work but are not (yet) contributed back.
Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient.

0 comments on commit 015e93e

Please sign in to comment.