Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Compare: Home

Showing with 17 additions and 1 deletion.
  1. +17 −1 Home.md
View
18 Home.md
@@ -1,5 +1,12 @@
# Welcome to the StreamingBz2Split wiki!
+## TODO:
+
+1. Avoid the second hadoop job which generates gz from bz2 splits.
+
+As this was more of a hack, I just concatenated the smaller bz2 files (rec...) to generate a larger one and didnt fix the header of the split file. Without that the Unix tools (bunzip2 etc) can read the entire bz2 file and not Hadoop. Another hack is to bunzip and gzip. The ideal way is to generate the bz2 header for split file and append smaller files, but this was just a start.
+
+
## FAQ:
### 1. Whats the status of bzip2 splitting support on Hadoop?
@@ -12,4 +19,13 @@ Pig can read and process bz2 files in parallel, but the limitation is that the d
Yes. That's another way.
### 4. How do I know if the data generated by the tool is the same as original data?
-I had written a hadoop-diff to verify this. I am in the process of open sourcing that also.
+I had written a hadoop-diff to verify this. I am in the process of open sourcing that also.
+
+### 5. Why do we need two jobs? Isn't one sufficient?
+See TODO.1. Just to parallelize the second part - converting smaller bz2 to gz - a second job would scale well.
+
+### 6. Can I run these on a Linux/Unix system?
+Yes, just run the splitBzip2.sh and set the ENVIRONMENT variables appropriately.
+
+### 7. Are there any limitations?
+Yes, if the bz2 file is large (> 3GB or something I dont remember), bzip2recover would fail. You have to recompile bz2 by increasing the number of blocks it can handle and place the recompiled bzip2recover in the same directory as the tool. We have run this tool with 5+GB bz2 files using the recompiled bzip2recover and that has worked well for us.
Something went wrong with that request. Please try again.