Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Compare: Home

Showing with 17 additions and 5 deletions.
  1. +17 −5 Home.md
View
22 Home.md
@@ -109,6 +109,10 @@ This is a hacky tool that I wrote and was adopted before I could make it lot mor
As this was more of a hack, I just concatenated the smaller bz2 files (rec...) to generate a larger one and didnt fix the header of the split file. Without that the Unix tools (bunzip2 etc) can read the entire bz2 file and not Hadoop. Another hack is to bunzip and gzip. The ideal way is to generate the bz2 header for split file and append smaller files, but this was just a start.
+### 2. Write it entirely in Java.
+
+Currently lots of files are being produced on the local file system. Writing it entirely in Java would ensure smaller files with bz2 header (see #1) can be generated without writing those temporary files to disk. From conceptualization to implementation took about 2 hours and shell makes it damn easy to prototype and its just been running with that. Writing in Java will also run this on Windows.
+
## FAQ:
### 1. Whats the status of bzip2 splitting support on Hadoop?
@@ -120,7 +124,7 @@ Pig can read and process bz2 files in parallel, but the limitation is that the d
### 3. Can I not just distcp the data to another folder with .bz2 extension and process them using Pig?
-Yes. That's another way.
+Yes. That's another way, but I haven't measured the latency for large files.
### 4. How do I know if the data generated by the tool is the same as original data?
@@ -128,12 +132,20 @@ I had written a hadoop-diff to verify this. I am in the process of open sourcing
### 5. Why do we need two jobs? Isn't one sufficient?
-See TODO.1. Just to parallelize the second part - converting smaller bz2 to gz - a second job would scale well.
+See TODO.1. Just to parallelize the second part - converting smaller bz2 to gz - a second job would scale well. The teams which have adopted this tool havent complained yet as this solves a big problem for them. But the second job is not required and thats in TODO now.
### 6. Can I run these on a Linux/Unix system?
-Yes, just run the splitBzip2.sh and set the ENVIRONMENT variables appropriately.
+Yes, just run the splitBzip2.sh and set the ENVIRONMENT variables at the top of the script or just export them before you run.
+
+
+### 7. Whats the input files requirements?
+The line delimiter should be '\n'. Because tools like 'tail' and 'cat' are used. And this is the way most logs are generated.
+
+
+### 8. Are there any limitations?
+Yes, if the bz2 file is large (> 3GB or something I dont remember), bzip2recover would fail. You have to recompile bz2 by increasing the number of blocks it can handle and place the recompiled bzip2recover in the same directory as the tool. We have run this tool with 5+GB bz2 files using the recompiled bzip2recover and that has worked well for us.
-### 7. Are there any limitations?
-Yes, if the bz2 file is large (> 3GB or something I dont remember), bzip2recover would fail. You have to recompile bz2 by increasing the number of blocks it can handle and place the recompiled bzip2recover in the same directory as the tool. We have run this tool with 5+GB bz2 files using the recompiled bzip2recover and that has worked well for us.
+### 9. What was the intention behind the hack?
+I was processing HDFS Namenode logs and these logs are large even with bz2 compression (can range to 10GB on Yahoo!'s largest clusters). Processing a year of such audit logs is going to impact our projects a lot. Hence this tool was born.
Something went wrong with that request. Please try again.