Hadoop troubleshooting

Andrea edited this page May 29, 2013 · 4 revisions

Symptoms / Solutions

Following are some issues that we have run into while running Hadoop, together with some possible solutions.

Symptom Possible Solutions
Tasks failing with the following error:
 java.io.EOFException
	at java.io.DataInputStream.readShort(DataInputStream.java:298)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
            .createBlockOutputStream(DFSClient.java:3060)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
            .nextBlockOutputStream(DFSClient.java:2983)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
           .access$2000(DFSClient.java:2255)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
            $DataStreamer.run(DFSClient.java:2446)

a) Increase the file descriptor limits, by setting ulimit to 8192;
b) Increase the upper bound on the number of files that each datanode will serve at any one time, by setting xceivers to 4096.
Get the following error while starting a datanode (check [hadoop_home]/logs/hadoop-hduser-datanode-xxx.log):
 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
       java.io.IOException: Incompatible namespaceIDs 

Format and restart the cluster
Tasks failing with the following error:
Too many fetch-failures

That could happen in a number of situations, and it’s a bit tricky to debug.
Usually it means that a machine wasn’t able to fetch a block from HDFS.
Cleanup the etc/hosts could help:
- use hostnames instead of ips
- sync it across all the nodes
- try commenting out “127.0.0.1 localhost”
Restart the cluster after making these changes.
Get the following error when putting data into the dfs:
Could only be replicated to 0 nodes, instead of 1
The NameNode does not have any available DataNodes. This can be caused by a wide variety of reasons.
Solution:
Erase all temporary data along with the namenode, reformat the namenode, start everything up, and visit the dfs health page (http://master:50070/dfshealth.jsp).
Tasks failing with the following error:
java.lang.Throwable: Child Error
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: 
           Task process exit with nonzero status of 137.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
Possible reason: the memory allocated for the tasks trackers (sum of mapred.*.child.java.opt in mapred-site.xml) is more than the nodes actual memory
Tasks fail during merge operations with an OutOfMemory Exception. Reduce mapred.job.shuffle.input.buffer.percent in core-site.xml to a value < 0.7, try 0.5 for example.
Sorting is too slow. Increase io.sort.mb and io.sort.factor to increase the buffer used for sorting in-memory and the number of files to merge at once. Possible values: 200 respectively 50.
Job failing with the following error:

java.io.IOException: Split metadata size exceeded 10000000. 
                     Aborting job job_201204170831_0012
	at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.
            readSplitMetaInfo(SplitMetaInfoReader.java:48)
	at org.apache.hadoop.mapred.JobInProgress.createSplits
            (JobInProgress.java:808)
	at org.apache.hadoop.mapred.JobInProgress.initTasks
            (JobInProgress.java:701)
	at org.apache.hadoop.mapred.JobTracker.initJob
            (JobTracker.java:4210)
	at org.apache.hadoop.mapred.EagerTaskInitializationListener
            $InitJob.run(EagerTaskInitializationListener.java:79)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
            (ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run
            (ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
The job is hitting the default set level of split sizes (10000000L).
Set the mapreduce.job.split.metainfo.maxsize property in your jobtracker’s mapred-site.xml config file to an higher value.

Logging issues

Hadoop depends on slf4j-api-1.4.3. Since any java.util.logging (jul) handler is provided for SLF4J 1.4.3 (only from 1.5.2), all incoming jul messages (eg. from ldspider, silk) are not redirected to the SLF4J API, but just printed to the standard output. See #79 for more details.
Below is a console output example:

[INFO] One time execution enabled
[INFO] Import Job freebase.3 started (crawl / daily)
[INFO] Crawling seed: http://rdf.freebase.com/ns/m/0fpjn6x (with levels=2, limit=100000)
Jan 10, 2012 12:25:13 PM com.ontologycentral.ldspider.hooks.links.LinkFilterSelect <init>
INFO: link predicate is [http://rdf.freebase.com/ns/music.artist.genre]
Jan 10, 2012 12:25:13 PM com.ontologycentral.ldspider.Crawler evaluateBreadthFirst
INFO: freebase.com: 1
Jan 10, 2012 12:25:13 PM com.ontologycentral.ldspider.Crawler evaluateBreadthFirst
INFO: Starting threads round 0 with 1 uris
Jan 10, 2012 12:25:14 PM com.ontologycentral.ldspider.http.LookupThread run
INFO: lookup on http://rdf.freebase.com/ns/m/0fpjn6x status 303 LT-0:http://rdf.freebase.com/ns/m/0fpjn6x
Jan 10, 2012 12:25:14 PM com.ontologycentral.ldspider.http.LookupThread run
INFO: lookup on http://rdf.freebase.com/rdf/m/0fpjn6x status 200 LT-0:http://rdf.freebase.com/rdf/m/0fpjn6x

Utilities / References

  • HDFS filesystem checking utility
    • HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures. By default fsck ignores open files but provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It can be run as hadoop fsck. For command usage, see fsck. fsck can be run on the whole file system or on a subset of files.
  • Another Hadoop troubleshooting page