Timberlake Considers Certain Failed Jobs as Running #7

ajsquared · 2014-11-24T16:08:45Z

I noticed an odd thing today. I ran a job that failed when starting the ApplicationMaster:

Application application_1416843883012_0019 failed 2 times due to Error launching appattempt_1416843883012_0019_000002. Got exception: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:209)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.setupTokens(AMLauncher.java:226)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.createAMContainerLaunchContext(AMLauncher.java:198)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:108)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
. Failing the application.

The ResourceManager correctly shows the job in the FAILED state. Timberlake, however, says the job is still running, and refreshing the page just resets the duration to 0. After restarting Timberlake the job is no longer shown as running, but it is not included in the list of finished jobs.

The text was updated successfully, but these errors were encountered:

jbalogh · 2014-11-24T16:15:46Z

Ah yeah, sorry about that. This happens because TL asks the ResourceManager for running jobs and trusts the HistoryServer to have all the finished jobs. Your job got into the list of running jobs but then it got stuck since the HistoryServer didn't know about it.

A previous version would drop the job from the list of running jobs if the ResourceManager didn't know about it. This led to weird issues where the job would disappear for a few seconds until the HistoryServer picked it up.

Thanks for the report! I'm thinking about how to make this part more reliable.

jbalogh · 2014-12-20T00:46:20Z

Hey Andrew! Sorry it took so long to get this fixed, but now it's done. Thanks again for the report.

ajsquared · 2014-12-20T00:52:57Z

Great, thanks!

jbalogh closed this as completed in bcccc79 Dec 20, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timberlake Considers Certain Failed Jobs as Running #7

Timberlake Considers Certain Failed Jobs as Running #7

ajsquared commented Nov 24, 2014

jbalogh commented Nov 24, 2014

jbalogh commented Dec 20, 2014

ajsquared commented Dec 20, 2014

Timberlake Considers Certain Failed Jobs as Running #7

Timberlake Considers Certain Failed Jobs as Running #7

Comments

ajsquared commented Nov 24, 2014

jbalogh commented Nov 24, 2014

jbalogh commented Dec 20, 2014

ajsquared commented Dec 20, 2014