Skip to content
This repository has been archived by the owner on Mar 14, 2020. It is now read-only.

Timberlake Considers Certain Failed Jobs as Running #7

Closed
ajsquared opened this issue Nov 24, 2014 · 3 comments
Closed

Timberlake Considers Certain Failed Jobs as Running #7

ajsquared opened this issue Nov 24, 2014 · 3 comments

Comments

@ajsquared
Copy link

I noticed an odd thing today. I ran a job that failed when starting the ApplicationMaster:

Application application_1416843883012_0019 failed 2 times due to Error launching appattempt_1416843883012_0019_000002. Got exception: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.security.Credentials.readTokenStorageStream(Credentials.java:209)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.setupTokens(AMLauncher.java:226)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.createAMContainerLaunchContext(AMLauncher.java:198)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:108)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
. Failing the application.

The ResourceManager correctly shows the job in the FAILED state. Timberlake, however, says the job is still running, and refreshing the page just resets the duration to 0. After restarting Timberlake the job is no longer shown as running, but it is not included in the list of finished jobs.

@jbalogh
Copy link
Contributor

jbalogh commented Nov 24, 2014

Ah yeah, sorry about that. This happens because TL asks the ResourceManager for running jobs and trusts the HistoryServer to have all the finished jobs. Your job got into the list of running jobs but then it got stuck since the HistoryServer didn't know about it.

A previous version would drop the job from the list of running jobs if the ResourceManager didn't know about it. This led to weird issues where the job would disappear for a few seconds until the HistoryServer picked it up.

Thanks for the report! I'm thinking about how to make this part more reliable.

@jbalogh
Copy link
Contributor

jbalogh commented Dec 20, 2014

Hey Andrew! Sorry it took so long to get this fixed, but now it's done. Thanks again for the report.

@ajsquared
Copy link
Author

Great, thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants