-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to run TF mnist example in GCP #74
Comments
Oops, there seems to be a bug in https://github.com/linkedin/TonY/blob/master/tony-cli/src/main/java/com/linkedin/tony/cli/ClusterSubmitter.java#L60 during the last refactoring. It should be |
Thanks for the quick fix, after modifying the code, my task runs, but is stuck at the following:
Looks like server is listening in ports 8032 & 10200, but in IPv6 instead of iPv4, not sure if this is normal.
Thanks |
Are you able to run a normal mapreduce job ? |
Yes, other MR job works fine. (I tried the sample GCP wordcount). Before I was running using the
Not sure which logs to collect/any suggestions? Thanks again |
|
@gogasca could you check |
Log information:
Not much above, but I found some information here:
I will try to access the web page to get more details. |
@gogasca the problem looks obvious from the log :) unexpected EOF while looking for matching `'' |
It seems for some reason there is an extra
|
Thanks, I was able to edit:
And modify:
As you can see, there was a ' by default. I have contacted the Dataproc team to find out if its a known issue. Job is running now. Will update thread later. Thanks again. |
Lol, np. It is always not easy to make a job run the first time. Let us know if you see more issues. |
Just for the record, my memory configuration was incorrect:
I reduced the value in tony.xml and increased the value in yarn-site.xml and seems to start the task now. (Will fine tune-it later)
Now I see the following:
This has been sitting for 30+ minutes now.
I get:
Not sure what would be the expected result now. |
I re-run the job with gcloud command as job status was not appearing in the Dataproc dashboard.
When I run the command above, I still see same results, in which job keeps running and just displays the following, looks like is stucked in:
|
@gogasca The logs from the workers won't be shipped to the AM log. Can you check the log of the other containers for this training job? You can't directly run the mnist_distributed.py cause that assumes a |
I looked into the workers and saw this:
Not sure if I'm looking at the right logs. |
@gogasca there should be another two workers:
|
This is what I see in one of the workers, only 1 container appears, not sure If I'm looking at the right place:
|
I opened an internal ticket to dig into this. Will keep this thread updated. |
@gogasca do you have access to the ResourceManager's UI? If the worker jobs are scheduled, you should be able to see additional containers on the UI. I'm not sure if you are running in a single node or multiple nodes, if you have more than one machines, the log might be on other nodes. |
@oliverhu just set it up. Thanks for the tip!
I ended up killing the existing jobs. I tried with a new fresh job:
It did generate 2 containers: In this container I found the following error in logs:
Questions:
Thanks again |
Ah I see. It seems to be a usability issue in TonY, the code assumes you have
We gonna fix this today. For the second question, the job should fail instead of sitting "Running", for the container log you pasted, is that all the logs there? Could you paste the log for the other worker as well? |
Oh I understand now, the job was launched from http://tony-dev-w-1.c.dpe-cloud-mle.internal:8042 Logs below: worker0 (tony-dev-w-0.c.dpe-cloud-mle.internal)
worker1 (tony-dev-w-1.c.dpe-cloud-mle.internal)
Complete logs.zip attached. |
@gogasca if you take a look at the screenshot you pasted: It seems your cluster failed to allocate more containers for you. Outstanding Resource Requests means the cluster can't allocate that much resource for your job. |
Thanks @oliverhu Tensorflow code job is now running!
Thank you very much for help me solving all these issues. |
Greaaaat! I'm so glad we made it work together on GCP!!
Let us know if you have more issues! |
Yes!
I keep getting this error:
This is my folder structure in master, worker0 and worker1. /usr/local/src/jobs/MyJob/
This is a new DataProc cluster I installed from scratch.
|
My other DataProc cluster, seems to run TF job without issue using previous code. |
@gogasca I don't see you having tony.xml in your folder architecture. You need to disable |
Thanks, that was the issue. :) |
I created a guide to install TonY in GCP, I would like it to share here if possible, what would be the best location? Thanks |
That's AWESOME! My original thought is to add this to wiki page but I don't think you'd have access to that and it is hard to review. How about in the
|
LGTM, thanks @oliverhu |
I'm running into an issue trying to get the example in GCP working. Any one have any pointers to debug?
|
Could you give us some more info (cmd that you use, Hadoop version, etc.) so we can reproduce the bugs? |
@dwu15 Have you tried editing /etc/hadoop/conf.empty/yarn-site.xml and removing the ' at the end? I remember seeing EOF errors, we also published the Installation guide: https://github.com/linkedin/TonY/tree/master/tony-examples/tony-in-gcp let us know if you still encounter issues. |
Thanks for the response @gogasca , @pdtran3k6 Yes I'm currently following the installation guide on a fresh cluster on GCP. I've more or less followed that guide word for word, and still getting these errors. here is the command that I ended up running in a GCP cloud shell for this project.
My |
Regarding the yarn-site.xml fix, I looked at the release notes, and it seemed like they fixed that bug on 12/10. |
@dwu15 could you provide the Hadoop log for the application application_1544551109758_0001? |
These are my log files. I was able to submit without the job failing, but the job now seems to hang indefinitely now.
|
Looking at the logs for one of the containers, this probably seems to be the issue, as it cant locate the zip file with the venv
|
That looks weird, does |
After some debugging, I got a job running. I had to change the folder structure
And specify relative paths in the hadoop job statement. I ran this statement in the parent directory of TFJob
I saw in one of the comments in this chain that there was an issue before with relative paths vs absolute paths, but this bug still seems to exist. |
Darn, we'll take a look again, thanks for reporting. |
Took a look tonight, it turned out to be fairly complicated to fix this issue correctly. Plan to revamp the logics behind handling |
This is fixed in trunk! Check this updated read.me for instructions: https://github.com/linkedin/TonY/pull/136/files |
Unable to run mnist example in Dataproc.
Java version
Command run:
Directory structure:
tony.xml contents:
The text was updated successfully, but these errors were encountered: