Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to run TF mnist example in GCP #74

Closed
spicyramen opened this issue Nov 11, 2018 · 46 comments
Closed

Failed to run TF mnist example in GCP #74

spicyramen opened this issue Nov 11, 2018 · 46 comments

Comments

@spicyramen
Copy link

Unable to run mnist example in Dataproc.

sudo java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=/usr/lo
cal/src/MyJob/venv.zip --src_dir=/usr/local/src/TonY/mnist/ --executes=/usr/local/src/TonY/mnist/src/mnist_distributed.py --conf_file=/usr/local/src/tony.xml --python_binary_path
=venv/bin/python3.5
18/11/11 08:23:02 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/11 08:23:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/11/11 08:23:02 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, null/core-site.
xml, null/hdfs-site.xml
Nov 11, 2018 8:23:02 AM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase <clinit>
INFO: GHFS version: hadoop2-1.9.8
18/11/11 08:23:03 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/11/11 08:23:03 INFO cli.ClusterSubmitter: Copying /usr/local/src/MyJob/tony-cli-0.1.5-all.jar to: hdfs://tony-dev-m/user/root/.tony/6665ca2a-fd31-4f61-a947-33f895517302
Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Ljava.lang.String;
        at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:60)
hadoop version
Hadoop 2.9.0
Subversion Unknown -r Unknown
Compiled by bigtop on 2018-08-17T12:00Z
Compiled with protoc 2.5.0
From source with checksum f510b6e8bafb2ddfd660aeb7454e7c30
This command was run using /usr/lib/hadoop/hadoop-common-2.9.0.jar

Java version

java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Command run:

java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/usr/local/src/MyJob/venv.zip \
--src_dir=/usr/local/src/TonY/mnist/ \
--executes=/usr/local/src/TonY/mnist/src/mnist_distributed.py \
--conf_file=/usr/local/src/tony.xml \
--python_binary_path=venv/bin/python3.5

Directory structure:

.
├── src
│   └── mnist_distributed.py
├── tony-cli-0.1.5-all.jar
├── tony.xml
└── venv.zip

tony.xml contents:

<configuration>
  <property>
    <name>tony.application.security.enabled</name>
   <value>false</value>
  </property>    
  <property>
    <name>tony.worker.instances</name>
    <value>2</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>15g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>0</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>3g</value>
  </property>
</configuration>
@oliverhu
Copy link
Member

oliverhu commented Nov 11, 2018

Oops, there seems to be a bug in https://github.com/linkedin/TonY/blob/master/tony-cli/src/main/java/com/linkedin/tony/cli/ClusterSubmitter.java#L60 during the last refactoring. It should be exitCode = TonyClient.start(updatedArgs.stream().toArray(String[]::new));, will update that.

@gogasca
Copy link
Contributor

gogasca commented Nov 12, 2018

Thanks for the quick fix, after modifying the code, my task runs, but is stuck at the following:

sudo java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=/usr/
local/src/MyJob/venv.zip --src_dir=/usr/local/src/MyJob/ --executes=/usr/local/src/MyJob/src/mnist_distributed.py --conf_file=/usr/local/src/MyJob/tony.xml --python_binary_path
=venv/bin/python3.5
18/11/12 00:46:15 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/12 00:46:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/11/12 00:46:15 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, null/core-sit
e.xml, null/hdfs-site.xml
Nov 12, 2018 12:46:15 AM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase <clinit>
INFO: GHFS version: hadoop2-1.9.8
18/11/12 00:46:16 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/11/12 00:46:16 INFO cli.ClusterSubmitter: Copying /usr/local/src/MyJob/tony-cli-0.1.5-all.jar to: hdfs://tony-dev-m/user/root/.tony/acc293e9-15f5-49a6-9b53-6b62c6e1e57b
18/11/12 00:46:16 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/11/12 00:46:16 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/11/12 00:46:16 INFO tony.TonyClient: Starting client..
18/11/12 00:46:16 INFO client.RMProxy: Connecting to ResourceManager at tony-dev-m/10.128.0.2:8032
18/11/12 00:46:16 INFO client.AHSProxy: Connecting to Application History server at tony-dev-m/10.128.0.2:10200

Looks like server is listening in ports 8032 & 10200, but in IPv6 instead of iPv4, not sure if this is normal.

gogasca@tony-dev-m:~$ sudo netstat -anp | grep 10200
tcp6       0      0 :::10200                :::*                    LISTEN      1887/java           
gogasca@tony-dev-m:~$ sudo netstat -anp | grep 8032
tcp6       0      0 :::8032                 :::*                    LISTEN      1910/java           
tcp6       0      0 10.128.0.2:8032         10.128.0.2:49208        ESTABLISHED 1910/java           
tcp6       0      0 10.128.0.2:49208        10.128.0.2:8032         ESTABLISHED 779/java          

Thanks

@oliverhu
Copy link
Member

Are you able to run a normal mapreduce job ?

@gogasca
Copy link
Contributor

gogasca commented Nov 12, 2018

Yes, other MR job works fine. (I tried the sample GCP wordcount).

image

Before I was running using the java -cp hadoop, but found a GCP command: gcloud submit hadoop and now I'm able to submit the job, and get the following:

gcloud dataproc jobs submit hadoop --cluster tony-dev \
>   --class com.linkedin.tony.cli.ClusterSubmitter \
>    --jars file:///usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar -- \
>   --executes=/usr/local/src/jobs/MyJob/src/mnist_distributed.py \
>   --python_venv=/usr/local/src/deps/venv/venv.zip \
>   --src_dir=/usr/local/src/jobs/MyJob/ \
>   --conf_file=/usr/local/src/jobs/MyJob/tony.xml \
>   --python_binary_path=tony/bin/python3.5
Job [9db641675f914f65baacb2098537d687] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/12 06:44:04 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
18/11/12 06:44:05 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 9db641675f914f65baa
cb2098537d687-conf.xml, /etc/hadoop/conf/core-site.xml, /etc/hadoop/conf/hdfs-site.xml
18/11/12 06:44:05 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: hadoop2-1.9.8
18/11/12 06:44:05 INFO cli.ClusterSubmitter: Copying /usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar to: hdfs://tony-dev-m/user/root/.tony/1661f515-4711-4030-9a82-6da5c16fb363
18/11/12 06:44:06 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/11/12 06:44:06 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/11/12 06:44:06 INFO tony.TonyClient: Starting client..
18/11/12 06:44:06 INFO client.RMProxy: Connecting to ResourceManager at tony-dev-m/10.128.0.2:8032
18/11/12 06:44:06 INFO client.AHSProxy: Connecting to Application History server at tony-dev-m/10.128.0.2:10200
18/11/12 06:44:17 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.Tony
ApplicationMaster --python_binary_path tony/bin/python3.5 --python_venv /usr/local/src/deps/venv/venv.zip --executes /usr/local/src/jobs/MyJob/src/mnist_distributed.py --hdfs_classpa
th hdfs://tony-dev-m/user/root/.tony/1661f515-4711-4030-9a82-6da5c16fb363 --container_env TONY_CONF_PATH=hdfs://tony-dev-m/user/root/.tony/application_1541371308113_0012/tony-final.x
ml --container_env TONY_CONF_TIMESTAMP=1542005057940 --container_env TONY_CONF_LENGTH=3523 --container_env TONY_ZIP_PATH=hdfs://tony-dev-m/user/root/.tony/application_1541371308113_0
012/tony.zip --container_env TONY_ZIP_LENGTH=232431960 --container_env TONY_ZIP_TIMESTAMP=1542005057881 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADO
OP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS
>$HADOOP_YARN_HOME/lib/*' 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
18/11/12 06:44:17 INFO tony.TonyClient: Submitting YARN application
18/11/12 06:44:18 INFO impl.YarnClientImpl: Submitted application application_1541371308113_0012
18/11/12 06:44:18 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://tony-dev-m:8088/proxy/application_1541371308113_0012/
18/11/12 06:44:18 INFO tony.TonyClient: ResourceManager web address for application: http://tony-dev-m:8088/cluster/app/application_1541371308113_0012
18/11/12 06:44:21 INFO tony.TonyClient: Application 12 finished with YarnState=FAILED, DSFinalStatus=FAILED, breaking monitoring loop.
18/11/12 06:44:21 ERROR tony.TonyClient: Application failed to complete successfully
ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [9db641675f914f65baacb2098537d687] entered state [ERROR] while waiting for [DONE].

Not sure which logs to collect/any suggestions? Thanks again

@gogasca
Copy link
Contributor

gogasca commented Nov 12, 2018

yarn application -list -appStates ALL

18/11/12 07:19:40 INFO client.RMProxy: Connecting to ResourceManager at tony-dev-m/10.128.0.2:8032
18/11/12 07:19:40 INFO client.AHSProxy: Connecting to Application History server at tony-dev-m/10.128.0.2:10200
Total number of applications (application-types: [], states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED] and tags: []):6
                Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress 
                      Tracking-URL
application_1541371308113_0001        hello-world.py                   SPARK          root         default                FINISHED               SUCCEEDED                 100%t
ony-dev-m:18080/history/application_1541371308113_0001/1
application_1541371308113_0013  TensorFlowApplication             TENSORFLOW          root         default                  FAILED                  FAILED                   0%h
ttp://tony-dev-m:8188/applicationhistory/app/application_1541371308113_0013
application_1541371308113_0012  TensorFlowApplication             TENSORFLOW          root         default                  FAILED                  FAILED                   0%h
ttp://tony-dev-m:8188/applicationhistory/app/application_1541371308113_0012
application_1541371308113_0014  TensorFlowApplication             TENSORFLOW          root         default                  FAILED                  FAILED                   0%h
ttp://tony-dev-m:8188/applicationhistory/app/application_1541371308113_0014
application_1541371308113_0011  TensorFlowApplication             TENSORFLOW          root         default                  FAILED                  FAILED                   0%h
ttp://tony-dev-m:8188/applicationhistory/app/application_1541371308113_0011
application_1541371308113_0005            word count               MAPREDUCE          root         default                FINISHED               SUCCEEDED                 100%h
ttp://tony-dev-m:19888/jobhistory/job/job_1541371308113_0005

@oliverhu
Copy link
Member

@gogasca could you check yarn logs -applicationId YOUR_APPLICATION_ID?

@gogasca
Copy link
Contributor

gogasca commented Nov 12, 2018

Log information:

sudo yarn logs -applicationId application_1541371308113_0005

18/11/12 20:07:35 INFO client.RMProxy: Connecting to ResourceManager at tony-dev-m/10.128.0.2:8032
18/11/12 20:07:36 INFO client.AHSProxy: Connecting to Application History server at tony-dev-m/10.128.0.2:10200
File /yarn-logs/root/logs/application_1541371308113_0005 does not exist.
Can not find any log file matching the pattern: [ALL] for the application: application_1541371308113_0005
Can not find the logs for the application: application_1541371308113_0005 with the appOwner: root

Not much above, but I found some information here:

more /hadoop/yarn/system/rmstore/FSRMStateRoot/RMAppRoot/application_1541371308113_0011/application_1541371308113_0011

�����,���
       ��Ў��,��TensorFlowApplication?default"�
u
tony.zip�i
W
�hdfs�
tony-dev-m?����������"8/user/root/.tony/application_1541371308113_0011/tony.zip����n?�����, �(�
tony-final.xml�m
]
�hdfs�
tony-dev-m?����������">/user/root/.tony/application_1541371308113_0011/tony-final.xml��ṗ��, �(�
��
�tony-cli-0.1.5-all.jar�}
k
�hdfs�
tony-dev-m?����������"L/user/root/.tony/9c664a77-acb7-43e2-86ff-860e24df68a2/tony-cli-0.1.5-all.jar����?�ۖ��, �(��EHDTS
       ��Ў��,����Ѽ�����Y���!z�T    =M8����y�YARN_AM_RM_TOKEN
TONY_CONF_PATH�Ohdfs://tony-dev-m/user/root/.tony/application_1541371308113_0011/tony-final.xml"C
?APPLICATION_WEB_PROXY_BASE�%/proxy/application_1541371308113_0011"�
�MAX_APP_ATTEMPTS��2"$
1542004792545"?TAMP�
�TONY_CONF_LENGTH��3523"Z
TONY_ZIP_PATH�Ihdfs://tony-dev-m/user/root/.tony/application_1541371308113_0011/tony.zip"�
TONY_ZIP_LENGTH�         232431960"#
1542004792494"���MP�
        CLASSPATH���{{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME
/*<CPS>$HADOOP_YARN_HOME/lib/*'"$
1542004792581*{{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --python_binary_path tony/bin/python3.5 --python_venv /usr/local/src/deps/venv/venv.zip --executes /usr/local/src/jobs/M
yJob/src/mnist_distributed.py --hdfs_classpath hdfs://tony-dev-m/user/root/.tony/9c664a77-acb7-43e2-86ff-860e24df68a2 --container_env TONY_CONF_PATH=hdfs://tony-dev-m/user/root/.tony/application_1541371308113_0011/tony-final.xml --container_env T
ONY_CONF_TIMESTAMP=1542004792545 --container_env TONY_CONF_LENGTH=3523 --container_env TONY_ZIP_PATH=hdfs://tony-dev-m/user/root/.tony/application_1541371308113_0011/tony.zip --container_env TONY_ZIP_LENGTH=232431960 --container_env TONY_ZIP_TIME
STAMP=1542004792494 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME
/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*' 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log ��� 2  ���*J  ����R
TENSORFLOW?�root �����,(        2Application application_1541371308113_0011 failed 2 times due to AM Container for appattempt_1541371308113_0011_000002 exited with  exitCode: 1
Failing this attempt.Diagnostics: [2018-11-12 06:39:55.263]Exception from container-launch.
Container id: container_1541371308113_0011_02_000001
Exit code: 1
[2018-11-12 06:39:55.263]
[2018-11-12 06:39:55.264]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: -c: line 0: unexpected EOF while looking for matching `''
/bin/bash: -c: line 1: syntax error: unexpected end of file
[2018-11-12 06:39:55.264]
[2018-11-12 06:39:55.265]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: -c: line 0: unexpected EOF while looking for matching `''
/bin/bash: -c: line 1: syntax error: unexpected end of file
[2018-11-12 06:39:55.265]
For more detailed output, check the application tracking page: http://tony-dev-m:8188/applicationhistory/app/application_1541371308113_0011 Then click on links to logs of each attempt.
. Failing the application.8�ϗ��,

I will try to access the web page to get more details.

@oliverhu
Copy link
Member

oliverhu commented Nov 12, 2018

@gogasca the problem looks obvious from the log :) unexpected EOF while looking for matching `''

@oliverhu
Copy link
Member

It seems for some reason there is an extra ' appended to your command:

Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.Tony
ApplicationMaster --python_binary_path tony/bin/python3.5 --python_venv /usr/local/src/deps/venv/venv.zip --executes /usr/local/src/jobs/MyJob/src/mnist_distributed.py --hdfs_classpa
th hdfs://tony-dev-m/user/root/.tony/1661f515-4711-4030-9a82-6da5c16fb363 --container_env TONY_CONF_PATH=hdfs://tony-dev-m/user/root/.tony/application_1541371308113_0012/tony-final.x
ml --container_env TONY_CONF_TIMESTAMP=1542005057940 --container_env TONY_CONF_LENGTH=3523 --container_env TONY_ZIP_PATH=hdfs://tony-dev-m/user/root/.tony/application_1541371308113_0
012/tony.zip --container_env TONY_ZIP_LENGTH=232431960 --container_env TONY_ZIP_TIMESTAMP=1542005057881 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADO
OP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS
>$HADOOP_YARN_HOME/lib/*' 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log```

There is an extra **'** after `HADOOP_YARN_HOME/lib`

@gogasca
Copy link
Contributor

gogasca commented Nov 13, 2018

Thanks, I was able to edit:

yarn-site.xml

And modify:

<property>
    <name>yarn.application.classpath</name>
    <value>$HADOOP_CONF_DIR,
      $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
      $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
      $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
      $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*'</value>
</property>

As you can see, there was a ' by default. I have contacted the Dataproc team to find out if its a known issue. Job is running now. Will update thread later. Thanks again.

@oliverhu
Copy link
Member

Lol, np. It is always not easy to make a job run the first time. Let us know if you see more issues.

@gogasca
Copy link
Contributor

gogasca commented Nov 14, 2018

Just for the record, my memory configuration was incorrect:

  • Master and workers have 15.0 GB RAM memory. (n1-standard-4 (4 vCPU, 15.0 GB memory)
  • tony.xml had: 15.0 GB (15360) memory configured under tony.worker.memory
  • In master yarn-site.xml the parameters: yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb were limited to 12288. (12.0 GB)

I reduced the value in tony.xml and increased the value in yarn-site.xml and seems to start the task now. (Will fine tune-it later)

2018-11-14 06:55:13 INFO  TonyApplicationMaster:943 - Error org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=15360,
 maxMemory=12288
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:262)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:227)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:243)
        at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:246)
        at org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:214)
        at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
        at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:388)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
        at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:868)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:814)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2603)

Now I see the following:

SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/root/filecache/44/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Nov 14, 2018 8:09:38 AM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase <clinit>
INFO: GHFS version: hadoop2-1.9.8
End of LogType:amstderr.log.This log file belongs to a running container (container_1541371308113_0032_01_000001) and so may not be complete.
*****************************************************************************
End of LogType:prelaunch.err.This log file belongs to a running container (container_1541371308113_0032_01_000001) and so may not be complete.
******************************************************************************
Container: container_1541371308113_0032_01_000001 on tony-dev-w-0.c.dpe-cloud-mle.internal:33777
LogAggregationType: LOCAL
================================================================================================
LogType:amstdout.log
LogLastModifiedTime:Wed Nov 14 08:09:54 +0000 2018
LogLength:3165
LogContents:
2018-11-14 08:09:36 INFO  Utils:134 - Unzipping tony.zip to destination ./
2018-11-14 08:09:37 INFO  TonyApplicationMaster:353 - Preparing application master..
2018-11-14 08:09:38 INFO  TonyApplicationMaster:373 - RPC server running at: tony-dev-w-0.c.dpe-cloud-mle.internal:14342
2018-11-14 08:09:38 INFO  TonyApplicationMaster:275 - Final task command: 'venv/tony/bin/python3.5 /usr/local/src/MyJob/src/mnist_distributed.py'
2018-11-14 08:09:38 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 08:09:39 INFO  TonySession:218 - Job ps with allocationRequestId 0
2018-11-14 08:09:39 INFO  TonyApplicationMaster:817 - Requested container ask: Capability[<memory:12288, vCores:1>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-11-14 08:09:39 INFO  TonySession:218 - Job worker with allocationRequestId 1
2018-11-14 08:09:39 INFO  TonyApplicationMaster:817 - Requested container ask: Capability[<memory:12288, vCores:1>]Priority[1]AllocationRequestId[1]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-11-14 08:09:39 INFO  TonySession:218 - Job worker with allocationRequestId 2
2018-11-14 08:09:39 INFO  TonyApplicationMaster:817 - Requested container ask: Capability[<memory:12288, vCores:1>]Priority[1]AllocationRequestId[2]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-11-14 08:09:39 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-14 08:09:39 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 08:09:40 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 08:09:41 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 08:09:42 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 08:09:43 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 08:09:44 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2

This has been sitting for 30+ minutes now.
When I run manually:

venv/tony/bin/python3.5 /usr/local/src/MyJob/src/mnist_distributed.py

I get:

(tony) gogasca@tony-dev-m:/usr/local/src/deps/venv$  tony/bin/python3.5 ../../jobs/MyJob/src/mnist_distributed.py 
Traceback (most recent call last):
  File "../../jobs/MyJob/src/mnist_distributed.py", line 246, in <module>
    tf.app.run()
  File "/usr/local/src/deps/venv/tony/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "../../jobs/MyJob/src/mnist_distributed.py", line 190, in main
    cluster_spec_str = os.environ["CLUSTER_SPEC"]
  File "/usr/local/src/deps/venv/tony/lib/python3.5/os.py", line 725, in __getitem__
    raise KeyError(key) from None
KeyError: 'CLUSTER_SPEC'
(tony) gogasca@tony-dev-m:/usr/local/src/deps/venv$ tony/bin/python3.5 -c 'import tensorflow as tf; print(tf.__version__)'
1.9.0

Not sure what would be the expected result now.

@gogasca
Copy link
Contributor

gogasca commented Nov 14, 2018

I re-run the job with gcloud command as job status was not appearing in the Dataproc dashboard.

sudo yarn logs -applicationId application_1541371308113_0033

When I run the command above, I still see same results, in which job keeps running and just displays the following, looks like is stucked in:

2018-11-14 17:43:57 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:43:58 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:43:59 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:00 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:01 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:01 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
18/11/14 18:02:33 INFO client.RMProxy: Connecting to ResourceManager at tony-dev-m/10.128.0.2:8032
18/11/14 18:02:33 INFO client.AHSProxy: Connecting to Application History server at tony-dev-m/10.128.0.2:10200
Container: container_1541371308113_0033_01_000001 on tony-dev-w-0.c.dpe-cloud-mle.internal:33777
LogAggregationType: LOCAL
================================================================================================
LogType:amstderr.log
LogLastModifiedTime:Wed Nov 14 17:43:55 +0000 2018
LogLength:601
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/root/filecache/47/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Nov 14, 2018 5:43:55 PM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase <clinit>
INFO: GHFS version: hadoop2-1.9.8
End of LogType:amstderr.log.This log file belongs to a running container (container_1541371308113_0033_01_000001) and so may not be complete.
*****************************************************************************
End of LogType:prelaunch.err.This log file belongs to a running container (container_1541371308113_0033_01_000001) and so may not be complete.
******************************************************************************
Container: container_1541371308113_0033_01_000001 on tony-dev-w-0.c.dpe-cloud-mle.internal:33777
LogAggregationType: LOCAL
================================================================================================
LogType:amstdout.log
LogLastModifiedTime:Wed Nov 14 18:02:33 +0000 2018
LogLength:115866
LogContents:
2018-11-14 17:43:53 INFO  Utils:134 - Unzipping tony.zip to destination ./
2018-11-14 17:43:55 INFO  TonyApplicationMaster:353 - Preparing application master..
2018-11-14 17:43:55 INFO  TonyApplicationMaster:373 - RPC server running at: tony-dev-w-0.c.dpe-cloud-mle.internal:12635
2018-11-14 17:43:55 INFO  TonyApplicationMaster:275 - Final task command: 'venv/tony/bin/python3.5 /usr/local/src/jobs/MyJob/src/mnist_distributed.py'
2018-11-14 17:43:56 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:43:56 INFO  TonySession:218 - Job ps with allocationRequestId 0
2018-11-14 17:43:56 INFO  TonyApplicationMaster:817 - Requested container ask: Capability[<memory:12288, vCores:1>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-11-14 17:43:56 INFO  TonySession:218 - Job worker with allocationRequestId 1
2018-11-14 17:43:56 INFO  TonyApplicationMaster:817 - Requested container ask: Capability[<memory:12288, vCores:1>]Priority[1]AllocationRequestId[1]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-11-14 17:43:56 INFO  TonySession:218 - Job worker with allocationRequestId 2
2018-11-14 17:43:56 INFO  TonyApplicationMaster:817 - Requested container ask: Capability[<memory:12288, vCores:1>]Priority[1]AllocationRequestId[2]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-11-14 17:43:56 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-14 17:43:57 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:43:58 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:43:59 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:00 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:01 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:01 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-14 17:44:02 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:03 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:04 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:05 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:06 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 17:44:06 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2

@oliverhu
Copy link
Member

@gogasca 2018-11-14 17:44:06 INFO TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2 means the task has started in the worker nodes and AM is waiting for the worker nodes to finish training.

The logs from the workers won't be shipped to the AM log. Can you check the log of the other containers for this training job?

You can't directly run the mnist_distributed.py cause that assumes a CLUSTER_SPEC env variable set up in the OS environment.

@gogasca
Copy link
Contributor

gogasca commented Nov 14, 2018

I looked into the workers and saw this:

root@tony-dev-w-0:/var/log/hadoop-yarn/userlogs/application_1541371308113_0033/container_1541371308113_0033_01_000001# ls -alh
total 1.5M
drwx--x--- 2 yarn yarn 4.0K Nov 14 17:43 .
drwx--x--- 3 yarn yarn 4.0K Nov 14 17:43 ..
-rw-r--r-- 1 yarn yarn  601 Nov 14 17:43 amstderr.log
-rw-r--r-- 1 yarn yarn 1.5M Nov 14 21:47 amstdout.log
-rw-r--r-- 1 yarn yarn    0 Nov 14 17:43 prelaunch.err
-rw-r--r-- 1 yarn yarn   70 Nov 14 17:43 prelaunch.out
root@tony-dev-w-0:/var/log/hadoop-yarn/userlogs/application_1541371308113_0033/container_1541371308113_0033_01_000001# tail -f amstdout.log 
2018-11-14 21:47:45 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:46 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:47 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:47 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-14 21:47:48 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:49 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:50 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:51 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:52 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:52 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-14 21:47:53 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:54 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:55 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:56 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:57 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-14 21:47:57 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-14 21:47:58 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!

Not sure if I'm looking at the right logs.
Job shows running after 4+ hours:

image

@oliverhu
Copy link
Member

@gogasca there should be another two workers:

container_1541371308113_0033_01_000002 and container_1541371308113_0033_01_000003

@gogasca
Copy link
Contributor

gogasca commented Nov 14, 2018

This is what I see in one of the workers, only 1 container appears, not sure If I'm looking at the right place:

gogasca@tony-dev-w-0:~$ sudo -i
root@tony-dev-w-0:~# ls -al /var/log/hadoop-yarn/userlogs/application_1541371308113_0033
total 12
drwx--x--- 3 yarn yarn 4096 Nov 14 17:43 .
drwxr-xr-x 5 yarn yarn 4096 Nov 14 22:54 ..
drwx--x--- 2 yarn yarn 4096 Nov 14 17:43 container_1541371308113_0033_01_000001
root@tony-dev-w-0:~# 
root@tony-dev-w-0:~# ls -al /var/log/hadoop-yarn/userlogs/application_1541371308113_0033/container_1541371308113_0033_01_000001/
total 1896
drwx--x--- 2 yarn yarn    4096 Nov 14 17:43 .
drwx--x--- 3 yarn yarn    4096 Nov 14 17:43 ..
-rw-r--r-- 1 yarn yarn     601 Nov 14 17:43 amstderr.log
-rw-r--r-- 1 yarn yarn 1918941 Nov 14 22:56 amstdout.log
-rw-r--r-- 1 yarn yarn       0 Nov 14 17:43 prelaunch.err
-rw-r--r-- 1 yarn yarn      70 Nov 14 17:43 prelaunch.out
root@tony-dev-w-0:~# find / -name "container_1541371308113_0033*"
/var/log/hadoop-yarn/userlogs/application_1541371308113_0033/container_1541371308113_0033_01_000001
/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1541371308113_0033/container_1541371308113_0033_01_000001
/hadoop/yarn/nm-local-dir/nmPrivate/application_1541371308113_0033/container_1541371308113_0033_01_000001
/hadoop/yarn/nm-local-dir/nmPrivate/application_1541371308113_0033/container_1541371308113_0033_01_000001/container_1541371308113_0033_01_000001.tokens
/hadoop/yarn/nm-local-dir/nmPrivate/application_1541371308113_0033/container_1541371308113_0033_01_000001/container_1541371308113_0033_01_000001.pid

@gogasca
Copy link
Contributor

gogasca commented Nov 15, 2018

I opened an internal ticket to dig into this. Will keep this thread updated.

@oliverhu
Copy link
Member

@gogasca do you have access to the ResourceManager's UI? If the worker jobs are scheduled, you should be able to see additional containers on the UI.

I'm not sure if you are running in a single node or multiple nodes, if you have more than one machines, the log might be on other nodes.

@gogasca
Copy link
Contributor

gogasca commented Nov 16, 2018

@oliverhu just set it up. Thanks for the tip!
I have one master node and 2 workers.

  • tony-dev-m
  • tony-dev-w-1
  • tony-dev-w-0

image

I ended up killing the existing jobs.

I tried with a new fresh job:

gcloud dataproc jobs submit hadoop --cluster tony-dev \
 --class com.linkedin.tony.cli.ClusterSubmitter \
 --jars file:///usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar \
 -- \
 --python_venv=/usr/local/src/deps/venv/venv.zip \
 --src_dir=/usr/local/src/jobs/MyJob/ \
 --executes=/usr/local/src/jobs/MyJob/src/mnist_distributed.py \
 --conf_file=/usr/local/src/jobs/MyJob/tony.xml \
 --python_binary_path=venv/bin/python3.5

image

It did generate 2 containers:

image

Container 1
image

Container 2
image

In this container I found the following error in logs:

2018-11-16 08:31:10 INFO  TaskExecutor:89 - TaskExecutor is running..
2018-11-16 08:31:10 INFO  TaskExecutor:83 - Reserved rpcPort: 38173
2018-11-16 08:31:10 INFO  TaskExecutor:84 - Reserved tbPort: 34039
2018-11-16 08:31:10 INFO  TaskExecutor:85 - Reserved py4j gatewayServerPort: 45559
2018-11-16 08:31:11 INFO  TaskExecutor:175 - Task command: venv/venv/bin/python3.5 /usr/local/src/jobs/MyJob/src/mnist_distributed.py
2018-11-16 08:31:11 INFO  Utils:134 - Unzipping tony.zip to destination ./
2018-11-16 08:31:12 INFO  TaskExecutor:184 - Setting up Rpc client, connecting to: tony-dev-w-1.c.dpe-cloud-mle.internal:10244
2018-11-16 08:31:13 INFO  TaskExecutor:102 - Unpacking Python virtual environment: /usr/local/src/deps/venv/venv.zip
2018-11-16 08:31:13 INFO  Utils:134 - Unzipping /usr/local/src/deps/venv/venv.zip to destination venv
2018-11-16 08:31:13 FATAL Utils:139 - Failed to unzip /usr/local/src/deps/venv/venv.zip
net.lingala.zip4j.exception.ZipException: zip file does not exist
	at net.lingala.zip4j.core.ZipFile.readZipInfo(ZipFile.java:407)
	at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:475)
	at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:451)
	at com.linkedin.tony.Utils.unzipArchive(Utils.java:137)
	at com.linkedin.tony.TaskExecutor.main(TaskExecutor.java:103)
2018-11-16 08:31:13 INFO  TaskExecutor:113 - Executor is running task ps 0
2018-11-16 08:31:13 INFO  TaskExecutor:190 - Application Master address : tony-dev-w-1.c.dpe-cloud-mle.internal:10244
2018-11-16 08:31:13 INFO  TaskExecutor:193 - ContainerId is: container_1541371308113_0035_01_000003 HostName is: tony-dev-w-0.c.dpe-cloud-mle.internal
2018-11-16 08:31:13 INFO  TaskExecutor:201 - Connecting to tony-dev-w-1.c.dpe-cloud-mle.internal:10244 to register worker spec: ps 0 tony-dev-w-0.c.dpe-cloud-mle.internal:38173

/usr/local/src/deps/venv/venv.zip only exists in my Node Manager (tony-dev-m)

root@tony-dev-m:/usr/local/src/deps/venv# ls -al /usr/local/src/deps/venv/venv.zip
-rw-r--r-- 1 root staff 103319485 Nov 11 07:29 /usr/local/src/deps/venv/venv.zip

Questions:

  1. What is the right file path I should pass for my virtual env? Should I copy manually the file to my workers?
  2. Isn't the right behavior is the job to fail after failing to unzip virtual env instead of sitting "Running"?

Thanks again

@oliverhu
Copy link
Member

Ah I see. It seems to be a usability issue in TonY, the code assumes you have venv.zip in the folder that you're submitting the job, basically you can only do relative path:

-python_venv venv.zip

We gonna fix this today.

For the second question, the job should fail instead of sitting "Running", for the container log you pasted, is that all the logs there? Could you paste the log for the other worker as well?

@gogasca
Copy link
Contributor

gogasca commented Nov 16, 2018

Oh I understand now, the job was launched from http://tony-dev-w-1.c.dpe-cloud-mle.internal:8042

Logs below:

worker0 (tony-dev-w-0.c.dpe-cloud-mle.internal)

2018-11-16 08:31:10 INFO  TaskExecutor:89 - TaskExecutor is running..
2018-11-16 08:31:10 INFO  TaskExecutor:83 - Reserved rpcPort: 38173
2018-11-16 08:31:10 INFO  TaskExecutor:84 - Reserved tbPort: 34039
2018-11-16 08:31:10 INFO  TaskExecutor:85 - Reserved py4j gatewayServerPort: 45559
2018-11-16 08:31:11 INFO  TaskExecutor:175 - Task command: venv/venv/bin/python3.5 /usr/local/src/jobs/MyJob/src/mnist_distributed.py
2018-11-16 08:31:11 INFO  Utils:134 - Unzipping tony.zip to destination ./
2018-11-16 08:31:12 INFO  TaskExecutor:184 - Setting up Rpc client, connecting to: tony-dev-w-1.c.dpe-cloud-mle.internal:10244
2018-11-16 08:31:13 INFO  TaskExecutor:102 - Unpacking Python virtual environment: /usr/local/src/deps/venv/venv.zip
2018-11-16 08:31:13 INFO  Utils:134 - Unzipping /usr/local/src/deps/venv/venv.zip to destination venv
2018-11-16 08:31:13 FATAL Utils:139 - Failed to unzip /usr/local/src/deps/venv/venv.zip
net.lingala.zip4j.exception.ZipException: zip file does not exist
	at net.lingala.zip4j.core.ZipFile.readZipInfo(ZipFile.java:407)
	at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:475)
	at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:451)
	at com.linkedin.tony.Utils.unzipArchive(Utils.java:137)
	at com.linkedin.tony.TaskExecutor.main(TaskExecutor.java:103)
2018-11-16 08:31:13 INFO  TaskExecutor:113 - Executor is running task ps 0
2018-11-16 08:31:13 INFO  TaskExecutor:190 - Application Master address : tony-dev-w-1.c.dpe-cloud-mle.internal:10244
2018-11-16 08:31:13 INFO  TaskExecutor:193 - ContainerId is: container_1541371308113_0035_01_000003 HostName is: tony-dev-w-0.c.dpe-cloud-mle.internal
2018-11-16 08:31:13 INFO  TaskExecutor:201 - Connecting to tony-dev-w-1.c.dpe-cloud-mle.internal:10244 to register worker spec: ps 0 tony-dev-w-0.c.dpe-cloud-mle.internal:38173

worker1 (tony-dev-w-1.c.dpe-cloud-mle.internal)

2018-11-16 20:28:35 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-16 20:28:35 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:36 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:37 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:38 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:39 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:40 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-16 20:28:40 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:41 INFO  TonyApplicationMaster:728 - Received registrations from 1 tasks, awaiting registration from 2 tasks.
2018-11-16 20:28:41 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:42 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:43 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:44 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:45 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-16 20:28:45 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:46 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:47 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:48 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:49 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:50 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-16 20:28:50 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:51 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:52 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:53 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:54 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:55 INFO  TonyApplicationMaster:507 - Completed worker tasks: 0, total worker tasks: 2
2018-11-16 20:28:55 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:56 INFO  TonyApplicationMaster:728 - Received registrations from 1 tasks, awaiting registration from 2 tasks.
2018-11-16 20:28:56 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:57 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!
2018-11-16 20:28:58 INFO  TonyApplicationMaster:660 - Client requesting TaskUrls!

Complete logs.zip attached.

@oliverhu
Copy link
Member

@gogasca if you take a look at the screenshot you pasted:
p1

It seems your cluster failed to allocate more containers for you. Outstanding Resource Requests means the cluster can't allocate that much resource for your job.

@gogasca
Copy link
Contributor

gogasca commented Nov 17, 2018

Thanks @oliverhu

Tensorflow code job is now running!

  • Master and workers have 15.0 GB RAM memory. n1-standard-4 (4 vCPU, 15.0 GB memory)
  • In tony.xml I changed tony.worker.memory to 4GB
  • In master yarn-site.xml the parameters: yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb are set to 12288. (12.0 GB)
  1. I manually moved /usr/local/src/deps/venv/venv.zip and /usr/local/src/jobs/MyJob/src/mnist_distributed.py to each worker. (Ideally I would like to refer to a common Google Cloud bucket or file:///)
  2. I changed the permissions for my output directories (will test writing to a GCS bucket)
  3. I got: 2018-11-17 03:23:19 WARN TonyApplicationMaster:782 - This Hadoop version doesn't have the YARN-7974 patch, TonY won't register TensorBoard URL withapplication's tracking URL I guess Hadoop 3.1+ is ideal.?
  4. Will continue fine tune Memory.

Thank you very much for help me solving all these issues.

@oliverhu
Copy link
Member

Greaaaat! I'm so glad we made it work together on GCP!!

  1. You won't have to copy venv.zip to workers once TonY assumes python_venv.zip at root folder #89 is merged. For src/ folder, that should work irrelevant to TonY assumes python_venv.zip at root folder #89. It works with absolute path.

  2. Permission wise, basically the user who submits the job must have the permission to write to the output directory. Alternatively, you can impersonate as another user.

  3. That is actually an internal patch we used to make TensorBoard accessible through a click. It is a good to have, and it will only be in Hadoop 2.10 (future release) and Hadoop 3.1.2. Nonetheless, in order to have GPU isolation, it is recommended to use Hadoop 3.1.1 (unless you backported the GPU features from 3.x like we did)

Let us know if you have more issues!

@gogasca
Copy link
Contributor

gogasca commented Nov 19, 2018

Yes!
Thanks for the tips.
I was preparing an installation guide, installed a new cluster by using latest TonY code today ~4:00 pm PST and run the following command:

gcloud dataproc jobs submit hadoop --cluster tony-staging \
--class com.linkedin.tony.cli.ClusterSubmitter \
--jars file:///usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar -- \
--python_venv=/usr/local/src/jobs/MyJob/env/tf19.zip \
--src_dir=/usr/local/src/jobs/MyJob/src \
--executes=/usr/local/src/jobs/MyJob/src/mnist_distributed.py \
--task_params='--data_dir /tmp/data/ --working_dir /tmp/output' \
--conf_file=/usr/local/src/jobs/MyJob/tony.xml \
--python_binary_path=tf19/bin/python3.5

I keep getting this error:

Job [9689dc686f264c79be767270fa242a78] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/19 06:10:52 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
18/11/19 06:10:52 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 9689dc686f264c79be767270fa242a78-conf.xml, hdfs-default.xml, hdfs-site.xml, /etc/
hadoop/conf/core-site.xml, /etc/hadoop/conf/hdfs-site.xml
18/11/19 06:10:52 INFO cli.ClusterSubmitter: Copying /usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar to: hdfs://tony-staging-m/user/root/.tony/d8a5a4de-e9f1-40d4-ac7d-3f450b1e2c1f
18/11/19 06:10:53 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/11/19 06:10:53 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/11/19 06:10:53 INFO tony.TonyClient: Starting client..
18/11/19 06:10:53 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/11/19 06:10:53 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.io.Text.encode(Text.java:451)
        at org.apache.hadoop.io.Text.set(Text.java:198)
        at org.apache.hadoop.io.Text.<init>(Text.java:88)
        at com.linkedin.tony.TonyClient.getTokens(TonyClient.java:556)
        at com.linkedin.tony.TonyClient.run(TonyClient.java:191)
        at com.linkedin.tony.TonyClient.start(TonyClient.java:686)
        at com.linkedin.tony.cli.ClusterSubmitter.submit(ClusterSubmitter.java:72)
        at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:85)
        ... 5 more
ERROR: (gcloud.dataproc.jobs.submit.hadoop) Job [9689dc686f264c79be767270fa242a78] entered state [ERROR] while waiting for [DONE].

This is my folder structure in master, worker0 and worker1.

/usr/local/src/jobs/MyJob/

.
├── env
│   └── tf19.zip
├── src
│   └── mnist_distributed.py
└── tony-cli-0.1.5-all.jar

This is a new DataProc cluster I installed from scratch.

Hadoop 2.9.0
Subversion https://bigdataoss-internal.googlesource.com/third_party/apache/hadoop -r e8ce80c37eebb173fc688e7f5686d7df74d182aa
Compiled by bigtop on 2018-10-25T12:56Z
Compiled with protoc 2.5.0
From source with checksum 1eb388d554db8e1cadcab4c1326ee72
This command was run using /usr/lib/hadoop/hadoop-common-2.9.0.jar

@gogasca
Copy link
Contributor

gogasca commented Nov 19, 2018

My other DataProc cluster, seems to run TF job without issue using previous code.

@oliverhu
Copy link
Member

@gogasca I don't see you having tony.xml in your folder architecture. You need to disable secure cluster in tony.xml.

@gogasca
Copy link
Contributor

gogasca commented Nov 19, 2018

Thanks, that was the issue. :)

@gogasca
Copy link
Contributor

gogasca commented Nov 19, 2018

I created a guide to install TonY in GCP, I would like it to share here if possible, what would be the best location? Thanks

@oliverhu
Copy link
Member

That's AWESOME! My original thought is to add this to wiki page but I don't think you'd have access to that and it is hard to review. How about in the TonY Example part, add a section Running TonY on GCP and then create another entry under the tony-example folder.

TonY Example
Below are examples to run distributed deep learning jobs with TonY:
> Distributed MNIST with TensorFlow
> Distributed MNIST with PyTorch

You can also run this on Google Cloud Platform:
> Distributed MNIST with TensorFlow on GCP

@gogasca
Copy link
Contributor

gogasca commented Nov 20, 2018

LGTM, thanks @oliverhu

@spicyramen spicyramen changed the title Failed to run mnist example in GCP Failed to run TF mnist example in GCP Nov 20, 2018
@dwu15
Copy link

dwu15 commented Dec 11, 2018

I'm running into an issue trying to get the example in GCP working. Any one have any pointers to debug?

SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/tmp/8db518ab428a46e0affe15098e4e2134/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/11 18:35:57 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
18/12/11 18:35:57 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 8db518ab428a46e0affe15098e4e2134-conf.xml, hdfs-default.xml, hdfs-site.xml, /etc/hadoop/conf/core-site.xml, /etc/hadoop/conf/hdfs-site.xml
18/12/11 18:35:58 INFO cli.ClusterSubmitter: Copying /tmp/8db518ab428a46e0affe15098e4e2134/tony-cli-0.1.5-all.jar to: hdfs://tony-staging-m/user/root/.tony/842fbdca-b10a-4487-b9ed-a7e1c4dee7b3
18/12/11 18:35:59 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/12/11 18:35:59 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/12/11 18:35:59 INFO tony.TonyClient: Starting client..
18/12/11 18:35:59 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.128.0.3:8032
18/12/11 18:35:59 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.128.0.3:10200
18/12/11 18:35:59 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --task_params '--data_dir /tmp/data/ --working_dir /tmp/output' --executes unzip tf19.zip && tf19/bin/python3.5 mnist_distributed.py --hdfs_classpath hdfs://tony-staging-m/user/root/.tony/842fbdca-b10a-4487-b9ed-a7e1c4dee7b3 --container_env TONY_CONF_PATH=hdfs://tony-staging-m/user/root/.tony/application_1544551109758_0001/tony-final.xml --container_env TONY_CONF_TIMESTAMP=1544553359849 --container_env TONY_CONF_LENGTH=182629 --container_env TONY_ZIP_PATH=hdfs://tony-staging-m/user/root/.tony/application_1544551109758_0001/tony.zip --container_env TONY_ZIP_LENGTH=3782 --container_env TONY_ZIP_TIMESTAMP=1544553359792 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/* 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
18/12/11 18:35:59 INFO tony.TonyClient: Submitting YARN application
18/12/11 18:36:00 INFO impl.YarnClientImpl: Submitted application application_1544551109758_0001
18/12/11 18:36:00 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://tony-staging-m:8088/proxy/application_1544551109758_0001/
18/12/11 18:36:00 INFO tony.TonyClient: ResourceManager web address for application: http://tony-staging-m:8088/cluster/app/application_1544551109758_0001
18/12/11 18:36:07 INFO tony.TonyClient: AM host: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal
18/12/11 18:36:07 INFO tony.TonyClient: AM RPC port: 12169
18/12/11 18:36:07 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.128.0.3:8032
18/12/11 18:36:07 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.128.0.3:10200
18/12/11 18:36:08 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:48:31 INFO retry.RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "tony-staging-m.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.3"; destination host is: "tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal":12169; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/11 18:49:02 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:03 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:04 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:05 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:06 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:07 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:08 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:09 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:10 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:11 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:11 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From tony-staging-m.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.3 to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:12169 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/11 18:49:42 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 0 time(s); retry policy is 
18/12/11 18:49:51 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From tony-staging-m.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.3 to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:12169 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/11 18:50:31 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:50:31 ERROR tony.TonyClient: Failed to run TonyClient
java.net.ConnectException: Call From tony-staging-m.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.3 to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:12169 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:754)
	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1500)
	at org.apache.hadoop.ipc.Client.call(Client.java:1442)
	at org.apache.hadoop.ipc.Client.call(Client.java:1352)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy16.getTaskUrls(Unknown Source)
	at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskUrls(TensorFlowClusterPBClientImpl.java:75)
	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy17.getTaskUrls(Unknown Source)
	at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskUrls(ApplicationRpcClient.java:109)
	at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:647)
	at com.linkedin.tony.TonyClient.run(TonyClient.java:207)
	at com.linkedin.tony.TonyClient.start(TonyClient.java:721)
	at com.linkedin.tony.cli.ClusterSubmitter.submit(ClusterSubmitter.java:72)
	at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:85)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:690)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:793)
	at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
	at org.apache.hadoop.ipc.Client.call(Client.java:1388)
	... 25 more
18/12/11 18:50:31 ERROR tony.TonyClient: Application failed to complete successfully```

@pdtran3k6
Copy link
Contributor

I'm running into an issue trying to get the example in GCP working. Any one have any pointers to debug?

SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/tmp/8db518ab428a46e0affe15098e4e2134/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/11 18:35:57 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
18/12/11 18:35:57 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 8db518ab428a46e0affe15098e4e2134-conf.xml, hdfs-default.xml, hdfs-site.xml, /etc/hadoop/conf/core-site.xml, /etc/hadoop/conf/hdfs-site.xml
18/12/11 18:35:58 INFO cli.ClusterSubmitter: Copying /tmp/8db518ab428a46e0affe15098e4e2134/tony-cli-0.1.5-all.jar to: hdfs://tony-staging-m/user/root/.tony/842fbdca-b10a-4487-b9ed-a7e1c4dee7b3
18/12/11 18:35:59 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/12/11 18:35:59 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/12/11 18:35:59 INFO tony.TonyClient: Starting client..
18/12/11 18:35:59 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.128.0.3:8032
18/12/11 18:35:59 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.128.0.3:10200
18/12/11 18:35:59 INFO tony.TonyClient: Completed setting up Application Master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --task_params '--data_dir /tmp/data/ --working_dir /tmp/output' --executes unzip tf19.zip && tf19/bin/python3.5 mnist_distributed.py --hdfs_classpath hdfs://tony-staging-m/user/root/.tony/842fbdca-b10a-4487-b9ed-a7e1c4dee7b3 --container_env TONY_CONF_PATH=hdfs://tony-staging-m/user/root/.tony/application_1544551109758_0001/tony-final.xml --container_env TONY_CONF_TIMESTAMP=1544553359849 --container_env TONY_CONF_LENGTH=182629 --container_env TONY_ZIP_PATH=hdfs://tony-staging-m/user/root/.tony/application_1544551109758_0001/tony.zip --container_env TONY_ZIP_LENGTH=3782 --container_env TONY_ZIP_TIMESTAMP=1544553359792 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/* 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
18/12/11 18:35:59 INFO tony.TonyClient: Submitting YARN application
18/12/11 18:36:00 INFO impl.YarnClientImpl: Submitted application application_1544551109758_0001
18/12/11 18:36:00 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://tony-staging-m:8088/proxy/application_1544551109758_0001/
18/12/11 18:36:00 INFO tony.TonyClient: ResourceManager web address for application: http://tony-staging-m:8088/cluster/app/application_1544551109758_0001
18/12/11 18:36:07 INFO tony.TonyClient: AM host: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal
18/12/11 18:36:07 INFO tony.TonyClient: AM RPC port: 12169
18/12/11 18:36:07 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.128.0.3:8032
18/12/11 18:36:07 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.128.0.3:10200
18/12/11 18:36:08 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:48:31 INFO retry.RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "tony-staging-m.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.3"; destination host is: "tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal":12169; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/11 18:49:02 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:03 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:04 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:05 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:06 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:07 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:08 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:09 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:10 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:11 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:49:11 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From tony-staging-m.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.3 to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:12169 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/11 18:49:42 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 0 time(s); retry policy is 
18/12/11 18:49:51 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From tony-staging-m.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.3 to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:12169 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking TensorFlowClusterPBClientImpl.getTaskUrls over null. Retrying after sleeping for 30000ms.
18/12/11 18:50:31 INFO ipc.Client: Retrying connect to server: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.4:12169. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/12/11 18:50:31 ERROR tony.TonyClient: Failed to run TonyClient
java.net.ConnectException: Call From tony-staging-m.us-central1-f.c.sodium-hangar-225217.internal/10.128.0.3 to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:12169 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:754)
	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1500)
	at org.apache.hadoop.ipc.Client.call(Client.java:1442)
	at org.apache.hadoop.ipc.Client.call(Client.java:1352)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy16.getTaskUrls(Unknown Source)
	at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskUrls(TensorFlowClusterPBClientImpl.java:75)
	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy17.getTaskUrls(Unknown Source)
	at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskUrls(ApplicationRpcClient.java:109)
	at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:647)
	at com.linkedin.tony.TonyClient.run(TonyClient.java:207)
	at com.linkedin.tony.TonyClient.start(TonyClient.java:721)
	at com.linkedin.tony.cli.ClusterSubmitter.submit(ClusterSubmitter.java:72)
	at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:85)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:690)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:793)
	at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
	at org.apache.hadoop.ipc.Client.call(Client.java:1388)
	... 25 more
18/12/11 18:50:31 ERROR tony.TonyClient: Application failed to complete successfully```

Could you give us some more info (cmd that you use, Hadoop version, etc.) so we can reproduce the bugs?

@gogasca
Copy link
Contributor

gogasca commented Dec 11, 2018

@dwu15 Have you tried editing /etc/hadoop/conf.empty/yarn-site.xml and removing the ' at the end? I remember seeing EOF errors, we also published the Installation guide: https://github.com/linkedin/TonY/tree/master/tony-examples/tony-in-gcp let us know if you still encounter issues.

@dwu15
Copy link

dwu15 commented Dec 11, 2018

Thanks for the response @gogasca , @pdtran3k6

Yes I'm currently following the installation guide on a fresh cluster on GCP. I've more or less followed that guide word for word, and still getting these errors.

here is the command that I ended up running in a GCP cloud shell for this project.

gcloud dataproc jobs submit hadoop --cluster tony-staging \
--class com.linkedin.tony.cli.ClusterSubmitter \
--jars file:///usr/local/src/jobs/TFJob/tony-cli-0.1.5-all.jar  -- \
--python_venv=/usr/local/src/jobs/TFJob/env/tf19.zip \
--src_dir=/usr/local/src/jobs/TFJob/src \
--executes=/usr/local/src/jobs/TFJob/src/mnist_distributed.py \
--task_params='--data_dir /tmp/data/ --working_dir /tmp/output/' \
--conf_file=/usr/local/src/jobs/TFJob/tony.xml \
--python_binary_path=tf19/bin/python3.5

My tony.xml is the same. I checked the file paths in the dataproc jobs submit statement and they all should exist on the master node. --task_params directories exist on each worker node as well

@dwu15
Copy link

dwu15 commented Dec 11, 2018

Regarding the yarn-site.xml fix, I looked at the release notes, and it seemed like they fixed that bug on 12/10.

@oliverhu
Copy link
Member

@dwu15 could you provide the Hadoop log for the application application_1544551109758_0001? yarn logs -appId application_1544551109758_0001 Good news is the application has started, but for some reason client host could not talk to the driver host.

@dwu15
Copy link

dwu15 commented Dec 11, 2018

These are my log files. I was able to submit without the job failing, but the job now seems to hang indefinitely now.

18/12/11 23:12:57 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.128.0.3:8032
18/12/11 23:12:57 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.128.0.3:10200
Invalid ApplicationId specified
root@tony-staging-m:~/ecosystem# yarn logs -applicationId application_1544551109758_2640
18/12/11 23:13:28 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.128.0.3:8032
18/12/11 23:13:29 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.128.0.3:10200
Container: container_1544551109758_2640_01_000003 on tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:35659
LogAggregationType: LOCAL
=========================================================================================================================
LogType:stdout
LogLastModifiedTime:Tue Dec 11 21:36:49 +0000 2018
LogLength:1940
LogContents:
2018-12-11 21:36:46 INFO  TaskExecutor:85 - TaskExecutor is running..
2018-12-11 21:36:46 INFO  TaskExecutor:79 - Reserved rpcPort: 45893
2018-12-11 21:36:46 INFO  TaskExecutor:80 - Reserved tbPort: 44941
2018-12-11 21:36:46 INFO  TaskExecutor:81 - Reserved py4j gatewayServerPort: 42281
2018-12-11 21:36:47 INFO  TaskExecutor:179 - Task command: venv/tf19/bin/python3.5 /usr/local/src/jobs/TFJob/src/mnist_distributed.py --data_dir /tmp/data/ --working_dir /tmp/output
2018-12-11 21:36:47 INFO  Utils:137 - Unzipping tony.zip to destination ./
2018-12-11 21:36:48 INFO  TaskExecutor:191 - Setting up Rpc client, connecting to: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263
2018-12-11 21:36:49 INFO  TaskExecutor:98 - Unpacking Python virtual environment: tf19.zip
2018-12-11 21:36:49 INFO  Utils:137 - Unzipping tf19.zip to destination venv
2018-12-11 21:36:49 ERROR Utils:142 - Failed to unzip tf19.zip
net.lingala.zip4j.exception.ZipException: zip file does not exist
        at net.lingala.zip4j.core.ZipFile.readZipInfo(ZipFile.java:407)
        at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:475)
        at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:451)
        at com.linkedin.tony.util.Utils.unzipArchive(Utils.java:140)
        at com.linkedin.tony.TaskExecutor.main(TaskExecutor.java:99)
2018-12-11 21:36:49 INFO  TaskExecutor:109 - Executor is running task worker 0
2018-12-11 21:36:49 INFO  TaskExecutor:197 - Application Master address : tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263
2018-12-11 21:36:49 INFO  TaskExecutor:200 - ContainerId is: container_1544551109758_2640_01_000003 HostName is: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal
2018-12-11 21:36:49 INFO  TaskExecutor:208 - Connecting to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263 to register worker spec: worker 0 tony-staging-w-1.us-central1-
f.c.sodium-hangar-225217.internal:45893
End of LogType:stdout.This log file belongs to a running container (container_1544551109758_2640_01_000003) and so may not be complete.
***********************************************************************
End of LogType:prelaunch.err.This log file belongs to a running container (container_1544551109758_2640_01_000003) and so may not be complete.
******************************************************************************
Container: container_1544551109758_2640_01_000003 on tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:35659
LogAggregationType: LOCAL
=========================================================================================================================
LogType:stderr
LogLastModifiedTime:Tue Dec 11 21:36:46 +0000 2018
LogLength:476
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/root/filecache/15/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
End of LogType:stderr.This log file belongs to a running container (container_1544551109758_2640_01_000003) and so may not be complete.
Container: container_1544551109758_2640_01_000002 on tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:40857
LogAggregationType: LOCAL
=========================================================================================================================
LogType:stdout
LogLastModifiedTime:Tue Dec 11 21:36:51 +0000 2018
LogLength:1932
LogContents:
2018-12-11 21:36:47 INFO  TaskExecutor:85 - TaskExecutor is running..
2018-12-11 21:36:47 INFO  TaskExecutor:79 - Reserved rpcPort: 40549
2018-12-11 21:36:47 INFO  TaskExecutor:80 - Reserved tbPort: 46217
2018-12-11 21:36:47 INFO  TaskExecutor:81 - Reserved py4j gatewayServerPort: 46285
2018-12-11 21:36:48 INFO  TaskExecutor:179 - Task command: venv/tf19/bin/python3.5 /usr/local/src/jobs/TFJob/src/mnist_distributed.py --data_dir /tmp/data/ --working_dir /tmp/output
2018-12-11 21:36:48 INFO  Utils:137 - Unzipping tony.zip to destination ./
2018-12-11 21:36:49 INFO  TaskExecutor:191 - Setting up Rpc client, connecting to: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263
2018-12-11 21:36:50 INFO  TaskExecutor:98 - Unpacking Python virtual environment: tf19.zip
2018-12-11 21:36:50 INFO  Utils:137 - Unzipping tf19.zip to destination venv
2018-12-11 21:36:50 ERROR Utils:142 - Failed to unzip tf19.zip
net.lingala.zip4j.exception.ZipException: zip file does not exist
        at net.lingala.zip4j.core.ZipFile.readZipInfo(ZipFile.java:407)
        at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:475)
        at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:451)
        at com.linkedin.tony.util.Utils.unzipArchive(Utils.java:140)
        at com.linkedin.tony.TaskExecutor.main(TaskExecutor.java:99)
2018-12-11 21:36:50 INFO  TaskExecutor:109 - Executor is running task ps 0
2018-12-11 21:36:50 INFO  TaskExecutor:197 - Application Master address : tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263
2018-12-11 21:36:51 INFO  TaskExecutor:200 - ContainerId is: container_1544551109758_2640_01_000002 HostName is: tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal
2018-12-11 21:36:51 INFO  TaskExecutor:208 - Connecting to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263 to register worker spec: ps 0 tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:40549
End of LogType:stdout.This log file belongs to a running container (container_1544551109758_2640_01_000002) and so may not be complete.
***********************************************************************


End of LogType:prelaunch.err.This log file belongs to a running container (container_1544551109758_2640_01_000002) and so may not be complete.
******************************************************************************


Container: container_1544551109758_2640_01_000002 on tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:40857
LogAggregationType: LOCAL
=========================================================================================================================
LogType:stderr
LogLastModifiedTime:Tue Dec 11 21:36:47 +0000 2018
LogLength:476
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/root/filecache/15/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
End of LogType:stderr.This log file belongs to a running container (container_1544551109758_2640_01_000002) and so may not be complete.
***********************************************************************
Container: container_1544551109758_2640_01_000002 on tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:40857
LogAggregationType: LOCAL
=========================================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Tue Dec 11 21:36:46 +0000 2018
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container
End of LogType:prelaunch.out.This log file belongs to a running container (container_1544551109758_2640_01_000002) and so may not be complete.
******************************************************************************


Container: container_1544551109758_2640_01_000001 on tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:35659
LogAggregationType: LOCAL
=========================================================================================================================
LogType:amstderr.log
LogLastModifiedTime:Tue Dec 11 21:36:36 +0000 2018
LogLength:476
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/root/filecache/15/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
End of LogType:amstderr.log.This log file belongs to a running container (container_1544551109758_2640_01_000001) and so may not be complete.
*****************************************************************************


Container: container_1544551109758_2640_01_000001 on tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:35659
LogAggregationType: LOCAL
=========================================================================================================================
LogType:amstdout.log
LogLastModifiedTime:Tue Dec 11 23:40:43 +0000 2018
LogLength:198859
LogContents:
2018-12-11 21:36:36 INFO  Utils:137 - Unzipping tony.zip to destination ./
2018-12-11 21:36:40 INFO  TonyApplicationMaster:380 - Preparing application master..
2018-12-11 21:36:41 INFO  TonyApplicationMaster:400 - RPC server running at: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263
2018-12-11 21:36:41 INFO  Utils:446 - Directory /system/tony-history/2018 already exists!
2018-12-11 21:36:41 INFO  Utils:446 - Directory /system/tony-history/2018/12 already exists!
2018-12-11 21:36:41 INFO  Utils:446 - Directory /system/tony-history/2018/12/11 already exists!
2018-12-11 21:36:42 INFO  TonyApplicationMaster:295 - Final task command: 'venv/tf19/bin/python3.5 /usr/local/src/jobs/TFJob/src/mnist_distributed.py --data_dir /tmp/data/ --working_dir /tmp/output'
2018-12-11 21:36:42 INFO  TonySession:218 - Job ps with allocationRequestId 0
2018-12-11 21:36:42 INFO  TonyApplicationMaster:883 - Requested container ask: Capability[<memory:4096, vCores:1>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-12-11 21:36:42 INFO  TonySession:218 - Job worker with allocationRequestId 1
2018-12-11 21:36:42 INFO  TonyApplicationMaster:883 - Requested container ask: Capability[<memory:4096, vCores:1>]Priority[1]AllocationRequestId[1]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-12-11 21:36:42 INFO  TonySession:218 - Job worker with allocationRequestId 2
2018-12-11 21:36:42 INFO  TonyApplicationMaster:883 - Requested container ask: Capability[<memory:4096, vCores:1>]Priority[1]AllocationRequestId[2]ExecutionTypeRequest[{Execution Type: GUARANTEED, Enforce Execution Type: false}]
2018-12-11 21:36:42 INFO  Utils:421 - Completed worker tasks: 0 out of 2 worker tasks.
2018-12-11 21:36:44 INFO  TonyApplicationMaster:981 - Allocated: 2 containers.
2018-12-11 21:36:44 INFO  TonyApplicationMaster:983 - Launching a task in container, containerId = container_1544551109758_2640_01_000002, containerNode = tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:40857, resourceRequest = <memory:4096, vCores:1>
2018-12-11 21:36:44 INFO  TonySession:236 - Matched job ps with allocationRequestId 0
2018-12-11 21:36:44 INFO  TonyApplicationMaster:1046 - Setting Container [container_1544551109758_2640_01_000002] for task [ps:0]..
2018-12-11 21:36:44 INFO  TonyApplicationMaster:1068 - Constructed command: [$JAVA_HOME/bin/java -Xmx1536m com.linkedin.tony.TaskExecutor  --am_address tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263 --task_command 'venv/tf19/bin/python3.5 /usr/local/src/jobs/TFJob/src/mnist_distributed.py --data_dir /tmp/data/ --working_dir /tmp/output' --venv tf19.zip 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr ]
2018-12-11 21:36:44 INFO  TonyApplicationMaster:1069 - Container environment: {TONY_CONF_TIMESTAMP=1544564193297, TONY_CONF_LENGTH=182332, TASK_INDEX=0, TASK_NUM=2, CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*, ATTEMPT_NUMBER=0, JOB_NAME=ps, TONY_CONF_PATH=hdfs://tony-staging-m/user/root/.tony/application_1544551109758_2640/tony-final.xml, TONY_ZIP_PATH=hdfs://tony-staging-m/user/root/.tony/application_1544551109758_2640/tony.zip, TONY_ZIP_LENGTH=91846570, TONY_ZIP_TIMESTAMP=1544564193223, SESSION_ID=0}
2018-12-11 21:36:44 INFO  TonyApplicationMaster:173 - Logs for ps 0 at: http://tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:8042/node/containerlogs/container_1544551109758_2640_01_000002/yarn
2018-12-11 21:36:44 INFO  TonyApplicationMaster:983 - Launching a task in container, containerId = container_1544551109758_2640_01_000003, containerNode = tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:35659, resourceRequest = <memory:4096, vCores:1>
2018-12-11 21:36:44 INFO  TonySession:236 - Matched job worker with allocationRequestId 1
2018-12-11 21:36:44 INFO  TonyApplicationMaster:1046 - Setting Container [container_1544551109758_2640_01_000003] for task [worker:0]..
2018-12-11 21:36:44 INFO  TonyApplicationMaster:1068 - Constructed command: [$JAVA_HOME/bin/java -Xmx1536m com.linkedin.tony.TaskExecutor  --am_address tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:11263 --task_command 'venv/tf19/bin/python3.5 /usr/local/src/jobs/TFJob/src/mnist_distributed.py --data_dir /tmp/data/ --working_dir /tmp/output' --venv tf19.zip 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr ]
2018-12-11 21:36:44 INFO  TonyApplicationMaster:1069 - Container environment: {TONY_CONF_TIMESTAMP=1544564193297, TONY_CONF_LENGTH=182332, TASK_INDEX=0, TASK_NUM=2, CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*, ATTEMPT_NUMBER=0, JOB_NAME=worker, TONY_CONF_PATH=hdfs://tony-staging-m/user/root/.tony/application_1544551109758_2640/tony-final.xml, TONY_ZIP_PATH=hdfs://tony-staging-m/user/root/.tony/application_1544551109758_2640/tony.zip, TONY_ZIP_LENGTH=91846570, TONY_ZIP_TIMESTAMP=1544564193223, SESSION_ID=0}
2018-12-11 21:36:44 INFO  TonyApplicationMaster:173 - Logs for worker 0 at: http://tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:8042/node/containerlogs/container_1544551109758_2640_01_000003/yarn
2018-12-11 21:36:45 INFO  TonyApplicationMaster:907 - Successfully started container container_1544551109758_2640_01_000003
2018-12-11 21:36:45 INFO  TonyApplicationMaster:907 - Successfully started container container_1544551109758_2640_01_000002
2018-12-11 21:36:47 INFO  Utils:421 - Completed worker tasks: 0 out of 2 worker tasks.
2018-12-11 21:36:49 INFO  TonyApplicationMaster:774 - Received cluster spec registration request from task worker:0 with spec: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:45893
2018-12-11 21:36:49 INFO  TonyApplicationMaster:781 - [worker:0] Received Registration for HB !!
2018-12-11 21:36:51 INFO  TonyApplicationMaster:774 - Received cluster spec registration request from task ps:0 with spec: tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:40549
2018-12-11 21:36:51 INFO  TonyApplicationMaster:781 - [ps:0] Received Registration for HB !!
2018-12-11 21:36:52 INFO  Utils:421 - Completed worker tasks: 0 out of 2 worker tasks.
2018-12-11 21:36:55 INFO  TonyApplicationMaster:794 - Received registrations from 2 tasks, awaiting registration from 1 tasks.
2018-12-11 21:36:57 INFO  Utils:421 - Completed worker tasks: 0 out of 2 worker tasks.
2018-12-11 21:37:02 INFO  Utils:421 - Completed worker tasks: 0 out of 2 worker tasks.
2018-12-11 21:37:07 INFO  Utils:421 - Completed worker tasks: 0 out of 2 worker tasks.
....

@dwu15
Copy link

dwu15 commented Dec 12, 2018

Looking at the logs for one of the containers, this probably seems to be the issue, as it cant locate the zip file with the venv

2018-12-12 01:25:45 INFO  TaskExecutor:85 - TaskExecutor is running..
2018-12-12 01:25:45 INFO  TaskExecutor:79 - Reserved rpcPort: 36535
2018-12-12 01:25:45 INFO  TaskExecutor:80 - Reserved tbPort: 39663
2018-12-12 01:25:45 INFO  TaskExecutor:81 - Reserved py4j gatewayServerPort: 36955
2018-12-12 01:25:46 INFO  TaskExecutor:179 - Task command: venv/tf19/bin/python3.5 /usr/local/src/jobs/TFJob/src/mnist_distributed.py --data_dir /tmp/data/ --working_dir /tmp/output/
2018-12-12 01:25:46 INFO  Utils:137 - Unzipping tony.zip to destination ./
2018-12-12 01:25:47 INFO  TaskExecutor:191 - Setting up Rpc client, connecting to: tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:14610
2018-12-12 01:25:49 INFO  TaskExecutor:98 - Unpacking Python virtual environment: tf19.zip
2018-12-12 01:25:50 INFO  Utils:137 - Unzipping tf19.zip to destination venv
2018-12-12 01:25:50 ERROR Utils:142 - Failed to unzip tf19.zip
net.lingala.zip4j.exception.ZipException: zip file does not exist
	at net.lingala.zip4j.core.ZipFile.readZipInfo(ZipFile.java:407)
	at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:475)
	at net.lingala.zip4j.core.ZipFile.extractAll(ZipFile.java:451)
	at com.linkedin.tony.util.Utils.unzipArchive(Utils.java:140)
	at com.linkedin.tony.TaskExecutor.main(TaskExecutor.java:99)
2018-12-12 01:25:50 INFO  TaskExecutor:109 - Executor is running task worker 0
2018-12-12 01:25:50 INFO  TaskExecutor:197 - Application Master address : tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:14610
2018-12-12 01:25:50 INFO  TaskExecutor:200 - ContainerId is: container_1544551109758_2663_01_000003 HostName is: tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal
2018-12-12 01:25:50 INFO  TaskExecutor:208 - Connecting to tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:14610 to register worker spec: worker 0 tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:36535
2018-12-12 01:25:50 INFO  Utils:110 - pollTillNonNull function finished within 0 seconds
2018-12-12 01:25:50 INFO  TaskExecutor:116 - Successfully registered and got cluster spec: {"ps":["tony-staging-w-1.us-central1-f.c.sodium-hangar-225217.internal:45307"],"worker":["tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:36535"]}
2018-12-12 01:25:50 INFO  TaskExecutor:218 - TensorBoard address : tony-staging-w-0.us-central1-f.c.sodium-hangar-225217.internal:39663
2018-12-12 01:25:50 INFO  Utils:110 - pollTillNonNull function finished within 60 seconds
2018-12-12 01:25:50 INFO  TaskExecutor:221 - Register TensorBoard response: SUCCEEDED
2018-12-12 01:25:50 INFO  TaskExecutor:130 - Setting up TensorFlow jobs..
2018-12-12 01:25:51 INFO  Utils:238 - Executing command: venv/tf19/bin/python3.5 /usr/local/src/jobs/TFJob/src/mnist_distributed.py --data_dir /tmp/data/ --working_dir /tmp/output/
2018-12-12 01:25:51 INFO  Utils:110 - pollTillNonNull function finished within 60 seconds
2018-12-12 01:25:51 INFO  TaskExecutor:230 - AM response for result execution run: RECEIVED
2018-12-12 01:25:51 INFO  TaskExecutor:159 - Child process exited with exit code 127

@oliverhu
Copy link
Member

That looks weird, does /usr/local/src/jobs/TFJob/env/tf19.zip exist? @gogasca i remember you also saw this before.. did you solve this without copying files manually around ?

@dwu15
Copy link

dwu15 commented Dec 12, 2018

After some debugging, I got a job running. I had to change the folder structure

TFJob/
  > src/
    mnist_distributed.py
  tony.xml
  tony-cli-0.1.5-all.jar
  tf19.zip

And specify relative paths in the hadoop job statement. I ran this statement in the parent directory of TFJob

sudo java -cp "`hadoop classpath --glob`:TFJob/*:TFJob" \
com.linkedin.tony.cli.ClusterSubmitter \
--python_venv tf19.zip \
--src_dir TFJob/src \
--executes TFJob/src/mnist_distributed.py \
--task_params '--data_dir /tmp/data/ --working_dir /tmp/output' \
--conf_file TFJob/tony.xml \
--python_binary_path tf19/bin/python3.5

I saw in one of the comments in this chain that there was an issue before with relative paths vs absolute paths, but this bug still seems to exist.

@oliverhu
Copy link
Member

Darn, we'll take a look again, thanks for reporting.

@gogasca
Copy link
Contributor

gogasca commented Dec 12, 2018

@dwu15 thanks for reporting, yes unfortunately we need to move each of the TF Python files and env using the same structure across all the workers. I opened #93 to be able to support GCS so we just locate the previous files in a common directory. Will take a look at this as it is still opened.

@oliverhu
Copy link
Member

Took a look tonight, it turned out to be fairly complicated to fix this issue correctly. Plan to revamp the logics behind handling hdfs_classpath & src_dir to all use the new resources flag

@oliverhu
Copy link
Member

This is fixed in trunk!

Check this updated read.me for instructions: https://github.com/linkedin/TonY/pull/136/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants