Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNIST example - Exception in TF background thread #569

Closed
Ipsedo opened this issue Jun 14, 2021 · 2 comments
Closed

MNIST example - Exception in TF background thread #569

Ipsedo opened this issue Jun 14, 2021 · 2 comments

Comments

@Ipsedo
Copy link

Ipsedo commented Jun 14, 2021

Environment:

  • Python version 3.6
  • Spark version 2.2.0
  • TensorFlow version 1.9.0
  • TensorFlowOnSpark version 1.4.0
  • Cluster version Hadoop 2.7.3

Describe the bug:
When trying to run MNIST example, I get the error : "Exception in TF background thread" on TFCluster run.

Logs:

from notebook :

2021-06-14 16:44:45,996 INFO (MainThread-22437) Reserving TFSparkNodes w/ TensorBoard
2021-06-14 16:44:45,998 INFO (MainThread-22437) cluster_template: {'ps': range(0, 1), 'worker': range(1, 10)}
2021-06-14 16:44:46,002 INFO (MainThread-22437) listening for reservations at ('XXX.XXX.XXX.XXX', XXXXX)
2021-06-14 16:44:46,003 INFO (MainThread-22437) Starting TensorFlow on executors
2021-06-14 16:44:46,180 INFO (MainThread-22437) Waiting for TFSparkNodes to start
2021-06-14 16:44:46,190 INFO (MainThread-22437) waiting for 10 reservations
2021-06-14 16:44:47,192 INFO (MainThread-22437) waiting for 10 reservations
2021-06-14 16:44:47,290 ERROR (Thread-6-22437) Exception in TF background thread
2021-06-14 16:44:48,195 INFO (MainThread-22437) waiting for 10 reservations

from yarn log :
[...]
21/06/14 16:26:50 INFO YarnAllocator: Received 10 containers from YARN, launching executors on 0 of them.

Then when I run :

TFCluster.run(sc, mnist_dist.map_fun, args, 10, 1, False, TFCluster.InputMode.SPARK)

I get :
21/06/14 16:27:24 INFO YarnAllocator: Driver requested a total number of 0 executor(s).

Spark Submit Command Line:
Ran from notebook with such configuration :

SparkSession
     .builder
     .appName("tensorflowonspark")
     .enableHiveSupport()
     .master("yarn")
     .config("spark.yarn.queue", "MY_QUEUE")
     .config("spark.executor.instances", "10")
     .config("spark.executor.memory", "8g")
     .config("spark.driver.memory", "8g")
     .config("spark.yarn.executor.memoryOverhead", "2048")
     .config("spark.yarn.driver.memoryOverhead", "2048")
     .config("spark.executor.cores", "5")
     .getOrCreate()

Did I miss something in Spark/Hadoop/Tensorflow versions ?
I can't get more information of this error in the logs, does it mean that TensorflowOnSpark can't "connect" to Spark ?

@leewyang
Copy link
Contributor

Are you able to run standard Spark jobs on this cluster with those same settings? The Driver requested a total number of 0 executors error seems to be occurring at the Spark level and not the TFoS level.

Please take a look at the Spark executor logs for any exceptions, since you're only showing the driver logs, which aren't showing the root cause of the exception.

Finally, you are using a very old version of tensorflow (1.9.0) and you should be using at least tensorflowonspark==1.4.4 (the last version of TFoS intended for TF 1.x branch). That said, If you are starting from scratch, I'd recommend using TF2.x along with the latest tensorflowonspark.

@Ipsedo
Copy link
Author

Ipsedo commented Jun 15, 2021

The problem was a conflict between .local and conda environment folders when distribute the python environment.

I switch to the 1.4.4 TFoS version working with TF 1.13.1 and all works fine.

Thanks for the quick response, I close the issue.

@Ipsedo Ipsedo closed this as completed Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants