MNIST example - Exception in TF background thread #569

Ipsedo · 2021-06-14T15:00:09Z

Environment:

Python version 3.6
Spark version 2.2.0
TensorFlow version 1.9.0
TensorFlowOnSpark version 1.4.0
Cluster version Hadoop 2.7.3

Describe the bug:
When trying to run MNIST example, I get the error : "Exception in TF background thread" on TFCluster run.

Logs:

from notebook :

2021-06-14 16:44:45,996 INFO (MainThread-22437) Reserving TFSparkNodes w/ TensorBoard
2021-06-14 16:44:45,998 INFO (MainThread-22437) cluster_template: {'ps': range(0, 1), 'worker': range(1, 10)}
2021-06-14 16:44:46,002 INFO (MainThread-22437) listening for reservations at ('XXX.XXX.XXX.XXX', XXXXX)
2021-06-14 16:44:46,003 INFO (MainThread-22437) Starting TensorFlow on executors
2021-06-14 16:44:46,180 INFO (MainThread-22437) Waiting for TFSparkNodes to start
2021-06-14 16:44:46,190 INFO (MainThread-22437) waiting for 10 reservations
2021-06-14 16:44:47,192 INFO (MainThread-22437) waiting for 10 reservations
2021-06-14 16:44:47,290 ERROR (Thread-6-22437) Exception in TF background thread
2021-06-14 16:44:48,195 INFO (MainThread-22437) waiting for 10 reservations

from yarn log :
[...]
21/06/14 16:26:50 INFO YarnAllocator: Received 10 containers from YARN, launching executors on 0 of them.

Then when I run :

TFCluster.run(sc, mnist_dist.map_fun, args, 10, 1, False, TFCluster.InputMode.SPARK)

I get :
21/06/14 16:27:24 INFO YarnAllocator: Driver requested a total number of 0 executor(s).

Spark Submit Command Line:
Ran from notebook with such configuration :

SparkSession
     .builder
     .appName("tensorflowonspark")
     .enableHiveSupport()
     .master("yarn")
     .config("spark.yarn.queue", "MY_QUEUE")
     .config("spark.executor.instances", "10")
     .config("spark.executor.memory", "8g")
     .config("spark.driver.memory", "8g")
     .config("spark.yarn.executor.memoryOverhead", "2048")
     .config("spark.yarn.driver.memoryOverhead", "2048")
     .config("spark.executor.cores", "5")
     .getOrCreate()

Did I miss something in Spark/Hadoop/Tensorflow versions ?
I can't get more information of this error in the logs, does it mean that TensorflowOnSpark can't "connect" to Spark ?

leewyang · 2021-06-14T23:12:59Z

Are you able to run standard Spark jobs on this cluster with those same settings? The Driver requested a total number of 0 executors error seems to be occurring at the Spark level and not the TFoS level.

Please take a look at the Spark executor logs for any exceptions, since you're only showing the driver logs, which aren't showing the root cause of the exception.

Finally, you are using a very old version of tensorflow (1.9.0) and you should be using at least tensorflowonspark==1.4.4 (the last version of TFoS intended for TF 1.x branch). That said, If you are starting from scratch, I'd recommend using TF2.x along with the latest tensorflowonspark.

Ipsedo · 2021-06-15T14:45:42Z

The problem was a conflict between .local and conda environment folders when distribute the python environment.

I switch to the 1.4.4 TFoS version working with TF 1.13.1 and all works fine.

Thanks for the quick response, I close the issue.

Ipsedo closed this as completed Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNIST example - Exception in TF background thread #569

MNIST example - Exception in TF background thread #569

Ipsedo commented Jun 14, 2021

leewyang commented Jun 14, 2021

Ipsedo commented Jun 15, 2021

MNIST example - Exception in TF background thread #569

MNIST example - Exception in TF background thread #569

Comments

Ipsedo commented Jun 14, 2021

leewyang commented Jun 14, 2021

Ipsedo commented Jun 15, 2021