Hi Team,
I understand that in TFNode.py, start_cluster_server() is responsible to set up GPU devices. If no free GPU is available, it will keep trying for ever until it get free GPU.
Is there a way to let the spark driver know there is no free GPU available on one node and stop the whole job?
Thanks a lot.
Best,
Yi