Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible migrate a CNN tensorflow application to tensorflowonspark? #440

Closed
rafaelmarconiramos opened this issue Aug 14, 2019 · 4 comments

Comments

@rafaelmarconiramos
Copy link

Hello,

I don´t know if this git is the correct local to send my doubts, but I did not find another place to do. I need to know if it is possible to migrate a tesorflow CNN to tensorflowonspark. The example with CNN uses known case, but in my case, I have specific architectures and images.

The guide to convert is not clear for me how the distribution will work at the spark cluster. Does someone know a howto to write tensorflowonspark from scratch?

I´m working with Python 3, Spark Cluster without hadoop and tensorflow without GPU.

Thanks
Rafael

@leewyang
Copy link
Contributor

Hi Rafael, I need to update the conversion guide, since it mostly reflects the older, lower-level TF APIs, which are now being de-emphasized. I do have some examples of conversions of the higher-level TF APIs that you can follow, e.g. mnist/estimator and mnist/keras.

That all said, I generally recommend the following development process:

  1. Get your single-node TF app working on a small subset of your dataset, preferring tf.data and the higher-level APIs like Keras and Estimator. This will sort out any TF-specific errors with a faster iteration cycle, and using these APIs will make it easier for you to convert to distributed TF.
  2. Try to "distribute" your app (on a single host, without TFoS). For the higher level APIs, this can be as simple as setting TF_CONFIG env variables correctly in different shell sessions and just running the "nodes". This will help you sort out any errors specific to moving to distributed TF, and it will help you understand how distributed TF works.
  3. Convert to TFoS. Assuming that update README; change test to inference #2 works reasonably, you should then able to move the code fairly quickly to TFoS. At this point, you'll be sorting out TFoS-specific errors and any errors due to scaling out your cluster.

The problem with starting with TFoS is that you'll be debugging all of the various layers simultaneously, which can make it difficult to identify the root causes.

@rafaelmarconiramos
Copy link
Author

Hello @leewyang,

I used your strategy, but I have some doubts. If you can help me, I thank you.

  1. I had an environment with a Python 3.7, but I found a bug in Pyspark with no solution. I migrated to Python 2.7 and restarted the test with Keras.

  2. The training phase with spark mode in Keras example worked. The first doubt is: the first tasks execute distributed, but when starting the epochs, the execution is on just one node. Why?

  3. The inference did not work. When I execute de inference the code uses two attributes --images_labels ${TFoS_HOME}/mnist/tfr/test
    --export ${TFoS_HOME}/mnist_model/export/serving/. The doubts are: a) Should the label be/mnist/csv/test?
    b) The training phase saves the model on ./mnist_model/. When I change the path, the program gives an error: "IOError: SavedModel file does not exist at: ./mnist_model." The model is there. Do I need to convert the model files to another format?

I appreciate if you can help me with these issues.

Thanks
Rafael

@leewyang
Copy link
Contributor

Hi Rafael,

  1. Yes, I've seen that Python 3.7 may not work well w/ Spark. I've seen that 3.6.9 works fine though, so you may be able to stay in a Python 3 environment if you'd like. (Note: Mac OS exhibits this issue... but a fix is available in the comments).

  2. Training should be occurring on all available workers, but if you have two executors, one of them will act as a PS node, so you'll only have one available worker. If you have more than two executors, and you still don't see training occurring on available workers, then pls send me the logs of the workers.

3a. That particular example is actually using TFRecords. Specifically, it's using "parallel" inferencing where each executor just loads the saved_model and inferences on a slice of the TFRecord files. The example could also use a tf.data.Dataset of text/csv files... you'd just need to modify the parser accordingly.

3b. Try using absolute paths for the model. Due to the vagaries of distributed systems, the local path notation may not point to where you think it should be.

@leewyang
Copy link
Contributor

Closing due to inactivity. Feel free to re-open if still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants