Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a model using Cloud ML to serve using TensorFlow Serving #1

Open
martiankuo1 opened this issue Apr 29, 2017 · 2 comments
Open

Comments

@martiankuo1
Copy link

I followed the steps describe in "tf_face" using my own set of training data and proceeded to "Training a model using Cloud ML to serve using TensorFlow Serving", I issued the following command according to the tutorial
"gcloud beta ml jobs submit training my9thmljob --package-path=pubfig_export --module-name=pubfig_export.export_log --region=us-central1 --staging-bucket=gs://cloudml-1001"

but got the following error from cloud machine engine job
"The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 445, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run sys.exit(main(sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 240, in main train_queue = get_input_queue(FLAGS.train_file) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 98, in get_input_queue train_images, train_labels = get_image_label_list(train_file) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 74, in get_image_label_list for line in open(image_label_file, "r"): IOError: [Errno 2] No such file or directory: '/tmp/data/train.txt' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=99426417043&resource=ml_job%2Fjob_id%2Fmy9thmljob&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22my9thmljob%22"

The only thing I can understand from the message is "[Error2] No such file or directory "/tmp/data/train.txt" , I double checked that I had move the "data" directory to "temp".

Any suggestion?
Your kind help will be deeply appreciated

CH

@wwoo
Copy link
Owner

wwoo commented May 3, 2017

Hello,

As a whole, the example needs to be updated since Cloud ML (or ML Engine) now has gone into GA. I have that on my todo. However - on the specific issue you're seeing:

If you're trying ML Engine online training (not local training), you'll need to set 'copy_from_gcs' to True and make sure you've uploaded a gzipped tarball of your 'data' directory (which is where train.txt should live) to GCS. The code downloads the tarball and unpacks it into /tmp, which is how ML Engine gets a copy of the dataset.

As-is, ML Engine will download 'gs://wwoo-train/pubfig/out.tar.gz' - you'll need to change this to your path. If the download works, you should see something like this in your logs:

...
14:20:36.915 Recursively copying from gs://wwoo-train/pubfig/out.tar.gz to /tmp/
...

It might be worth checking that this works with local training first. Once you've verified that works, try it with ML Engine.

ww

@martiankuo1
Copy link
Author

martiankuo1 commented May 3, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants