Training a model using Cloud ML to serve using TensorFlow Serving #1

martiankuo1 · 2017-04-29T00:22:22Z

I followed the steps describe in "tf_face" using my own set of training data and proceeded to "Training a model using Cloud ML to serve using TensorFlow Serving", I issued the following command according to the tutorial
"gcloud beta ml jobs submit training my9thmljob --package-path=pubfig_export --module-name=pubfig_export.export_log --region=us-central1 --staging-bucket=gs://cloudml-1001"

but got the following error from cloud machine engine job
"The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 445, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run sys.exit(main(sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 240, in main train_queue = get_input_queue(FLAGS.train_file) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 98, in get_input_queue train_images, train_labels = get_image_label_list(train_file) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 74, in get_image_label_list for line in open(image_label_file, "r"): IOError: [Errno 2] No such file or directory: '/tmp/data/train.txt' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=99426417043&resource=ml_job%2Fjob_id%2Fmy9thmljob&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22my9thmljob%22"

The only thing I can understand from the message is "[Error2] No such file or directory "/tmp/data/train.txt" , I double checked that I had move the "data" directory to "temp".

Any suggestion?
Your kind help will be deeply appreciated

CH

wwoo · 2017-05-03T01:03:55Z

Hello,

As a whole, the example needs to be updated since Cloud ML (or ML Engine) now has gone into GA. I have that on my todo. However - on the specific issue you're seeing:

If you're trying ML Engine online training (not local training), you'll need to set 'copy_from_gcs' to True and make sure you've uploaded a gzipped tarball of your 'data' directory (which is where train.txt should live) to GCS. The code downloads the tarball and unpacks it into /tmp, which is how ML Engine gets a copy of the dataset.

As-is, ML Engine will download 'gs://wwoo-train/pubfig/out.tar.gz' - you'll need to change this to your path. If the download works, you should see something like this in your logs:

...
14:20:36.915 Recursively copying from gs://wwoo-train/pubfig/out.tar.gz to /tmp/
...

It might be worth checking that this works with local training first. Once you've verified that works, try it with ML Engine.

ww

martiankuo1 · 2017-05-03T12:07:49Z

Thanks. I did make it work on the local training. I will try what you recommended. Your kind help is deeply appreciated. Cheng-Hua Kuo CloudMile

…

On Wed, May 3, 2017 at 9:03 AM, wwoo ***@***.***> wrote: Hello, As a whole, the example needs to be updated since Cloud ML (or ML Engine) now has gone into GA. I have that on my todo. However - on the specific issue you're seeing: If you're trying ML Engine online training (not local training), you'll need to set 'copy_from_gcs' to True and make sure you've uploaded a gzipped tarball of your 'data' directory (which is where train.txt should live) to GCS. The code downloads the tarball and unpacks it into /tmp, which is how ML Engine gets a copy of the dataset. As-is, ML Engine will download 'gs://wwoo-train/pubfig/out.tar.gz' - you'll need to change this to your path. If the download works, you should see something like this in your logs: ... 14:20:36.915 Recursively copying from gs://wwoo-train/pubfig/out.tar.gz to /tmp/ ... It might be worth checking that this works with local training first. Once you've verified that works, try it with ML Engine. ww — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Aa3UqXgOrfzDuBrnGkIL5V3cUEoB69tcks5r19J7gaJpZM4NMFvB> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a model using Cloud ML to serve using TensorFlow Serving #1

Training a model using Cloud ML to serve using TensorFlow Serving #1

martiankuo1 commented Apr 29, 2017

wwoo commented May 3, 2017

martiankuo1 commented May 3, 2017 via email

Training a model using Cloud ML to serve using TensorFlow Serving #1

Training a model using Cloud ML to serve using TensorFlow Serving #1

Comments

martiankuo1 commented Apr 29, 2017

wwoo commented May 3, 2017

martiankuo1 commented May 3, 2017 via email