-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training a model using Cloud ML to serve using TensorFlow Serving #1
Comments
Hello, As a whole, the example needs to be updated since Cloud ML (or ML Engine) now has gone into GA. I have that on my todo. However - on the specific issue you're seeing: If you're trying ML Engine online training (not local training), you'll need to set 'copy_from_gcs' to True and make sure you've uploaded a gzipped tarball of your 'data' directory (which is where train.txt should live) to GCS. The code downloads the tarball and unpacks it into /tmp, which is how ML Engine gets a copy of the dataset. As-is, ML Engine will download 'gs://wwoo-train/pubfig/out.tar.gz' - you'll need to change this to your path. If the download works, you should see something like this in your logs: ... It might be worth checking that this works with local training first. Once you've verified that works, try it with ML Engine. ww |
Thanks.
I did make it work on the local training. I will try what you recommended.
Your kind help is deeply appreciated.
Cheng-Hua Kuo
CloudMile
…On Wed, May 3, 2017 at 9:03 AM, wwoo ***@***.***> wrote:
Hello,
As a whole, the example needs to be updated since Cloud ML (or ML Engine)
now has gone into GA. I have that on my todo. However - on the specific
issue you're seeing:
If you're trying ML Engine online training (not local training), you'll
need to set 'copy_from_gcs' to True and make sure you've uploaded a gzipped
tarball of your 'data' directory (which is where train.txt should live) to
GCS. The code downloads the tarball and unpacks it into /tmp, which is how
ML Engine gets a copy of the dataset.
As-is, ML Engine will download 'gs://wwoo-train/pubfig/out.tar.gz' -
you'll need to change this to your path. If the download works, you should
see something like this in your logs:
...
14:20:36.915 Recursively copying from gs://wwoo-train/pubfig/out.tar.gz
to /tmp/
...
It might be worth checking that this works with local training first. Once
you've verified that works, try it with ML Engine.
ww
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/Aa3UqXgOrfzDuBrnGkIL5V3cUEoB69tcks5r19J7gaJpZM4NMFvB>
.
|
I followed the steps describe in "tf_face" using my own set of training data and proceeded to "Training a model using Cloud ML to serve using TensorFlow Serving", I issued the following command according to the tutorial
"gcloud beta ml jobs submit training my9thmljob --package-path=pubfig_export --module-name=pubfig_export.export_log --region=us-central1 --staging-bucket=gs://cloudml-1001"
but got the following error from cloud machine engine job
"The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 445, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run sys.exit(main(sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 240, in main train_queue = get_input_queue(FLAGS.train_file) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 98, in get_input_queue train_images, train_labels = get_image_label_list(train_file) File "/root/.local/lib/python2.7/site-packages/pubfig_export/export_log.py", line 74, in get_image_label_list for line in open(image_label_file, "r"): IOError: [Errno 2] No such file or directory: '/tmp/data/train.txt' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=99426417043&resource=ml_job%2Fjob_id%2Fmy9thmljob&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22my9thmljob%22"
The only thing I can understand from the message is "[Error2] No such file or directory "/tmp/data/train.txt" , I double checked that I had move the "data" directory to "temp".
Any suggestion?
Your kind help will be deeply appreciated
CH
The text was updated successfully, but these errors were encountered: