TFoS can't export model #57

terryKing1992 · 2017-04-07T08:52:24Z

I append some code in mnist_dist.py file, like this:

while not sv.should_stop() and step < args.steps:
   # Run a training step asynchronously.
   # See `tf.train.SyncReplicasOptimizer` for additional details on how to
   # perform *synchronous* training.

   # using feed_dict
   batch_xs, batch_ys = feed_dict()
   feed = {x: batch_xs, y_: batch_ys}

   if len(batch_xs) != batch_size:
       print("done feeding")
       break
   else:
       if args.mode == "train":
           _, step = sess.run([train_op, global_step], feed_dict=feed)
           # print accuracy and save model checkpoint to HDFS every 100 steps
           if (step % 100 == 0):
               print("{0} step: {1} accuracy: {2}".format(datetime.now().isoformat(), step,
                                                          sess.run(accuracy,
                                                                   {x: batch_xs, y_: batch_ys})))
       else:  # args.mode == "inference"
           labels, preds, acc = sess.run([label, prediction, accuracy], feed_dict=feed)

           results = ["{0} Label: {1}, Prediction: {2}".format(datetime.now().isoformat(), l, p) for
                      l, p in zip(labels, preds)]
           TFNode.batch_results(ctx.mgr, results)
           print("acc: {0}".format(acc))

 if sv.is_chief:
   print("save model to:{}=======1".format(local_model_dir))
   sess.graph._unsafe_unfinalize()
   classification_inputs = utils.build_tensor_info(x)
   classification_outputs_classes = utils.build_tensor_info(y)
   print ('begin exporting!======11111')
   classification_signature = signature_def_utils.build_signature_def(
       inputs={signature_constants.CLASSIFY_INPUTS: classification_inputs},
       outputs={
           signature_constants.CLASSIFY_OUTPUT_CLASSES:
               classification_outputs_classes
       },
       method_name=signature_constants.CLASSIFY_METHOD_NAME)
   print ('begin exporting!======22222')
   tensor_info_x = utils.build_tensor_info(x)
   tensor_info_y = utils.build_tensor_info(y)
   print ('begin exporting!======33333')
   prediction_signature = signature_def_utils.build_signature_def(
       inputs={'images': tensor_info_x},
       outputs={'scores': tensor_info_y},
       method_name=signature_constants.PREDICT_METHOD_NAME)
   print ('begin exporting!======44444')
   legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
   print ('begin exporting!======55555')
   builder = saved_model_builder.SavedModelBuilder(local_model_dir)
   print('begin exporting!======66666')
   builder.add_meta_graph_and_variables(
       sess, [tag_constants.SERVING],
       signature_def_map={
           'predict_images':
               prediction_signature,
           signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
               classification_signature,
       },clear_devices=True,legacy_init_op=legacy_init_op)
   print ('begin exporting!======77777')
   builder.save()

but the chief worker only exec to this print('begin exporting!======66666') line, the task end without executing the rest code.

and the chief worker log as bellow:

  save model to:/data/mnist/1/0=======1
  begin exporting!======11111
  begin exporting!======22222
  begin exporting!======33333
  begin exporting!======44444
  INFO:tensorflow:No assets to save.2017-04-07 16:31:23,479 INFO (Thread-1-30287) No assets to save.
  INFO:tensorflow:No assets to write.
  2017-04-07 16:31:23,479 INFO (Thread-1-30287) No assets to write.
  begin exporting!======55555
  begin exporting!======66666
  2017-04-07 16:31:23,520 INFO (MainThread-30287) Feeding None into output queue
  2017-04-07 16:31:23,528 INFO (MainThread-30287) Setting mgr.state to 'stopped'17/04/07 16:31:23 INFO python.PythonRunner: Times: total = 123, boot = -28156, init = 28200, finish = 7917/04/07 16:31:23 INFO executor.Executor: Finished task 4.0 in stage 3.0 (TID 36). 2102 bytes result sent to driver17/04/07 16:31:29 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
  17/04/07 16:31:29 INFO memory.MemoryStore: MemoryStore cleared
  17/04/07 16:31:29 INFO storage.BlockManager: BlockManager stopped
  17/04/07 16:31:29 INFO util.ShutdownHookManager: Shutdown hook called17/04/07 16:31:29 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a281a003-75aa-4d20-8f9d-93e3dbe8874a/executor-35ec5b73-87a6-4fa2-b895-b5783cace0d8/spark-957ff0d0-334b-42cc-98fb-4995ab0f871517/04/07 16:31:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM:29 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a281a003-75aa-4d20-8f9d-93e3dbe8874a/executor-35ec5b73-87a6-4fa2-b895-b5783cace0d8/spark-957ff0d0-334b-42cc-98fb-4995ab0f8715

how can i make the task work well on exporting model. BTW, this export model code can work well on tensorflow distribution. thanks a lot

The text was updated successfully, but these errors were encountered:

leewyang · 2017-04-07T16:41:28Z

@terryKing1992 I did a similar experiment a couple weeks ago with code as follows:

      with tf.Session() as sess:
          print("{0} session ready".format(datetime.now().isoformat()))
          print("restoring model")
          saver.restore(sess, model_path)
          print("restored model")

          model_builder = builder.SavedModelBuilder(args.export_path)
          tensor_info_x = utils.build_tensor_info(x)
          tensor_info_y = utils.build_tensor_info(y)
          prediction_signature = sig_util.build_signature_def(
              inputs={'images': tensor_info_x},
              outputs={'scores': tensor_info_y},
              method_name=sig.PREDICT_METHOD_NAME)
          model_builder.add_meta_graph_and_variables(
              sess, [tag.SERVING],
              signature_def_map={
                  sig.DEFAULT_SERVING_SIGNATURE_DEF_KEY: prediction_signature
              })
          print("exporting model")
          model_builder.save()
          print("exported model")

In my case I was restoring from an existing checkpoint, and then exporting a SavedModel. Also, I had a time.sleep(60) before the cluster.shutdown() (I think it was forcing the executors to shutdown before the save operation was completed). I haven't had to chance to revisit this in a while, but hopefully, this will unblock you...

terryKing1992 · 2017-04-10T03:24:40Z

@leewyang thanks for your response, i try to add time.sleep(60) before cluster.shutdown(). it still can't export the model. do you know why the program does not execute these three lines?

builder.add_meta_graph_and_variables(
      sess, [tag_constants.SERVING],
      signature_def_map={
          'predict_images':
              prediction_signature,
          signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
              classification_signature,
      })
  print ('begin exporting!======77777')
  builder.save()

leewyang · 2017-04-10T16:39:30Z

Unfortunately, I don't have that much experience with the SavedModelBuilder. As I mentioned, I was experimenting recently. That said, I had tried something similar to your code, but ended up with something that loads a previously written checkpoint. Note: I had to point to a specific checkpoint file not the upper level model directory.

terryKing1992 · 2017-04-11T03:48:55Z

I switch the tfspark.zip to Mar 8 commit node . the job can sometimes work, but sometimes it throws some errors like this:

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, /data/mnist/1/variables/variables_temp_7de0e7aec37b4ad3bbb04d7ddba6921d/part-00000-of-00001.data-00000-of-00001.tempstate14562722955503715538
          [[Node: save_1/SaveV2 = SaveV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](save_1/ShardedFilename, save_1/SaveV2/tensor_names, save_1/SaveV2/shape_and_slices, Variable, hid_b, hid_b/Adagrad, hid_w, hid_w/Adagrad, sm_b, sm_b/Adagrad, sm_w, sm_w/Adagrad)]]
 
Caused by 
  op u'save_1/SaveV2', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/data/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in <module>
  File "/data/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "/data/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
  File "/data/spark/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/data/spark/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 346, in func
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 794, in func
  File "/root/tensorflow/TensorFlowOnSpark/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 218, in _mapfn
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "mnist_dist.py", line 210, in map_fun
    clear_devices=True,legacy_init_op=legacy_init_op)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/saved_model/builder_impl.py", line 432, in add_meta_graph_and_variables
    allow_empty=True)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1040, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1070, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 669, in build
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 356, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 330, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 271, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 214, in save_op
    tensors)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 779, in save_v2
    tensors=tensors, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

Really strange issue.

leewyang · 2017-04-11T20:18:21Z

FWIW, I didn't have much success writing out the SavedModel after training (I was getting all sorts of weird TensorFlow graph/saver errors), which is why I ended up with a version of the code that restores from a saved checkpoint...

terryKing1992 · 2017-04-12T13:06:14Z

Thanks all the same

terryKing1992 closed this as completed Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFoS can't export model #57

TFoS can't export model #57

terryKing1992 commented Apr 7, 2017 •

edited

leewyang commented Apr 7, 2017

terryKing1992 commented Apr 10, 2017 •

edited

leewyang commented Apr 10, 2017

terryKing1992 commented Apr 11, 2017 •

edited

leewyang commented Apr 11, 2017

terryKing1992 commented Apr 12, 2017

TFoS can't export model #57

TFoS can't export model #57

Comments

terryKing1992 commented Apr 7, 2017 • edited

leewyang commented Apr 7, 2017

terryKing1992 commented Apr 10, 2017 • edited

leewyang commented Apr 10, 2017

terryKing1992 commented Apr 11, 2017 • edited

leewyang commented Apr 11, 2017

terryKing1992 commented Apr 12, 2017

terryKing1992 commented Apr 7, 2017 •

edited

terryKing1992 commented Apr 10, 2017 •

edited

terryKing1992 commented Apr 11, 2017 •

edited