Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFoS can't export model #57

Closed
terryKing1992 opened this issue Apr 7, 2017 · 6 comments
Closed

TFoS can't export model #57

terryKing1992 opened this issue Apr 7, 2017 · 6 comments

Comments

@terryKing1992
Copy link

terryKing1992 commented Apr 7, 2017

I append some code in mnist_dist.py file, like this:

while not sv.should_stop() and step < args.steps:
   # Run a training step asynchronously.
   # See `tf.train.SyncReplicasOptimizer` for additional details on how to
   # perform *synchronous* training.

   # using feed_dict
   batch_xs, batch_ys = feed_dict()
   feed = {x: batch_xs, y_: batch_ys}

   if len(batch_xs) != batch_size:
       print("done feeding")
       break
   else:
       if args.mode == "train":
           _, step = sess.run([train_op, global_step], feed_dict=feed)
           # print accuracy and save model checkpoint to HDFS every 100 steps
           if (step % 100 == 0):
               print("{0} step: {1} accuracy: {2}".format(datetime.now().isoformat(), step,
                                                          sess.run(accuracy,
                                                                   {x: batch_xs, y_: batch_ys})))
       else:  # args.mode == "inference"
           labels, preds, acc = sess.run([label, prediction, accuracy], feed_dict=feed)

           results = ["{0} Label: {1}, Prediction: {2}".format(datetime.now().isoformat(), l, p) for
                      l, p in zip(labels, preds)]
           TFNode.batch_results(ctx.mgr, results)
           print("acc: {0}".format(acc))

 if sv.is_chief:
   print("save model to:{}=======1".format(local_model_dir))
   sess.graph._unsafe_unfinalize()
   classification_inputs = utils.build_tensor_info(x)
   classification_outputs_classes = utils.build_tensor_info(y)
   print ('begin exporting!======11111')
   classification_signature = signature_def_utils.build_signature_def(
       inputs={signature_constants.CLASSIFY_INPUTS: classification_inputs},
       outputs={
           signature_constants.CLASSIFY_OUTPUT_CLASSES:
               classification_outputs_classes
       },
       method_name=signature_constants.CLASSIFY_METHOD_NAME)
   print ('begin exporting!======22222')
   tensor_info_x = utils.build_tensor_info(x)
   tensor_info_y = utils.build_tensor_info(y)
   print ('begin exporting!======33333')
   prediction_signature = signature_def_utils.build_signature_def(
       inputs={'images': tensor_info_x},
       outputs={'scores': tensor_info_y},
       method_name=signature_constants.PREDICT_METHOD_NAME)
   print ('begin exporting!======44444')
   legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
   print ('begin exporting!======55555')
   builder = saved_model_builder.SavedModelBuilder(local_model_dir)
   print('begin exporting!======66666')
   builder.add_meta_graph_and_variables(
       sess, [tag_constants.SERVING],
       signature_def_map={
           'predict_images':
               prediction_signature,
           signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
               classification_signature,
       },clear_devices=True,legacy_init_op=legacy_init_op)
   print ('begin exporting!======77777')
   builder.save()

but the chief worker only exec to this print('begin exporting!======66666') line, the task end without executing the rest code.

and the chief worker log as bellow:

  save model to:/data/mnist/1/0=======1
  begin exporting!======11111
  begin exporting!======22222
  begin exporting!======33333
  begin exporting!======44444
  INFO:tensorflow:No assets to save.2017-04-07 16:31:23,479 INFO (Thread-1-30287) No assets to save.
  INFO:tensorflow:No assets to write.
  2017-04-07 16:31:23,479 INFO (Thread-1-30287) No assets to write.
  begin exporting!======55555
  begin exporting!======66666
  2017-04-07 16:31:23,520 INFO (MainThread-30287) Feeding None into output queue
  2017-04-07 16:31:23,528 INFO (MainThread-30287) Setting mgr.state to 'stopped'17/04/07 16:31:23 INFO python.PythonRunner: Times: total = 123, boot = -28156, init = 28200, finish = 7917/04/07 16:31:23 INFO executor.Executor: Finished task 4.0 in stage 3.0 (TID 36). 2102 bytes result sent to driver17/04/07 16:31:29 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
  17/04/07 16:31:29 INFO memory.MemoryStore: MemoryStore cleared
  17/04/07 16:31:29 INFO storage.BlockManager: BlockManager stopped
  17/04/07 16:31:29 INFO util.ShutdownHookManager: Shutdown hook called17/04/07 16:31:29 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a281a003-75aa-4d20-8f9d-93e3dbe8874a/executor-35ec5b73-87a6-4fa2-b895-b5783cace0d8/spark-957ff0d0-334b-42cc-98fb-4995ab0f871517/04/07 16:31:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM:29 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a281a003-75aa-4d20-8f9d-93e3dbe8874a/executor-35ec5b73-87a6-4fa2-b895-b5783cace0d8/spark-957ff0d0-334b-42cc-98fb-4995ab0f8715  

how can i make the task work well on exporting model. BTW, this export model code can work well on tensorflow distribution. thanks a lot

@leewyang
Copy link
Contributor

leewyang commented Apr 7, 2017

@terryKing1992 I did a similar experiment a couple weeks ago with code as follows:

      with tf.Session() as sess:
          print("{0} session ready".format(datetime.now().isoformat()))
          print("restoring model")
          saver.restore(sess, model_path)
          print("restored model")

          model_builder = builder.SavedModelBuilder(args.export_path)
          tensor_info_x = utils.build_tensor_info(x)
          tensor_info_y = utils.build_tensor_info(y)
          prediction_signature = sig_util.build_signature_def(
              inputs={'images': tensor_info_x},
              outputs={'scores': tensor_info_y},
              method_name=sig.PREDICT_METHOD_NAME)
          model_builder.add_meta_graph_and_variables(
              sess, [tag.SERVING],
              signature_def_map={
                  sig.DEFAULT_SERVING_SIGNATURE_DEF_KEY: prediction_signature
              })
          print("exporting model")
          model_builder.save()
          print("exported model")

In my case I was restoring from an existing checkpoint, and then exporting a SavedModel. Also, I had a time.sleep(60) before the cluster.shutdown() (I think it was forcing the executors to shutdown before the save operation was completed). I haven't had to chance to revisit this in a while, but hopefully, this will unblock you...

@terryKing1992
Copy link
Author

terryKing1992 commented Apr 10, 2017

@leewyang thanks for your response, i try to add time.sleep(60) before cluster.shutdown(). it still can't export the model. do you know why the program does not execute these three lines?

builder.add_meta_graph_and_variables(
      sess, [tag_constants.SERVING],
      signature_def_map={
          'predict_images':
              prediction_signature,
          signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
              classification_signature,
      })
  print ('begin exporting!======77777')
  builder.save()

@leewyang
Copy link
Contributor

Unfortunately, I don't have that much experience with the SavedModelBuilder. As I mentioned, I was experimenting recently. That said, I had tried something similar to your code, but ended up with something that loads a previously written checkpoint. Note: I had to point to a specific checkpoint file not the upper level model directory.

@terryKing1992
Copy link
Author

terryKing1992 commented Apr 11, 2017

I switch the tfspark.zip to Mar 8 commit node . the job can sometimes work, but sometimes it throws some errors like this:

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, /data/mnist/1/variables/variables_temp_7de0e7aec37b4ad3bbb04d7ddba6921d/part-00000-of-00001.data-00000-of-00001.tempstate14562722955503715538
          [[Node: save_1/SaveV2 = SaveV2[dtypes=[DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:0/cpu:0"](save_1/ShardedFilename, save_1/SaveV2/tensor_names, save_1/SaveV2/shape_and_slices, Variable, hid_b, hid_b/Adagrad, hid_w, hid_w/Adagrad, sm_b, sm_b/Adagrad, sm_w, sm_w/Adagrad)]]
 
Caused by 
  op u'save_1/SaveV2', defined at:
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/data/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in <module>
  File "/data/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
  File "/data/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
  File "/data/spark/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/data/spark/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2407, in pipeline_func
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 346, in func
  File "/data/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 794, in func
  File "/root/tensorflow/TensorFlowOnSpark/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 218, in _mapfn
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/usr/lib64/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "mnist_dist.py", line 210, in map_fun
    clear_devices=True,legacy_init_op=legacy_init_op)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/saved_model/builder_impl.py", line 432, in add_meta_graph_and_variables
    allow_empty=True)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1040, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1070, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 669, in build
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 356, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 330, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 271, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 214, in save_op
    tensors)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 779, in save_v2
    tensors=tensors, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()

Really strange issue.

@leewyang
Copy link
Contributor

FWIW, I didn't have much success writing out the SavedModel after training (I was getting all sorts of weird TensorFlow graph/saver errors), which is why I ended up with a version of the code that restores from a saved checkpoint...

@terryKing1992
Copy link
Author

Thanks all the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants