Synchronous Training using SyncReplicasOptimizer #8978

tushar00jain · 2017-04-05T13:58:01Z

I'm trying to implement a synchronous distributed Recurrent Neural Network using TensorFlow on multiple servers. Here's the link to my code: https://github.com/tushar00jain/spark-ml/blob/master/rnn-sync.ipynb. I've also provided the relevant part below.

I want the computations within the same batch to happen in parallel but I think it's still computing separate RNNs on each worker server and updating the parameters on the parameter server separately. I know this because I am printing the _current_state variable after I run the graph for each batch. Also, the _total_loss for the same global step is different on each worker server.

I'm following the instructions provided at the following links: https://www.tensorflow.org/deploy/distributed#replicated_training https://www.tensorflow.org/api_docs/python/tf/train/SyncReplicasOptimizer

Is this a bug or is there something wrong with my code?

    sess = sv.prepare_or_wait_for_session(server.target)
    queue_runners = tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS)
    sv.start_queue_runners(sess, queue_runners)

    tf.logging.info('Started %d queues for processing input data.',
                    len(queue_runners))

    if is_chief:
            sv.start_queue_runners(sess, chief_queue_runners)
            sess.run(init_tokens_op)

    print("{0} session ready".format(datetime.now().isoformat()))
    #####################################################################

    ########################### training loop ###########################
    _current_state = np.zeros((batch_size, state_size))
    for batch_idx in range(args.steps):
        if sv.should_stop() or tf_feed.should_stop():
            break

        batchX, batchY = feed_dict(tf_feed.next_batch(batch_size))

        print('==========================================================')
        print(_current_state)

        if args.mode == "train":
            _total_loss, _train_step, _current_state, _predictions_series, _global_step = sess.run(
            [total_loss, train_step, current_state, predictions_series, global_step],
            feed_dict={
                batchX_placeholder:batchX,
                batchY_placeholder:batchY,
                init_state:_current_state
            })

            print(_global_step, batch_idx)
            print(_current_state)
            print('==========================================================')

            if _global_step % 5 == 0:
                print("Step", _global_step, "Loss", _total_loss)

The text was updated successfully, but these errors were encountered:

asimshankar · 2017-04-05T17:13:30Z

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there. Thanks!

tushar00jain · 2017-04-05T17:27:04Z

hmm thanks for the reply. I just thought it could be a bug because I was following the instructions as stated on the website but can't get to synchronise the training.

jmchen-g · 2017-04-05T20:41:55Z

This is most likely something wrong with the code. Only the chief is supposed to update the variables including global step once the vars collected enough gradients. It is also expected that each worker will do this separately to keep the speed of async runs as much as possible...

tushar00jain · 2017-04-06T07:51:17Z

Thanks! I'm not sure how to keep the hidden state consistent across all machines since it's not a trainable parameter but I'll ask on StackOverflow.

asimshankar closed this as completed Apr 5, 2017

smodlich mentioned this issue Jul 25, 2017

tf.train.SyncReplicasOptimizer no synchronization among workers #11753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronous Training using SyncReplicasOptimizer #8978

Synchronous Training using SyncReplicasOptimizer #8978

tushar00jain commented Apr 5, 2017

asimshankar commented Apr 5, 2017

tushar00jain commented Apr 5, 2017

jmchen-g commented Apr 5, 2017

tushar00jain commented Apr 6, 2017

Synchronous Training using SyncReplicasOptimizer #8978

Synchronous Training using SyncReplicasOptimizer #8978

Comments

tushar00jain commented Apr 5, 2017

asimshankar commented Apr 5, 2017

tushar00jain commented Apr 5, 2017

jmchen-g commented Apr 5, 2017

tushar00jain commented Apr 6, 2017