Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronous Training using SyncReplicasOptimizer #8978

Closed
tushar00jain opened this issue Apr 5, 2017 · 4 comments
Closed

Synchronous Training using SyncReplicasOptimizer #8978

tushar00jain opened this issue Apr 5, 2017 · 4 comments

Comments

@tushar00jain
Copy link

I'm trying to implement a synchronous distributed Recurrent Neural Network using TensorFlow on multiple servers. Here's the link to my code: https://github.com/tushar00jain/spark-ml/blob/master/rnn-sync.ipynb. I've also provided the relevant part below.

I want the computations within the same batch to happen in parallel but I think it's still computing separate RNNs on each worker server and updating the parameters on the parameter server separately. I know this because I am printing the _current_state variable after I run the graph for each batch. Also, the _total_loss for the same global step is different on each worker server.

I'm following the instructions provided at the following links: https://www.tensorflow.org/deploy/distributed#replicated_training https://www.tensorflow.org/api_docs/python/tf/train/SyncReplicasOptimizer

Is this a bug or is there something wrong with my code?

    sess = sv.prepare_or_wait_for_session(server.target)
    queue_runners = tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS)
    sv.start_queue_runners(sess, queue_runners)

    tf.logging.info('Started %d queues for processing input data.',
                    len(queue_runners))

    if is_chief:
            sv.start_queue_runners(sess, chief_queue_runners)
            sess.run(init_tokens_op)

    print("{0} session ready".format(datetime.now().isoformat()))
    #####################################################################

    ########################### training loop ###########################
    _current_state = np.zeros((batch_size, state_size))
    for batch_idx in range(args.steps):
        if sv.should_stop() or tf_feed.should_stop():
            break

        batchX, batchY = feed_dict(tf_feed.next_batch(batch_size))

        print('==========================================================')
        print(_current_state)

        if args.mode == "train":
            _total_loss, _train_step, _current_state, _predictions_series, _global_step = sess.run(
            [total_loss, train_step, current_state, predictions_series, global_step],
            feed_dict={
                batchX_placeholder:batchX,
                batchY_placeholder:batchY,
                init_state:_current_state
            })

            print(_global_step, batch_idx)
            print(_current_state)
            print('==========================================================')

            if _global_step % 5 == 0:
                print("Step", _global_step, "Loss", _total_loss)  
@asimshankar
Copy link
Contributor

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there. Thanks!

@tushar00jain
Copy link
Author

hmm thanks for the reply. I just thought it could be a bug because I was following the instructions as stated on the website but can't get to synchronise the training.

@jmchen-g
Copy link
Contributor

jmchen-g commented Apr 5, 2017

This is most likely something wrong with the code. Only the chief is supposed to update the variables including global step once the vars collected enough gradients. It is also expected that each worker will do this separately to keep the speed of async runs as much as possible...

@tushar00jain
Copy link
Author

Thanks! I'm not sure how to keep the hidden state consistent across all machines since it's not a trainable parameter but I'll ask on StackOverflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants