Add different rnn implementation modes to ptb tutorial #2276

bignamehyp · 2017-08-23T20:36:57Z

Add different rnn implementation modes to ptb tutorial using CudnnLSTM and LSTMBlockCell.

rnn_mode == CUDNN

| config | epochs | train | valid | test | wps

| small | 13 | 40.50 | 116.60 | 112.13 | 49k
| medium | 39 | 22.58 | 123.83 | 118.69 | 24.2k
| large | 55 | 8.03 | 129.78 | 126.03 | 10.5k

In fact, the params tensor layout for cudnn_lstm is so different from canonical LSTM,
we need a complete different set of init_scale and learning rate parameters.

rnn_mode == BLOCK

| config | epochs | train | valid | test | wps

| small | 13 | 40.55 | 120.70 | 115.52 | 17.2k
| medium | 39 | 45.68 | 86.97 | 83.47 | 13.6k
| large | 55 | 37.94 | 82.75 | 78.49 | 5.0k

Wps before this cl for small, medium, large models are
15.6k, 12.6k, and 5.0k, respectively.

Benchmarking platform: E5-2690 v4, Titan X.

bignamehyp · 2017-08-23T20:44:25Z

I just signed cla.

bignamehyp · 2017-08-23T21:44:46Z

Run googlebot again.

…M and LSTMBlockCell. rnn_mode == CUDNN =================================================== | config | epochs | train | valid | test | wps ==================================================== | small | 13 | 40.50 | 116.60 | 112.13 | 49k | medium | 39 | 22.58 | 123.83 | 118.69 | 24.2k | large | 55 | 8.03 | 129.78 | 126.03 | 10.5k ==================================================== In fact, the params tensor layout for cudnn_lstm is so different from canonical LSTM, we need a complete different set of init_scale and learning rate parameters. rnn_mode == BLOCK =================================================== | config | epochs | train | valid | test | wps ==================================================== | small | 13 | 40.55 | 120.70 | 115.52 | 17.2k | medium | 39 | 45.68 | 86.97 | 83.47 | 13.6k | large | 55 | 37.94 | 82.75 | 78.49 | 5.0k ==================================================== Wps before this cl for small, medium, large models are 15.6k, 12.6k, and 5.0k, respectively. Benchmarking platform: E5-2690 v4, Titan X.

Sync with sequence_loss change.

nealwu · 2017-08-24T00:22:35Z

tutorials/rnn/ptb/ptb_word_lm.py

-import reader
+from google3.third_party.tensorflow_models.tutorials.rnn.ptb import reader
+from google3.third_party.tensorflow_models.tutorials.rnn.ptb import util
+from google3.third_party.tensorflow.python.client import device_lib


Can you remove the google3 imports? Also, we don't have a util.py file in this folder, and I'm not sure whether we can import device_lib.

Done. PTAL.

Tested on my local machine.

Update BUILD

bignamehyp · 2017-08-28T21:12:22Z

PTAL.

nealwu · 2017-08-29T03:45:59Z

Looks good at a glance. I'll try to run it on my machine tomorrow and see how it does.

nealwu · 2017-08-29T21:50:50Z

👍 Got the following results with the medium model:

Epoch: 39 Train Perplexity: 45.733
Epoch: 39 Valid Perplexity: 87.722
Test Perplexity: 83.632

…gle3 This will error when num_gpus > 1

janchorowski · 2017-09-07T16:27:28Z

This cl is buggy and should be reverted:

First, the CUDNN backend doesn't make sense:

it doesn't do dropout between LSTM layers, which is required to reach the perplexities reported in the header of this file.
the code doesn't handle initial states properly, it is mandatory that the initial state of the RNN is taken from the last state of the previous iteration.

Second, the CL pulled old code for the basic LSTM, making it incompatible with the new Tensorflow, see e.g. #8191.

protoget · 2017-09-13T17:02:15Z

Re janchorowski@
First

Cudnn takes dropout rate as one argument. It does dropout for each layer input. I think we tested the convergence before submitting the CL.
This isn't true. For each example, the initial state is always 0 to begin with. The state is carried over to next step (not next example); after finishing all the steps you get the final state (for this example).

Second:
Not sure which part is problematic, could you point out more explicitly?

janchorowski · 2017-09-15T16:43:50Z

Hi,

I aplogize for speaking too soon. I had two problems with running the new code and blamed it to this change:

the one I described as second, in which this line:

models/tutorials/rnn/ptb/ptb_word_lm.py

Line 131 in 7e9e15a

[attn_cell() for _ in range(config.num_layers)], state_is_tuple=True)

got changed into: https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py#L222 Note that before the change a new cell object was created for each layer, while after the change the same cell object is reused. This triggered the error "ValueError: Attempt to reuse RNNCell". I then upgraded tensorflow to v 1.3 and can't reproduce this error, I can install some older versions and see at which versions the error is occurring, unfortunately I didn't record it.
the bad perplexities achieved with the cudnn backend. The model still overfits (see the table in the OP message). I blamed it on lack of dropout in the cudnn wrapper and on incorrect recurrent state preservation. I reread the code and you are right about the hidden states being properly preserved from minibatch to minibatch. If dropout also works for all layers, then why changing the LSTM implementation has such a drastic impact on the perplexity? The cudnn backend shouldn't require a completely new set of hyperparameters. It is unfortunate to have a fast implementation which gives bad perplexities and a slow one that gives good perplexities.

protoget · 2017-09-15T19:54:35Z

Re janchorowski@

Thanks for the details.

You're right the cell is reused, thanks for catching that. Would you create a PR since you've invested a lot of time in it? We can review it.
We have no control of the essential implementation of Cudnn RNN, since it's done by Nvidia as a close-source library. I agree it is unfortunate. That said, I think for the short time being, it would become common for users to use different hyper params depending on the impl. e.g. Nvidia has a tutorial for training with mixed precision for speedup. . You might need to adjust your learning rate in order to achieve convergence.
We have a way to initialize cudnn variables w/ the same distributions as that for platform-independent rnn cells. The change is pending review and would probably catch the 1.4 release. We will update the example with the new cudnn APIs and check the pplx again.

Thank you.

See: tensorflow#2276 (comment) Signed-off-by: MrD <the.ubik@gmail.com>

Utumno · 2017-10-09T14:58:51Z

Does the commit linked above fix the problem with reusing the same cell ? If yes I could submit a PR

nealwu · 2017-10-09T21:31:00Z

@bignamehyp @protoget Does that change by @Utumno seem good to you? See #934 for some context.

protoget · 2017-10-09T21:36:30Z

LGTM if it passes quality tests.
FYI new Cudnn RNN layer API has been submitted so later this example will be further refactored.

nealwu · 2017-10-11T00:20:51Z

@Utumno could you submit your PR and @-mention us?

Utumno · 2017-10-11T11:07:40Z

Thanks, will do ASAP but please merge #2403 first - it's needed so this even runs on python 3 and it will be in my pull otherwise

nealwu · 2017-10-11T17:58:56Z

@Utumno Sure thing, merged.

See: tensorflow#2276 (comment) Signed-off-by: MrD <the.ubik@gmail.com>

…gle3 This will error when num_gpus > 1

cocosci · 2017-11-12T00:52:54Z

Does anyone can tell me the difference between "canonical LSTM cells" and "block or CUDNN cells"?
My tf version is 1.2.1 and that leaves me with canonical LSTM cells only. The keyword "BASIC" sounds not so fancy...:)

Correct me if I am wrong.
"All these three backend cells are LSTM units that do not support clipping, projection, etc. The major difference lies in the hardware optimization. It's Okay to choose any of the three"

bignamehyp requested a review from nealwu August 23, 2017 20:36

googlebot added the cla: no label Aug 23, 2017

bignamehyp removed the cla: no label Aug 23, 2017

bignamehyp self-assigned this Aug 23, 2017

googlebot added the cla: no label Aug 23, 2017

bignamehyp force-pushed the ptb branch 2 times, most recently from 479285f to 75ba6b5 Compare August 23, 2017 23:53

googlebot added cla: yes and removed cla: no labels Aug 23, 2017

bignamehyp force-pushed the ptb branch from 75ba6b5 to fae9da4 Compare August 24, 2017 00:16

Update ptb_word_lm.py

5144830

Sync with sequence_loss change.

nealwu reviewed Aug 24, 2017

View reviewed changes

bignamehyp and others added 3 commits August 24, 2017 10:07

Add util.py and resolve reviewer's comments.

4889b77

Tested on my local machine.

Update BUILD

976ef54

Update BUILD

Resolve reviewer's comments.

ed17b5c

tensorflow deleted a comment from googlebot Aug 24, 2017

nealwu merged commit c705568 into master Aug 29, 2017

nealwu deleted the ptb branch August 29, 2017 22:02

mark86092 added a commit to mark86092/tensorflow-models that referenced this pull request Sep 1, 2017

In tensorflow#2276, auto_parallel(metagraph, model) try to import goo…

3d6dc1d

…gle3 This will error when num_gpus > 1

Utumno added a commit to Utumno/models that referenced this pull request Oct 9, 2017

[ptb] stop using copies of the same cell in MultiRNNCell:

d2f8380

See: tensorflow#2276 (comment) Signed-off-by: MrD <the.ubik@gmail.com>

Utumno added a commit to Utumno/models that referenced this pull request Oct 11, 2017

[ptb] stop using copies of the same cell in MultiRNNCell:

34e9041

See: tensorflow#2276 (comment) Signed-off-by: MrD <the.ubik@gmail.com>

Utumno mentioned this pull request Oct 11, 2017

Ptb fixups and using argparse #2524

Closed

ajakash pushed a commit to ajakash/models that referenced this pull request Oct 24, 2017

In tensorflow#2276, auto_parallel(metagraph, model) try to import goo…

fc1f9c7

…gle3 This will error when num_gpus > 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add different rnn implementation modes to ptb tutorial #2276

Add different rnn implementation modes to ptb tutorial #2276

bignamehyp commented Aug 23, 2017

bignamehyp commented Aug 23, 2017

bignamehyp commented Aug 23, 2017

nealwu Aug 24, 2017

bignamehyp Aug 24, 2017

bignamehyp commented Aug 28, 2017

nealwu commented Aug 29, 2017

nealwu commented Aug 29, 2017

janchorowski commented Sep 7, 2017

protoget commented Sep 13, 2017

janchorowski commented Sep 15, 2017

protoget commented Sep 15, 2017

Utumno commented Oct 9, 2017

nealwu commented Oct 9, 2017

protoget commented Oct 9, 2017

nealwu commented Oct 11, 2017

Utumno commented Oct 11, 2017

nealwu commented Oct 11, 2017

cocosci commented Nov 12, 2017 •

edited

Loading

Add different rnn implementation modes to ptb tutorial #2276

Add different rnn implementation modes to ptb tutorial #2276

Conversation

bignamehyp commented Aug 23, 2017

rnn_mode == CUDNN

| config | epochs | train | valid | test | wps

| small | 13 | 40.50 | 116.60 | 112.13 | 49k | medium | 39 | 22.58 | 123.83 | 118.69 | 24.2k | large | 55 | 8.03 | 129.78 | 126.03 | 10.5k

rnn_mode == BLOCK

| config | epochs | train | valid | test | wps

| small | 13 | 40.55 | 120.70 | 115.52 | 17.2k | medium | 39 | 45.68 | 86.97 | 83.47 | 13.6k | large | 55 | 37.94 | 82.75 | 78.49 | 5.0k

bignamehyp commented Aug 23, 2017

bignamehyp commented Aug 23, 2017

nealwu Aug 24, 2017

Choose a reason for hiding this comment

bignamehyp Aug 24, 2017

Choose a reason for hiding this comment

bignamehyp commented Aug 28, 2017

nealwu commented Aug 29, 2017

nealwu commented Aug 29, 2017

janchorowski commented Sep 7, 2017

protoget commented Sep 13, 2017

janchorowski commented Sep 15, 2017

protoget commented Sep 15, 2017

Utumno commented Oct 9, 2017

nealwu commented Oct 9, 2017

protoget commented Oct 9, 2017

nealwu commented Oct 11, 2017

Utumno commented Oct 11, 2017

nealwu commented Oct 11, 2017

cocosci commented Nov 12, 2017 • edited Loading

| small | 13 | 40.50 | 116.60 | 112.13 | 49k
| medium | 39 | 22.58 | 123.83 | 118.69 | 24.2k
| large | 55 | 8.03 | 129.78 | 126.03 | 10.5k

| small | 13 | 40.55 | 120.70 | 115.52 | 17.2k
| medium | 39 | 45.68 | 86.97 | 83.47 | 13.6k
| large | 55 | 37.94 | 82.75 | 78.49 | 5.0k

cocosci commented Nov 12, 2017 •

edited

Loading