-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Apply Transformer model for the word language problem in pytorch/examples #555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply Transformer model for the word language problem in pytorch/examples #555
Conversation
Encoder/decoder embeddings normalized by sqrt(d_model). test loss 3.84 lr=5.0 Encoder/decoder embeddings normalized by sqrt(d_model). test loss 4.68 lr=20.0 Remove print out. Revise main.py file. Load the best training model through epochs. Update README.md file to include the transformer model. Update the README.md file. Use PositionalEncoding in transformer. test loss 0.30 lr=5.0
efb5fd7 to
9332536
Compare
Update generate.py to generate text with transformer.pt model. Add CUDA function to generate.py when running transformer model. Add generate_subsequent_mask() in Transformer Generate transformer model in main.py. Revise generate.py working for both RNN and Transformer models. Remove decoder_data Add some changes because of transformer.py.
Change d_ff to dim_feedforward. Remove Embeddings and PositionalEncoder out of transformer.py.
9332536 to
70ebca2
Compare
|
A few minor updates. Not using superclass for the transformer model. Check transformer module exists before using. @cpuhrsch |
word_language_model/generate.py
Outdated
| with torch.no_grad(): # no tracking history | ||
| for i in range(args.words): | ||
| seq_mask = model.generate_square_subsequent_mask(input.size(0)).to(device) | ||
| output = model(input, input, src_mask=seq_mask, tgt_mask=seq_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since seq_mask is generated only to be fed into the model, maybe we can do this within the model forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generate seq_mask and tgt_mask is doable inside the forward function. Currently those are generated at the beginning of each epoch so they are re-used through the time steps. I thought this saves efforts.
word_language_model/main.py
Outdated
| try: | ||
| from torch.nn import Transformer | ||
| except: | ||
| raise ImportError('Transformer module exists in PyTorch 1.1 and above.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to say "Transformer module does not exist in PyTorch 1.1 or lower".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this into model.py
|
|
||
| ntokens = len(corpus.dictionary) | ||
| model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.tied).to(device) | ||
| if args.model == 'Transformer': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this within Model? That is, you extend RNNModel to contain both and pass a flag to decide which one to use. This is in line with RNN settings GRU, LSTM, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about merging TransformerSeq2Seq and RNNModel, but realized they are very different in the architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk about this in person.
word_language_model/main.py
Outdated
| ntokens = len(corpus.dictionary) | ||
| model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.tied).to(device) | ||
| if args.model == 'Transformer': | ||
| model = model.TransformerSeq2Seq(ntokens, ntokens, nhead=args.transformer_head, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RNNModel isn't using keyword arguments to pass ntokens, dropout etc. - We could be consistent so it's easier to read.
word_language_model/main.py
Outdated
|
|
||
| if args.model == 'Transformer': | ||
| if args.bptt > len(data_source) - 1 - i: | ||
| tgt_mask = model.transformer.generate_square_subsequent_mask(len(data_source) - 1 - i).to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to do this implicitly within model and cache it via say self.tgt_mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be len(data) instead of len(data_source)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we initiate the transformer model, we have no idea the length of tgt and src, unless we pass it through as an argument. The masks will change depending on the incoming tgt and src sequence. So it's better to generate them at the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is for the case when the data left don't have a full length of args.bptt.
word_language_model/main.py
Outdated
|
|
||
| if args.model == 'Transformer': | ||
| if args.bptt > len(train_data) - 1 - i: | ||
| tgt_mask = model.transformer.generate_square_subsequent_mask(len(train_data) - 1 - i).to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
| word_weights = output[-1].squeeze().div(args.temperature).exp().cpu() | ||
| word_idx = torch.multinomial(word_weights, 1)[0] | ||
| word_tensor = torch.Tensor([[word_idx]]).long().to(device) | ||
| input = torch.cat([input, word_tensor], 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean that input keeps on growing larger and larger? Is that actually necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary. We can carry a constant length for the input sequence.
Summary: Accidentally rebased the old PR and make it too messy. Find it here (#19274) Create a PR for comments. The model is still WIP but I want to have some feedbacks before moving too far. The transformer model depends on several modules, like MultiheadAttention (landed). Transformer is implemented based on the paper (https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf). Users have the flexibility to build a transformer with self-defined and/or built-in components (i.e encoder, decoder, encoder_layer, decoder_layer). Users could use Transformer class to build a standard transformer model and modify sub-layers as needed. Add a few unit tests for the transformer module, as follow: TestNN.test_Transformer_cell TestNN.test_transformerencoderlayer TestNN.test_transformerdecoderlayer TestNN.test_transformer_args_check TestScript.test_scriptmodule_transformer_cuda There is another demonstration example for applying transformer module on the word language problem. pytorch/examples#555 Pull Request resolved: #20170 Differential Revision: D15417983 Pulled By: zhangguanheng66 fbshipit-source-id: 7ce771a7e27715acd9a23d60bf44917a90d1d572
…ples (pytorch#555) * Use append to accelerate data loading process. * First transformer model working for word language model. * Work for GPU (all the model and data have to be sent to cuda) * Transformer model GPU activated nhead=1 nlayers=1 d_ff=64 test loss 6.55 * Use lr=5.0 test loss 4.8 Encoder/decoder embeddings normalized by sqrt(d_model). test loss 3.84 lr=5.0 Encoder/decoder embeddings normalized by sqrt(d_model). test loss 4.68 lr=20.0 Remove print out. Revise main.py file. Load the best training model through epochs. Update README.md file to include the transformer model. Update the README.md file. Use PositionalEncoding in transformer. test loss 0.30 lr=5.0 * Update main.py to have mask on source sequences. Update generate.py to generate text with transformer.pt model. Add CUDA function to generate.py when running transformer model. Add generate_subsequent_mask() in Transformer Generate transformer model in main.py. Revise generate.py working for both RNN and Transformer models. Remove decoder_data Add some changes because of transformer.py. * No need to provide Trnasform args for generating text. Change d_ff to dim_feedforward. Remove Embeddings and PositionalEncoder out of transformer.py. * Replace tabs with spaces. * Update transformer model in model.py. * Recycle RNN arguments for Transformer model. * Update README.md file. * Remove model.generator in main.py. * Update the warnings in transformer model. * Fix a small bug in model.py. * Remove keyword arguments for consistence. * Create a new function generate_square_subsequent_mask inside the TransformerSeq2Seq * Remove unnecessary attributes. * A minor change. * Move src_mask and tgt_mask as the members of the module. * Move transformer check to model.py * Move masks inside forward function. * User TransformerEncoder for word language model. * Remove Generator module in Transformer. * Merge RNN and Transformer model in model.py * Minor changes. * Minor changes to address reviewer's comments. * Remove reset_parameter function. * Split RNN and Transformer model to keep code readable. * Minor changes.
Apply recently developed PyTorch Transformer model for the word language problem. Any reviews and suggestions are appreciated. Please don't land this PR until the transformer model is fully supported by PyTorch. @cpuhrsch @soumith