Skip to content

Conversation

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 commented May 3, 2019

Apply recently developed PyTorch Transformer model for the word language problem. Any reviews and suggestions are appreciated. Please don't land this PR until the transformer model is fully supported by PyTorch. @cpuhrsch @soumith

@zhangguanheng66 zhangguanheng66 changed the title [WIP] Apply Transformer model word language [WIP] Apply Transformer model for the word language problem in pytorch/examples May 3, 2019
Encoder/decoder embeddings normalized by sqrt(d_model). test loss 3.84 lr=5.0
Encoder/decoder embeddings normalized by sqrt(d_model). test loss 4.68 lr=20.0
Remove print out.
Revise main.py file.
Load the best training model through epochs.
Update README.md file to include the transformer model.
Update the README.md file.
Use PositionalEncoding in transformer. test loss 0.30 lr=5.0
@zhangguanheng66 zhangguanheng66 force-pushed the transformer_word_language_model branch from efb5fd7 to 9332536 Compare May 6, 2019 15:20
Guanheng Zhang added 2 commits May 6, 2019 08:22
Update generate.py to generate text with transformer.pt model.
Add CUDA function to generate.py when running transformer model.
Add generate_subsequent_mask() in Transformer
Generate transformer model in main.py.
Revise generate.py working for both RNN and Transformer models.
Remove decoder_data
Add some changes because of transformer.py.
Change d_ff to dim_feedforward.
Remove Embeddings and PositionalEncoder out of transformer.py.
@zhangguanheng66 zhangguanheng66 force-pushed the transformer_word_language_model branch from 9332536 to 70ebca2 Compare May 6, 2019 15:23
@zhangguanheng66
Copy link
Contributor Author

A few minor updates. Not using superclass for the transformer model. Check transformer module exists before using. @cpuhrsch

with torch.no_grad(): # no tracking history
for i in range(args.words):
seq_mask = model.generate_square_subsequent_mask(input.size(0)).to(device)
output = model(input, input, src_mask=seq_mask, tgt_mask=seq_mask)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since seq_mask is generated only to be fed into the model, maybe we can do this within the model forward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate seq_mask and tgt_mask is doable inside the forward function. Currently those are generated at the beginning of each epoch so they are re-used through the time steps. I thought this saves efforts.

try:
from torch.nn import Transformer
except:
raise ImportError('Transformer module exists in PyTorch 1.1 and above.')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to say "Transformer module does not exist in PyTorch 1.1 or lower".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into model.py


ntokens = len(corpus.dictionary)
model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.tied).to(device)
if args.model == 'Transformer':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this within Model? That is, you extend RNNModel to contain both and pass a flag to decide which one to use. This is in line with RNN settings GRU, LSTM, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about merging TransformerSeq2Seq and RNNModel, but realized they are very different in the architecture.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about this in person.

ntokens = len(corpus.dictionary)
model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.tied).to(device)
if args.model == 'Transformer':
model = model.TransformerSeq2Seq(ntokens, ntokens, nhead=args.transformer_head,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RNNModel isn't using keyword arguments to pass ntokens, dropout etc. - We could be consistent so it's easier to read.


if args.model == 'Transformer':
if args.bptt > len(data_source) - 1 - i:
tgt_mask = model.transformer.generate_square_subsequent_mask(len(data_source) - 1 - i).to(device)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to do this implicitly within model and cache it via say self.tgt_mask?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be len(data) instead of len(data_source)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we initiate the transformer model, we have no idea the length of tgt and src, unless we pass it through as an argument. The masks will change depending on the incoming tgt and src sequence. So it's better to generate them at the runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is for the case when the data left don't have a full length of args.bptt.


if args.model == 'Transformer':
if args.bptt > len(train_data) - 1 - i:
tgt_mask = model.transformer.generate_square_subsequent_mask(len(train_data) - 1 - i).to(device)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

word_weights = output[-1].squeeze().div(args.temperature).exp().cpu()
word_idx = torch.multinomial(word_weights, 1)[0]
word_tensor = torch.Tensor([[word_idx]]).long().to(device)
input = torch.cat([input, word_tensor], 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that input keeps on growing larger and larger? Is that actually necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary. We can carry a constant length for the input sequence.

@zhangguanheng66 zhangguanheng66 changed the title [WIP] Apply Transformer model for the word language problem in pytorch/examples Apply Transformer model for the word language problem in pytorch/examples Jun 12, 2019
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Jun 12, 2019
Summary:
Accidentally rebased the old PR and make it too messy. Find it here (#19274)

Create a PR for comments. The model is still WIP but I want to have some feedbacks before moving too far. The transformer model depends on several modules, like MultiheadAttention (landed).

Transformer is implemented based on the paper (https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf). Users have the flexibility to build a transformer with self-defined and/or built-in components (i.e encoder, decoder, encoder_layer, decoder_layer). Users could use Transformer class to build a standard transformer model and modify sub-layers as needed.

Add a few unit tests for the transformer module, as follow:
TestNN.test_Transformer_cell
TestNN.test_transformerencoderlayer
TestNN.test_transformerdecoderlayer
TestNN.test_transformer_args_check
TestScript.test_scriptmodule_transformer_cuda

There is another demonstration example for applying transformer module on the word language problem. pytorch/examples#555
Pull Request resolved: #20170

Differential Revision: D15417983

Pulled By: zhangguanheng66

fbshipit-source-id: 7ce771a7e27715acd9a23d60bf44917a90d1d572
@soumith soumith merged commit 4581968 into pytorch:master Aug 9, 2019
YinZhengxun pushed a commit to YinZhengxun/mt-exercise-02 that referenced this pull request Mar 30, 2025
…ples (pytorch#555)

* Use append to accelerate data loading process.

* First transformer model working for word language model.

* Work for GPU (all the model and data have to be sent to cuda)

* Transformer model GPU activated nhead=1 nlayers=1 d_ff=64 test loss 6.55

* Use lr=5.0 test loss 4.8
Encoder/decoder embeddings normalized by sqrt(d_model). test loss 3.84 lr=5.0
Encoder/decoder embeddings normalized by sqrt(d_model). test loss 4.68 lr=20.0
Remove print out.
Revise main.py file.
Load the best training model through epochs.
Update README.md file to include the transformer model.
Update the README.md file.
Use PositionalEncoding in transformer. test loss 0.30 lr=5.0

* Update main.py to have mask on source sequences.
Update generate.py to generate text with transformer.pt model.
Add CUDA function to generate.py when running transformer model.
Add generate_subsequent_mask() in Transformer
Generate transformer model in main.py.
Revise generate.py working for both RNN and Transformer models.
Remove decoder_data
Add some changes because of transformer.py.

* No need to provide Trnasform args for generating text.
Change d_ff to dim_feedforward.
Remove Embeddings and PositionalEncoder out of transformer.py.

* Replace tabs with spaces.

* Update transformer model in model.py.

* Recycle RNN arguments for Transformer model.

* Update README.md file.

* Remove model.generator in main.py.

* Update the warnings in transformer model.

* Fix a small bug in model.py.

* Remove keyword arguments for consistence.

* Create a new function generate_square_subsequent_mask inside the TransformerSeq2Seq

* Remove unnecessary attributes.

* A minor change.

* Move src_mask and tgt_mask as the members of the module.

* Move transformer check to model.py

* Move masks inside forward function.

* User TransformerEncoder for word language model.

* Remove Generator module in Transformer.

* Merge RNN and Transformer model in model.py

* Minor changes.

* Minor changes to address reviewer's comments.

* Remove reset_parameter function.

* Split RNN and Transformer model to keep code readable.

* Minor changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants