PyTorch implementation of basic sequence to sequence (seq2seq) encoder-decoder architecture for machine translation.
[SS = Source Sentence ; GT = Ground Truth Translation ; PT = Predicted Translation]
SS : da monsieur le president merci pour votre superbe rapport sur les droits de l'homme
GT: da mr president thank you for a superb report on human rights
PT: da mr president thank you for your superb report on human rights
SS: nous avons recu a ce sujet des informations diverses
GT: we have received different information about this
PT: we have received this on different different length
SS: a cet egard le role de la publicite est ambivalent
GT: in this respect the role of advertising is ambivalent
PT: in this respect the role role role of advertising is is ambivalent ambivalent
SS: avant le vote sur le paragraphe < d >
GT: before the vote on paragraph < d >
PT: before the vote on paragraph < d >
SS: nous avions foi dans les objectifs fixes
GT: we believed in those targets
PT: we had faith in the set set set objectives set
Use the command
python main.py <root_location> <filename(s)> <save_path> -attn=dot -debug=True
from the home folder.
<root_location>
is the place consisting of the datafiles<filename(s)>
is the name of the parallel corpus file(s). It can be- a single value corresponding to the file which contains the source-target pair in tab separated format
- or two files corresponding to the source tokens and target tokens respectively, aligned by each line.
<save_path>
is the location to save the trained model.-attn
(optional, default='dot') specifies the attention model.True
orFalse
value for-debug
toggles the debug mode.
- The architecture follows a similar pattern to Bahadanau et. al.. A deep GRU model is used as an encoder, and another deep GRU model is used to decode and output a meaningful translation.
- Batch processing is possible, and all sentences in a batch are pre processed to be of almost equal length.
- Masked cross entropy loss is implemented at the decoder, where any tokens padded for ease of batch processing are excluded in computing the loss.
- Text Preprocessing: Convert unicode string to ascii, convert to lower case, remove punctuations, replace numbers with a special token
<d>
, and replace rare words with a special<unk>
sybmol. - Source sentences longer than 15 word tokens are omitted from dataset.
The different kinds of attention model implemented are
Most of the hyper parameters have been provided with suitable default values in main.py. Some hyperparameters I think are worth changing for better performance are listed below.
- teacher_forcing_ratio : If this value is 1, then the outputs fed to the next step of RNN are the ground truth words, called teacher forcing. If 0, then the predictions of the previous method are fed as inputs, called scheduled sampling. This value corresponds to probability of using teacher forcing at each iteration.
- n_layers : Deeper GRUs are known to give better results compared to shallower GRU. Sutskever et. al. use 4 layers at encoder and decoder, while Google's NMT uses 8 on each side.
- dropout : Increasing the value of dropout reduces overfitting
- Beam Search: Presently, decoding is done using the best prediction at each time step. Beam search with beam
k
can be used to decode thek
best words at each timestep.k
is typically 3-5. - Char Level RNN: To better handle the rare words. character level RNN can be added on top of existing architecture.
The network is run on the EuroParl dataset for fr-en translation, which consists of over 2,000,000 sentence pairs, which reduced to around 400,000 after preprocessing. With a batch size of 64, each epoch took around 27 minutes to run on a TitanX GPU, and achieved a unigram BLEU score of 0.42 on a heldout validation set.