Skip to content

Machine Translation using seq2seq models with attention in PyTorch.

Notifications You must be signed in to change notification settings

tarun005/MachineTranslate-seq2seq

Repository files navigation

Machine Translation using Sequence to sequence Deep Learning Models

PyTorch implementation of basic sequence to sequence (seq2seq) encoder-decoder architecture for machine translation.

Sample FR-EN translations:

[SS = Source Sentence ; GT = Ground Truth Translation ; PT = Predicted Translation]

SS : da monsieur le president merci pour votre superbe rapport sur les droits de l'homme
GT: da mr president thank you for a superb report on human rights
PT: da mr president thank you for your superb report on human rights

SS: nous avons recu a ce sujet des informations diverses
GT: we have received different information about this
PT: we have received this on different different length

SS: a cet egard le role de la publicite est ambivalent
GT: in this respect the role of advertising is ambivalent
PT: in this respect the role role role of advertising is is ambivalent ambivalent

SS: avant le vote sur le paragraphe < d >
GT: before the vote on paragraph < d >
PT: before the vote on paragraph < d >

SS: nous avions foi dans les objectifs fixes
GT: we believed in those targets
PT: we had faith in the set set set objectives set

How to run

Use the command python main.py <root_location> <filename(s)> <save_path> -attn=dot -debug=True from the home folder.

  • <root_location> is the place consisting of the datafiles
  • <filename(s)> is the name of the parallel corpus file(s). It can be
    • a single value corresponding to the file which contains the source-target pair in tab separated format
    • or two files corresponding to the source tokens and target tokens respectively, aligned by each line.
  • <save_path> is the location to save the trained model.
  • -attn (optional, default='dot') specifies the attention model.
  • True or False value for -debug toggles the debug mode.

Architecture Details

  • The architecture follows a similar pattern to Bahadanau et. al.. A deep GRU model is used as an encoder, and another deep GRU model is used to decode and output a meaningful translation.
  • Batch processing is possible, and all sentences in a batch are pre processed to be of almost equal length.
  • Masked cross entropy loss is implemented at the decoder, where any tokens padded for ease of batch processing are excluded in computing the loss.
  • Text Preprocessing: Convert unicode string to ascii, convert to lower case, remove punctuations, replace numbers with a special token <d>, and replace rare words with a special <unk> sybmol.
  • Source sentences longer than 15 word tokens are omitted from dataset.

Attention module

The different kinds of attention model implemented are

  • None : No attention module would be used during decoding.
  • dot : dot
  • linear : linear
  • bilinear : bilinear

Parameter tuning

Most of the hyper parameters have been provided with suitable default values in main.py. Some hyperparameters I think are worth changing for better performance are listed below.

  • teacher_forcing_ratio : If this value is 1, then the outputs fed to the next step of RNN are the ground truth words, called teacher forcing. If 0, then the predictions of the previous method are fed as inputs, called scheduled sampling. This value corresponds to probability of using teacher forcing at each iteration.
  • n_layers : Deeper GRUs are known to give better results compared to shallower GRU. Sutskever et. al. use 4 layers at encoder and decoder, while Google's NMT uses 8 on each side.
  • dropout : Increasing the value of dropout reduces overfitting

To Do:

  • Beam Search: Presently, decoding is done using the best prediction at each time step. Beam search with beam k can be used to decode the k best words at each timestep. k is typically 3-5.
  • Char Level RNN: To better handle the rare words. character level RNN can be added on top of existing architecture.

Sample results for FRENCH-ENGLISH translation.

The network is run on the EuroParl dataset for fr-en translation, which consists of over 2,000,000 sentence pairs, which reduced to around 400,000 after preprocessing. With a batch size of 64, each epoch took around 27 minutes to run on a TitanX GPU, and achieved a unigram BLEU score of 0.42 on a heldout validation set.

About

Machine Translation using seq2seq models with attention in PyTorch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published