Skip to content
Pytorch implementation of Neural Machine Translation with seq2seq and attention (en-zh)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Neural Machine Translation

Pytorch implementation of Neural Machine Translation with seq2seq and attention (en-zh) (英汉翻译)

This repo reaches 10.44 BLEU score in my test dataset.(Use multi-bleu.perl).


The goal of machine translation is to maximize p(y|x). Due to the infinite space of language, directly estimating this conditional probability is impossible. Thus neural networks, which are good at fitting complex functions, are introduced into machine translation.

Sutskever et. al 2014 proposed a model consisting of encoder and decoder, named seq2seq. Once proposed, it set off a tornado in NMT. A number of follow-up work began (e.g. Cho et. al). One of the most famous work is the attention mechanism (Bahdanau et. al 2014).

In this repo, I implemented the seq2seq model with attention in PyTorch for en-zh translation.


python 3.6

  • PyTorch>=0.4
  • torchtext
  • nltk
  • jieba
  • subword-nmt


First, run


to tokenize and do BPE.

Then run nmt.ipynb for training and testing.

Lastly, BLEU score is calculated by

perl multi-bleu.perl references.txt < predictions.txt

Pretrained model can be found here. (passwd:ukvh)


neu2017 from CWMT corpus 2017

2 million parallel sentences (enzh)

98% of data is for training, the other for validating and testing.


  • tokenizer
    • zh: jieba
    • en: nltk.word_tokenizer
  • BPE: subword-nmt (For the parameter num_operations, I choose 32000.)

It is worth mentioning that BPE reduced the vocabulary significantly, from 50000+ to 32115.

Besides, <sos> and <eos> symbols are conventionally prepended to each sentences. OOV is represented with <unk>.

One problem in my training is that the vocabulary of Chinese is so large that I only took those the top 50k most frequent, which results in many <unk> in my training dataset. Too many <unk> in many training dataset has bad influence in my model when training. My model prefers to predict <unk> as well. So I just ignore <unk> when predicting.

Model Architecture (seq2seq)

Similar to Luong et. al 2015.


  • embeddings: glove (en)& word2vec (zh) (both 300-dim)
  • encoder: 4-layer Bi-GRU (hidden-size 1000-dim)
  • decoder: 4-layer GRU with attention (hidden-size 1000-dim)
  • attention: bilinear global attention

According to Chung, J. et. al 2014, GRU can reach similar performance as LSTM. So I chose GRU.

Training details


  • optim: Adam
  • lr: 1e-4
  • no L2 regularization (Since there is no obvious overfitting)
  • dropout: 0.3
  • clip gradient norm: 0.5
  • warm-up: fix embeddings before 1 epoch

validation loss: validation loss (The periodic bulge is because I reset optimizer every epoch. It is not necessary.)

Perplexity reaches 5.5 in validation dataset.

Trained 231K steps, nearly 4 epochs

I found that training is a bit slow. However, some parameters including attention parameters can not be tuned well with larger learning rate.

Beam Search

According to wikipedia, beam search is BFS with width constraints.

Google's GNMT paper gave two refinements to the beam search algorithm: a coverage penalty and length normalization. The coverage penalty formula they proposed is so empirical that I just use length normalization. But I found that this method did not perform very well. I got 9.23 BLEU which is less than 10.44 which normal beam search algorithm reached.


Alignment visualizations:





You can’t perform that action at this time.