English-Vietnamese Machine Translation using Transformer

Data:

This project uses 2 dataset:

IWSLT'15 English-Vietnamese from Stanford NLP group. You can download it here. This dataset contains 133317 sentence pairs in training set and 1268 sentence pair in testset.
VLSP 2020 dataset contain English-Vietnamese sentence pair in domain NEWS. You can download it here. Training data consists of two corpora: Parallel corpora, which are in UTF-8 plaintexts, 1-to-1 sentence aligned, one sentence per line, and include in-domain NEWS dataset of size 20k samples with 80% in the training set, 10% in the dev set and 10% in the test set; and out-of-domain parallel datasets roughly of size 4M samples, such as openSub (3.5M), ted-like (55k), evbcorpus (45k), wiki-alt (20k), and basic (8.8k) datasets. In this project we only use evbcorpus, in-domain NEWS and wiki-alt dataset.

Data from VLSP 2020 dataset will be appended with IWSLT training data to create a larger training set for training model. In testing phase, BLEU score on IWSLT testing set will be used to benchmark.

Model:

In this project we build a transformer based model. Original implementation of baseline model can be found here. Baseline has 6 layers for encoder and decoder. We also train and test a new model with 8 layers.

Training and Testing:

Tokenize with spaCy: model 'en' for English, spacy_vi_model (from https://github.com/trungtv/vivi_spacy)
Vectorize using torchtext
Positional Embedding: Concatenate positional embedding matrix to input
Label Smoothing: Distribute correct class in one-hot vector into the remaining class to reduce overfit
Train each version for 30 epochs
Testing using beam search and nltk wordnet

Result:

Model	Training set	BLEU Score (%)
Baseline, 6 layers	IWSLT	25.16
Modified, 8 layers	IWSLT	25.95
Modified, 8 layers	IWSLT + VLSP 2020	27.69
Tensor2tensor	IWSLT	29.44
Google API		31.69

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
model		model
src		src
templates		templates
upload		upload
NMT.ipynb		NMT.ipynb
README.md		README.md
demo.ipynb		demo.ipynb
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

English-Vietnamese Machine Translation using Transformer

Data:

Model:

Training and Testing:

Result:

About

Releases

Packages

Contributors 2

Languages

vuongdanghuy/nmt_transformer

Folders and files

Latest commit

History

Repository files navigation

English-Vietnamese Machine Translation using Transformer

Data:

Model:

Training and Testing:

Result:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages