Apply state of art machine translation techniques to Chinese and English translation
- Download [MultiUN dataset](wget http://opus.nlpl.eu/download.php?f=MultiUN/en-zh.txt.zip)
- Unzip file and put 'MultiUN.en-zh.zh' and 'MultiUN.en-zh.en' in 'corpra' directory.
- Install Chinese tokenization library jieba by running command 'pip install jieba'.
- Run token_zh.py to tokenize Chinese corpus in MultiUN Dataset (MultiUN.en-zh.zh).
- Rename the tokenized result as 'MultiUN.en-zh.zh'.
- Run split_data.py to split dataset into training set and test set. Hyperparameters are in hyperparams.py
TODO List:
- Understand the padding logic of transformer net.