<a href="https://colab.research.google.com/github/vietai/dab/blob/master/colab/Interactive_Back_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation by Backtranslation

Author:  [ Trieu H. Trinh](https://thtrieu.github.io/), Thang Le, Phat Hoang, [Thang Luong](http://thangluong.com)

**MIT License**

Copyright (c) [2019] [Trieu H. Trinh](https://thtrieu.github.io/)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Introduction

Back translation is the process of translating a sentence in language A to language B and back to A. Due to randomess in the translation process, the output of this back-translation is a slight variation of the source sentence with the same semantic meaning. Back-translation is therefore a very useful technique for augmenting NLP datasets. To see such an example, checkout the [Colab here](https://colab.research.google.com/drive/1_I0KvFlHFyBcTRT3Bfx9BGLJcIHGJNrG) and also send love <3 and attention to our [Github repository](https://github.com/vietai/back_translate) for this project. 

In this Colab, we aim to minimally demonstrate examples of back-translation using our pretrained translation models. The process is simple: first we point to our pretrained models on Google Cloud Storage, then we use them to interactively back-translate. Although we provided only English-Vietnamese and English-French pairs, the code work with any other pairs as long as the checkpoints are obtained by training `transformer` on translation problems using `tensor2tensor`.

## Step 1. Specify path to pretrained translation models

You only need to run this step once.

For English - French - English, please use the following settings:

```
model=transformer
hparams_set=transformer_big
from_problem=translate_enfr_wmt32k
to_problem=translate_enfr_wmt32k_rev

from_ckpt=checkpoints/translate_enfr_fren_uda/enfr/model.ckpt-500000
to_ckpt=checkpoints/translate_enfr_fren_uda/fren/model.ckpt-500000

from_data_dir=checkpoints/translate_enfr_fren_uda/
to_data_dir=checkpoints/translate_enfr_fren_uda/
```

For English - Vietnamese - English, please use the following settings:


```
model=transformer
hparams_set=transformer_tiny
from_problem=translate_envi_iwslt32k
to_problem=translate_vien_iwslt32k

from_ckpt=checkpoints/translate_envi_iwslt32k_tiny/avg/
to_ckpt=checkpoints/translate_vien_iwslt32k_tiny/avg/

from_data_dir=data/translate_envi_iwslt32k/
to_data_dir=data/translate_vien_iwslt32k/
```

In [0]:
!pip install -q -U tensor2tensor

import os
from tensor2tensor.bin import t2t_decoder
from tensor2tensor.models import transformer
import tensorflow as tf


%cd /content/
src = '/content/dab'
if not os.path.exists(src):
    !git clone https://github.com/vietai/dab.git
else:
    %cd $src
    !git pull

%cd /
!ls $src

# Create hparams and the model
model_name = "transformer"  # @param {type:"string"}
hparams_set = "transformer_tiny"  # @param {type: "string"}
from_problem = "translate_envi_iwslt32k"  # @param {type: "string"}
to_problem = "translate_vien_iwslt32k"  # @param {type: "string"}
google_cloud_bucket = 'vien-translation'  # @param {type: "string"}
from_ckpt = 'checkpoints/translate_envi_iwslt32k_tiny/avg/'  # @param {type:"string"}
to_ckpt = 'checkpoints/translate_vien_iwslt32k_tiny/avg/'  # @param {type:"string"}

from_data_dir = 'data/translate_envi_iwslt32k/'  # @param {type:"string"}
to_data_dir = 'data/translate_vien_iwslt32k/'  # @param {type:"string"}

bucket_path = 'gs://' + google_cloud_bucket
from_ckpt = os.path.join(bucket_path, from_ckpt)
to_ckpt = os.path.join(bucket_path, to_ckpt)
from_data_dir = os.path.join(bucket_path, from_data_dir)
to_data_dir = os.path.join(bucket_path, to_data_dir)

# Convert directory into checkpoints
if tf.gfile.IsDirectory(from_ckpt):
  from_ckpt = tf.train.latest_checkpoint(from_ckpt)
if tf.gfile.IsDirectory(to_ckpt):
  to_ckpt = tf.train.latest_checkpoint(to_ckpt)


## Step 2. Run back translation!

### a. Back-translating an English sentence

In [0]:
beam_size = 2 #@param {type: "integer"}
alpha = 0.6  #@param {type: "number"}

decode_hparams = "beam_size={},alpha={}".format(beam_size, alpha)

# >>> Hi there.
# Paraphrased: Hello .
# >>> How are you doing today?
# Paraphrased: How do you do today ?
# >>> Thank you so much.
# Paraphrased: Thank you very much .
# >>> I used to dream of becoming a soccer player
# Paraphrased: I 've been dreaming to become a football player .
# >>> It is definitely our duty to push the boundary of scientific research.
# Paraphrased: It 's certainly our mission to push the boundaries of science .

!python $src/back_translate.py \
--decode_hparams=$decode_hparams \
--model=$model_name \
--hparams_set=$hparams_set \
--from_problem=$from_problem \
--to_problem=$to_problem \
--from_ckpt=$from_ckpt \
--to_ckpt=$to_ckpt \
--from_data_dir=$from_data_dir \
--to_data_dir=$to_data_dir \
--backtranslate_interactively


### b. Back translating sentences in the intermediate language

In [0]:
beam_size = 2 #@param {type: "integer"}
alpha = 0.6  #@param {type: "number"}

decode_hparams = "beam_size={},alpha={}".format(beam_size, alpha)

from_problem, to_problem = to_problem, from_problem
from_ckpt, to_ckpt = to_ckpt, from_ckpt
from_data_dir, to_data_dir = to_data_dir, from_data_dir

# Tôi từng ước mơ trở thành cầu thủ bóng đá
!python $src/back_translate.py \
--decode_hparams=$decode_hparams \
--model=$model_name \
--hparams_set=$hparams_set \
--from_problem=$from_problem \
--to_problem=$to_problem \
--from_ckpt=$from_ckpt \
--to_ckpt=$to_ckpt \
--from_data_dir=$from_data_dir \
--to_data_dir=$to_data_dir \
--backtranslate_interactively


## Acknowledgements

This work is made possible by [VietAI](http://vietai.org/).

## References

1. Improving Neural Machine Translation Models with Monolingual Data - Sennrich et al. , 2016a  ([arxiv](https://arxiv.org/abs/1511.06709))
2. Understanding Back-Translation at Scale - Edunov, Sergey, et al., 2018 ([arxiv](https://arxiv.org/abs/1808.09381))
3. T2T translate vi<->en tiny tpu - Trieu H. Trinh ([colab](https://colab.research.google.com/drive/1Bx5HfxbmXnMK7kBLHlmGyhVhQVVrDI0p))
4. Sentiment Analysis + Back translation - Trieu H. Trinh ([colab](https://colab.research.google.com/drive/1_I0KvFlHFyBcTRT3Bfx9BGLJcIHGJNrG#scrollTo=7yvhttVKTkZu))
5. Tensor2Tensor Intro - Tensor2Tensor Team([colab](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb))
