<a href="https://colab.research.google.com/github/ntkchinh/dab/blob/master/Interactive_Back_Translation_with_Style.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation by Backtranslation

Author:  [ Trieu H. Trinh](https://thtrieu.github.io/), Thang Le, Phat Hoang, [Thang Luong](http://thangluong.com)

**MIT License**

Copyright (c) [2019] [Trieu H. Trinh](https://thtrieu.github.io/)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Introduction

Back translation is the process of translating a sentence in language A to language B and back to A. Due to randomess in the translation process, the output of this back-translation is a slight variation of the source sentence with the same semantic meaning. Back-translation is therefore a very useful technique for augmenting NLP datasets. To see such an example, checkout the [Colab here](https://colab.research.google.com/drive/1_I0KvFlHFyBcTRT3Bfx9BGLJcIHGJNrG) and also send love <3 and attention to our [Github repository](https://github.com/vietai/back_translate) for this project. 

In this Colab, we aim to minimally demonstrate examples of back-translation using our pretrained translation models. The process is simple: first we point to our pretrained models on Google Cloud Storage, then we use them to interactively back-translate. Although we provided only English-Vietnamese and English-French pairs, the code work with any other pairs as long as the checkpoints are obtained by training `transformer` on translation problems using `tensor2tensor`.

## Step 1. Specify path to pretrained translation models

You only need to run this step once.

For English - French - English, please use the following settings:

```
model=transformer
hparams_set=transformer_big
from_problem=translate_enfr_wmt32k
to_problem=translate_enfr_wmt32k_rev

from_ckpt=checkpoints/translate_enfr_fren_uda/enfr/model.ckpt-500000
to_ckpt=checkpoints/translate_enfr_fren_uda/fren/model.ckpt-500000

from_data_dir=checkpoints/translate_enfr_fren_uda/
to_data_dir=checkpoints/translate_enfr_fren_uda/
```

For English - Vietnamese - English, please use the following settings:


```
model=transformer
hparams_set=transformer_tiny
from_problem=translate_envi_iwslt32k
to_problem=translate_vien_iwslt32k

from_ckpt=checkpoints/translate_envi_iwslt32k_tiny/avg/
to_ckpt=checkpoints/translate_vien_iwslt32k_tiny/avg/

from_data_dir=data/translate_envi_iwslt32k/
to_data_dir=data/translate_vien_iwslt32k/
```

In [None]:
%tensorflow_version 1.x
!pip install -q -U tensor2tensor
!pip install tensorflow-datasets==3.2.1

import os
from tensor2tensor.bin import t2t_decoder
from tensor2tensor.models import transformer
import tensorflow as tf


In [53]:
%cd /content/
src = '/content/dab'
if not os.path.exists(src):
    !git clone https://github.com/ntkchinh/dab.git
else:
    %cd $src
    !git pull

%cd /
!ls $src

/content
/content/dab
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (1/1), done.[K
remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0[K
Unpacking objects: 100% (3/3), done.
From https://github.com/ntkchinh/dab
   05063ab..df991bd  master     -> origin/master
Updating 05063ab..df991bd
Fast-forward
 decoding.py | 6 [32m+++[m[31m---[m
 1 file changed, 3 insertions(+), 3 deletions(-)
/
back_translate.py  gif		__pycache__	t2t_decoder.py
colab		   LICENSE	README.md	t2t_trainer.py
decoding.py	   problems.py	t2t_datagen.py


In [24]:
# Create hparams and the model
model_name = "transformer"  # @param {type:"string"}
hparams_set = "transformer_base"  # @param {type: "string"}
from_problem = "translate_class11_appendtag_envi_iwslt32k"  # @param {type: "string"}
to_problem = "translate_class11_appendtag_vien_iwslt32k"  # @param {type: "string"}
google_cloud_bucket = 'best_vi_translation'  # @param {type: "string"}
from_ckpt = 'checkpoints/translate_class11_appendtag_envi_base_1000k/SAVE/'  # @param {type:"string"}
to_ckpt = 'checkpoints/translate_class11_appendtag_vien_base_1000k/SAVE/'  # @param {type:"string"}

from_data_dir = 'data/translate_class11_appendtag_envi_iwslt32k/'  # @param {type:"string"}
to_data_dir = 'data/translate_class11_appendtag_vien_iwslt32k/'  # @param {type:"string"}

bucket_path = 'gs://' + google_cloud_bucket
from_ckpt_dir = os.path.join(bucket_path, from_ckpt)
to_ckpt_dir = os.path.join(bucket_path, to_ckpt)
from_data_dir = os.path.join(bucket_path, from_data_dir)
to_data_dir = os.path.join(bucket_path, to_data_dir)

# Convert directory into checkpoints
if tf.gfile.IsDirectory(from_ckpt_dir):
  print('yes')
  # from_ckpt = tf.train.latest_checkpoint(from_ckpt_dir)
  from_ckpt = os.path.join(from_ckpt_dir, 'model.ckpt-1000000')  # <- this is not a "dir"
if tf.gfile.IsDirectory(to_ckpt_dir):
  print('yes')
  # to_ckpt = tf.train.latest_checkpoint(to_ckpt_dir)
  to_ckpt = os.path.join(to_ckpt_dir, 'model.ckpt-1000000')
print(from_ckpt_dir)
print(to_ckpt_dir)

print(from_ckpt)
print(to_ckpt)



yes
yes
gs://best_vi_translation/checkpoints/translate_class11_appendtag_envi_base_1000k/SAVE/
gs://best_vi_translation/checkpoints/translate_class11_appendtag_vien_base_1000k/SAVE/
gs://best_vi_translation/checkpoints/translate_class11_appendtag_envi_base_1000k/SAVE/model.ckpt-1000000
gs://best_vi_translation/checkpoints/translate_class11_appendtag_vien_base_1000k/SAVE/model.ckpt-1000000


In [46]:

def setup_tpu():
  from google.colab import auth
  auth.authenticate_user()

  # Mount the bucket to colab, so that python package os can access to it.
  # First we install gcsfuse to be able to mount Google Cloud Storage with Colab.
  print('\nInstalling gcsfuse')
  !echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
  !curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
  !apt -qq update
  !apt -qq install gcsfuse

  bucket = google_cloud_bucket
  print('Mounting bucket {} to local.'.format(bucket))
  mount_point = '/content/{}'.format(bucket)
  if not os.path.exists(mount_point):
    tf.gfile.MakeDirs(mount_point)
  
  !fusermount -u $mount_point
  !gcsfuse --implicit-dirs $bucket $mount_point
  print('\nMount point content:')
  !ls $mount_point

  # First we Connect to the TPU pod.
  tpu_address = ''
  if 'COLAB_TPU_ADDR' in os.environ:
    tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    print ('TPU address is', tpu_address)
    with tf.Session(tpu_address) as session:
      devices = session.list_devices()
      # Upload credentials to TPU.
      with open('/content/adc.json', 'r') as f:
        auth_info = json.load(f)
      tf.contrib.cloud.configure_gcs(session, credentials=auth_info)

    print('TPU devices:')
    pprint.pprint(devices)

  return mount_point, tpu_address

mount_point, tpu_address = setup_tpu()
  
print('\nMount point: {}'.format(mount_point))
print('TPU address: {}'.format(tpu_address))


Installing gcsfuse
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1974  100  1974    0     0  98700      0 --:--:-- --:--:-- --:--:-- 98700
OK
48 packages can be upgraded. Run 'apt list --upgradable' to see them.
gcsfuse is already the newest version (0.33.2).
0 upgraded, 0 newly installed, 0 to remove and 48 not upgraded.
Mounting bucket best_vi_translation to local.
Using mount point: /content/best_vi_translation
2021/03/15 06:21:26.110260 Opening GCS connection...
2021/03/15 06:21:26.604716 Mounting file system...
2021/03/15 06:21:26.605058 File system has been successfully mounted.

Mount point content:
checkpoints  data  raw

Mount point: /content/best_vi_translation
TPU address: 


## Step 2. Run back translation!

### a. Back-translating an English sentence

In [None]:
beam_size = 2 #@param {type: "integer"}
alpha = 0.6  #@param {type: "number"}

decode_hparams = "beam_size={},alpha={}".format(beam_size, alpha)

# >>> Hi there., then quietly left as the members of the press swarmed around her .
# Paraphrased: Hello .
# >>> How are you doing today?
# Paraphrased: How do you do today ?
# >>> Thank you so much.
# Paraphrased: Thank you very much .
# >>> I used to dream of becoming a soccer player
# Paraphrased: I 've been dreaming to become a football player .
# >>> It is definitely our duty to push the boundary of scientific research.
# Paraphrased: It 's certainly our mission to push the boundaries of science .

!python $src/back_translate.py \
--decode_hparams=$decode_hparams \
--model=$model_name \
--hparams_set=$hparams_set \
--from_problem=$from_problem \
--to_problem=$to_problem \
--output_dir=$from_ckpt_dir \
--from_ckpt=$from_ckpt \
--to_ckpt=$to_ckpt \
--from_data_dir=$from_data_dir \
--to_data_dir=$to_data_dir \
--backtranslate_interactively










Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
INFO:tensorflow:Configuring DataParallelism to replicate the model.
INFO:tensorflow:Configuring DataParallelism to replicate the model.
INFO:tensorflow:schedule=continuous_train_and_eval
INFO:tensorflow:schedule=continuous_train_and_eval
INFO:tensorflow:worker_gpu=1
INFO:tensorflow:worker_gpu=1
INFO:tensorflow:sync=False
INFO:tensorflow:sync=False
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0']
INFO:tensorflow:ps_devices: ['gpu:0']
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7eff206ebf10>, '_master': '', '_num_ps_repli

### b. Back translating sentences in the intermediate language

In [None]:
beam_size = 2 #@param {type: "integer"}
alpha = 0.6  #@param {type: "number"}

decode_hparams = "beam_size={},alpha={}".format(beam_size, alpha)

from_problem, to_problem = to_problem, from_problem
from_ckpt, to_ckpt = to_ckpt, from_ckpt
from_data_dir, to_data_dir = to_data_dir, from_data_dir

# Tôi từng ước mơ trở thành cầu thủ bóng đá
!python $src/back_translate.py \
--decode_hparams=$decode_hparams \
--model=$model_name \
--hparams_set=$hparams_set \
--from_problem=$from_problem \
--to_problem=$to_problem \
--from_ckpt=$from_ckpt \
--to_ckpt=$to_ckpt \
--from_data_dir=$from_data_dir \
--to_data_dir=$to_data_dir \
--backtranslate_interactively


## Acknowledgements

This work is made possible by [VietAI](http://vietai.org/).

## References

1. Improving Neural Machine Translation Models with Monolingual Data - Sennrich et al. , 2016a  ([arxiv](https://arxiv.org/abs/1511.06709))
2. Understanding Back-Translation at Scale - Edunov, Sergey, et al., 2018 ([arxiv](https://arxiv.org/abs/1808.09381))
3. T2T translate vi<->en tiny tpu - Trieu H. Trinh ([colab](https://colab.research.google.com/drive/1Bx5HfxbmXnMK7kBLHlmGyhVhQVVrDI0p))
4. Sentiment Analysis + Back translation - Trieu H. Trinh ([colab](https://colab.research.google.com/drive/1_I0KvFlHFyBcTRT3Bfx9BGLJcIHGJNrG#scrollTo=7yvhttVKTkZu))
5. Tensor2Tensor Intro - Tensor2Tensor Team([colab](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb))
