<a href="https://colab.research.google.com/github/vietai/back_translate/blob/master/colabs/Interactive_Back_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation by Backtranslation

Author:  [ Trieu H. Trinh](https://thtrieu.github.io/), Thang Le, Phat Hoang, [Thang Luong](http://thangluong.com)

**MIT License**

Copyright (c) [2019] [Trieu H. Trinh](https://thtrieu.github.io/)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

## Introduction

Back translation is the process of translating a sentence in language A to language B and back to A. Due to randomess in the translation process, the output of this back-translation is a slight variation of the source sentence with the same semantic meaning. Back-translation is therefore a very useful technique for augmenting NLP datasets. To see such an example, checkout the [Colab here](https://colab.research.google.com/drive/1_I0KvFlHFyBcTRT3Bfx9BGLJcIHGJNrG) and also send love <3 and attention to our [Github repository](https://github.com/vietai/back_translate) for this project. 

In this Colab, we aim to minimally demonstrate examples of back-translation using our pretrained translation models. The process is simple: first we point to our pretrained models (Vietnamese to English and English to Vietnamese) on Google Cloud Storage, then we use them to interactively back-translate.

## Step 1. Specify path to pretrained translation models

You only need to run this step once.

In [0]:
print('1. Installing t2t.')
!pip install -q -U tensor2tensor
print('Done.')

print('\n2. Installing gcsfuse')
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

import os

print('\n3. Pull/Clone code from github.com/vietai/back_translate')
src = '/content/back_translate'
if not os.path.exists(src):
    !git clone https://github.com/vietai/back_translate.git
    %cd "back_translate"
else:
    %cd "back_translate"
    !git pull
!ls

from google.colab import auth
from tensor2tensor import problems
from tensor2tensor.utils import trainer_lib
from tensor2tensor.models import transformer
import numpy as np
import tensorflow as tf

print('\n4. Setup tensorflow.')
# Now we run back_translate/problems.py to import Vi2En problem
% run $src/problems.py

# Enable TF Eager execution
tfe = tf.contrib.eager
tfe.enable_eager_execution()

print('\n5. Authenticate Google User.')
# Authenticate user of this colab.
auth.authenticate_user()

# Create hparams and the model
model_name = "transformer"  # @param {type:"string"}
hparams_set = "transformer_tiny"  # @param {type: "string"}
google_cloud_bucket = 'vien-translation'  # @param {type: "string"}
vien_path = 'checkpoints/translate_vien_iwslt32k_tiny/avg'  # @param {type:"string"}
envi_path = 'checkpoints/translate_envi_iwslt32k_tiny/avg'  # @param {type:"string"}


# Now we mount the local storage to the google cloud bucket.
bucket = google_cloud_bucket
print('\n6.Mounting bucket {} to local.'.format(bucket))
mount_point = '/content/{}'.format(bucket)

if not os.path.exists(mount_point):
  tf.gfile.MakeDirs(mount_point)

!fusermount -u $mount_point
!gcsfuse --implicit-dirs $bucket $mount_point
!ls $mount_point

envi_dir = os.path.join(mount_point, envi_path)
vien_dir = os.path.join(mount_point, vien_path)

envi_data_dir = os.path.join(mount_point, "data/translate_envi_iwslt32k")
vien_data_dir = os.path.join(mount_point, "data/translate_vien_iwslt32k")

vien_ckpt_path = os.path.join(vien_dir, "model.ckpt-50000")
envi_ckpt_path = os.path.join(envi_dir, "model.ckpt-50000")

Modes = tf.estimator.ModeKeys


vien_problem = problems.problem("translate_vien_iwslt32k")

# Get the encoders from the problem
vien_encoders = vien_problem.feature_encoders(vien_data_dir)


# Setup helper functions for encoding and decoding
def vien_encode(input_str, output_str=None):
  """Input str to features dict, ready for inference"""
  inputs = vien_encoders["inputs"].encode(input_str) + [1]  # add EOS id
  batch_inputs = tf.reshape(inputs, [1, -1, 1])  # Make it 3D.
  return {"inputs": batch_inputs, "target_space_id": tf.constant(1, dtype=tf.int32)}

def vien_decode(integers):
  """List of ints to str"""
  integers = list(np.squeeze(integers))
  if 1 in integers:
    integers = integers[:integers.index(1)]
  return vien_encoders["inputs"].decode(np.squeeze(integers))


hparams = trainer_lib.create_hparams(hparams_set, data_dir=vien_data_dir, problem_name="translate_vien_iwslt32k")

# NOTE: Only create the model once when restoring from a checkpoint; it's a
# Layer and so subsequent instantiations will have different variable scopes
# that will not match the checkpoint.
translate_vien_model = registry.model(model_name)(hparams, Modes.EVAL)


# Restore and translate!
def translate_vien(inputs, beam_size=4, alpha=0.6):
  encoded_inputs = vien_encode(inputs)

  with tfe.restore_variables_on_create(vien_ckpt_path):
    translated_outputs = translate_vien_model.infer(encoded_inputs, beam_size=beam_size, alpha=alpha)
        
  return vien_decode(translated_outputs["outputs"]), translated_outputs["cache"]

envi_problem = problems.problem("translate_envi_iwslt32k")

# Get the encoders from the problem
envi_encoders = envi_problem.feature_encoders(envi_data_dir)

envi_hparams = trainer_lib.create_hparams(hparams_set, data_dir=envi_data_dir, problem_name="translate_envi_iwslt32k")
translate_envi_model = registry.model(model_name)(envi_hparams, Modes.EVAL)


# Setup helper functions for encoding and decoding
def envi_encode(input_str, output_str=None):
  """Input str to features dict, ready for inference"""
  inputs = envi_encoders["inputs"].encode(input_str) + [1]  # add EOS id
  batch_inputs = tf.reshape(inputs, [1, -1, 1])  # Make it 3D.
  return {"inputs": batch_inputs, "target_space_id": tf.constant(1, dtype=tf.int32)}

def envi_decode(integers):
  """List of ints to str"""
  integers = list(np.squeeze(integers))
  if 1 in integers:
    integers = integers[:integers.index(1)]
  return envi_encoders["inputs"].decode(np.squeeze(integers))



def translate_envi(inputs, beam_size=4, alpha=0.6):
    encoded_inputs = envi_encode(inputs)
    
    with tfe.restore_variables_on_create(envi_ckpt_path):
        translated_outputs = translate_envi_model.infer(encoded_inputs, beam_size=beam_size, alpha=alpha)
        
    return envi_decode(translated_outputs["outputs"]), translated_outputs["cache"]

## Step 2. Back translating a sentence.

You can repeat this step as many times as you wish.

### a. Back-translating a Vietnamese sentence

In [6]:
beam_size = 2 #@param {type: "integer"}
alpha = 0.6
# Tôi từng ước mơ trở thành cầu thủ bóng đá
vi_input_sentence = "Tôi từng ước mơ trở thành cầu thủ bóng đá" #@param {type:"raw"}
en_output_sentence, _ = translate_vien(vi_input_sentence, beam_size=beam_size, alpha=alpha)
vi_output_sentence, _ = translate_envi(en_output_sentence, beam_size=beam_size, alpha=alpha)
print("Paraphrased: {}".format(vi_output_sentence))


Augmented data:
Tôi đã mơ ước là một người chơi bóng đá .


### b. Back translating an English sentence

In [45]:
beam_size = 2 #@param {type: "integer"}
alpha = 0.6
en_input_sentence = "It is definitely our duty to push the boundary of scientific research ." #@param {type:"raw"}
vi_output_sentence, _ = translate_envi(en_input_sentence, beam_size=beam_size, alpha=alpha)
en_output_sentence, _ = translate_vien(vi_output_sentence, beam_size=beam_size, alpha=alpha)
print("Paraphrased: {}".format(en_output_sentence.replace('&apos;', '\'')))


Paraphrased: It 's certainly our mission to push the boundaries of science .


## Acknowledgements

This work is made possible by [VietAI](http://vietai.org/).

## References

1. Improving Neural Machine Translation Models with Monolingual Data - Sennrich et al. , 2016a  ([arxiv](https://arxiv.org/abs/1511.06709))
2. Understanding Back-Translation at Scale - Edunov, Sergey, et al., 2018 ([arxiv](https://arxiv.org/abs/1808.09381))
3. T2T translate vi<->en tiny tpu - Trieu H. Trinh ([colab](https://colab.research.google.com/drive/1Bx5HfxbmXnMK7kBLHlmGyhVhQVVrDI0p))
4. Sentiment Analysis + Back translation - Trieu H. Trinh ([colab](https://colab.research.google.com/drive/1_I0KvFlHFyBcTRT3Bfx9BGLJcIHGJNrG#scrollTo=7yvhttVKTkZu))
5. Tensor2Tensor Intro - Tensor2Tensor Team([colab](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb))
