<a href="https://colab.research.google.com/github/vietai/dab/blob/master/colab/Vietnamese_Backtranslation_Model_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation for Vietnamese Language by Backtranslation

Author: Thang Le

In this tutorial, we will experiment with using our trained Transformer models to translate interactively from Vietnamese to English and vice versa. We visualize attention weights in the translation models to help you get some idea how Transformer works. Lastly, we make use of these two translation models to do back-translation. Back-translation for languages with limited labeled data such as Vietnamese. The rest of this tutorial is organized as follows:
> 1. Mount to [Goolge Cloud Storage](https://cloud.google.com/storage/) for accessing our trained models
> 2. Clone some source codes needed
> 3. Prepare [tensor2tenor](https://github.com/tensorflow/tensor2tensor) models for inference
> 4. Interactive Translation
> 5. Attention Visualization
> 6. Vietnamese Data Augmentation by Back Translation

## I. Mount to Google Cloud Storage

We haved managed to train  **Vi --> En** and **En --> Vi** translation models and placed them on Google Cloud Storage. For inference purpose, we need to install **gcsfuse** to access our translation models on Google Cloud Storage

NOTE: In case you want to train you own translation models, here is the [colab](https://colab.research.google.com/drive/1Bx5HfxbmXnMK7kBLHlmGyhVhQVVrDI0p#scrollTo=cTUSADz_ti63) to check out! 

In [0]:
print('\nInstalling gcsfuse')
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

Let's import some necessary dependencies

In [0]:
from google.colab import auth
from itertools import product

import collections
import copy
import json
import os
import string
import re
import pprint

import numpy as np
import tensorflow as tf

# Enable TF Eager execution
tfe = tf.contrib.eager
tfe.enable_eager_execution()

auth.authenticate_user()

In [0]:
# Now we mount the local storage to the google cloud bucket.
bucket = 'vien-translation'
print('Mounting bucket {} to local.'.format(bucket))
mount_point = '/content/{}'.format(bucket)

if not os.path.exists(mount_point):
  tf.gfile.MakeDirs(mount_point)

!fusermount -u $mount_point
!gcsfuse --implicit-dirs $bucket $mount_point
!ls $mount_point

## II. Pull or Clone Some Source Codes

To do Vietnamese <--> English translation, we need to make use of Tensor2Tensor problems

### From our VietAI's github `vietai/dab`

We clone from `vietai/dab` for declaring ViEn problem

In [0]:
src = '/content/dab'
if not os.path.exists(src):
    !git clone https://github.com/vietai/dab.git
    %cd $src
else:
    %cd $src
    !git pull
!ls

### From Tensor2Tensor
We then build Tensor2Tensor from source and import it for later usage. Because the tensor2tensor library provide EnVi problem only, we need to run **dab/problems.py** to register ViEn problem (for translating from Vietnamese in English) to tensor2tensor registry.

In [0]:
if not os.path.exists("tensor2tensor"):
    !git clone https://github.com/azraelzhor/tensor2tensor.git
    %cd tensor2tensor
else:
    %cd tensor2tensor
    !git pull

!pip install -q -v .

from tensor2tensor import models
from tensor2tensor import problems
from tensor2tensor.layers import common_layers
from tensor2tensor.utils import trainer_lib
from tensor2tensor.utils import t2t_model
from tensor2tensor.utils import registry
from tensor2tensor.utils import metrics
from tensor2tensor.visualization import attention
from tensor2tensor.data_generators import text_encoder, translate_envi
%run ../problems.py
%cd ..

## III. Prepare Tensor2Tensor Models for Inference

### General configuration
Here we set up some configuration to make sure that we get access to the right directories which store our trained translation models

In [0]:
envi_dir = os.path.join(mount_point, "checkpoints/translate_envi_iwslt32k_tiny/avg")
vien_dir = os.path.join(mount_point, "checkpoints/translate_vien_iwslt32k_tiny/avg")

envi_data_dir = os.path.join(mount_point, "data/translate_envi_iwslt32k")
vien_data_dir = os.path.join(mount_point, "data/translate_vien_iwslt32k")

vien_ckpt_path = os.path.join(vien_dir, "model.ckpt-50000")
envi_ckpt_path = os.path.join(envi_dir, "model.ckpt-50000")

Modes = tf.estimator.ModeKeys

# Create hparams and the model
model_name = "transformer"
hparams_set = "transformer_tiny"

### Vi_En Problem

In [0]:
vien_problem = problems.problem("translate_vien_iwslt32k")

# Get the encoders from the problem
vien_encoders = vien_problem.feature_encoders(vien_data_dir)


# Setup helper functions for encoding and decoding
def vien_encode(input_str, output_str=None):
  """Input str to features dict, ready for inference"""
  inputs = vien_encoders["inputs"].encode(input_str) + [1]  # add EOS id
  batch_inputs = tf.reshape(inputs, [1, -1, 1])  # Make it 3D.
  return {"inputs": batch_inputs, "target_space_id": tf.constant(1, dtype=tf.int32)}

def vien_decode(integers):
  """List of ints to str"""
  integers = list(np.squeeze(integers))
  if 1 in integers:
    integers = integers[:integers.index(1)]
  return vien_encoders["inputs"].decode(np.squeeze(integers))


hparams = trainer_lib.create_hparams(hparams_set, data_dir=vien_data_dir, problem_name="translate_vien_iwslt32k")

# NOTE: Only create the model once when restoring from a checkpoint; it's a
# Layer and so subsequent instantiations will have different variable scopes
# that will not match the checkpoint.
translate_vien_model = registry.model(model_name)(hparams, Modes.EVAL)


# Restore and translate!
def translate_vien(inputs, beam_size=4, alpha=0.6):
  encoded_inputs = vien_encode(inputs)

  with tfe.restore_variables_on_create(vien_ckpt_path):
    translated_outputs = translate_vien_model.infer(encoded_inputs, beam_size=beam_size, alpha=alpha)
        
  return vien_decode(translated_outputs["outputs"]), translated_outputs["cache"]

### En_Vi Problem

In [0]:
envi_problem = problems.problem("translate_envi_iwslt32k")

# Get the encoders from the problem
envi_encoders = envi_problem.feature_encoders(envi_data_dir)

envi_hparams = trainer_lib.create_hparams(hparams_set, data_dir=envi_data_dir, problem_name="translate_envi_iwslt32k")
translate_envi_model = registry.model(model_name)(envi_hparams, Modes.EVAL)


# Setup helper functions for encoding and decoding
def envi_encode(input_str, output_str=None):
  """Input str to features dict, ready for inference"""
  inputs = envi_encoders["inputs"].encode(input_str) + [1]  # add EOS id
  batch_inputs = tf.reshape(inputs, [1, -1, 1])  # Make it 3D.
  return {"inputs": batch_inputs, "target_space_id": tf.constant(1, dtype=tf.int32)}

def envi_decode(integers):
  """List of ints to str"""
  integers = list(np.squeeze(integers))
  if 1 in integers:
    integers = integers[:integers.index(1)]
  return envi_encoders["inputs"].decode(np.squeeze(integers))



def translate_envi(inputs, beam_size=4, alpha=0.6):
    encoded_inputs = envi_encode(inputs)
    
    with tfe.restore_variables_on_create(envi_ckpt_path):
        translated_outputs = translate_envi_model.infer(encoded_inputs, beam_size=beam_size, alpha=alpha)
        
    return envi_decode(translated_outputs["outputs"]), translated_outputs["cache"]

## IV. Interactive Translation
In this section, you can test the quality of our trained translation models by input any sentence in Vietnamese or English. The models will then translate the input sentence in one language and print out the output sentence in the other language.

### From Vietnamese to English

In [0]:
beam_size = 4 #@param {type: "integer"}
alpha = 0.6 #@param {type: "number"}
# Tôi là một giáo viên giỏi
vi_input_sentence = "Tôi là một giáo viên giỏi" #@param {type:"raw"}
en_output_sentence, _ = translate_vien(vi_input_sentence, beam_size=beam_size, alpha=alpha)
en_output_sentence = en_output_sentence.replace('&apos;', '\'')
print("The input sentence is tranlated as: \n{}".format(en_output_sentence))

### From English to Vietnamese

In [0]:
beam_size = 4 #@param {type: "integer"}
alpha = 0.6 #@param {type: "number"}

# I am a good teacher
en_input_sentence = "I am a good teacher" #@param {type:"raw"}
vi_output_sentence, _ = translate_envi(en_input_sentence, beam_size=beam_size, alpha=alpha)
print("The input sentence is tranlated as: \n{}".format(vi_output_sentence))

## V. Attention Visualization
In this section, we will visualize Transformer's attention weights to help you get some insights about how Transformer encode and decode sentences

In [0]:
inputs = "Tôi là một thầy giáo giỏi"
outputs, cache = translate_vien(inputs, beam_size=1, alpha=0)

print("Inputs: %s" % inputs)
print("Outputs: %s" % outputs.replace('&apos;', '\''))

In [0]:
SIZE = 35

def encode_eval(input_str, output_str, encoders):
  inputs = tf.reshape(encoders["inputs"].encode(input_str) + [1], [1, -1, 1, 1])  # Make it 3D.
  outputs = tf.reshape(encoders["inputs"].encode(output_str) + [1], [1, -1, 1, 1])  # Make it 3D.
  return {"inputs": inputs, "targets": outputs}

def get_att_mats(translate_model, hparams):
  enc_atts = []
  dec_atts = []
  encdec_atts = []

  for i in range(hparams.num_hidden_layers):
    enc_att = translate_model.attention_weights[
      "transformer/body/encoder/layer_%i/self_attention/multihead_attention/dot_product_attention" % i][0]
    dec_att = translate_model.attention_weights[
      "transformer/body/decoder/layer_%i/self_attention/multihead_attention/dot_product_attention" % i][0]
    encdec_att = translate_model.attention_weights[
      "transformer/body/decoder/layer_%i/encdec_attention/multihead_attention/dot_product_attention" % i][0]
    
    enc_atts.append(resize(enc_att))
    dec_atts.append(resize(dec_att))
    encdec_atts.append(resize(encdec_att))

  return enc_atts, dec_atts, encdec_atts

def resize(np_mat):
  # Sum across tokens
  np_mat = np_mat[:, :SIZE, :SIZE]
  row_sums = np.sum(np_mat, axis=-1)
  # Normalize
  layer_mat = np_mat / row_sums[:, np.newaxis]
  lsh = layer_mat.shape
  # Add extra dim for viz code to work.
  layer_mat = np.reshape(layer_mat, (1, lsh[0], lsh[1], lsh[2]))
  return layer_mat

def to_tokens(ids, hparams):
  ids = np.squeeze(ids)
  subtokenizer = hparams.problem_hparams.vocabulary['targets']
  tokens = []
  for _id in ids:
    if _id == 0:
      tokens.append('<PAD>')
    elif _id == 1:
      tokens.append('<EOS>')
    elif _id == -1:
      tokens.append('<NULL>')
    else:
        tokens.append(subtokenizer._subtoken_id_to_subtoken_string(_id))
  return tokens

def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

In [0]:
# Get normalized attention weights for each layer
enc_atts, dec_atts, encdec_atts = get_att_mats(translate_vien_model, hparams)

dec_atts_revised = [np.repeat(dec_att, dec_att.shape[-1], axis=2) for dec_att in dec_atts]

attention_history = cache["attention_history"]
encdec_atts_revised = [attention_history["layer_0"], attention_history["layer_1"]]


In [0]:
inp_text = to_tokens(vien_encoders["inputs"].encode(inputs), hparams)
out_text = to_tokens(vien_encoders["inputs"].encode(outputs), hparams)

call_html()

attention.show(inp_text, out_text, enc_atts, dec_atts_revised, encdec_atts_revised)

## VI. Vietnamese Data Augmentation by Backtranslation
In this section, we can input any Vietnamese sentence and get its parapharases in an interactive way. By changing the **beam size** and **length penalty ($\alpha$)** hyperparameters , we can generate multiple paraphrases for each input sentence.




Below is some fun examples that we tried, you can then try it yourself. Have some fun :v


---

* Example 1: Bạn tôi học rất giỏi nhưng bạn không thích học
    * Bạn biết đấy , bạn biết đấy , bạn biết đấy , nhưng bạn không thích học .
    * Bạn rất giỏi ở trường , nhưng bạn không thích học .
---
* Example 2: Tôi là ai, và đây là đâu
    * Tôi là người mà người dân chủ , và đây là nơi nào ?
    * Tôi là ai là người manize , và đây là nơi nào ?
    * Tôi là người có thể tự nhiên , và đó là nơi mà nó ở đâu , và đó là nơi mà nó ở đâu ?
    * Tôi là người mà người có thể tự nhiên , và đó là nơi này ở đâu , và đó là nơi nào ?
---
* Example 3:  VietAI là một tổ chức phi lợi nhuận với mục tiêu thúc đẩy sự phát triển của công nghệ AI tại Việt Nam
    * VietAI là một tổ chức phi lợi nhuận với sự phát triển của AI ở Việt Nam .
---

NOTE: We experimented to investigate the effectiveness of Vietnamese augmented data by backtranslation on a sentiment analysis task for Foody comments and it seems promising :3. For more information, please check the [colab](https://colab.research.google.com/drive/1_I0KvFlHFyBcTRT3Bfx9BGLJcIHGJNrG#scrollTo=7yvhttVKTkZu) here.


In [0]:
beam_size = 4 #@param {type: "integer"}
alpha = 0.6 #@param {type: "number"}
# Tôi từng ước mơ trở thành cầu thủ bóng đá
vi_input_sentence = "Tôi từng ước mơ trở thành cầu thủ bóng đá" #@param {type:"raw"}
print("Augmented data:")

en_output_sentence, _ = translate_vien(vi_input_sentence, beam_size=beam_size, alpha=alpha)
vi_output_sentence, _ = translate_envi(en_output_sentence, beam_size=beam_size, alpha=alpha)
print(vi_output_sentence)


## Acknowledgements

This work is made possible by [VietAI](http://vietai.org/). Special thanks to [Thang Luong](http://thangluong.com), [ Trieu H. Trinh](https://thtrieu.github.io/) and Phat Hoang for collaborating and giving comments.

## References

1. Improving Neural Machine Translation Models with Monolingual Data - Sennrich et al. , 2016a  ([arxiv](https://arxiv.org/abs/1511.06709))
2. Understanding Back-Translation at Scale - Edunov, Sergey, et al., 2018 ([arxiv](https://arxiv.org/abs/1808.09381))
3. T2T translate vi<->en tiny tpu - Trieu H. Trinh ([colab](https://colab.research.google.com/drive/1Bx5HfxbmXnMK7kBLHlmGyhVhQVVrDI0p))
4. Sentiment Analysis + Back translation - Trieu H. Trinh ([colab](https://colab.research.google.com/drive/1_I0KvFlHFyBcTRT3Bfx9BGLJcIHGJNrG#scrollTo=7yvhttVKTkZu))
5. Tensor2Tensor Intro - Tensor2Tensor Team([colab](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb))
