# High-Quality Data Augmentation for Low-Resource NMT: Combining Translation Memory, a GAN Generator, and Filtering


#### As this is a proof-of-concept demonstration:
1. We employ toy data instead of real datasets.

2. The translation outputs are primarily intended to illustrate the viability of our approach.

In [1]:
!pip install -r requirements.txt

## Integrating Translation Memory into Input: 
Translation Memory (TM) is an effective approach to enhance machine translation by providing additional training data for the model.

Consequently, we integrated TM into the model's input.

<img src="./figure/TM.png" width="40%" style="margin: 0 auto;">

- $s$: A German sentence on source side.
- $s_{t}$: A German sentence on source side which is similar to $s$.
- $t_{t}$: A Upper Sorbian senctence on target side corresponding to $s_{t}$.

In [2]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from retrieval.similar_sentences_selection import read_data, generate_new, similar_domain_selection

experiments= "GAN_TM"
# original bilingual corpus training: 50 * 2, validation: 5 * 2, test: 5 * 2
de = './data/toy_data/train.hsb-de.de'
hsb = './data/toy_data/train.hsb-de.hsb'
val_de = './data/toy_data/devel.hsb-de.de'
val_hsb = './data/toy_data/devel.hsb-de.hsb'
test_de = './data/toy_data/devel_test.hsb-de.de'
test_hsb = './data/toy_data/devel_test.hsb-de.hsb'
# target path to write new sentence pair integrated with TM.
train_src = './data/{}/train.dehsb_hsb.dehsb'.format(experiments)
train_tgt = './data/{}/train.dehsb_hsb.hsb'.format(experiments)
val_src = './data/{}/val.dehsb_hsb.dehsb'.format(experiments)
val_tgt = './data/{}/val.dehsb_hsb.hsb'.format(experiments)
test_src = './data/{}/test.dehsb_hsb.dehsb'.format(experiments)
test_tgt = './data/{}/test.dehsb_hsb.hsb'.format(experiments)
# separations
S = "[SEP]"
St = "[SEP]"
Tt = "[SEP]"

de_lines = read_data(de)
hsb_lines = read_data(hsb)
val_de_lines = read_data(val_de)
val_hsb_lines = read_data(val_hsb)
test_de_lines = read_data(test_de)
test_hsb_lines = read_data(test_hsb)

generate_new(top_k_similar=10, distance=0.5, retrieve_dataset=de_lines, query_lines=de_lines, retrieve_tar_sen_dataset=hsb_lines, query_tar_sen_dataset=hsb_lines, new_src_path=train_src, new_tgt_path=train_tgt)
generate_new(top_k_similar=10, distance=0.5, retrieve_dataset=de_lines, query_lines=val_de_lines, retrieve_tar_sen_dataset=hsb_lines, query_tar_sen_dataset=val_hsb_lines, new_src_path=val_src, new_tgt_path=val_tgt)
generate_new(top_k_similar=2, distance=0.5, retrieve_dataset=de_lines, query_lines=test_de_lines, retrieve_tar_sen_dataset=hsb_lines, query_tar_sen_dataset=test_hsb_lines,  new_src_path=test_src, new_tgt_path=test_tgt)


In [3]:
# similar domain selection
mono_de = "./data/toy_data/news.2007.de.shuffled.deduped"
created_src = './data/{}/created.dehsb_hsb.dehsb'.format(experiments)
mono_src = './data/aug_double/mono.de_hsb.de'
mono_de_lines = read_data(mono_de)
similar_domain_selection(2, mono_de_lines, de_lines, hsb_lines, created_src, mono_src)


In [4]:
# we list the newly generated test set as an example
new_test_de_lines = read_data(test_src)
new_test_hsb_lines = read_data(test_tgt)

for s,t in zip(new_test_de_lines[:-1],new_test_hsb_lines[:-1]):
    print("source sentences: ", s, "\n")
    print("target sentence: ", t, "\n")
    print("-------------------------------------------------------------------------------------------------------------------------------------------\n")

## Leveraging A GAN Generator to Implement Back Translation from the Source Side

The GAN structure enables the utilization of additional source-side monolingual corpus to improve the performance of the translation model.

<img src="./figure/GAN.png" width="40%" style="margin: 0 auto;">


In [5]:
# we begin to train the generator in the architecture above.

import train_GAN
train_GAN.begin_training()

# In the train log below, we first print some important parameters.
# In our actual experiments, we leverage early stop and save the best model to ensure the performance.
# Followed with them, is the structure of G and D.

In [6]:
# translate and evaluate the test set
import translate
translate.main(test=True)

In [7]:
import evaluation
from tabulate import tabulate
table_header = ['Generator', 'BLEU', 'chrF2', 'TER']
bleu, chrF2, ter = evaluation.cal_bleu()
print(tabulate(tabular_data=[('+TM +GAN', bleu, chrF2, ter)],headers=table_header,tablefmt='grid'))
# evaluation.main()


## Conducting Data Augmentation experiments to Evaluate the Performance of High-quality Filter.

Not all sentences translated by the generator are of high quality. 

Therefore, we implemented the following process to filter out sentences that significantly deviate from natural language sentences. 

We retained high-quality translation results and demonstrated their effectiveness through data augmentation experiments.

<img src="./figure/filter.png" width="40%" style="margin: 0 auto;">

In [8]:
# translate the monolingual corpus in similar domain
# We use a pretrain model to translate the monolingual corpus, to show the true situation of our filter below.
# In your case, you should use the generator (G) trained above.
# Please modify the model_config.py file.
import translate
translate.main(test=False)

In [9]:
!cp ./result/GAN_TM/translation.txt ./data/aug_double/created.de_hsb.hsb
!cp ./data/toy_data/train.hsb-de.de ./data/aug_double/train.de_hsb.de
!cp ./data/toy_data/train.hsb-de.hsb ./data/aug_double/train.de_hsb.hsb

In [10]:
# get the filter interval in natural bilingual corpus
import high_quality_procedure.high_quality_filter as filter
ppl_mean, ppl_std, len_mean, len_std = filter.get_original_interval()

Function filter.filter_synthetic() below accepts four parameters to establish criteria for the filter. 

To conduct comparative experiments, adjust the standard intervals in filter.filter_synthetic() while keeping other procedures unchanged. 

For example:

- filter.filter_synthetic(0, 100, len_mean+len_std, len_mean-len_std) means filter the synthetic sentences by the ratio of length.

This process is straightforward and repetitive, utilizing vanilla Transformer models throughout. 

Therefore, further elaboration is unnecessary.

In [11]:
# filter the synthetic bilingual corpus
filter.filter_synthetic(ppl_mean+ppl_std, ppl_mean-ppl_std, len_mean+len_std, len_mean-len_std)

In [12]:
# augment the original bilingual corpus.
!cat ./data/toy_data/train.hsb-de.de ./data/aug_double/filtered.de_hsb.de > ./data/aug_double/train.de_hsb.de

In [13]:
!cat ./data/toy_data/train.hsb-de.hsb ./data/aug_double/filtered.de_hsb.hsb > ./data/aug_double/train.de_hsb.hsb

In [14]:
# data augmentation experiment.
import transformer
transformer.begin_training()


In [15]:
# translate and evaluate test set.
from model_config import Transformer_Config
config = Transformer_Config()
translate.main(test=True, config=config)
table_header = ['Filter', 'BLEU', 'chrF2', 'TER']
bleu, chrF2, ter = evaluation.cal_bleu(config=config)
print(tabulate(tabular_data=[('+ratio_of_ppl +ratio_of_len', bleu, chrF2, ter)],headers=table_header,tablefmt='grid'))
# evaluation.main(config=config)