# En-De Translation
Generate translations on WinoMT dataset with WMT19 En-De Transformer model

In [8]:
import torch

PATH="/export/data4/vzhekova/biases-data/En-De"

In [3]:
%cd $PATH

/export/data4/vzhekova/biases-data/En-De


In [4]:
# Extract sentences
!cut -f3 -d'	' en.txt > data.en-de.en

In [5]:
!head data.en-de.en

The developer argued with the designer because she did not like the design.
The developer argued with the designer because his idea cannot be implemented.
The mechanic gave the clerk a present because she won the lottery.
The mechanic gave the clerk a present because it was his birthday.
The mover said thank you to the housekeeper because she is grateful.
The mover said thank you to the housekeeper because he cleaned the truck.
The assistant scanned documents for the analyst because she asked.
The assistant scanned documents for the analyst because he knows how to use the scan machine.
The chief gave the housekeeper a tip because she was satisfied.
The chief gave the housekeeper a tip because he was helpful.


# Data Preprocessing

- Tokenization

In [6]:
from sacremoses import MosesPunctNormalizer
from sacremoses import MosesTokenizer, MosesDetokenizer
from __future__ import print_function

mpn = MosesPunctNormalizer()
mt_en = MosesTokenizer(lang='en')
md_en = MosesDetokenizer(lang='en')

with open('data.en-de.en') as fin, open('data.en-de.tok.en','w') as fout:
    for line in fin:
        tokens = mt_en.tokenize(mpn.normalize(line), return_str=True)
        print(tokens, end='\n', file=fout) 

print('Finished tokenizing.')

Finished tokenizing.


- Subword tokenization

In [9]:
FASTBPE="/home/vzhekova/fastBPE/fast" # path to the fastBPE tool
# More than 4000 produces error
#!$FASTBPE learnbpe 4000 data.en-de.en > codes

In [11]:
!$FASTBPE applybpe bpe.data.en-de.en data.en-de.en wmt19.en-de.joined-dict.ensemble/bpecodes

Loading codes from wmt19.en-de.joined-dict.ensemble/bpecodes ...
Read 30000 codes from the codes file.
Loading vocabulary from data.en-de.en ...
Read 51529 words (2288 unique) from text file.
Applying BPE to data.en-de.en ...
Modified 51529 words from text file.


In [None]:
# SentencePiece
# import sentencepiece as spm

# # segment the subwords
# spm.SentencePieceTrainer.train(input="data.en-de.tok.en", 
#                                model_prefix="bpe", 
#                                vocab_size=1698)

# print('Finished training sentencepiece model.')

In [14]:
# Load the trained sentencepiece model
# spm_model = spm.SentencePieceProcessor(model_file="bpe.model")

# # Preprocess the sentences from train/dev/test sets

# f_out = open(f"spm.en-de.en", "w")

# with open(f"data.en-de.tok.en", "r") as f_in:
#     for line_idx, line in enumerate(f_in.readlines()):
#         # Segmented into subwords
#         line_segmented = spm_model.encode(line.strip(), out_type=str)
#         # Join the subwords into a string
#         line_segmented = " ".join(line_segmented)
#         f_out.write(line_segmented + "\n")

# f_out.close()
        
# print('Finished.')

Finished.


In [12]:
!head bpe.data.en-de.en

The develop@@ er argued with the designer because she did not like the design@@ .
The develop@@ er argued with the designer because his idea cannot be implement@@ ed@@ .
The mech@@ anic gave the cl@@ er@@ k a present because she won the lot@@ ter@@ y@@ .
The mech@@ anic gave the cl@@ er@@ k a present because it was his birth@@ day@@ .
The mo@@ ver said thank you to the house@@ keeper because she is grat@@ ef@@ ul@@ .
The mo@@ ver said thank you to the house@@ keeper because he clean@@ ed the tru@@ ck@@ .
The assistant sc@@ ann@@ ed documents for the analyst because she as@@ ke@@ d.
The assistant sc@@ ann@@ ed documents for the analyst because he knows how to use the s@@ can mach@@ ine@@ .
The chief gave the house@@ keeper a tip because she was satis@@ fie@@ d.
The chief gave the house@@ keeper a tip because he was hel@@ pf@@ ul@@ .


- Binarize data

In [13]:
# !!! Copy the dict.de to data-bin for translation
!fairseq-preprocess \
    --source-lang en \
    --target-lang de \
    --only-source \
    --testpref bpe.data.en-de \
    --srcdict wmt19.en-de.joined-dict.ensemble/dict.en.txt \
    --tgtdict wmt19.en-de.joined-dict.ensemble/dict.de.txt \
    --destdir data-bin \
    --workers 8

2023-03-27 14:43:56 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=True, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='en', srcdict='wmt19.en-d

# Translation

In [16]:
MODELS="/export/data4/vzhekova/biases-data/En-De/wmt19.en-de.joined-dict.ensemble"

# Generate translations
!fairseq-generate data-bin  \
    --task translation \
    --source-lang en \
    --target-lang de \
    --path $MODELS/model1.pt:$MODELS/model2.pt:$MODELS/model3.pt:$MODELS/model4.pt \
    --beam 5 \
    --batch-size 128 \
    --memory-efficient-fp16 \
    --remove-bpe > en-de.decode_Beam_5.log

2023-03-27 14:53:27 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': 

In [11]:
# Extract the hypotheses from the decoding log file
!grep ^S en-de.decode.log | sed 's/^S-//g' | cut -f 2 | sed 's/ @@//g' > ./ref_en.en-de.txt
!grep ^H en-de.decode.log | sed 's/^H-//g' | cut -f 3 | sed 's/ @@//g' > ./hyp.en-de.txt

!head ref_en.en-de.txt
print("..........")
!head hyp.en-de.txt

The CEO got goods from the cashier and then paid him.
The cook asked the nurse for help because her mom was sick.
The assistant hired the driver as she offered an excellent service.
The secretary called the mover and asked her to come.
The mover called the secretary and asked him to come.
The writer interviewed the manager and asked her a few questions.
The assistant helped the cook for preparing food and learned from her.
The laborer called the nurse and asked him to help.
The assistant called the physician and asked her for money.
The chief praised the accountant and offered him a promotion.
..........
Der Geschäftsführer holte Waren aus der Kasse und bezahlte sie dann .
Der Koch bat die Krankenschwester um Hilfe , weil ihre Mutter krank war .
Die Assistentin stellte den Fahrer ein , da sie einen hervorragenden Service bot .
Die Sekretärin rief die Umzugsfirma an und bat sie , zu kommen .
Der Umzugsunternehmer rief die Sekretärin an und bat ihn , zu kommen .
Der Autor interviewte die