# En-Fr Translation
Generate translations on MuST-C dataset with WMT14 En-Fr Transformer model

In [2]:
import torch

# check if we can connect to the GPU with PyTorch
if torch.cuda.is_available():
    device = torch.cuda.current_device()
    print('Current device:', torch.cuda.get_device_name(device))
else:
    print('Failed to find GPU. Will use CPU.')
    device = 'cpu'

Current device: GeForce GTX 1080 Ti


In [3]:
PATH="/export/data4/vzhekova/biases-data/En-Fr_MuST-C"

%cd $PATH

/export/data4/vzhekova/biases-data/En-Fr_MuST-C


In [21]:
!echo -e "\nFirst lines of English:\n"
!head tst.en-fr.en
!echo -e "\nFirst lines of French:\n"
!head tst.en-fr.fr


First lines of English:

Back in New York, I am the head of development for a non-profit called Robin Hood.
When I'm not fighting poverty, I'm fighting fires as the assistant captain of a volunteer fire company.
Now in our town, where the volunteers supplement a highly skilled career staff, you have to get to the fire scene pretty early to get in on any action.
I remember my first fire.
I was the second volunteer on the scene, so there was a pretty good chance I was going to get in.
But still it was a real footrace against the other volunteers to get to the captain in charge to find out what our assignments would be.
When I found the captain, he was having a very engaging conversation with the homeowner, who was surely having one of the worst days of her life.
Here it was, the middle of the night, she was standing outside in the pouring rain, under an umbrella, in her pajamas, barefoot, while her house was in flames.
The other volunteer who had arrived just before me — let's call him 

# Data preprocessing

- Tokenization

In [24]:
# Tokenize text
from sacremoses import MosesPunctNormalizer
from sacremoses import MosesTokenizer, MosesDetokenizer
from __future__ import print_function

mpn = MosesPunctNormalizer()

# Preprocess the sentences from train/dev/test sets
for partition in ["train", "dev", "tst"]:
    for lang in ["fr", "en"]:
        mt_fr = MosesTokenizer(lang=lang)
        with open(f"{partition}.en-fr.{lang}") as fin, open(f"tok.{partition}.en-fr.{lang}",'w') as fout:
            for line in fin:
                tokens = mt_en.tokenize(mpn.normalize(line), return_str=True)
                print(tokens, end='\n', file=fout) 

        

print('Finished tokenizing.')

Finished tokenizing.


- Subword tokenization

In [25]:
# Training subword model
!subword-nmt learn-bpe -s 32000 < tok.train.en-fr.en > sw.model.en
!subword-nmt learn-bpe -s 32000 < tok.train.en-fr.fr > sw.model.fr

print('Finished subword training.')

100%|#####################################| 32000/32000 [08:16<00:00, 64.48it/s]
100%|#####################################| 32000/32000 [08:01<00:00, 66.39it/s]
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
Finished subword.


In [27]:
# Applying subword model
for partition in ["train", "dev", "tst"]:
    for lang in ["fr", "en"]:
        sw = f"sw.model.{lang}"
        fin = f"tok.{partition}.en-fr.{lang}"
        fout = f"sw.{partition}.en-fr.{lang}"
        !subword-nmt apply-bpe -c $sw < $fin > $fout
        
print('Apllied subword model.')

Apllied subword model.


In [28]:
!echo -e "\nFirst lines of tokenized English:\n"
!head sw.train.en-fr.en

!echo -e "\nFirst lines of tokenized French:\n"
!head sw.train.en-fr.fr


First lines of tokenized English:

Thank you so much , Chris . And it &apos;s truly a great honor to have the opportunity to come to this stage twice ; I &apos;m extremely grateful .
I have been blown away by this conference , and I want to thank all of you for the many nice comments about what I had to say the other night .
And I say that sincerely , partly because ( Mo@@ ck so@@ b ) I need that . ( Laughter )
( Laughter ) I flew on Air Force Two for eight years .
( Laughter ) Now I have to take off my shoes or boots to get on an airplane ! ( Laughter ) ( Applause )
I &apos;ll tell you one quick story to illustrate what that &apos;s been like for me . ( Laughter )
It &apos;s a true story - every bit of this is true .
Soon after Ti@@ pper and I left the - ( Mo@@ ck so@@ b ) White House - ( Laughter ) we were driving from our home in Nashville to a little farm we have 50 miles east of Nashville . Dri@@ ving ourselves . ( Laughter )
( Laughter ) I looked in the re@@ ar-@@ view mirror an

- Binarize data

In [None]:
# Binarize the data for training

# map words appearing less than threshold times to unknown 
# reuse model dict
!fairseq-preprocess \
    --source-lang en \
    --target-lang fr \
    --trainpref sw.train.en-fr \
    --validpref sw.dev.en-fr \
    --testpref sw.tst.en-fr \
    --srcdict /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/dict.en.txt \
    --tgtdict /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/dict.fr.txt \
    --destdir data-bin \
    --thresholdtgt 0 \
    --thresholdsrc 0 \
    --workers 8

In [32]:
# Binarize the data for test

# map words appearing less than threshold times to unknown 
# reuse model dict; solution for dictionary size to match test dataset
!fairseq-preprocess \
    --source-lang en \
    --target-lang fr \
    --trainpref sw.train.en-fr \
    --validpref sw.dev.en-fr \
    --testpref sw.tst.en-fr \
    --srcdict /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/dict.en.txt \
    --tgtdict /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/dict.fr.txt \
    --destdir data-bin-test \
    --thresholdtgt 0 \
    --thresholdsrc 0 \
    --workers 8

2023-03-08 14:35:14 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin-test', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='en', srcdict='/exp

# Translation

In [33]:
%cd $PATH

/export/data4/vzhekova/biases-data/En-Fr_MuST-C


- Beam search

In [34]:
# Generate translations
!fairseq-generate data-bin-test  \
    --task translation \
    --source-lang en \
    --target-lang fr \
    --path /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/model.pt \
    --beam 5 \
    --batch-size 256 \
    --memory-efficient-fp16 \
    --remove-bpe=subword_nmt > en-fr.decode.log

2023-03-08 14:38:05 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': 

# Evaluation

In [35]:
# Extract the hypotheses and references from the decoding log file
!grep ^H en-fr.decode.log | sed 's/^H-//g' | cut -f 3 | sed 's/ @@//g' > ./hyp.txt
!grep ^T en-fr.decode.log | sed 's/^T-//g' | cut -f 2 | sed 's/ @@//g' > ./ref.txt

!head ./hyp.txt
print("..........")
!head ./ref.txt

C&apos; est pour compléter l&apos; expérience .
lane : Je peux vous aider .
Man : Cause , ma marque ?
Pourquoi est @-@ il important d&apos; être prudent ?
Quelles sont les autres marques comme ça ?
De Seattle ... après une semaine .
Vous pouvez donc avoir ce nuage .
Vous pouvez zoomer très simplement .
Et apparemment , il était assez populaire .
Je vous remercie de votre attention .
..........
C &apos;est pour compléter l <<unk>> .
<<unk>> : Je peux vous aider .
Man : <<unk>> , ma marque ?
Pourquoi <<unk>> important pour redémarrer ?
Quelles autres marques sont comme ça ?
De Seattle ... après une semaine .
Alors vous pouvez avoir ce <<unk>> .
Vous pouvez <<unk>> dessus très simplement .
<<unk>> , c <<unk>> plutôt populaire .
Mark Bezos : Merci .


In [36]:
# Detokenize text        

md_en = MosesDetokenizer(lang='en')
md_fr = MosesDetokenizer(lang='fr')

with open('hyp.txt', encoding='utf8') as fin, open('hyp_detok.txt','w', encoding='utf8') as fout:
    for line in fin:
        tokens = md_en.detokenize(line.split(), return_str=True)
        print(tokens, end='\n', file=fout)
        
with open('ref.txt', encoding='utf8') as fin, open('ref_detok.txt','w', encoding='utf8') as fout:
    for line in fin:
        tokens = md_fr.detokenize(line.split(), return_str=True)
        print(tokens, end='\n', file=fout)

print('Finished detokenizing.')

Finished detokenizing.


In [37]:
!head ./hyp_detok.txt
print("..........")
!head ./ref_detok.txt

C' est pour compléter l' expérience.
lane: Je peux vous aider.
Man: Cause, ma marque?
Pourquoi est-il important d' être prudent?
Quelles sont les autres marques comme ça?
De Seattle... après une semaine.
Vous pouvez donc avoir ce nuage.
Vous pouvez zoomer très simplement.
Et apparemment, il était assez populaire.
Je vous remercie de votre attention.
..........
C 'est pour compléter l <<unk>>.
<<unk>> : Je peux vous aider.
Man : <<unk>>, ma marque ?
Pourquoi <<unk>> important pour redémarrer ?
Quelles autres marques sont comme ça ?
De Seattle... après une semaine.
Alors vous pouvez avoir ce <<unk>>.
Vous pouvez <<unk>> dessus très simplement.
<<unk>>, c <<unk>> plutôt populaire.
Mark Bezos : Merci.


In [38]:
# Evaluate the model
# BLEU score of 20.3 (beam=5)
!cat ./hyp_detok.txt | sacrebleu ./ref_detok.txt

{
 "name": "BLEU",
 "score": 20.3,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1",
 "verbose_score": "58.4/35.8/24.3/16.6 (BP = 0.668 ratio = 0.713 hyp_len = 57927 ref_len = 81282)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.3.1"
}
[0m

# Finetuning WMT14 En-Fr model on MuST-C

In [None]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin \
    --arch transformer_vaswani_wmt_en_fr_big --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --keep-last-epochs 2 \
    --max-tokens 4096 \
    --max-epoch 5 \
    --finetune-from-model /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/model.pt	

2023-03-08 17:16:03 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': N