# En-Fr Translation
Generate translations on MuST-C dataset with WMT14 En-Fr Transformer model

In [1]:
import torch

# Check if we can connect to the GPU with PyTorch
if torch.cuda.is_available():
    device = torch.cuda.current_device()
    print('Current device:', torch.cuda.get_device_name(device))
else:
    print('Failed to find GPU. Will use CPU.')
    device = 'cpu'

Current device: GeForce GTX 1080 Ti


In [2]:
PATH="/export/data4/vzhekova/biases-data/En-Fr_MuST-C"

%cd $PATH

/export/data4/vzhekova/biases-data/En-Fr_MuST-C


In [3]:
!echo -e "\nFirst lines of English:\n"
!head tst.en-fr.en
!echo -e "\nFirst lines of French:\n"
!head tst.en-fr.fr


First lines of English:

Back in New York, I am the head of development for a non-profit called Robin Hood.
When I'm not fighting poverty, I'm fighting fires as the assistant captain of a volunteer fire company.
Now in our town, where the volunteers supplement a highly skilled career staff, you have to get to the fire scene pretty early to get in on any action.
I remember my first fire.
I was the second volunteer on the scene, so there was a pretty good chance I was going to get in.
But still it was a real footrace against the other volunteers to get to the captain in charge to find out what our assignments would be.
When I found the captain, he was having a very engaging conversation with the homeowner, who was surely having one of the worst days of her life.
Here it was, the middle of the night, she was standing outside in the pouring rain, under an umbrella, in her pajamas, barefoot, while her house was in flames.
The other volunteer who had arrived just before me — let's call him 

# Data preprocessing

- Tokenization

In [4]:
# Tokenize text

from sacremoses import MosesPunctNormalizer
from sacremoses import MosesTokenizer, MosesDetokenizer
from __future__ import print_function

mpn = MosesPunctNormalizer()

# Preprocess the sentences from train/dev/test sets
for partition in ["train", "dev", "tst"]:
    for lang in ["fr", "en"]:
        mt = MosesTokenizer(lang=lang)
        with open(f"{partition}.en-fr.{lang}") as fin, open(f"tok.{partition}.en-fr.{lang}",'w') as fout:
            for line in fin:
                tokens = mt.tokenize(mpn.normalize(line), return_str=True)
                print(tokens, end='\n', file=fout) 

        

print('Finished tokenizing.')

Finished tokenizing.


- Subword tokenization

In [25]:
# Training subword model

#!subword-nmt learn-bpe -s 32000 < tok.train.en-fr.en > sw.model.en
#!subword-nmt learn-bpe -s 32000 < tok.train.en-fr.fr > sw.model.fr

#print('Finished subword training.')

100%|#####################################| 32000/32000 [08:16<00:00, 64.48it/s]
100%|#####################################| 32000/32000 [08:01<00:00, 66.39it/s]
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
/bin/bash: fin: No such file or directory
Finished subword.


In [5]:
# Applying subword model

for partition in ["train", "dev", "tst"]:
    for lang in ["fr", "en"]:
        sw = f"sw.model.{lang}"
        fin = f"tok.{partition}.en-fr.{lang}"
        fout = f"sw.{partition}.en-fr.{lang}"
        #!subword-nmt apply-bpe -c $sw < $fin > $fout
        !subword-nmt apply-bpe -c bpecodes < $fin > $fout
        
print('Aplied subword model.')

Apllied subword model.


In [6]:
!echo -e "\nFirst lines of tokenized English:\n"
!head sw.train.en-fr.en

!echo -e "\nFirst lines of tokenized French:\n"
!head sw.train.en-fr.fr


First lines of tokenized English:

Thank you so much , Chris . And it &apos;s truly a great hon@@ or to have the opportunity to come to this stage twice ; I &apos;m extremely grateful .
I have been b@@ low@@ n away by this conference , and I want to thank all of you for the many nice comments about what I had to say the other night .
And I say that sincerely , partly because ( M@@ ock so@@ b ) I need that . ( L@@ aughter )
( L@@ aughter ) I fle@@ w on Air Force Two for eight years .
( L@@ aughter ) Now I have to take off my shoes or boo@@ ts to get on an air@@ plane ! ( L@@ aughter ) ( Appl@@ ause )
I &apos;ll tell you one quick story to illustrate what that &apos;s been like for me . ( L@@ aughter )
It &apos;s a true story - every bit of this is true .
So@@ on after T@@ ip@@ per and I left the - ( M@@ ock so@@ b ) White House - ( L@@ aughter ) we were driving from our home in N@@ ash@@ ville to a little farm we have 50 miles east of N@@ ash@@ ville . Dri@@ ving ourselves . ( L@@ augh

- Binarize data

In [7]:
# Binarize the data for training
 
# reuse model dict
!fairseq-preprocess \
    --source-lang en \
    --target-lang fr \
    --trainpref sw.train.en-fr \
    --validpref sw.dev.en-fr \
    --testpref sw.tst.en-fr \
    --srcdict /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/dict.en.txt \
    --tgtdict /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/dict.fr.txt \
    --destdir data-bin \
    --workers 8

2023-03-21 16:43:41 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='en', srcdict='/export/d

# Translation

In [8]:
%cd $PATH

/export/data4/vzhekova/biases-data/En-Fr_MuST-C


- Beam search

In [10]:
# Generate translations
!fairseq-generate data-bin  \
    --task translation \
    --source-lang en \
    --target-lang fr \
    --path /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/model.pt \
    --beam 5 \
    --batch-size 256 \
    --memory-efficient-fp16 \
    --remove-bpe=subword_nmt > en-fr.decode.log

2023-03-21 16:46:33 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': 

# Evaluation

In [11]:
# Extract the hypotheses and references from the decoding log file
!grep ^H en-fr.decode.log | sed 's/^H-//g' | cut -f 3 | sed 's/ @@//g' > ./hyp.txt
!grep ^T en-fr.decode.log | sed 's/^T-//g' | cut -f 2 | sed 's/ @@//g' > ./ref.txt

!head ./hyp.txt
print("..........")
!head ./ref.txt

Troisième technique : la catégorisation .
Et c&apos; est tout . Merci .
Vous pouvez aller jusqu&apos; au bout .
Maintenant , c&apos; est une des images .
Réfléchissons donc aux atomes .
Voici leur couloir d&apos; embâcle .
Et elle frappe un 257 .
Mais que faisons @-@ nous ?
Mark Bezos : Merci .
Jetez un coup d&apos; oeil sur ce qu&apos; elle fait .
..........
Troisième technique : la catégorisation .
Voilà tout . Merci .
On peut faire tout ça .
Voici l&apos; une des photos .
Alors , pensons aux atomes .
Voici leur rayon confiture .
Et elle frappe à 257 .
Mais nous en faisons quoi ?
Mark Bezos : Merci .
Regardez ce qu&apos; elle fait .


In [26]:
# Detokenize text        

md_fr = MosesDetokenizer(lang='fr')

with open('hyp.txt', encoding='utf8') as fin, open('hyp_detok.txt','w', encoding='utf8') as fout:
    for line in fin:
        tokens = md_fr.detokenize(line.split(), return_str=True)
        print(tokens, end='\n', file=fout)
        
with open('ref.txt', encoding='utf8') as fin, open('ref_detok.txt','w', encoding='utf8') as fout:
    for line in fin:
        tokens = md_fr.detokenize(line.split(), return_str=True)
        print(tokens, end='\n', file=fout)

print('Finished detokenizing.')

Finished detokenizing.


In [27]:
!head ./hyp_detok.txt
print("..........")
!head ./ref_detok.txt

Troisième technique : la catégorisation.
Et c'est tout. Merci.
Vous pouvez aller jusqu'au bout.
Maintenant, c'est une des images.
Réfléchissons donc aux atomes.
Voici leur couloir d'embâcle.
Et elle frappe un 257.
Mais que faisons-nous ?
Mark Bezos : Merci.
Jetez un coup d'oeil sur ce qu'elle fait.
..........
Troisième technique : la catégorisation.
Voilà tout. Merci.
On peut faire tout ça.
Voici l'une des photos.
Alors, pensons aux atomes.
Voici leur rayon confiture.
Et elle frappe à 257.
Mais nous en faisons quoi ?
Mark Bezos : Merci.
Regardez ce qu'elle fait.


In [28]:
# Evaluate the model
# BLEU score of 20.3 (beam=5)
# BLEU score of 44.5 (beam=5) when using bpecodes
!cat ./hyp_detok.txt | sacrebleu ./ref_detok.txt

{
 "name": "BLEU",
 "score": 44.5,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1",
 "verbose_score": "69.8/50.0/38.2/29.5 (BP = 1.000 ratio = 1.002 hyp_len = 55620 ref_len = 55507)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.3.1"
}
[0m

# Finetuning WMT14 En-Fr model on MuST-C

In [15]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin \
    --arch transformer_vaswani_wmt_en_fr_big --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --keep-last-epochs 2 \
    --max-tokens 4096 \
    --max-epoch 5 \
    --finetune-from-model /export/data4/vzhekova/biases-data/En-Fr/wmt14.en-fr.joined-dict.transformer/model.pt	

2023-03-21 16:50:36 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': N

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



epoch 004: 100%|▉| 2026/2027 [26:47<00:00,  1.24it/s, loss=3.868, nll_loss=2.1862023-03-21 18:41:42 | INFO | fairseq_cli.train | begin validation on "valid" subset

epoch 004 | valid on 'valid' subset:   0%|               | 0/18 [00:00<?, ?it/s][A
epoch 004 | valid on 'valid' subset:   6%|▍      | 1/18 [00:00<00:05,  2.93it/s][A
epoch 004 | valid on 'valid' subset:  11%|▊      | 2/18 [00:00<00:04,  3.70it/s][A
epoch 004 | valid on 'valid' subset:  17%|█▏     | 3/18 [00:00<00:03,  3.94it/s][A
epoch 004 | valid on 'valid' subset:  22%|█▌     | 4/18 [00:00<00:03,  4.30it/s][A
epoch 004 | valid on 'valid' subset:  28%|█▉     | 5/18 [00:01<00:03,  4.19it/s][A
epoch 004 | valid on 'valid' subset:  33%|██▎    | 6/18 [00:01<00:02,  4.17it/s][A
epoch 004 | valid on 'valid' subset:  39%|██▋    | 7/18 [00:01<00:02,  4.06it/s][A
epoch 004 | valid on 'valid' subset:  44%|███    | 8/18 [00:01<00:02,  4.08it/s][A
epoch 004 | valid on 'valid' subset:  50%|███▌   | 9/18 [00:02<00:02,  4.12it/s

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



- Translation

In [16]:
%cd $PATH

/export/data4/vzhekova/biases-data/En-Fr_MuST-C


In [17]:
# Generate translations
!fairseq-generate data-bin  \
    --task translation \
    --source-lang en \
    --target-lang fr \
    --path checkpoints/checkpoint_best.pt \
    --beam 5 \
    --batch-size 256 \
    --memory-efficient-fp16 \
    --remove-bpe=subword_nmt > en-fr.decode_finetuned.log

2023-03-21 20:15:06 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': 

- Evaluation

In [18]:
# Extract the hypotheses and references from the decoding log file
!grep ^H en-fr.decode_finetuned.log | sed 's/^H-//g' | cut -f 3 | sed 's/ @@//g' > ./hyp_finetuned.txt
!grep ^T en-fr.decode_finetuned.log | sed 's/^T-//g' | cut -f 2 | sed 's/ @@//g' > ./ref_finetuned.txt

!head ./hyp_finetuned.txt
print("..........")
!head ./ref_finetuned.txt

La troisième technique : catégorisation .
Et c&apos; est tout . Merci .
Vous pouvez faire tout le chemin .
Voici une des images .
Réfléchissons aux atomes .
Voici leur jam aisle .
Et elle touche une 257ème .
Mais que faites-nous avec ?
Mark Bezos : Merci .
Regardez ce qu&apos; elle fait .
..........
Troisième technique : la catégorisation .
Voilà tout . Merci .
On peut faire tout ça .
Voici l&apos; une des photos .
Alors , pensons aux atomes .
Voici leur rayon confiture .
Et elle frappe à 257 .
Mais nous en faisons quoi ?
Mark Bezos : Merci .
Regardez ce qu&apos; elle fait .


In [29]:
# Detokenize text        

md_fr = MosesDetokenizer(lang='fr')

with open('hyp_finetuned.txt', encoding='utf8') as fin, open('hyp_finetuned_detok.txt','w', encoding='utf8') as fout:
    for line in fin:
        tokens = md_fr.detokenize(line.split(), return_str=True)
        print(tokens, end='\n', file=fout)
        
with open('ref_finetuned.txt', encoding='utf8') as fin, open('ref_finetuned_detok.txt','w', encoding='utf8') as fout:
    for line in fin:
        tokens = md_fr.detokenize(line.split(), return_str=True)
        print(tokens, end='\n', file=fout)

print('Finished detokenizing.')

Finished detokenizing.


In [30]:
# Evaluate the model
# BLEU score of 33.8 (beam=5)
# BLEU score of 45.9 (beam=5) after using bpecodes
!cat ./hyp_finetuned_detok.txt | sacrebleu ./ref_finetuned_detok.txt

{
 "name": "BLEU",
 "score": 45.9,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1",
 "verbose_score": "72.3/52.9/40.7/31.7 (BP = 0.975 ratio = 0.975 hyp_len = 54117 ref_len = 55507)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.3.1"
}
[0m