In [11]:
import torch

# check if we can connect to the GPU with PyTorch
if torch.cuda.is_available():
    device = torch.cuda.current_device()
    print('Current device:', torch.cuda.get_device_name(device))
else:
    print('Failed to find GPU. Will use CPU.')
    device = 'cpu'

Current device: GeForce GTX 1080 Ti


In [15]:
%cd /home/vzhekova/fairseq/examples/translation

/home/vzhekova/fairseq/examples/translation


In [23]:
# List files in downloaded `sample_data`
!ls -ltr sample_data

!echo -e "\nFirst lines of English:\n"
!head sample_data/train.de-en.en
!echo -e "\nFirst lines of German:\n"
!head sample_data/train.de-en.de

total 48856
-rw-r--r-- 1 vzhekova input   199632 Okt 24 16:30 dev.de-en.en
-rw-r--r-- 1 vzhekova input   223089 Okt 24 16:30 dev.de-en.de
-rw-r--r-- 1 vzhekova input   418647 Okt 24 16:30 tst.de-en.en
-rw-r--r-- 1 vzhekova input   473219 Okt 24 16:30 tst.de-en.de
-rw-r--r-- 1 vzhekova input 18496311 Okt 24 16:30 train.de-en.de
-rw-r--r-- 1 vzhekova input 16663248 Okt 24 16:30 train.de-en.en
-rw-r--r-- 1 vzhekova input 13535860 Okt 24 16:37 data.zip

First lines of English:

It can be a very complicated thing, the ocean.
And it can be a very complicated thing, what human health is.
And bringing those two together might seem a very daunting task, but what I'm going to try to say is that even in that complexity, there's some simple themes that I think, if we understand, we can really move forward.
And those simple themes aren't really themes about the complex science of what's going on, but things that we all pretty well know.
And I'm going to start with this one: If momma ain't happy, ai

In [24]:
import sentencepiece as spm

# segment the subwords
spm.SentencePieceTrainer.train(input="sample_data/train.de-en.en,sample_data/train.de-en.de", 
                               model_prefix="bpe", 
                               vocab_size=10000)

print('Finished training sentencepiece model.')

Finished training sentencepiece model.


sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: sample_data/train.de-en.en
  input: sample_data/train.de-en.de
  input_format: 
  model_prefix: bpe
  model_type: UNIGRAM
  vocab_size: 10000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_pref

In [28]:
# Load the trained sentencepiece model
spm_model = spm.SentencePieceProcessor(model_file="bpe.model")

# preprocess the sentences from train/dev/test sets
for partition in ["train", "dev", "tst"]:
    for lang in ["de", "en"]:
        f_out = open(f"sample_data/spm.{partition}.de-en.{lang}", "w")

        with open(f"sample_data/{partition}.de-en.{lang}", "r") as f_in:
            for line_idx, line in enumerate(f_in.readlines()):
                # Segmented into subwords
                line_segmented = spm_model.encode(line.strip(), out_type=str)
                # Join the subwords into a string
                line_segmented = " ".join(line_segmented)
                f_out.write(line_segmented + "\n")

        f_out.close()
        
print('Finished.')

Finished.


In [29]:
!ls -ltr sample_data

!echo -e "\nFirst lines of tokenized English:\n"
!head sample_data/spm.train.de-en.en
!echo -e "\nFirst lines of tokenized German:\n"
!head sample_data/spm.train.de-en.de

total 91812
-rw-r--r-- 1 vzhekova input   199632 Okt 24 16:30 dev.de-en.en
-rw-r--r-- 1 vzhekova input   223089 Okt 24 16:30 dev.de-en.de
-rw-r--r-- 1 vzhekova input   418647 Okt 24 16:30 tst.de-en.en
-rw-r--r-- 1 vzhekova input   473219 Okt 24 16:30 tst.de-en.de
-rw-r--r-- 1 vzhekova input 18496311 Okt 24 16:30 train.de-en.de
-rw-r--r-- 1 vzhekova input 16663248 Okt 24 16:30 train.de-en.en
-rw-r--r-- 1 vzhekova input 28468890 Nov 12 15:55 spm.train.de-en.de
-rw-r--r-- 1 vzhekova input 26971608 Nov 12  2022 spm.train.de-en.en
-rw-r--r-- 1 vzhekova input   343454 Nov 12  2022 spm.dev.de-en.de
-rw-r--r-- 1 vzhekova input   321479 Nov 12  2022 spm.dev.de-en.en
-rw-r--r-- 1 vzhekova input   728203 Nov 12  2022 spm.tst.de-en.de
-rw-r--r-- 1 vzhekova input   678280 Nov 12  2022 spm.tst.de-en.en

First lines of tokenized English:

▁It ▁can ▁be ▁a ▁very ▁complicated ▁thing , ▁the ▁ocean .
▁And ▁it ▁can ▁be ▁a ▁very ▁complicated ▁thing , ▁what ▁human ▁health ▁is .
▁And ▁bring ing ▁those ▁two ▁t

In [43]:
# Preprocess/binarize the data
TEXT="/home/vzhekova/fairseq/examples/translation/sample_data"
!echo $TEXT
# Binarize the data for training
!fairseq-preprocess \
    --source-lang en --target-lang de \
    --trainpref $TEXT/spm.train.de-en \
    --validpref $TEXT/spm.dev.de-en \
    --testpref $TEXT/spm.tst.de-en \
    --destdir data-bin/iwslt14.de-en \
    --thresholdtgt 0 --thresholdsrc 0 \
    --workers 8

/home/vzhekova/fairseq/examples/translation/sample_data
2022-11-12 16:46:42 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/iwslt14.de-en', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_loggi

In [51]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train \
    /home/vzhekova/fairseq/examples/translation/data-bin/iwslt14.de-en \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --keep-last-epochs 2 \
    --max-tokens 4096 \
    --max-epoch 10 \
    --reset-optimizer

2022-11-12 19:13:02 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': N

In [52]:
TEST_INPUT="/home/vzhekova/fairseq/examples/translation/sample_data/spm.tst.de-en.de"
PRED_LOG="/home/vzhekova/fairseq/examples/translation/en-de.decode.log"

!fairseq-generate /home/vzhekova/fairseq/examples/translation/data-bin/iwslt14.de-en \
      --task translation \
      --source-lang en \
      --target-lang de \
      --path /home/vzhekova/fairseq/examples/translation/checkpoints/checkpoint_best.pt \
      --batch-size 256 \
      --beam 4 \
      --remove-bpe=sentencepiece > $PRED_LOG

2022-11-13 12:58:01 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name'

In [53]:
# extract the hypotheses and references from the decoding log file
!grep ^H $PRED_LOG | sed 's/^H-//g' | cut -f 3 | sed 's/ ##//g' > ./hyp.txt
!grep ^T $PRED_LOG | sed 's/^T-//g' | cut -f 2 | sed 's/ ##//g' > ./ref.txt

In [54]:
!head ./hyp.txt
print("..........")
!head ./ref.txt

Und er sagt.
Sie sind wieder wieder wieder.
Wir nehmen es aus.
Und ich dachte darüber nach.
Was wollen diese Menschen sein?
Ich wollte sie unterstützen.
"Ja", sagte er.
Wer macht das Beste?
Warum stimmt das?
Es macht sie gut."
..........
Und er sagt...
Bewerte sie wieder.
Wir heben ab.
Ich dachte darüber nach.
Was brauchen diese Menschen?
Ich wollte sie unterstützten.
"Ja", sagte er.
Wer ist am erfolgreichsten?
Und warum ist das so?
So fühlen sie sich gut."


In [55]:
# evaluating the model
!cat ./hyp.txt | sacrebleu ref.txt

{
 "name": "BLEU",
 "score": 15.0,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1",
 "verbose_score": "50.1/21.5/10.8/5.7 (BP = 0.933 ratio = 0.935 hyp_len = 80332 ref_len = 85901)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.3.1"
}
[0m