<a href="https://colab.research.google.com/github/sakhile-mabunda/Natural-Language-Processing/blob/master/char_starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Umsuka English - Isizulu Parallel Corpus

## Research question: How do tokenization strategies affect machine translation performance?

In [23]:
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/drive')

Mounted at /content/drive


In [24]:
! rm -r joeynmt
! rm -r sample_data
! rm -r dev.bpe.en
! rm -r dev.bpe.zu
! rm -r dev.en
! rm -r dev.zu
! rm -r test.bpe.en
! rm -r test.bpe.zu
! rm -r test.en
! rm -r test.zu
! rm -r train.bpe.en
! rm -r train.bpe.zu
! rm -r train.en
! rm -r train.zu
! rm -r en-zu.eval.csv?download=1
! rm -r en-zu.training.csv?download=1
! rm -r tokenizer-trained.json
! rm -r train.uni.en
! rm -r train.uni.zu
! rm -r dev.uni.en
! rm -r dev.uni.zu
! rm -r test.uni.en
! rm -r test.uni.zu
! rm -r train.char.en
! rm -r train.char.zu
! rm -r test.char.zu
! rm -r test.char.en
! rm -r m.vocab
! rm -r m.model
! rm -r dev.char.en
! rm -r dev.char.zu


rm: cannot remove 'sample_data': No such file or directory
rm: cannot remove 'en-zu.eval.csv?download=1': No such file or directory
rm: cannot remove 'en-zu.training.csv?download=1': No such file or directory
rm: cannot remove 'tokenizer-trained.json': No such file or directory
rm: cannot remove 'train.uni.en': No such file or directory
rm: cannot remove 'train.uni.zu': No such file or directory
rm: cannot remove 'dev.uni.en': No such file or directory
rm: cannot remove 'dev.uni.zu': No such file or directory
rm: cannot remove 'test.uni.en': No such file or directory
rm: cannot remove 'test.uni.zu': No such file or directory
rm: cannot remove 'm.vocab': No such file or directory
rm: cannot remove 'm.model': No such file or directory


In [25]:
import os

source_language = 'en'
target_language = 'zu' 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = 'baseline' # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# This will save it to a folder in our gdrive instead!
!mkdir -p "/content/drive/My Drive/nlp-project/$src-$tgt-$tag"
os.environ["gdrive_path"] = "/content/drive/My Drive/nlp-project/%s-%s-%s" % (source_language, target_language, tag)

In [26]:
!echo $gdrive_path

/content/drive/My Drive/nlp-project/en-zu-baseline


In [27]:
from google.colab import files

files.upload()

# ! wget https://zenodo.org/record/5035171/files/en-zu.training.csv?download=1

# ! wget https://zenodo.org/record/5035171/files/en-zu.eval.csv?download=1

Saving en-zu.training.csv to en-zu.training (1).csv
Saving en-zu.eval.csv to en-zu.eval (1).csv




In [28]:
import pandas as pd

train = pd.read_csv('en-zu.training.csv')

eval_df = pd.read_csv('en-zu.eval.csv')

## Pre-processing and export

It is generally a good idea to remove duplicate translations and conflicting translations from the corpus. In practice, these public corpora include some number of these that need to be cleaned.

In addition we will split our data into dev/test/train and export to the filesystem.

In [29]:
# drop duplicate translations
train = train.drop_duplicates()

# Shuffle the data to remove bias in dev set selection.
train = train.sample(frac=1, random_state=seed).reset_index(drop=True)

In [30]:
# Install fuzzy wuzzy to remove "almost duplicate" sentences in the
# test and training sets.
! pip install fuzzywuzzy
! pip install python-Levenshtein

import time
from fuzzywuzzy import process
import numpy as np
from os import cpu_count
from functools import partial
from multiprocessing import Pool

# reset the index of the training set after previous filtering
train.reset_index(drop=False, inplace=True)

# Remove samples from the training data set if they "almost overlap" with the
# samples in the test set.

# Filtering function. Adjust pad to narrow down the candidate matches to
# within a certain length of characters of the given sample.
def fuzzfilter(sample, candidates, pad):
  candidates = [x for x in candidates if len(x) <= len(sample)+pad and len(x) >= len(sample)-pad] 
  if len(candidates) > 0:
    return process.extractOne(sample, candidates)[1]
  else:
    return np.nan



In [31]:
# This section does the split between train/dev for the parallel corpora then saves them as separate files
# We use 1000 dev set and the given test set.
# Do the split between dev/train and create parallel corpora

num_dev_patterns = 600

# lower case the corpora - this will make it easier to generalize, but without proper casing.
if lc:
    train["en"] = train["en"].str.lower()
    train["zu"] = train["zu"].str.lower()

# dev and test sets
dev = eval_df.tail(num_dev_patterns) 
test = eval_df.drop(eval_df.tail(num_dev_patterns).index)

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
  for index, row in train.iterrows():
    src_file.write(row["en"]+"\n")
    trg_file.write(row["zu"]+"\n")
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["en"]+"\n")
    trg_file.write(row["zu"]+"\n")

with open("test."+source_language, "w") as src_file, open("test."+target_language, "w") as trg_file:
  for index, row in test.iterrows():
    src_file.write(row["en"]+"\n")
    trg_file.write(row["zu"]+"\n")

# Doublecheck the format below. There should be no extra quotation marks or weird characters.
! head train.*
! head dev.*
! head test.*

==> train.dropout.bpe.en <==
T@@ od@@ ay, those using Pl@@ anc@@ k and c@@ os@@ m@@ ic b@@ ack@@ gr@@ ound data to ob@@ tain a val@@ u@@ e fo@@ r the H@@ ub@@ b@@ le con@@ st@@ ant get a f@@ ig@@ ure of 6@@ 7@@ .@@ 4 pl@@ u@@ s or min@@ us 0@@ .@@ 5. By contr@@ ast the local ap@@ pr@@ o@@ ach giv@@ es a f@@ ig@@ ur@@ e of 7@@ 3@@ .@@ 5 pl@@ us or min@@ us 1.@@ 4. Th@@ ese val@@ u@@ es re@@ pres@@ ent the two d@@ if@@ fer@@ ent val@@ u@@ es we have for the exp@@ ans@@ ion of the un@@ iver@@ s@@ e. (@@ Se@@ e "@@ A mat@@ ter of met@@ r@@ ic@@ s," bel@@ ow@@ .@@ )
S@@ l@@ ac@@ k@@ 's st@@ ock has no@@ w f@@ all@@ en nearly 20@@ % from its refer@@ ence pr@@ ice of $@@ 2@@ 4 on the day of its W@@ all Street d@@ eb@@ ut@@ .
A@@ ye@@ sh@@ a Sh@@ ro@@ f@@ f@@ 's latest Instagram ent@@ ry des@@ er@@ ves e@@ ver@@ y@@ on@@ e's att@@ ent@@ ion.
w@@ int@@ er we@@ ather al@@ er@@ ts from we@@ st V@@ ir@@ g@@ in@@ ia all the way up to M@@ ain@@ e.
M@@ r M@@ o@@ ise - who has been in power since 2@@ 



---


## Installation of JoeyNMT

[JoeyNMT](https://joeynmt.readthedocs.io) is a simple, minimalist NMT package.  

In [32]:
# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt;pip3 install .

# Install Pytorch with GPU support v1.7.1.
# ! pip uninstall torch
! pip install torch==1.10.0+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Cloning into 'joeynmt'...
remote: Enumerating objects: 3224, done.[K
remote: Counting objects: 100% (273/273), done.[K
remote: Compressing objects: 100% (143/143), done.[K
remote: Total 3224 (delta 155), reused 209 (delta 130), pack-reused 2951[K
Receiving objects: 100% (3224/3224), 8.18 MiB | 15.18 MiB/s, done.
Resolving deltas: 100% (2184/2184), done.
Processing /content/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
  Created wheel for joeynmt: filename=joeynmt-1.3-py3-none-any.whl size=86029 sha256=f4ec9914a9add

In [33]:
from os import path
os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language

# Learn BPEs on the training data.
os.environ["data_path"] = path.join("joeynmt", "data", source_language + target_language) # Herman! 

# 1. Preprocessing the data into sentencepiece char tokens

In [34]:
! pip install sentencepiece



In [35]:
import sentencepiece as spm

# train sentencepiece model
# makes segmenter instance
# encode: file
spm.SentencePieceTrainer.train('--input=train.en,train.zu --model_prefix=m_char --vocab_size=2000 --model_type=char')

sp = spm.SentencePieceProcessor(model_file="m_char.model")

with open("train.en","r") as r_train_en, open("train.char."+source_language,"w") as w_train_char_en, \
open("train.zu","r") as r_train_zu, open("train.char."+target_language,"w") as w_train_char_zu:
  for line in r_train_en:
    w_train_char_en.write(" ".join(sp.encode(line, out_type=str))+"\n")
  for line in r_train_zu:
    w_train_char_zu.write(" ".join(sp.encode(line, out_type=str))+"\n")

with open("dev.en","r") as r_dev_en, open("dev.char."+source_language,"w") as w_dev_char_en, \
open("dev.zu","r") as r_dev_zu, open("dev.char."+target_language,"w") as w_dev_char_zu:
  for line in r_dev_en:
    w_dev_char_en.write(" ".join(sp.encode(line, out_type=str))+"\n")
  for line in r_dev_zu:
    w_dev_char_zu.write(" ".join(sp.encode(line, out_type=str))+"\n")

with open("test.en","r") as r_test_en, open("test.char."+source_language,"w") as w_test_char_en, \
open("test.zu","r") as r_test_zu, open("test.char."+target_language,"w") as w_test_char_zu:
  for line in r_test_en:
    w_test_char_en.write(" ".join(sp.encode(line, out_type=str))+"\n")
  for line in r_test_zu:
    w_test_char_zu.write(" ".join(sp.encode(line, out_type=str))+"\n")


# 2. Preprocessing the Data into Subword BPE Tokens without dropout

In [36]:
# Do subword NMT

# Learn BPEs on the training data.
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt


# 3. Preprocessing the Data into Subword BPE Tokens with dropout


In [37]:
# Do subword NMT
from os import path
os.environ['src'] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ['tgt'] = target_language

# Learn BPEs on the training data.
os.environ['data_path'] = path.join('joeynmt', 'data', source_language + target_language)
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe --dropout 0.1 -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.dropout.bpe.$src
! subword-nmt apply-bpe --dropout 0.1 -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.dropout.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.dropout.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.dropout.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.dropout.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.dropout.bpe.$tgt

# 4. Preprocessing the data into sentencepiece char tokens

In [38]:
# train sentencepiece model
# makes segmenter instance
# encode: file
spm.SentencePieceTrainer.train('--input=train.en,train.zu --model_prefix=m_word --vocab_size=2000 --model_type=word')

sp = spm.SentencePieceProcessor(model_file="m_word.model")

with open("train.en","r") as r_train_en, open("train.word."+source_language,"w") as w_train_word_en, \
open("train.zu","r") as r_train_zu, open("train.word."+target_language,"w") as w_train_word_zu:
  for line in r_train_en:
    w_train_word_en.write(" ".join(sp.encode(line, out_type=str))+"\n")
  for line in r_train_zu:
    w_train_word_zu.write(" ".join(sp.encode(line, out_type=str))+"\n")

with open("dev.en","r") as r_dev_en, open("dev.word."+source_language,"w") as w_dev_word_en, \
open("dev.zu","r") as r_dev_zu, open("dev.word."+target_language,"w") as w_dev_word_zu:
  for line in r_dev_en:
    w_dev_word_en.write(" ".join(sp.encode(line, out_type=str))+"\n")
  for line in r_dev_zu:
    w_dev_word_zu.write(" ".join(sp.encode(line, out_type=str))+"\n")

with open("test.en","r") as r_test_en, open("test.word."+source_language,"w") as w_test_word_en, \
open("test.zu","r") as r_test_zu, open("test.word."+target_language,"w") as w_test_word_zu:
  for line in r_test_en:
    w_test_word_en.write(" ".join(sp.encode(line, out_type=str))+"\n")
  for line in r_test_zu:
    w_test_word_zu.write(" ".join(sp.encode(line, out_type=str))+"\n")

In [39]:
# Create directory, move everyone we care about to the correct location
! mkdir -p $data_path
! cp *.en $data_path
! cp *.zu $data_path
! cp bpe.codes.4000 $data_path
! ls $data_path

# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp *.en "$gdrive_path"
! cp *.zu "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.dropout.bpe.$src joeynmt/data/$src$tgt/train.dropout.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.dropout.txt
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.char.$src joeynmt/data/$src$tgt/train.char.$tgt --output_path joeynmt/data/$src$tgt/m_char.vocab
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.word.$src joeynmt/data/$src$tgt/train.word.$tgt --output_path joeynmt/data/$src$tgt/m_word.vocab

# Some output
! echo "BPE Zulu Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 joeynmt/data/$src$tgt/m.vocab  # Herman

bpe.codes.4000	    dev.word.zu		 test.word.en	       train.en
dev.bpe.en	    dev.zu		 test.word.zu	       train.word.en
dev.bpe.zu	    test.bpe.en		 test.zu	       train.word.zu
dev.char.en	    test.bpe.zu		 train.bpe.en	       train.zu
dev.char.zu	    test.char.en	 train.bpe.zu	       vocab.en
dev.dropout.bpe.en  test.char.zu	 train.char.en	       vocab.zu
dev.dropout.bpe.zu  test.dropout.bpe.en  train.char.zu
dev.en		    test.dropout.bpe.zu  train.dropout.bpe.en
dev.word.en	    test.en		 train.dropout.bpe.zu
bpe.codes.4000	    test.bpe.en		 train.char.en
dev.bpe.en	    test.bpe.zu		 train.char.zu
dev.bpe.zu	    test.char.en	 train.dropout.bpe.en
dev.char.en	    test.char.zu	 train.dropout.bpe.zu
dev.char.zu	    test.dropout.bpe.en  train.en
dev.dropout.bpe.en  test.dropout.bpe.zu  train.word.en
dev.dropout.bpe.zu  test.en		 train.word.zu
dev.en		    test.word.en	 train.zu
dev.word.en	    test.word.zu	 vocab.en
dev.word.zu	    test.zu		 vocab.zu
dev.zu		    train.bpe.en
models		    tr

# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in  [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [40]:
name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.char"
    dev:   "data/{name}/dev.char"
    test:  "data/{name}/test.char"
    level: "char"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/m_char.vocab"
    trg_vocab: "data/{name}/m_char.vocab"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 2000                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 1000          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [41]:
# Train the model

!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

2021-11-24 11:11:04,617 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-11-24 11:11:04,644 - INFO - joeynmt.data - Loading training data...
2021-11-24 11:11:04,858 - INFO - joeynmt.data - Building vocabulary...
2021-11-24 11:11:04,860 - INFO - joeynmt.data - Loading dev data...
2021-11-24 11:11:04,871 - INFO - joeynmt.data - Loading test data...
2021-11-24 11:11:04,879 - INFO - joeynmt.data - Data loaded.
2021-11-24 11:11:04,879 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-11-24 11:11:05,118 - INFO - joeynmt.model - Enc-dec model built.
2021-11-24 11:11:07,655 - INFO - joeynmt.training - Total params: 11095296
2021-11-24 11:11:09,850 - INFO - joeynmt.helpers - cfg.name                           : enzu_transformer
2021-11-24 11:11:09,851 - INFO - joeynmt.helpers - cfg.data.src                       : en
2021-11-24 11:11:09,851 - INFO - joeynmt.helpers - cfg.data.trg                       : zu
2021-11-24 11:11:09,851 - INFO - joeynmt.helpers - cfg.data.t

In [42]:
# Copy the created models from the notebook storage to google drive for persistant storage
!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

In [43]:
# Output our validation accuracy
! cat "$gdrive_path/models/${src}${tgt}_transformer/validations.txt"

Steps: 1000	Loss: 573333.18750	PPL: 24.20686	bleu: 0.00456	LR: 0.00030000	*
Steps: 2000	Loss: 506040.53125	PPL: 16.65345	bleu: 0.00461	LR: 0.00030000	*
Steps: 3000	Loss: 478393.96875	PPL: 14.28137	bleu: 0.00454	LR: 0.00030000	*
Steps: 4000	Loss: 448334.37500	PPL: 12.08399	bleu: 0.00461	LR: 0.00030000	*


In [44]:
# Test our model
! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"

2021-11-24 11:53:54,971 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-11-24 11:53:54,972 - INFO - joeynmt.data - Building vocabulary...
2021-11-24 11:53:54,972 - INFO - joeynmt.data - Loading dev data...
2021-11-24 11:53:54,985 - INFO - joeynmt.data - Loading test data...
2021-11-24 11:53:54,994 - INFO - joeynmt.data - Data loaded.
2021-11-24 11:53:55,016 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 3600
2021-11-24 11:53:55,016 - INFO - joeynmt.prediction - Loading model from models/enzu_transformer/4000.ckpt
2021-11-24 11:53:57,689 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-11-24 11:53:57,978 - INFO - joeynmt.model - Enc-dec model built.
2021-11-24 11:53:58,081 - INFO - joeynmt.prediction - Decoding on dev set (data/enzu/dev.char.zu)...
2021-11-24 12:01:46,991 - INFO - joeynmt.prediction -  dev bleu[13a]:   0.00 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-11-24 12:01:46,992 - INFO - joeynm