
# NMT-Keras tutorial
---

This notebook describes, step by step, how to build a neural machine translation model with NMT-Keras. The tutorial is organized in different sections:


1. Create a Dataset instance, in order to properly manage the data. 
2. Create and train the Neural Translation Model in the training data.
3. Apply the trained model on new (unseen) data.

All these steps are automatically run by the toolkit. But, to learn and understand the full process, it is didactic to follow this tutorial.


So, let's start installing the toolkit.

In [1]:
!pip install update pip
!pip uninstall -y keras  # Avoid crashes with pre-installed packages
!git clone https://github.com/lvapeab/nmt-keras
import os
os.chdir('nmt-keras')
!pip install -e .


Collecting update
  Downloading https://files.pythonhosted.org/packages/9f/c4/dfe8a392edd35cc635c35cd3b20df6a746aacdeb39b685d1668b56bf819b/update-0.0.1-py2.py3-none-any.whl
Collecting style==1.1.0
  Downloading https://files.pythonhosted.org/packages/4c/0b/6be2071e20c621e7beb01b86e8474c2ec344a9750ba5315886f24d6e7386/style-1.1.0-py2.py3-none-any.whl
Installing collected packages: style, update
Successfully installed style-1.1.0 update-0.0.1
Uninstalling Keras-2.3.1:
  Successfully uninstalled Keras-2.3.1
Cloning into 'nmt-keras'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 4736 (delta 0), reused 0 (delta 0), pack-reused 4730[K
Receiving objects: 100% (4736/4736), 5.69 MiB | 4.46 MiB/s, done.
Resolving deltas: 100% (3215/3215), done.
Obtaining file:///content/nmt-keras
Collecting keras@ https://github.com/MarcBS/keras/archive/master.zip
[?25l  Downloading https://github.com/Mar

## 1. Building a Dataset model
First, we are creating a [Dataset](https://github.com/MarcBS/multimodal_keras_wrapper/keras_wrapper/dataset.py) object (from the [Multimodal Keras Wrapper](https://github.com/MarcBS/multimodal_keras_wrapper) library). This object will be the interface between our data (text files) and the model:

In [2]:
from keras_wrapper.dataset import Dataset, saveDataset
from data_engine.prepare_data import keep_n_captions
ds = Dataset('tutorial_dataset', 'tutorial', silence=False)

Using TensorFlow backend.


Now that we have the empty dataset, we must indicate its inputs and outputs. In our case, we'll have two different inputs and one single output:

1. Outputs:
**target_text**: Sentences in our target language.

2. Inputs:
**source_text**: Sentences in the source language.

**state_below**: Sentences in the target language, but shifted one position to the right (for teacher-forcing training of the model).

For setting up the outputs, we use the setOutputs function, with the appropriate parameters. Note that, when we are building the dataset for the training split, we build the vocabulary (up to 30000 words).

In [5]:
ds.setOutput('/content/drive/My Drive/parallel-dataset/train.en',
             'train',
             type='text',
             id='target_text',
             tokenization='tokenize_none',
             build_vocabulary=True,
             pad_on_batch=True,
             sample_weights=True,
             max_text_len=30,
             max_words=50000,
             min_occ=0)

ds.setOutput('/content/drive/My Drive/parallel-dataset/val.en',
             'val',
             type='text',
             id='target_text',
             pad_on_batch=True,
             tokenization='tokenize_none',
             sample_weights=True,
             max_text_len=30,
             max_words=0)

[22/06/2020 15:57:56] 	Applying tokenization function: "tokenize_none".
[22/06/2020 15:57:56] Creating vocabulary for data with data_id 'target_text'.
[22/06/2020 15:57:57] 	 Total: 15457 unique words in 29000 sentences with a total of 345022 words.
[22/06/2020 15:57:57] Creating dictionary of 50000 most common words, covering 100.0% of the text.
[22/06/2020 15:57:57] Loaded "train" set outputs of data_type "text" with data_id "target_text" and length 29000.
[22/06/2020 15:57:57] 	Applying tokenization function: "tokenize_none".
[22/06/2020 15:57:57] Loaded "val" set outputs of data_type "text" with data_id "target_text" and length 1014.


Similarly, we introduce the source text data, with the setInputs function. Again, when building the training split, we must construct the vocabulary.

In [None]:
ds.setInput('/content/drive/My Drive/parallel-dataset/train.de',
            'train',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            build_vocabulary=True,
            fill='end',
            max_text_len=30,
            max_words=30000,
            min_occ=0)
ds.setInput('/content/drive/My Drive/parallel-dataset/val.de',
            'val',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            fill='end',
            max_text_len=30,
            min_occ=0)

[20/06/2020 13:40:22] 	Applying tokenization function: "tokenize_none".
[20/06/2020 13:40:22] Creating vocabulary for data with data_id 'source_text'.
[20/06/2020 13:40:22] 	 Total: 24907 unique words in 29000 sentences with a total of 322380 words.
[20/06/2020 13:40:22] Creating dictionary of 30000 most common words, covering 100.0% of the text.
[20/06/2020 13:40:22] Loaded "train" set inputs of data_type "text" with data_id "source_text" and length 29000.
[20/06/2020 13:40:22] 	Applying tokenization function: "tokenize_none".
[20/06/2020 13:40:22] Loaded "val" set inputs of data_type "text" with data_id "source_text" and length 1014.


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


...and for the 'state_below' data. Note that: 1) The offset flat is set to 1, which means that the text will be shifted to the right 1 position. 2) During sampling time, we won't have this input. Hence, we 'hack' the dataset model by inserting an artificial input, of type 'ghost' for the validation split.

In [None]:
ds.setInput('/content/drive/My Drive/parallel-dataset/train.en',
            'train',
            type='text',
            id='state_below',
            required=False,
            tokenization='tokenize_none',
            pad_on_batch=True,
            build_vocabulary='target_text',
            offset=1,
            fill='end',
            max_text_len=30,
            max_words=30000)
ds.setInput(None,
            'val',
            type='ghost',
            id='state_below',
            required=False)

[20/06/2020 13:40:31] 	Applying tokenization function: "tokenize_none".
[20/06/2020 13:40:31] 	Reusing vocabulary named "target_text" for data with data_id "state_below".
[20/06/2020 13:40:31] Loaded "train" set inputs of data_type "text" with data_id "state_below" and length 29000.
[20/06/2020 13:40:31] Loaded "val" set inputs of data_type "ghost" with data_id "state_below" and length 1014.


We can also keep the literal source words (for replacing unknown words).

In [None]:
  for split, input_text_filename in zip(['train', 'val'], ['/content/drive/My Drive/parallel-dataset/train.de', '/content/drive/My Drive/parallel-dataset/val.de']):
    ds.setRawInput(input_text_filename,
                  split,
                  type='file-name',
                  id='raw_source_text',
                  overwrite_split=True)

[20/06/2020 13:40:36] Loaded "train" set inputs of type "file-name" with id "raw_source_text".
[20/06/2020 13:40:36] Loaded "val" set inputs of type "file-name" with id "raw_source_text".


We also need to match the references with the inputs. Since we only have one reference per input sample, we set `repeat=1`.

In [None]:
keep_n_captions(ds, repeat=1, n=1, set_names=['val'])


[20/06/2020 13:40:39] Keeping 1 captions per input on the val set.
[20/06/2020 13:40:39] Samples reduced to 1014 in val set.


Finally, we can save our dataset instance for using in other experiments:

In [None]:
saveDataset(ds, 'datasets')


[20/06/2020 13:40:43] <<< creating directory datasets ... >>>
[20/06/2020 13:40:43] <<< Saving Dataset instance to datasets/Dataset_tutorial_dataset.pkl ... >>>
[20/06/2020 13:40:43] <<< Dataset instance saved >>>


## 2. Creating and training a Neural Translation Model
Now, we'll create and train a Neural Machine Translation (NMT) model. Since there is a significant number of hyperparameters, we'll use the default ones, specified in the `config.py` file. Note that almost every hardcoded parameter is automatically set from config if we run  `main.py `.

We'll create an `'AttentionRNNEncoderDecoder'` (a LSTM encoder-decoder with attention mechanism). Refer to the [`model_zoo.py`](https://github.com/lvapeab/nmt-keras/blob/master/nmt_keras/model_zoo.py) file for other models (e.g. Transformer). 

So first, let's import the model and the hyperparameters. We'll also load the dataset we stored in the previous section (not necessary as it is in memory, but as a demonstration):

In [None]:
from config import load_parameters
from nmt_keras.model_zoo import TranslationModel
from keras_wrapper.cnn_model import loadModel
from keras_wrapper.dataset import loadDataset
from keras_wrapper.extra.callbacks import PrintPerformanceMetricOnEpochEndOrEachNUpdates
params = load_parameters()
dataset = loadDataset('datasets/Dataset_tutorial_dataset.pkl')


[20/06/2020 13:40:48] <<< Loading Dataset instance from datasets/Dataset_tutorial_dataset.pkl ... >>>
[20/06/2020 13:40:48] <<< Dataset instance loaded >>>


Since the number of words in the dataset may be unknown beforehand, we must update the params information according to the dataset instance:


In [None]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['source_text']
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len['target_text']

Now, we create a `TranslationModel` instance:


In [None]:
params['MODEL_TYPE'] = 'AttentionRNNEncoderDecoder' #  Supported models: 'AttentionRNNEncoderDecoder' and 'Transformer'.
nmt_model = TranslationModel(params,
                             model_type=params['MODEL_TYPE'], 
                             model_name='tutorial_model',
                             vocabularies=dataset.vocabulary,
                             store_path='trained_models/tutorial_model/',
                             verbose=True)


[20/06/2020 13:40:55] <<< Building AttentionRNNEncoderDecoder Translation_Model >>>





[20/06/2020 13:40:55] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:650: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.






[20/06/2020 13:40:55] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4786: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.






[20/06/2020 13:40:55] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:157: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


[20/06/2020 13:40:56] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3561: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


-----------------------------------------------------------------------------------
		TranslationModel instance
-----------------------------------------------------------------------------------
_model_type: AttentionRNNEncoderDecoder
name: tutorial_model
model_path: trained_models/tutorial_model/
verbose: True

Params:
	ACCUMULATE_GRADIENTS: 1
	ADDITIONAL_OUTPUT_MERGE_MODE: Add
	ALIGN_FROM_RAW: True
	ALPHA_FACTOR: 0.6
	AMSGRAD: False
	APPLY_DETOKENIZATION: False
	ATTENTION_DROPOUT_P: 0.0
	ATTENTION_MODE: add
	ATTENTION_SIZE: 32
	BATCH_NORMALIZATION_MODE: 1
	BATCH_SIZE: 50
	BEAM_SEARCH: True
	BEAM_SIZE: 6
	BETA_1: 0.9
	BETA_2: 0.999
	BIDIRECTIONAL_DEEP_ENCODER: True
	BIDIRECTIONAL_ENCODER: True
	BIDIRECTIONAL_MERGE_MODE: concat
	BPE_CODES_PATH: examples/EuTrans//training_codes.joint
	CLASSIFIER_ACTIVATION: softmax
	CLIP_C: 5.0
	CLIP_V: 0.0
	COVERAGE_NORM_FACTOR: 0.2
	COVERAGE_PENALTY: False
	DATASET_NAME: EuTrans
	DATASET_STORE_PATH: datasets/
	DATA_AUGMENTATION: False
	DATA_ROOT_PATH

[20/06/2020 13:40:57] From /content/nmt-keras/nmt-keras/nmt_keras/model_zoo.py:213: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

[20/06/2020 13:40:57] Preparing optimizer and compiling. Optimizer configuration: 
	 LR: 0.001
	 LOSS: categorical_crossentropy
	 BETA_1: 0.9
	 BETA_2: 0.999
	 EPSILON: 1e-08





[20/06/2020 13:40:57] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1192: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Next, we must define the inputs and outputs mapping from our Dataset instance to our model:


In [None]:

inputMapping = dict()
for i, id_in in enumerate(params['INPUTS_IDS_DATASET']):
    pos_source = dataset.ids_inputs.index(id_in)
    id_dest = nmt_model.ids_inputs[i]
    inputMapping[id_dest] = pos_source
nmt_model.setInputsMapping(inputMapping)

outputMapping = dict()
for i, id_out in enumerate(params['OUTPUTS_IDS_DATASET']):
    pos_target = dataset.ids_outputs.index(id_out)
    id_dest = nmt_model.ids_outputs[i]
    outputMapping[id_dest] = pos_target
nmt_model.setOutputsMapping(outputMapping)


We can add some callbacks for controlling the training (e.g. Sampling each N updates, early stop, learning rate annealing...). For instance, let's build a sampling callback. After each epoch, it will compute the BLEU scores on the development set using the sacreBLEU package. We need to pass some configuration variables to the callback (in the extra_vars dictionary):


In [None]:
is_transformer = params.get('ATTEND_ON_OUTPUT', 'transformer' in params['MODEL_TYPE'].lower())
search_params = {
    'language': 'en',
    'tokenize_f': eval('dataset.' + 'tokenize_none'),
    'beam_size': 12,
    'optimized_search': True,
    'model_inputs': params['INPUTS_IDS_MODEL'],
    'model_outputs': params['OUTPUTS_IDS_MODEL'],
    'dataset_inputs':  params['INPUTS_IDS_DATASET'],
    'dataset_outputs':  params['OUTPUTS_IDS_DATASET'],
    'n_parallel_loaders': 1,
    'maxlen': 50,
    'normalize_probs': True,
    'pos_unk': True and not is_transformer,  # Pos_unk is unimplemented for transformer models
    'heuristic': 0,
    'state_below_maxlen': -1,
    'attend_on_output': is_transformer,
    'val': {'references': dataset.extra_variables['val']['target_text']}
  }

vocab = dataset.vocabulary['target_text']['idx2words']
callbacks = []
input_text_id = params['INPUTS_IDS_DATASET'][0]

callbacks.append(PrintPerformanceMetricOnEpochEndOrEachNUpdates(nmt_model,
                                                                dataset,
                                                                gt_id='target_text',
                                                                metric_name=['sacrebleu'],
                                                                set_name=['val'],
                                                                batch_size=50,
                                                                each_n_epochs=1,
                                                                extra_vars=search_params,
                                                                reload_epoch=0,
                                                                is_text=True,
                                                                input_text_id=input_text_id,
                                                                index2word_y=vocab,
                                                                sampling_type='max_likelihood',
                                                                beam_search=True,
                                                                save_path=nmt_model.model_path,
                                                                start_eval_on_epoch=0,
                                                                write_samples=True,
                                                                write_type='list',
                                                                verbose=True))

Now we are ready to train. Let's set up some training parameters...


In [None]:
training_params = {'n_epochs': 8,
                   'batch_size': 50,
                   'maxlen': 30,
                   'epochs_for_save': 1,
                   'verbose': 1,
                   'eval_on_sets': [], 
                   'n_parallel_loaders': 1,
                   'extra_callbacks': callbacks,
                   'reload_epoch': 0,
                   'epoch_offset': 0}


And train!


In [None]:
nmt_model.trainNet(dataset, training_params)


[20/06/2020 13:41:35] <<< Training model >>>
[20/06/2020 13:41:35] Training parameters: { 
	batch_size: 50
	class_weights: None
	da_enhance_list: []
	da_patch_type: resize_and_rndcrop
	data_augmentation: False
	each_n_epochs: 1
	epoch_offset: 0
	epochs_for_save: 1
	eval_on_epochs: True
	eval_on_sets: []
	extra_callbacks: [<keras_wrapper.extra.callbacks.EvalPerformance object at 0x7f0109ae2b70>]
	homogeneous_batches: False
	initial_lr: 1.0
	joint_batches: 4
	lr_decay: None
	lr_gamma: 0.1
	lr_half_life: 50000
	lr_reducer_exp_base: 0.5
	lr_reducer_type: linear
	lr_warmup_exp: -1.5
	maxlen: 30
	mean_substraction: False
	metric_check: None
	min_delta: 0.0
	min_lr: 1e-09
	n_epochs: 8
	n_gpus: 1
	n_parallel_loaders: 1
	normalization_type: None
	normalize: False
	num_iterations_val: None
	patience: 0
	patience_check_split: val
	reduce_each_epochs: True
	reload_epoch: 0
	shuffle: True
	start_eval_on_epoch: 0
	start_reduction_on_epoch: 0
	tensorboard: False
	tensorboard_params: {'log_dir': 'tens




[20/06/2020 13:41:40] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3315: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.






[20/06/2020 13:41:40] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:292: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.






[20/06/2020 13:41:40] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:299: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.






[20/06/2020 13:41:40] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:312: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.






[20/06/2020 13:41:40] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:321: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.






[20/06/2020 13:41:40] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:328: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.



Epoch 1/8


[20/06/2020 13:50:05] <<< Saving model to trained_models/tutorial_model/epoch_1 ... >>>





  'TensorFlow optimizers do not '
[20/06/2020 13:50:08] <<< Model saved >>>

[20/06/2020 13:50:08] <<< Predicting outputs of val set >>>




 Total cost: 11301.798203 	 Average cost: 11.145758
The sampling took: 208.869295 secs (Speed: 0.205985 sec/sample)


[20/06/2020 13:53:37] Prediction output 0: target_text (text)
[20/06/2020 13:53:37] Decoding beam search prediction ...
[20/06/2020 13:53:37] Using heuristic 0





[20/06/2020 13:53:37] Evaluating on metric sacrebleu
[20/06/2020 13:53:37] Computing SacreBleu scores on the val split...
[20/06/2020 13:53:37] Bleu_4: 0.34382451706233275
[20/06/2020 13:53:37] Done evaluating on metric sacrebleu
[20/06/2020 13:53:37] 
<<< Progress plot saved in trained_models/tutorial_model/epoch_1.jpg >>>


Epoch 2/8


[20/06/2020 14:02:02] <<< Saving model to trained_models/tutorial_model/epoch_2 ... >>>





[20/06/2020 14:02:03] <<< Model saved >>>

[20/06/2020 14:02:03] <<< Predicting outputs of val set >>>




 Total cost: 18548.678641 	 Average cost: 18.292582
The sampling took: 169.148276 secs (Speed: 0.166813 sec/sample)


[20/06/2020 14:04:52] Prediction output 0: target_text (text)
[20/06/2020 14:04:52] Decoding beam search prediction ...
[20/06/2020 14:04:52] Using heuristic 0
[20/06/2020 14:04:52] Evaluating on metric sacrebleu





[20/06/2020 14:04:52] Computing SacreBleu scores on the val split...
[20/06/2020 14:04:52] Bleu_4: 13.996877467559987
[20/06/2020 14:04:52] Done evaluating on metric sacrebleu
[20/06/2020 14:04:53] 
<<< Progress plot saved in trained_models/tutorial_model/epoch_2.jpg >>>


Epoch 3/8


[20/06/2020 14:13:15] <<< Saving model to trained_models/tutorial_model/epoch_3 ... >>>





[20/06/2020 14:13:16] <<< Model saved >>>

[20/06/2020 14:13:16] <<< Predicting outputs of val set >>>




 Total cost: 14501.026911 	 Average cost: 14.300815
The sampling took: 157.329790 secs (Speed: 0.155158 sec/sample)


[20/06/2020 14:15:53] Prediction output 0: target_text (text)
[20/06/2020 14:15:53] Decoding beam search prediction ...
[20/06/2020 14:15:53] Using heuristic 0
[20/06/2020 14:15:53] Evaluating on metric sacrebleu





[20/06/2020 14:15:54] Computing SacreBleu scores on the val split...
[20/06/2020 14:15:54] Bleu_4: 16.729644635441666
[20/06/2020 14:15:54] Done evaluating on metric sacrebleu
[20/06/2020 14:15:54] 
<<< Progress plot saved in trained_models/tutorial_model/epoch_3.jpg >>>


Epoch 4/8


[20/06/2020 14:24:23] <<< Saving model to trained_models/tutorial_model/epoch_4 ... >>>





[20/06/2020 14:24:24] <<< Model saved >>>

[20/06/2020 14:24:24] <<< Predicting outputs of val set >>>




 Total cost: 12796.827810 	 Average cost: 12.620146
The sampling took: 140.501818 secs (Speed: 0.138562 sec/sample)


[20/06/2020 14:26:44] Prediction output 0: target_text (text)
[20/06/2020 14:26:44] Decoding beam search prediction ...
[20/06/2020 14:26:44] Using heuristic 0
[20/06/2020 14:26:45] Evaluating on metric sacrebleu





[20/06/2020 14:26:45] Computing SacreBleu scores on the val split...
[20/06/2020 14:26:45] Bleu_4: 22.648543921228203
[20/06/2020 14:26:45] Done evaluating on metric sacrebleu
[20/06/2020 14:26:45] 
<<< Progress plot saved in trained_models/tutorial_model/epoch_4.jpg >>>


Epoch 5/8


[20/06/2020 14:35:06] <<< Saving model to trained_models/tutorial_model/epoch_5 ... >>>





[20/06/2020 14:35:07] <<< Model saved >>>

[20/06/2020 14:35:07] <<< Predicting outputs of val set >>>




 Total cost: 12474.627965 	 Average cost: 12.302394
The sampling took: 137.670628 secs (Speed: 0.135770 sec/sample)


[20/06/2020 14:37:25] Prediction output 0: target_text (text)
[20/06/2020 14:37:25] Decoding beam search prediction ...
[20/06/2020 14:37:25] Using heuristic 0
[20/06/2020 14:37:25] Evaluating on metric sacrebleu





[20/06/2020 14:37:25] Computing SacreBleu scores on the val split...
[20/06/2020 14:37:25] Bleu_4: 25.985609476394615
[20/06/2020 14:37:25] Done evaluating on metric sacrebleu
[20/06/2020 14:37:25] 
<<< Progress plot saved in trained_models/tutorial_model/epoch_5.jpg >>>


Epoch 6/8


[20/06/2020 14:45:45] <<< Saving model to trained_models/tutorial_model/epoch_6 ... >>>





[20/06/2020 14:45:46] <<< Model saved >>>

[20/06/2020 14:45:46] <<< Predicting outputs of val set >>>




 Total cost: 10634.221109 	 Average cost: 10.487398
The sampling took: 132.722480 secs (Speed: 0.130890 sec/sample)


[20/06/2020 14:47:59] Prediction output 0: target_text (text)
[20/06/2020 14:47:59] Decoding beam search prediction ...
[20/06/2020 14:47:59] Using heuristic 0





[20/06/2020 14:47:59] Evaluating on metric sacrebleu
[20/06/2020 14:47:59] Computing SacreBleu scores on the val split...
[20/06/2020 14:47:59] Bleu_4: 26.377897676934694
[20/06/2020 14:47:59] Done evaluating on metric sacrebleu
[20/06/2020 14:47:59] 
<<< Progress plot saved in trained_models/tutorial_model/epoch_6.jpg >>>


Epoch 7/8


[20/06/2020 14:56:15] <<< Saving model to trained_models/tutorial_model/epoch_7 ... >>>





[20/06/2020 14:56:15] <<< Model saved >>>

[20/06/2020 14:56:15] <<< Predicting outputs of val set >>>




 Total cost: 10016.882394 	 Average cost: 9.878582
The sampling took: 128.586598 secs (Speed: 0.126811 sec/sample)


[20/06/2020 14:58:24] Prediction output 0: target_text (text)
[20/06/2020 14:58:24] Decoding beam search prediction ...
[20/06/2020 14:58:24] Using heuristic 0
[20/06/2020 14:58:24] Evaluating on metric sacrebleu





[20/06/2020 14:58:24] Computing SacreBleu scores on the val split...
[20/06/2020 14:58:24] Bleu_4: 29.452345864699875
[20/06/2020 14:58:24] Done evaluating on metric sacrebleu
[20/06/2020 14:58:24] 
<<< Progress plot saved in trained_models/tutorial_model/epoch_7.jpg >>>


Epoch 8/8


[20/06/2020 15:06:50] <<< Saving model to trained_models/tutorial_model/epoch_8 ... >>>





[20/06/2020 15:06:51] <<< Model saved >>>

[20/06/2020 15:06:51] <<< Predicting outputs of val set >>>




 Total cost: 9412.687276 	 Average cost: 9.282729
The sampling took: 125.681089 secs (Speed: 0.123946 sec/sample)


[20/06/2020 15:08:56] Prediction output 0: target_text (text)
[20/06/2020 15:08:56] Decoding beam search prediction ...
[20/06/2020 15:08:56] Using heuristic 0
[20/06/2020 15:08:56] Evaluating on metric sacrebleu





[20/06/2020 15:08:57] Computing SacreBleu scores on the val split...
[20/06/2020 15:08:57] Bleu_4: 29.745362978969368
[20/06/2020 15:08:57] Done evaluating on metric sacrebleu
[20/06/2020 15:08:57] 
<<< Progress plot saved in trained_models/tutorial_model/epoch_8.jpg >>>
[20/06/2020 15:08:57] <<< Finished training model >>>


## 3. Decoding with a trained Neural Machine Translation Model

Now, we'll load from disk the model we just trained and we'll apply it for translating new text. In this case, we want to translate the 'test' split from our dataset.

Since we want to translate a new data split ('test') we must add it to the dataset instance, just as we did before (at the first tutorial). In case we also had the refences of the test split and we wanted to evaluate it, we can add it to the dataset. Note that this is not mandatory and we could just predict without evaluating.

In [None]:
dataset.setInput('/content/drive/My Drive/parallel-dataset/test_2017_flickr.de',
            'test',
            type='text',
            id='source_text',
            pad_on_batch=True,
            tokenization='tokenize_none',
            fill='end',
            max_text_len=50,
            min_occ=0)

dataset.setInput(None,
            'test',
            type='ghost',
            id='state_below',
            required=False)

dataset.setRawInput('/content/drive/My Drive/parallel-dataset/test_2017_flickr.de',
              'test',
              type='file-name',
              id='raw_source_text',
              overwrite_split=True)


[20/06/2020 15:11:16] 	Applying tokenization function: "tokenize_none".
[20/06/2020 15:11:16] Loaded "test" set inputs of data_type "text" with data_id "source_text" and length 1000.
[20/06/2020 15:11:16] Loaded "test" set inputs of data_type "ghost" with data_id "state_below" and length 1000.
[20/06/2020 15:11:16] Loaded "test" set inputs of type "file-name" with id "raw_source_text".


Now, let's load the translation model. Suppose we want to load the model saved at the end of the epoch 4:


In [None]:
params['INPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len[params['INPUTS_IDS_DATASET'][0]]
params['OUTPUT_VOCABULARY_SIZE'] = dataset.vocabulary_len[params['OUTPUTS_IDS_DATASET'][0]]

# Load model
nmt_model = loadModel('trained_models/tutorial_model', 8)


[20/06/2020 15:11:24] <<< Loading model from trained_models/tutorial_model/epoch_8_Model_Wrapper.pkl ... >>>
[20/06/2020 15:11:24] <<< Loading model from trained_models/tutorial_model/epoch_8.h5 ... >>>
[20/06/2020 15:11:27] <<< Loading optimized model... >>>
[20/06/2020 15:11:32] <<< Optimized model loaded. >>>
[20/06/2020 15:11:32] <<< Model loaded in 7.7300 seconds. >>>


Once we loaded the model, we just have to invoke the sampling method (in this case, the Beam Search algorithm) for the 'test' split:


In [None]:
is_transformer = params.get('ATTEND_ON_OUTPUT', 'transformer' in params['MODEL_TYPE'].lower())

params_prediction = {
    'language': 'en',
    'tokenize_f': eval('dataset.' + 'tokenize_none'),
    'beam_size': 12,
    'optimized_search': True,
    'model_inputs': params['INPUTS_IDS_MODEL'],
    'model_outputs': params['OUTPUTS_IDS_MODEL'],
    'dataset_inputs':  params['INPUTS_IDS_DATASET'],
    'dataset_outputs':  params['OUTPUTS_IDS_DATASET'],
    'n_parallel_loaders': 1,
    'maxlen': 50,
    'normalize_probs': True,
    'pos_unk': True and not is_transformer,
    'heuristic': 0,
    'state_below_maxlen': -1,
    'predict_on_sets': ['test'],
    'verbose': 0,
    'attend_on_output': is_transformer
  }
predictions = nmt_model.predictBeamSearchNet(dataset, params_prediction)['test']


[20/06/2020 15:11:39] <<< Predicting outputs of test set >>>




 Total cost: 8872.210595 	 Average cost: 8.872211
The sampling took: 117.405943 secs (Speed: 0.117406 sec/sample)


Up to now, in the variable 'predictions', we have the indices of the words of the hypotheses. We must decode them into words. For doing this, we'll use the dictionary stored in the dataset object:


In [None]:
from keras_wrapper.utils import decode_predictions_beam_search
vocab = dataset.vocabulary['target_text']['idx2words']
samples = predictions['samples'] # Get word indices from the samples.

predictions = decode_predictions_beam_search(samples,  
                                             vocab,
                                             verbose=params['VERBOSE'])

[20/06/2020 15:14:17] Decoding beam search prediction ...


Finally, we store the hypotheses:



In [None]:
filepath = 'test.pred'
from keras_wrapper.extra.read_write import list2file
list2file(filepath, predictions)
!head -n 4 test.pred

A sidewalk.
A guard dog and one is getting out of an ice cream truck in a Asian car.
A athlete with brown gear in front of a graffiti covered in front of a tree that is holding a mountain parked by the sun.
A man with a naked pants swings on a music match on a stage.


If we have the references of this split, we can also evaluate the performance of our system on it. First, we must add them to the dataset object:


In [None]:
dataset.setOutput('/content/drive/My Drive/parallel-dataset/test_2017_flickr.en',
             'test',
             type='text',
             id='target_text',
             pad_on_batch=True,
             tokenization='tokenize_none',
             sample_weights=True,
             max_text_len=30,
             max_words=0)
keep_n_captions(dataset, repeat=1, n=1, set_names=['test'])

[20/06/2020 15:14:29] 	Applying tokenization function: "tokenize_none".
[20/06/2020 15:14:29] Loaded "test" set outputs of data_type "text" with data_id "target_text" and length 1000.
[20/06/2020 15:14:29] Keeping 1 captions per input on the test set.
[20/06/2020 15:14:29] Samples reduced to 1000 in test set.


Next, we call the evaluation system: the sacreBLEU package:


In [None]:

from keras_wrapper.extra.evaluation import select
metric = 'sacrebleu'
# Apply sampling
extra_vars = dict()
extra_vars['tokenize_f'] = eval('dataset.' + 'tokenize_none')
extra_vars['language'] = params['TRG_LAN']
extra_vars['test'] = dict()
extra_vars['test']['references'] = dataset.extra_variables['test']['target_text']
metrics = select[metric](pred_list=predictions,
                                          verbose=1,
                                          extra_vars=extra_vars,
                                          split='test')

[20/06/2020 15:14:37] Computing SacreBleu scores on the test split...
[20/06/2020 15:14:37] Bleu_4: 23.304958245214795


And that's all!