## German to English Translation Dataset

In [39]:
import string
import re
from pickle import dump
from unicodedata import normalize
from numpy import array
from pickle import load
from numpy.random import rand
from numpy.random import shuffle

#
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.callbacks import ModelCheckpoint

Using TensorFlow backend.


#### Preparing the Text Data

some observations I note from reviewing the raw data:

* There is punctuation.

* The text contains uppercase and lowercase.

* There are special characters in the German.

* There are duplicate phrases in English with different translations in German.

* The file is ordered by sentence length with very long sentences toward the end of the fil

Data preparation is divided into two subsections:

 - Clean Text

 - Split Text

#### 1. Clean Text

In [10]:
#load the data in a way that preserves the Unicode German characters. 
#The function below called load_doc() will load the file as a blob of text.
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

Each line contains a single pair of phrases, first English and then German, separated by a tab character.

We must split the loaded text by line and then by phrase. 

The function to_pairs() below will split the loaded text.

In [11]:
# split a loaded document into sentences with pairs having English and corresponing Germain word pair
def to_pairs(doc):
	lines = doc.strip().split('\n')
	pairs = [line.split('\t') for line in  lines]
	return pairs

#### clean each sentence. 

The specific cleaning operations we will perform are as follows:

- Remove all non-printable characters.

- Remove all punctuation characters.

- Normalize all Unicode characters to ASCII (e.g. Latin characters).

- Normalize the case to lowercase.

- Remove any remaining tokens that are not alphabetic.

In [22]:
# clean a list of lines
def clean_pair(lines):
	cleaned = list()
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for pair in lines:
		clean_pair = list()
		for line in pair:
			# normalize unicode characters
			line = normalize('NFD', line).encode('ascii', 'ignore')
			line = line.decode('UTF-8')
			# tokenize on white space
			line = line.split()
			# convert to lowercase
			line = [word.lower() for word in line]
			# remove punctuation from each token
			line = [word.translate(table) for word in line]
			# remove non-printable chars form each token
			line = [re_print.sub('', w) for w in line]
			# remove tokens with numbers in them
			line = [word for word in line if word.isalpha()]
			# store as string
			clean_pair.append(' '.join(line))
		cleaned.append(clean_pair)
	return array(cleaned)

In [17]:
# save a list of clean sentences to file
def save_clean_data(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

In [27]:
# load dataset
filename = 'deu.txt'
doc = load_doc(filename)
# split into english-german pairs
pairs = to_pairs(doc)


# clean sentences
clean_pairs = clean_pair(pairs)
# save clean pairs to file
save_clean_data(clean_pairs, 'english-german.pkl')
# spot check
for i in range(10):
    print("Unclean \n")
    print("="*30)
    print(pairs[i])
    print("clean \n")
    print("="*30)
    print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))

Saved: english-german.pkl
Unclean 

['Hi.', 'Hallo!']
clean 

[hi] => [hallo]
Unclean 

['Hi.', 'Grüß Gott!']
clean 

[hi] => [gru gott]
Unclean 

['Run!', 'Lauf!']
clean 

[run] => [lauf]
Unclean 

['Wow!', 'Potzdonner!']
clean 

[wow] => [potzdonner]
Unclean 

['Wow!', 'Donnerwetter!']
clean 

[wow] => [donnerwetter]
Unclean 

['Fire!', 'Feuer!']
clean 

[fire] => [feuer]
Unclean 

['Help!', 'Hilfe!']
clean 

[help] => [hilfe]
Unclean 

['Help!', 'Zu Hülf!']
clean 

[help] => [zu hulf]
Unclean 

['Stop!', 'Stopp!']
clean 

[stop] => [stopp]
Unclean 

['Wait!', 'Warte!']
clean 

[wait] => [warte]


In [14]:
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [13]:
re_print = re.compile('[^%s]' % re.escape(string.printable))
re_print

re.compile(r'[^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\!\"\#\$\%\&\\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^_\`\{\|\}\~\ \\t\\n\\r\\x0b\\x0c]',
re.UNICODE)

In [9]:
#Creates a dictionary with initial values initialized as None for each punctuation symbol
table = str.maketrans('', '', string.punctuation)
table

{33: None,
 34: None,
 35: None,
 36: None,
 37: None,
 38: None,
 39: None,
 40: None,
 41: None,
 42: None,
 43: None,
 44: None,
 45: None,
 46: None,
 47: None,
 58: None,
 59: None,
 60: None,
 61: None,
 62: None,
 63: None,
 64: None,
 91: None,
 92: None,
 93: None,
 94: None,
 95: None,
 96: None,
 123: None,
 124: None,
 125: None,
 126: None}

#### 2. Split Text

The clean data contains a little over 150,000 phrase pairs and some of the pairs toward the end of the file are very long.

We will simplify the problem by reducing the dataset to the first 10,000 examples in the file; these will be the shortest phrases in the dataset.

Further, we will then stake the first 9,000 of those as examples for training and the remaining 1,000 examples to test the fit model.

In [29]:
# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# save a list of clean sentences to file
def save_clean_data(sentences, filename):
	dump(sentences, open(filename, 'wb'))
	print('Saved: %s' % filename)

# load dataset
raw_dataset = load_clean_sentences('english-german.pkl')

# reduce dataset size
n_sentences = 10000
dataset = raw_dataset[:n_sentences, :]
# random shuffle
shuffle(dataset)
# split into train/test
train, test = dataset[:9000], dataset[9000:]

In [30]:
# save
save_clean_data(dataset, 'english-german-both.pkl')
save_clean_data(train, 'english-german-train.pkl')
save_clean_data(test, 'english-german-test.pkl')

Saved: english-german-both.pkl
Saved: english-german-train.pkl
Saved: english-german-test.pkl


Running the example creates three new files: the english-german-both.pkl that contains all of the train and test examples that we can use to define the parameters of the problem, such as max phrase lengths and the vocabulary, and the english-german-train.pkl and english-german-test.pkl files for the train and test dataset.

## Train Neural Translation Model

This involves both loading and preparing the clean text data ready for modeling and defining and training the model on the prepared data.

In [31]:
# load a clean dataset
def load_clean_sentences(filename):
	return load(open(filename, 'rb'))

# load datasets
dataset = load_clean_sentences('english-german-both.pkl')
train = load_clean_sentences('english-german-train.pkl')
test = load_clean_sentences('english-german-test.pkl')

In [32]:
dataset.shape

(10000, 2)

In [33]:
train.shape

(9000, 2)

In [34]:
test.shape

(1000, 2)

#### We can use the Keras Tokenize class to map words to integers, as needed for modeling. 


We will use separate tokenizer for the English sequences and the German sequences. 

The function below-named create_tokenizer() will train a tokenizer on a list of phrases.

In [38]:
# fit a tokenizer
def create_tokenizer(lines):
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

#### the function named max_length() below will find the length of the longest sequence in a list of phrases.

In [36]:
# max sentence length
def max_length(lines):
	return max(len(line.split()) for line in lines)

In [41]:
dataset[:5,]

array([['get out', 'raus'],
       ['i want to play', 'ich will spielen'],
       ['you can rest', 'sie konnen sich ausruhen'],
       ['he relented', 'er gab nach'],
       ['its not enough', 'das reicht nicht aus']], dtype='<U370')

#### Prepare tokenizers, vocabulary sizes, and maximum lengths for both the English and German phrases.

In [40]:
# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))

English Vocabulary Size: 2309
English Max Length: 5


In [44]:
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
#word index returns  a dictionary of the word and its index position
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
print('German Vocabulary Size: %d' % ger_vocab_size)
print('German Max Length: %d' % (ger_length))

German Vocabulary Size: 3657
German Max Length: 10


#### prepare the training dataset.

- Each input and output sequence must be encoded to integers and padded to the maximum phrase length. 

- This is because we will use a word embedding for the input sequences and one hot encode the output sequences 

The function below named encode_sequences() will perform these operations and return the result.

In [63]:
# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
	# integer encode sequences
	X = tokenizer.texts_to_sequences(lines)
	#print((lines,X))
	# pad sequences with 0 values
	X = pad_sequences(X, maxlen=length, padding='post')
	return X

The output sequence needs to be one-hot encoded. This is because the model will predict the probability of each word in the vocabulary as output.

The function encode_output() below will one-hot encode English output sequences.

In [54]:
# one hot encode target sequence
def encode_output(sequences, vocab_size):
	ylist = list()
	for sequence in sequences:
		encoded = to_categorical(sequence, num_classes=vocab_size)
		ylist.append(encoded)
	y = array(ylist)
	y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
	return y

####  prepare both the train and test dataset ready for training the model.

In [58]:
train[:10,0]

array(['get out', 'i want to play', 'you can rest', 'he relented',
       'its not enough', 'it has to stop', 'come back', 'i like to sing',
       'go away', 'is tom reliable'], dtype='<U370')

In [57]:
test[:10,1]

array(['wir stehen', 'ein mensch muss arbeiten', 'tom ist nicht verloren',
       'sie horten auf', 'macht eure taschen leer', 'er ist mein vater',
       'schauen sie genau hin', 'fasst mit an', 'nimm das',
       'ich hatte unrecht'], dtype='<U370')

In [64]:
# prepare training data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = encode_output(trainY, eng_vocab_size)
# prepare validation data
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
testY = encode_output(testY, eng_vocab_size)

#### We will use an encoder-decoder LSTM model on this problem. 

In this architecture, the input sequence is encoded by a front-end model called the encoder then decoded word by word by a backend model called the decoder.

The function define_model() below defines the model and takes a number of arguments used to configure the model, such as 

- the size of the input and output vocabularies, 

- the maximum length of input and output phrases, and 

- the number of memory units used to configure the model.


The model is trained using the efficient Adam approach to stochastic gradient descent and minimizes the categorical loss function because we have framed the prediction problem as multi-class classification.

In [76]:
!pip install graphviz



fbprophet 0.4.post2 requires setuptools-git>=1.2, which is not installed.
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [79]:
import matplotlib.pyplot as plt

In [81]:
# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
	model = Sequential()
	model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
	model.add(LSTM(n_units))
	model.add(RepeatVector(tar_timesteps))
	model.add(LSTM(n_units, return_sequences=True))
	model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
	return model

# define model
model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256)
model.compile(optimizer='adam', loss='categorical_crossentropy')
# summarize defined model
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 10, 256)           936192    
_________________________________________________________________
lstm_17 (LSTM)               (None, 256)               525312    
_________________________________________________________________
repeat_vector_9 (RepeatVecto (None, 5, 256)            0         
_________________________________________________________________
lstm_18 (LSTM)               (None, 5, 256)            525312    
_________________________________________________________________
time_distributed_9 (TimeDist (None, 5, 2309)           593413    
Total params: 2,580,229
Trainable params: 2,580,229
Non-trainable params: 0
_________________________________________________________________
None


#### train the model.

We train the model for 30 epochs and a batch size of 64 examples.

We use checkpointing to ensure that each time the model skill on the test set improves, the model is saved to file.

In [82]:
# fit model
filename = 'model.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)

Train on 9000 samples, validate on 1000 samples
Epoch 1/30
 - 47s - loss: 4.3278 - val_loss: 3.5656

Epoch 00001: val_loss improved from inf to 3.56561, saving model to model.h5
Epoch 2/30
 - 32s - loss: 3.4131 - val_loss: 3.4259

Epoch 00002: val_loss improved from 3.56561 to 3.42591, saving model to model.h5
Epoch 3/30
 - 33s - loss: 3.2560 - val_loss: 3.3179

Epoch 00003: val_loss improved from 3.42591 to 3.31793, saving model to model.h5
Epoch 4/30
 - 34s - loss: 3.1018 - val_loss: 3.2021

Epoch 00004: val_loss improved from 3.31793 to 3.20206, saving model to model.h5
Epoch 5/30
 - 34s - loss: 2.9721 - val_loss: 3.1067

Epoch 00005: val_loss improved from 3.20206 to 3.10666, saving model to model.h5
Epoch 6/30
 - 34s - loss: 2.8132 - val_loss: 2.9849

Epoch 00006: val_loss improved from 3.10666 to 2.98485, saving model to model.h5
Epoch 7/30
 - 33s - loss: 2.6501 - val_loss: 2.8713

Epoch 00007: val_loss improved from 2.98485 to 2.87132, saving model to model.h5
Epoch 8/30
 - 33s 

<keras.callbacks.History at 0x5b8a76d8>

#### the best model saved during training must be loaded

In [84]:
# load model
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu
model = load_model('model.h5')

Evaluation involves two steps: 
    
    - first generating a translated output sequence, and 
    
    - then repeating this process for many input examples and summarizing the skill of the model across multiple cases.

Starting with inference, the model can predict the entire output sequence in a one-shot manner.

In [86]:
# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
# prepare data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
 

This will be a sequence of integers that we can enumerate and lookup in the tokenizer to map back to words.

The function below, named word_for_id(), will perform this reverse mapping.

In [92]:
# map an integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None

We can perform this mapping for each integer in the translation and return the result as a string of words.

The function predict_sequence() below performs this operation for a single encoded source phrase.

In [94]:
# generate target given source sequence
def predict_sequence(model, tokenizer, source):
	prediction = model.predict(source, verbose=0)[0]
	integers = [np.argmax(vector) for vector in prediction]
	target = list()
	for i in integers:
		word = word_for_id(i, tokenizer)
		if word is None:
			break
		target.append(word)
	return ' '.join(target)

Next, we can repeat this for each source phrase in a dataset and compare the predicted result to the expected target phrase in English.

We can print some of these comparisons to screen to get an idea of how the model performs in practice.

We will also calculate the BLEU scores to get a quantitative idea of how well the model has performed.

The evaluate_model() function below implements this, calling the above predict_sequence() function for each phrase in a provided dataset

In [90]:
 
# evaluate the skill of the model
def evaluate_model(model, tokenizer, sources, raw_dataset):
	actual, predicted = list(), list()
	for i, source in enumerate(sources):
		# translate encoded source text
		source = source.reshape((1, source.shape[0]))
		translation = predict_sequence(model, eng_tokenizer, source)
		raw_target, raw_src = raw_dataset[i]
		if i < 10:
			print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
		actual.append(raw_target.split())
		predicted.append(translation.split())
	# calculate BLEU score
	print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
	print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
	print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
	print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

In [93]:
# test on some training sequences
import numpy as np
print('train')
evaluate_model(model, eng_tokenizer, trainX, train)
# test on some test sequences
print('test')
evaluate_model(model, eng_tokenizer, testX, test)


train
src=[raus], target=[get out], predicted=[get out]
src=[ich will spielen], target=[i want to play], predicted=[i want to raise]
src=[sie konnen sich ausruhen], target=[you can rest], predicted=[you can rest]
src=[er gab nach], target=[he relented], predicted=[he relented]
src=[das reicht nicht aus], target=[its not enough], predicted=[its not enough]
src=[es muss aufhoren], target=[it has to stop], predicted=[it has to stop]
src=[komm wieder], target=[come back], predicted=[come back]
src=[ich singe gern], target=[i like to sing], predicted=[i like to]
src=[mach die sause], target=[go away], predicted=[get away]
src=[ist tom zuverlassig], target=[is tom reliable], predicted=[is tom reliable]


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


BLEU-1: 0.077815
BLEU-2: 0.000000
BLEU-3: 0.000000
BLEU-4: 0.000000
test
src=[wir stehen], target=[were standing], predicted=[were are]
src=[ein mensch muss arbeiten], target=[a man must work], predicted=[my one a fever]
src=[tom ist nicht verloren], target=[tom isnt lost], predicted=[tom isnt]
src=[sie horten auf], target=[they stopped], predicted=[they stood up]
src=[macht eure taschen leer], target=[empty your bags], predicted=[close your bags]
src=[er ist mein vater], target=[hes my father], predicted=[he is my dog]
src=[schauen sie genau hin], target=[look closely], predicted=[watch closely]
src=[fasst mit an], target=[lend us a hand], predicted=[look for]
src=[nimm das], target=[take this], predicted=[take this]
src=[ich hatte unrecht], target=[i was wrong], predicted=[i was wrong]
BLEU-1: 0.081937
BLEU-2: 0.000000
BLEU-3: 0.000000
BLEU-4: 0.000000
