This notebook is for model training, and can be used to train or recalibrate a new model.
https://www.analyticsvidhya.com/blog/2019/01/neural-machine-translation-keras/

In [2]:
from sklearn.model_selection import train_test_split
from src.persistence.adapter.data_cleaning_adapter import EngDaDataCleaningAdapter
from src.domain.services.encoding_service import EncodingService

In [3]:
#clean data
#Encode tokens
#train model
#run application with trained model

In [4]:
#Read data adn clean
data = EngDaDataCleaningAdapter().get_clean_data()

In [5]:
# split data into train and test set
train, test = train_test_split(data, test_size=0.2, random_state = 12)

It’s time to encode the sentences. We will encode German sentences as the input sequences and English sentences as the target sequences. This has to be done for both the train and test datasets.

In [6]:
da_index = 0
en_index = 1

da_length = len(data[:, da_index])
en_length = len(data[:, en_index])

da_encoder = EncodingService(data[:,da_index])
en_encoder = EncodingService(data[:,en_index])

da_vocab_size = len(da_encoder.tokenizer.word_index)+1
en_vocab_size = len(en_encoder.tokenizer.word_index)+1

# Encode training data
trainX = da_encoder.encode_tokens(da_length, train[:, da_index])
trainY = en_encoder.encode_tokens(en_length, train[:, en_index])

# Encode validation data
testX = da_encoder.encode_tokens(da_length, test[:, da_index])
testY = en_encoder.encode_tokens(en_length, test[:, en_index])

In [7]:
#Delete object to clear up memory for model training
#Need to wait until garbage collection after deletion of objects
del train
del da_encoder
del en_encoder

In [8]:
import gc
gc.collect()

0

In [9]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, RepeatVector

# build NMT model
def define_model(in_vocab, out_vocab, in_timesteps, out_timesteps, units):
      model = Sequential()
      model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True))
      model.add(LSTM(units))
      model.add(RepeatVector(out_timesteps))
      model.add(LSTM(units, return_sequences=True))
      model.add(Dense(out_vocab, activation='softmax'))
      return model

We are using the RMSprop optimizer in this model as it’s usually a good choice when working with recurrent neural networks.

We will train it for 30 epochs and with a batch size of 512 with a validation split of 20%. 80% of the data will be used for training the model and the rest for evaluating it. You may change and play around with these hyperparameters

In [10]:
from keras import optimizers

# model compilation
model = define_model(da_vocab_size, en_vocab_size, da_length, en_length, 512)
rms = optimizers.RMSprop(lr=0.001)
model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')



In [None]:
from keras.callbacks import ModelCheckpoint

DIR = "../src/persistence/models/"
filename = 'en_da_model.v1'

checkpoint = ModelCheckpoint((DIR + filename),
                             monitor='val_loss',
                             verbose=1,
                             save_best_only=True,
                             mode='min')

# train model
history = model.fit(trainX,
                    trainY.reshape(trainY.shape[0], trainY.shape[1], 1),
                    epochs=30,
                    batch_size=512,
                    validation_split = 0.2,
                    callbacks=[checkpoint],
                    verbose=1)

Epoch 1/30


In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train','validation'])
plt.show()

In [12]:
testX.shape[0]

5492

- Text
- Tokens, which is a sequence of character, and a subpart of original text
- embeddings, is encoded tokens into integer arrays
- transformers, which is adding self attention to the embeddings

We now have a model, that with a given input can predict a sequence of tokens having the highest probability to meet the desired outcome.