**NMT**

**let's lookt at a simple neural machine translation model that will translate English sentences to French**

the English sentences are fed to the Enocder:
    1. note that the French translations are also used as inputs to the decoder but shifted back by one step
    2. for the first very word it's given the  start-of-sequence(SOS) token
    3. teh decoder is expected to end the sentence with an end of sequence(EOS) token

Note:
    1. that the English sentences are reversed before they are fed to the encoder
    2. which is useful because that's generally the first thing that the decoder needs to translate

Each Word is initially represented by it's 1D:
    1. next an ebmedding layer returns the word embedding
    2. these word embeddingss are what is actually fed to the encoder and the decoder

at each step the decoder outputs a score for each word in the output vocabulary:
    1. and then the softamx layer turns these scores into probabilities
    2. the word with the highest probability is output
    3. this is much like a regular classification task
    4. so you can train the model using the 'sparse categorical crossetnropy' loss

so far we have assumed that all input sequences have a constant length, but obviously sentences of the same length:
    1. since regular tensors have fixed shapes, then can only contain sentences of the same length
    2. you can use maksing to handle this
    3. group sentences into buckets of similar length(bucket for the 1 to 6 word sentences)
    4. another for the 7 to 12 word sentences and so on

we want to ignore any output past the EOS token:
    1. so these tokens should not contribute to the loss(they must not masked out)

When the output Vocabulary is large(which is the case here):
    1. outputing a probability for each and every possible word would be torribly slow
    2. to avoid this one solution, is to look at the logits outputs by the model for the correct word and for a random sample of incorrect words
    3. then compute an approximation of the loss based only on these logits
    4. this sample softmax function can be used during training ans use the normal softmax function at inference time.
    5. sampled softmax function can not be used at inference time because it requires knowing the target

In [4]:
# Importing Libraries
import tensorflow as tf
from tensorflow import keras
import numpy as np
import tensorflow_datasets as tfds

In [5]:
tf.random.set_seed(42)

In [6]:
vocab_size = 100
embed_size = 10

In [7]:
import tensorflow_addons as tfa

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

#LSTM, returns short and long term memory
encoder = keras.layers.LSTM(512, return_state=True)#True,so that we can get the final hidden state and pass it to the decoder
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]

#sampler is one of several samplers availabe in TensorFlow Addons,their role is to tell the decoder at each step what it
# should pretend the previous output was , during inference
sampler = tfa.seq2seq.sampler.TrainingSampler()

decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
                                                 output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(
    decoder_embeddings, initial_state=encoder_state,
    sequence_length=sequence_lengths)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)

model = keras.models.Model(
    inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
    outputs=[Y_proba])

In [8]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

In [9]:
X = np.random.randint(100, size=10*1000).reshape(1000, 10)
Y = np.random.randint(100, size=15*1000).reshape(1000, 15)
X_decoder = np.c_[np.zeros((1000, 1)), Y[:, :-1]]
seq_lengths = np.full([1000], 15)

history = model.fit([X, X_decoder, seq_lengths], Y, epochs=2)

Epoch 1/2
Epoch 2/2
