1. What are the pros and cons of using a stateful RNN versus a stateless RNN?

> In a stateful RNN, we do not reset the training weights of the memory cell after each mini batch - only between epochs. During training, the i-th sample of batch j+1 must be the continuation of the ith sample of batch j. This potentially allows the RNN to learn longer sequences. However, it reduces the training data available, as effectively we "stride" the length of the batch. Also, since the model needs to know the batch size in advance - as it needs to remember the weights for each sample in the batch - we need to transfer the weights to another model in case we want to use it to infer different batch size.

Book's answer:
Stateless RNNs can only capture patterns whose length is less than, or equal to, the size of the windows the RNN is trained on. Conversely, stateful RNNs can capture longer-term patterns. However, implementing a stateful RNN is much harder⁠—especially preparing the dataset properly. Moreover, stateful RNNs do not always work better, in part because consecutive batches are not independent and identically distributed (IID). Gradient Descent is not fond of non-IID datasets.

2. Why do people use encoder–decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?

> Encoder-decoder allow the sequence of the output to be of different length than the sequence of the input. It allows the decoder to consider the entire sentence that the encoder saw and "encoded" rather than simply translate one word at a time (or - up to word t), which in many translation contexts is important.

3. How can you deal with variable-length input sequences? What about variable-length output sequences?

> For inputs, we can use ragged tensors or masking. Ragged tensors have some rough edges in tensoflow (e.g. cannot be used as training targets), so may not be fully feasible. Masking means that the text vectorization layer's pad token gets understood as a mask, and subsequent layers that support masking simply ignore these inputs. For variable-length output sequences, we ask the decoder to predict the EOS token, we apply paddings, and we use mask on the decoder's inputs so that the loss function ignores antything after the end of the sequence.

4. What is beam search, and why would you use it? What tool can you use to implement it?

> Beam search is a technique where during inference, we keep a list of k possible sentences ("beam-width") to predict. We rely on the output probabilities as conditional probabilities, keeping track of the total probability per sentence, and constantly maintaining the top k. This allows us to sidestep bad luck early on that will derail the model as it needs to build on top of the previously predicted tokens. We can implement it with python.

Addition from the book:
You can implement beam search by writing a custom memory cell. Alternatively, TensorFlow Addons's seq2seq API provides an implementation.

5. What is an attention mechanism? How does it help?

> Attention mechanism is essentially a memory retrieval mechanism. Instead of expecting the decoder to only rely on the vector produced by the encoder to deliver its translation, at each step the decoder can attend to outputs of the encoder that occurred at any temporal point. This can increase the accuracy of the translation as the decoder has deeper access to the encoder's results.

6. What is the most important layer in the transformer architecture? What is its purpose?

> The transformer showed that we don't need to use RNNs or Convolutions to produce accurate translation. This means that there is no more need to process the network one temporal step at a time, but can do this in parallel. The most important layer is the multi-head attention. It allows self-attention of each word with all other words and masked attention (where causality is maintained) to encrich the representations of the inputs + positions of the encoder and decoder; as well as cross-attention - using the decoder's representation of its inputs to attend to the encoder's representation so that the appropriate output can be learned.

7. When would you need to use sampled softmax?

> When the dimensionality of the output is very big (e.g. vocabularies with 30,000 possible tokens), calculating the softmax can be expensive. Therefore, when calculating the loss we compute the softmax of the label plus a sample of other words. This can speed up training.

Book's answer:
Sampled softmax is used when training a classification model when there are many classes (e.g., thousands). It computes an approximation of the cross-entropy loss based on the logit predicted by the model for the correct class, and the predicted logits for a sample of incorrect words. This speeds up training considerably compared to computing the softmax over all logits and then estimating the cross-entropy loss. After training, the model can be used normally, using the regular softmax function to compute all the class probabilities based on all the logits.