<a href="https://colab.research.google.com/github/jrgreen7/SYSC4906/blob/master/W2025/Tutorials/T8/Tutorial-8_Seq2Seq.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 8 - seq2seq 

**Semester:** Winter 2025

**Adapted by:** [Kevin Dick](https://kevindick.ai/), [Igor Bogdanov](igorbogdanov@cmail.carleton.ca)

**Part I adapted from:** [seq2seq Tutorial](https://github.com/lukas/ml-class/blob/master/videos/seq2seq/train.py) originally from this [Keras Blog](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html).

**Part II adapted from:** Clay Woolam <clay@woolam.org> under a BSD License.

---

## PART 1(a): seq2seq LSTM Model

The canonical example of a  sequence-to-sequence (`seq2seq`) learning task is **language translation**. A sequence representing a sentence in one language is encoded into a latent space (an embedded representation) and then decoded into another language.

**Neither fixed input/output length:** The input of characters of variable length from a given alphabet needs to somehow be converted into an out variable in length and possibly from an altogether different alphabet.

`seq2seq` models generally require **massive amounts** of data to learn their task effectively and this tutorial focuses on a unique example that allows the generation of large amounts of data:

### Method: 

We will generate thousands of **string**-representation of math questions (e.g., `"39+3"`) and their target **string**-representation answers (e.g., `"42"`). These will be vectorized and used to train an LSTM model that will learn the "translation" task of converting a query string from "question-language" into a target string in "answer-language"!


In [None]:
from __future__ import print_function
from keras.models import Sequential
from keras import layers
import numpy as np
from six.moves import range
import matplotlib.pyplot as plt


class CharacterTable(object):
    """Given a set of characters:
    + Encode them to a one-hot integer representation
    + Decode the one-hot or integer representation to their character output
    + Decode a vector of probabilities to their character output
    """

    def __init__(self, chars):
        """Initialize character table.
        # Arguments
            chars: Characters that can appear in the input.
        """
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    def encode(self, C, num_rows):
        """One-hot encode given string C.
        # Arguments
            C: string, to be encoded.
            num_rows: Number of rows in the returned one-hot encoding. This is
                used to keep the # of rows for each data the same.
        """
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    def decode(self, x, calc_argmax=True):
        """Decode the given vector or 2D array to their character output.
        # Arguments
            x: A vector or a 2D array of probabilities or one-hot representations;
                or a vector of character indices (used with `calc_argmax=False`).
            calc_argmax: Whether to find the character index with maximum
                probability, defaults to `True`.
        """
        if calc_argmax:
            x = x.argmax(axis=-1)
        return "".join(self.indices_char[x] for x in x)


print("CharacterTable utility class defined for encoding/decoding characters.")
print("This will convert between text strings and one-hot encoded matrices.")


class colors:
    ok = "\033[92m"
    fail = "\033[91m"
    close = "\033[0m"

In [None]:
# Parameters for the model and dataset.
TRAINING_SIZE = 50000
DIGITS = 3
REVERSE = True

# Maximum length of input is 'int + int' (e.g., '345+678'). Maximum length of
# int is DIGITS.
MAXLEN = DIGITS + 1 + DIGITS

# All the numbers, plus sign and space for padding.
chars = '0123456789+ '
ctable = CharacterTable(chars)
questions = []
expected = []
seen = set()

print(f"Model parameters initialized:")
print(f"- Training size: {TRAINING_SIZE} examples")
print(f"- Maximum digits per number: {DIGITS}")
print(f"- Input reversal: {REVERSE}")
print(f"- Maximum input length: {MAXLEN} characters")
print(f"- Character set: '{chars}'")
print("Ready to generate training data...")

In [None]:
print('Generating data...')
while len(questions) < TRAINING_SIZE:
    f = lambda: int(''.join(np.random.choice(list('0123456789'))
                    for i in range(np.random.randint(1, DIGITS + 1))))
    a, b = f(), f()
    # Skip any addition questions we've already seen
    # Also skip any such that x+Y == Y+x (hence the sorting).
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)
    # Pad the data with spaces such that it is always MAXLEN.
    q = '{}+{}'.format(a, b)
    query = q + ' ' * (MAXLEN - len(q))
    ans = str(a + b)
    # Answers can be of maximum size DIGITS + 1.
    ans += ' ' * (DIGITS + 1 - len(ans))
    if REVERSE:
        # Reverse the query, e.g., '12+345  ' becomes '  543+21'. (Note the
        # space used for padding.)
        query = query[::-1]
    questions.append(query)
    expected.append(ans)

print('Total addition questions:', len(questions))

print("\nSample data (first 5 examples):")
print("Question (original) | Question (formatted) | Expected Answer")
print("-" * 60)
for i in range(5):
    # Get the original question by removing padding and reversing if needed
    original_q = questions[i].strip()
    if REVERSE:
        original_q = original_q[::-1]

    print(f"{original_q} | {questions[i]} | {expected[i]}")

# Also show some statistics about the data
print("\nData statistics:")
question_lengths = [len(q.strip()) for q in questions]
answer_lengths = [len(a.strip()) for a in expected]
print(
    f"Average question length: {sum(question_lengths)/len(question_lengths):.2f} characters"
)
print(
    f"Average answer length: {sum(answer_lengths)/len(answer_lengths):.2f} characters"
)
print(f"Shortest question: {min(question_lengths)} characters")
print(f"Longest question: {max(question_lengths)} characters")
print(f"Shortest answer: {min(answer_lengths)} characters")
print(f"Longest answer: {max(answer_lengths)} characters")




In [None]:
print("Vectorization...")
x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=bool)
y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=bool)
for i, sentence in enumerate(questions):
    x[i] = ctable.encode(sentence, MAXLEN)
for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, DIGITS + 1)

# Display example of vectorization
print("\nExample of vectorization:")
print(f"Original question: '{questions[0]}'")
print(f"One-hot encoded shape: {x[0].shape} (sequence_length, num_characters)")

# Show a small portion of the one-hot encoding for the first example
print("\nFirst few positions of the one-hot encoding:")
sample_indices = min(5, MAXLEN)  # Show first 5 positions or fewer if MAXLEN is smaller
sample_chars = min(5, len(chars))  # Show first 5 characters or fewer if chars is smaller
print("Position | " + " ".join(f"{c:^5}" for c in chars[:sample_chars]) + " ...")
for i in range(sample_indices):
    print(f"{i:^8} | " + " ".join(f"{int(x[0][i][j]):^5}" for j in range(sample_chars)) + " ...")

print(f"\nTotal size of training data: {x.shape[0]} examples")
print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")

In [None]:
# Shuffle (x, y) in unison as the later parts of x will almost all be larger
# digits.
indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]
# Explicitly set apart 10% for validation data that we never train over.
split_at = len(x) - len(x) // 10
(x_train, x_val) = x[:split_at], x[split_at:]
(y_train, y_val) = y[:split_at], y[split_at:]

print("Data preparation:")
print(f"- Total examples: {len(x)}")
print(f"- Training examples: {len(x_train)} ({len(x_train)/len(x)*100:.1f}%)")
print(f"- Validation examples: {len(x_val)} ({len(x_val)/len(x)*100:.1f}%)")

# Show a few examples from the training set
print("\nSample training examples (after shuffling):")
print("Question | Expected Answer")
print("-" * 30)
for i in range(3):
    # Decode the one-hot encoded data back to strings
    q = ctable.decode(x_train[i])
    a = ctable.decode(y_train[i])
    # Get the original question by removing padding and reversing if needed
    original_q = q.strip()
    if REVERSE:
        original_q = original_q[::-1]
    print(f"{original_q} | {a.strip()}")

print("\nTraining Data Shape:")
print(f"- Input (x_train): {x_train.shape} (examples, sequence_length, features)")
print(f"- Output (y_train): {y_train.shape} (examples, sequence_length, features)")
print("\nValidation Data Shape:")
print(f"- Input (x_val): {x_val.shape}")
print(f"- Output (y_val): {y_val.shape}")

In [None]:
# Try replacing GRU, or SimpleRNN.
RNN = layers.LSTM
HIDDEN_SIZE = 128
BATCH_SIZE = 128
LAYERS = 1
print("Building the seq2seq model...")
model = Sequential()

# ENCODER
print("\n1. Encoder:")
print(f"   - Using {RNN.__name__} with {HIDDEN_SIZE} hidden units")
print("   - Processes input sequence and produces a fixed-length representation")
# "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE.
# Note: In a situation where your input sequences have a variable length,
# use input_shape=(None, num_feature).
model.add(RNN(HIDDEN_SIZE, input_shape=(MAXLEN, len(chars))))

# BRIDGE
print("\n2. Bridge:")
print(f"   - RepeatVector creates {DIGITS + 1} copies of the encoder output")
print("   - This connects the encoder to the decoder")
# As the decoder RNN's input, repeatedly provide with the last output of
# RNN for each time step. Repeat 'DIGITS + 1' times as that's the maximum
# length of output, e.g., when DIGITS=3, max output is 999+999=1998.
model.add(layers.RepeatVector(DIGITS + 1))

# DECODER
print("\n3. Decoder:")
print(f"   - Using {LAYERS} layer(s) of {RNN.__name__} with {HIDDEN_SIZE} hidden units")
print("   - Generates the output sequence character by character")
# The decoder RNN could be multiple layers stacked or a single layer.
for _ in range(LAYERS):
    # By setting return_sequences to True, return not only the last output but
    # all the outputs so far in the form of (num_samples, timesteps,
    # output_dim). This is necessary as TimeDistributed in the below expects
    # the first dimension to be the timesteps.
    model.add(RNN(HIDDEN_SIZE, return_sequences=True))

# OUTPUT LAYER
print("\n4. Output Layer:")
print(f"   - TimeDistributed Dense layer with softmax activation")
print(
    f"   - Produces probability distribution over {len(chars)} possible characters at each time step"
)
# Apply a dense layer to the every temporal slice of an input. For each of step
# of the output sequence, decide which character should be chosen.
model.add(layers.TimeDistributed(layers.Dense(len(chars), activation="softmax")))

# COMPILATION
print("\n5. Model Compilation:")
print("   - Loss: categorical_crossentropy (standard for classification tasks)")
print("   - Optimizer: adam (adaptive learning rate)")
print("   - Metric: accuracy (percentage of correctly predicted characters)")
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# MODEL SUMMARY
print("\nModel Architecture Summary:")
model.summary()


In [None]:
# Store val_loss and accuracy for visulization and early stopping
train_loss=[]
val_loss=[]
train_acc=[]
val_acc=[]
patience=3
min_delta=0.001
val_loss_increase=0
iterations = 2

# TRAINING PARAMETERS
print("\nTraining Parameters:")
print(f"- Batch size: {BATCH_SIZE} examples")
print(f"- Maximum iterations: {iterations} (with early stopping)")
print(f"- Early stopping patience: {patience} iterations")
print(f"- Minimum improvement threshold: {min_delta}")

In [None]:

# Train the model each generation and show predictions against the validation
# dataset.
print("\nStarting training process...")
print("Each iteration represents one epoch (full pass through the training data)")
print("The model will train until validation loss stops improving")
print("After each iteration, we'll show example predictions from the validation set")

for iteration in range(1, iterations):
    print()
    print("-" * 50)
    print(f"Iteration {iteration}/{iterations}")

    # Train for one epoch
    train_history = model.fit(
        x_train,
        y_train,
        batch_size=BATCH_SIZE,
        epochs=1,
        validation_data=(x_val, y_val),
    )

    # Track metrics
    current_train_loss = train_history.history["loss"][0]
    current_val_loss = train_history.history["val_loss"][0]
    current_train_acc = train_history.history["accuracy"][0]
    current_val_acc = train_history.history["val_accuracy"][0]

    train_loss.append(current_train_loss)
    val_loss.append(current_val_loss)
    train_acc.append(current_train_acc)
    val_acc.append(current_val_acc)

    # Print current metrics
    print(f"Training loss: {current_train_loss:.4f}, accuracy: {current_train_acc:.2%}")
    print(f"Validation loss: {current_val_loss:.4f}, accuracy: {current_val_acc:.2%}")

    # Early stopping check
    if iteration > 1:
        loss_improvement = val_loss[iteration - 2] - val_loss[iteration - 1]
        if loss_improvement < min_delta:
            val_loss_increase += 1
            print(
                f"Validation loss not improving. Patience: {val_loss_increase}/{patience}"
            )
            if val_loss_increase >= patience:
                print("Early stopping triggered - validation loss did not improve")
                break
        else:
            print(f"Validation loss improved by {loss_improvement:.6f}")
            val_loss_increase = 0

    # Show example predictions
    print("\nExample predictions:")
    correct_count = 0
    print("Question | Target | Prediction | Result")
    print("-" * 50)
    for i in range(10):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=True)

        # Get the original question by removing padding and reversing if needed
        original_q = q.strip()
        if REVERSE:
            original_q = original_q[::-1]

        result = "✓" if correct.strip() == guess.strip() else "✗"
        if correct.strip() == guess.strip():
            correct_count += 1

        print(f"{original_q:10} | {correct.strip():6} | {guess.strip():10} | {result}")

    print(f"\nAccuracy on sample: {correct_count/10:.0%}")

# Plot training history
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(train_loss, label="Training Loss")
plt.plot(val_loss, label="Validation Loss")
plt.title("Loss over iterations")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(train_acc, label="Training Accuracy")
plt.plot(val_acc, label="Validation Accuracy")
plt.title("Accuracy over iterations")
plt.xlabel("Iteration")
plt.ylabel("Accuracy")
plt.legend()

plt.tight_layout()
plt.show()

print("\nTraining complete!")
print(f"Final validation accuracy: {val_acc[-1]:.2%}")
print(f"Total iterations: {len(train_loss)}")

## Part I(b) - Subtraction and Variable Length Input

This second implementation demonstrates **subtraction** and permits **variable length** inputs.

### Key differences from Part I(a):
 
1. **Operation**: Subtraction instead of addition
2. **Character set**: Includes the minus sign "-" in addition to digits, plus, and space
3. **Digits**: Allows up to 5 digits per number (increased from 3)
4. **No reversal**: Input sequences are not reversed in this implementation

This implementation shows how the same seq2seq architecture can be adapted to learn different mathematical operations with minimal changes to the model structure.



In [None]:

from keras.models import Sequential
from keras.layers import LSTM, TimeDistributed, RepeatVector, Dense
import numpy as np


class CharacterTable(object):
    """Given a set of characters:
    + Encode them to a one hot integer representation
    + Decode the one hot integer representation to their character output
    + Decode a vector of probabilities to their character output
    """

    def __init__(self, chars):
        """Initialize character table.
        # Arguments
            chars: Characters that can appear in the input.
        """
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    def encode(self, C, num_rows):
        """One hot encode given string C.
        # Arguments
            num_rows: Number of rows in the returned one hot encoding. This is
                used to keep the # of rows for each data the same.
        """
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    def decode(self, x, calc_argmax=True):
        if calc_argmax:
            x = x.argmax(axis=-1)
        return "".join(self.indices_char[x] for x in x)


__training_size = 50000
__digits = 5
__hidden_size = 128
__batch_size = 128
maxlen = __digits + 1 + __digits

# All the numbers, plus sign and space for padding.
chars = "0123456789+- "
ctable = CharacterTable(chars)

questions = []
expected = []
seen = set()
print("Generating data...")
while len(questions) < __training_size:
    f = lambda: int(
        "".join(
            np.random.choice(list("0123456789"))
            for i in range(np.random.randint(1, __digits + 1))
        )
    )
    a, b = f(), f()
    # Skip any addition questions we've already seen
    # Also skip any such that x+Y == Y+x (hence the sorting).
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)
    # Pad the data with spaces such that it is always MAXLEN.
    q = "{}-{}".format(a, b)
    query = q + " " * (maxlen - len(q))
    ans = str(a - b)
    # Answers can be of maximum size DIGITS + 1.
    ans += " " * (__digits + 1 - len(ans))

    questions.append(query)
    expected.append(ans)

print("Total addition questions:", len(questions))

print("\nSample subtraction data (first 5 examples):")
print("Question | Expected Answer")
print("-" * 40)
for i in range(5):
    print(f"{questions[i].strip()} | {expected[i].strip()}")

# Add data statistics
print("\nSubtraction Data Statistics:")
question_lengths = [len(q.strip()) for q in questions]
answer_lengths = [len(a.strip()) for a in expected]
print(
    f"Average question length: {sum(question_lengths)/len(question_lengths):.2f} characters"
)
print(
    f"Average answer length: {sum(answer_lengths)/len(answer_lengths):.2f} characters"
)
print(f"Shortest question: {min(question_lengths)} characters")
print(f"Longest question: {max(question_lengths)} characters")

print("Vectorization...")
x = np.zeros((len(questions), maxlen, len(chars)), dtype=bool)
y = np.zeros((len(questions), __digits + 1, len(chars)), dtype=bool)
for i, sentence in enumerate(questions):
    x[i] = ctable.encode(sentence, maxlen)
for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, __digits + 1)

indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

split_at = len(x) - len(x) // 10
(x_train, x_val) = x[:split_at], x[split_at:]
(y_train, y_val) = y[:split_at], y[split_at:]

model = Sequential()
model.add(LSTM(__hidden_size, input_shape=(maxlen, len(chars))))
model.add(RepeatVector(__digits + 1))
model.add(LSTM(__hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(len(chars), activation="softmax")))
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

print("\nComparison with Addition Model:")
print("1. Operation: Subtraction instead of addition")
print(f"2. Character set: '{chars}' (includes minus sign)")
print(f"3. Maximum digits: {__digits} (increased from {DIGITS})")
print("4. No input reversal in this implementation")

for iteration in range(1, 50):
    print()
    print("-" * 50)
    print("Iteration", iteration)
    model.fit(
        x_train,
        y_train,
        batch_size=__batch_size,
        epochs=1,
        validation_data=(x_val, y_val),
    )

   # Show only three examples per iteration with better formatting
    print("\nExample predictions:")
    correct_count = 0
    print("Question | Expected | Prediction | Result")
    print("-" * 50)
    for i in range(3):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=True)

        # Clean up the strings by removing padding
        q_clean = q.strip()
        correct_clean = correct.strip()
        guess_clean = guess.strip()

        result = "✓" if correct_clean == guess_clean else "✗"
        if correct_clean == guess_clean:
            correct_count += 1

        print(f"{q_clean:12} | {correct_clean:8} | {guess_clean:10} | {result}")

    print(f"\nAccuracy on sample: {correct_count/3:.0%}")


## Part I (c)- Bonus: the Douglas R. Hofstadter `pq`-system as a `seq2seq` task

For those familiar with Dr. Douglas Hofstadter's masterpiece [**(GEB) Gödel, Escher, Bach: and Eternal Golden Braid**](https://www.physixfan.com/wp-content/files/GEBen.pdf), the `pq`-system that Hofstader leverages heavily throughout the book can also be leveraged as an example of a `seq2seq` task.

More formally, the `pq`-system has only three disting symbols: `p`, `q`, and `-` and these are used in combination to generate statements/theorems for this system such as:

`--p---q-----`

`-p-q--`

`----------p-q-----------`

When you look at these example strings, can you *discern a meaning* for what the symbols `p`, `q`, and `-` stand for? As a human, we might try to identify a pattern within a large number of these statements and hope to deduce a pattern that allows us to generate new and valid statements within this system. 

### Excerpt from GEB (Chapter II: Isomorphisms Induce Meaning):

> Perhaps you have already thought to yourself that the `pq`-theorems are like additions. The string `--p---q-----` is a theorem because 2 plus 3 equals 5. It could even occur to you that the theorem `--p---q-----` is a statement, written in an odd notation, whose meaning is that **2 plus 3 is 5**. Is this a reasonable way to look at things? Well, I deliberately chose 'p' to remind you of 'plus',and 'q' to remind you of 'equals' . . . So, does the string `--p---q-----` actually mean "2 plus 3 equals 5"?

Aside: GEB is **strongly recommended** to those with deep interests at the intersection of *mathematics, artificial intelligence, philosophy, cognition, musical theory, and the arts.*

For the purposes of understanding the utility of `seq2seq` on solving arbitrary **string translation** tasks, this is precisely what a machine learning algorithm must do. 

**Presented with thousands of examples of valid query strings and their targets, the model learns an internal representation that allows it to correctly map the meaning of an assembly of sybmols into an alternative and valid representation.**

---

Similar to the examples above that generate example mathematical strings and have the model learn to "translate" that input string into its resulting output, we will generate pairs of strings valid in Hofstadter's `pq`-system:

**Example:** Input `x="---p--"` with target `y="q-----"`





In [None]:
from keras.models import Sequential
from keras.layers import LSTM, TimeDistributed, RepeatVector, Dense
import numpy as np

class CharacterTable(object):
    """Given a set of characters:
    + Encode them to a one hot integer representation
    + Decode the one hot integer representation to their character output
    + Decode a vector of probabilities to their character output
    """
    def __init__(self, chars):
        """Initialize character table.
        # Arguments
            chars: Characters that can appear in the input.
        """
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    def encode(self, C, num_rows):
        """One hot encode given string C.
        # Arguments
            num_rows: Number of rows in the returned one hot encoding. This is
                used to keep the # of rows for each data the same.
        """
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x

    def decode(self, x, calc_argmax=True):
        if calc_argmax:
            x = x.argmax(axis=-1)
        # print("hi")
        # print(x)
        # print(x.shape)
        print(len(self.indices_char))
        return ''.join(self.indices_char[x] for x in x)

training_size = 50000
digits = 1000 # A "digit" here is a dash and we need to allow up to 1000 in length
hidden_size = 128
batch_size = 128
maxlen = digits + 1 + digits

# The dash for numbers, p for plus sign, q for equals and space for padding.
chars = '-pq '
ctable = CharacterTable(chars)

questions = []
expected = []
seen = set()
print('Generating data...')
while len(questions) < training_size:
    f = lambda: ''.join('-' for i in range(np.random.randint(1, digits + 1)))
    a, b = f(), f()
    # Skip any addition questions we've already seen
    # Also skip any such that x+Y == Y+x (hence the sorting).
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)
    # Pad the data with spaces such that it is always MAXLEN.
    q = '{}p{}'.format(a, b)
    query = q + ' ' * (maxlen - len(q))
    ans = 'q' + '-' * (len(a) + len(b))
    # Answers can be of maximum size DIGITS * 2.
    ans += ' ' * (maxlen - len(ans))

    questions.append(query)
    expected.append(ans)
    
print('Total addition questions:', len(questions))

In [None]:
print('Vectorization...')
x = np.zeros((len(questions), maxlen, len(chars)), dtype=bool)
y = np.zeros((len(questions), maxlen, len(chars)), dtype=bool)

print('Vectorizing questions...')
for i, sentence in enumerate(questions):
    x[i] = ctable.encode(sentence, maxlen)
print('Vectorising answers...')
for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, maxlen)

indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

print('Splitting into train and validation sets...')
split_at = len(x) - len(x) // 10
(x_train, x_val) = x[:split_at], x[split_at:]
(y_train, y_val) = y[:split_at], y[split_at:]

print(f'Size train: {len(x_train)}\tSize val: {len(x_val)}\nFirst train input: {x_train[0]}\nFirst train answer: {y_train[0]}')

In [None]:
model = Sequential()
model.add(LSTM(hidden_size, input_shape=(maxlen, len(chars))))
model.add(RepeatVector(maxlen))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(len(chars), activation='softmax')))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

for iteration in range(1, 50):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=1,
              validation_data=(x_val, y_val))

    # Show only three examples per iteration
    for i in range(3):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict(rowx, verbose=0)
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=True)
        print('Q', q)
        print('T', correct)
        if correct == guess:
            print('☑')
        else:
            print('☒')
        print(guess)

# Takeaway Messages
* The cannonical example of a `seq2seq` learning task is **language translation**: a seqence represening a sentence in one language is encoded into a latent space (an embedded representation) and then decoded into another language.
* In translation, the **input of characters of variable length** and from a **given alphabet** needs to be converted into an **output also variable in length** and possibly from an altogether **different alphabet**.
* `seq2seq` models generally require **massive amounts** of data to effectively learn their task.