# Character-level recurrent sequence-to-sequence translation model

**Author:** Tsz Lung<br>
**Date created:** 2021/09/29<br>
**Last modified:** 2021/09/29<br>
**Description:** Character-level recurrent sequence-to-sequence translation model.

## Introduction

This example demonstrates how to implement a basic character-level
recurrent sequence-to-sequence model. We apply it to translating
single word with accent into single word with standard phonic blending,
character-by-character. Note that it is fairly unusual to
do character-level machine translation, as word-level
models are more common in this domain.

**Summary of the algorithm**

- We start with input sequences from a domain (e.g. single word)
    and corresponding target sequences from another domain
    (e.g. single word ).
- An encoder LSTM turns input sequences to 2 state vectors
    (we keep the last LSTM state and discard the outputs).
- A decoder LSTM is trained to turn the target sequences into
    the same sequence but offset by one timestep in the future,
    a training process called "teacher forcing" in this context.
    It uses as initial state the state vectors from the encoder.
    Effectively, the decoder learns to generate `targets[t+1...]`
    given `targets[...t]`, conditioned on the input sequence.
- In inference mode, when we want to decode unknown input sequences, we:
    - Encode the input sequence into state vectors
    - Start with a target sequence of size 1
        (just the start-of-sequence character)
    - Feed the state vectors and 1-char target sequence
        to the decoder to produce predictions for the next character
    - Sample the next character using these predictions
        (we simply use argmax).
    - Append the sampled character to the target sequence
    - Repeat until we generate the end-of-sequence character or we
        hit the character limit.


## Setup


In [50]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import re
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

## Download the data


In [51]:
#read training data
input_data_list = []
output_data_list = []
with open('wordfun_phonics.txt') as f:
    lines = f.readlines()
    
count = 0    
for line in lines:
    count += 1
    #print(f'line {count}: {line}')    
    input_output = re.split('; |, |\n',line)
    #output = input_output[0]
    #print(output)
    wordCount =0
    output_word = ''
    for word in input_output:
        if wordCount == 0:
            output_word = word
        else:
            if word != '':
                input_data_list.append(word)
                output_data_list.append(output_word)
        
        wordCount += 1
        
print(f'size of training data {len(input_data_list)}')

label_encoder = LabelEncoder()
y_encoded_0 = label_encoder.fit_transform(output_data_list)
y_encoded = to_categorical(y_encoded_0)

total_output_classes = len(label_encoder.classes_)
print(f'total_output_classes {total_output_classes}')
print(f'total y_encoded {len(y_encoded)}')

#generate dummy data by duplicate current data
for class_label in label_encoder.classes_:
    for index in range(50):
        input_data_list.append(class_label)
        output_data_list.append(class_label)

y_encoded_0 = label_encoder.fit_transform(output_data_list)
y_encoded = to_categorical(y_encoded_0)

print(f'size of training data  with dummy data {len(input_data_list)}')
print(f'total y_encoded {len(y_encoded)}')

size of training data 739
total_output_classes 278
total y_encoded 739
size of training data  with dummy data 14639
total y_encoded 14639


## Configuration


In [52]:
batch_size = 64  # Batch size for training.
epochs = 500  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 20000  # Number of samples to train on.
# Path to the data txt file on disk.
# data_path = "yue.txt"


## Prepare the data


In [68]:
# Vectorize the data.
input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
# with open(data_path, "r", encoding="utf-8") as f:
#     lines = f.read().split("\n")
# for line in lines[: min(num_samples, len(lines) - 1)]:
#     input_text, target_text, _ = line.split("\t")
#     # We use "tab" as the "start sequence" character
#     # for the targets, and "\n" as "end sequence" character.
#     target_text = "\t" + target_text + "\n"
#     input_texts.append(input_text)
#     target_texts.append(target_text)
#     print(input_text)
#     print(target_text)
#     for char in input_text:
#         if char not in input_characters:
#             input_characters.add(char)
#     for char in target_text:
#         if char not in target_characters:
#             target_characters.add(char)
#read training data
# input_data_list = []
# output_data_list = []
with open('wordfun_phonics.txt') as f:
    lines = f.readlines()
    
count = 0    
for line in lines[: min(num_samples, len(lines) - 1)]:
    count += 1
#     print(f'line {count}: {line}')    
    input_output = re.split('; |, |\n',line)
#     output = input_output[0]
#     print(output)
    wordCount =0
    output_word = ''
    target_text = ''
    for input_text in input_output:
        
        if wordCount == 0:
            target_text = "\t" + input_text.upper() + "\n"
        else:
            if input_text != '':
                input_texts.append(input_text.upper())
#                 temp_target_text = "\t" + target_text + "\n"
                target_texts.append(target_text)
        
        wordCount += 1
    
        for char in input_text.upper():
            if char not in input_characters:
                input_characters.add(char)
        for char in target_text.upper():
            if char not in target_characters:
                target_characters.add(char)
        
print(f'size of training data {len(input_texts)}')


label_encoder = LabelEncoder()
y_encoded_0 = label_encoder.fit_transform(target_texts)
y_encoded = to_categorical(y_encoded_0)

total_output_classes = len(label_encoder.classes_)
print(f'total_output_classes {total_output_classes}')
print(f'total y_encoded {len(y_encoded)}')

#generate dummy data by duplicate current data
for class_label in label_encoder.classes_:
    for index in range(50):
        temp_input = re.split('\t|\n',class_label)
        input_texts.append(temp_input[1])
        target_texts.append(class_label)

y_encoded_0 = label_encoder.fit_transform(output_data_list)
y_encoded = to_categorical(y_encoded_0)

print(f'size of training data  with dummy data {len(input_data_list)}')
print(f'total y_encoded {len(y_encoded)}')


print(input_characters)
print(target_characters)


input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])



print(input_characters)
print(target_characters)

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
)
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    # decoder_input_data[i, t + 1 :, target_token_index[' ']] = 1.0
    # decoder_target_data[i, t:, target_token_index[' ']] = 1.0


size of training data 738
total_output_classes 277
total y_encoded 738
size of training data  with dummy data 14639
total y_encoded 14639
{'N', 'Y', ';', 'R', 'O', 'F', 'I', 'E', 'H', 'A', 'D', ' ', 'W', 'C', 'U', 'L', 'X', 'P', 'Z', 'S', 'K', 'J', 'Q', 'M', 'G', 'V', 'B', 'T'}
{'N', 'Y', 'R', 'O', 'F', 'I', 'E', 'H', 'A', '\n', 'D', ' ', 'W', 'C', 'U', 'L', 'X', 'P', '\t', 'Z', 'S', 'K', 'J', 'Q', 'M', 'G', 'V', 'B', 'T'}
[' ', ';', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
['\t', '\n', ' ', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
Number of samples: 14588
Number of unique input tokens: 28
Number of unique output tokens: 29
Max sequence length for inputs: 15
Max sequence length for outputs: 15


In [69]:
target_texts

['\tRED\n',
 '\tRED\n',
 '\tRED\n',
 '\tRED\n',
 '\tGREEN\n',
 '\tGREEN\n',
 '\tGREEN\n',
 '\tBLUE\n',
 '\tBLUE\n',
 '\tWHITE\n',
 '\tWHITE\n',
 '\tWHITE\n',
 '\tWHITE\n',
 '\tBLACK\n',
 '\tBLACK\n',
 '\tORANGE\n',
 '\tORANGE\n',
 '\tPURPLE\n',
 '\tPURPLE\n',
 '\tPURPLE\n',
 '\tYELLOW\n',
 '\tPINK\n',
 '\tPINK\n',
 '\tPINK\n',
 '\tCIRCLE\n',
 '\tCIRCLE\n',
 '\tCIRCLE\n',
 '\tSQUARE\n',
 '\tSQUARE\n',
 '\tSQUARE\n',
 '\tSQUARE\n',
 '\tSTAR\n',
 '\tSTAR\n',
 '\tSTAR\n',
 '\tTRIANGLE\n',
 '\tTRIANGLE\n',
 '\tRECTANGLE\n',
 '\tRECTANGLE\n',
 '\tRECTANGLE\n',
 '\tHEART\n',
 '\tHEART\n',
 '\tHEART\n',
 '\tARROW\n',
 '\tARROW\n',
 '\tARROW\n',
 '\tARROW\n',
 '\tCROSS\n',
 '\tCROSS\n',
 '\tDIAMOND\n',
 '\tDIAMOND\n',
 '\tLEFT\n',
 '\tLEFT\n',
 '\tLEFT\n',
 '\tRIGHT\n',
 '\tRIGHT\n',
 '\tUP\n',
 '\tUP\n',
 '\tUP\n',
 '\tDOWN\n',
 '\tDOWN\n',
 '\tDOWN\n',
 '\tDOWN\n',
 '\tDOWN\n',
 '\tDOWN\n',
 '\tFORWARDS\n',
 '\tFORWARDS\n',
 '\tFORWARDS\n',
 '\tFORWARDS\n',
 '\tFORWARDS\n',
 '\tFORWARDS\n',
 

## Build the model


In [70]:
# Define an input sequence and process it.
encoder_inputs = keras.Input(shape=(None, num_encoder_tokens))
encoder = keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = keras.Input(shape=(None, num_decoder_tokens))

# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = keras.layers.Dense(num_decoder_tokens, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)


## Train the model


In [71]:
model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)
model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
)



Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500


Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78/500
Epoch 79/500
Epoch 80/500
Epoch 81/500
Epoch 82/500
Epoch 83/500
Epoch 84/500
Epoch 85/500
Epoch 86/500
Epoch 87/500
Epoch 88/500
Epoch 89/500
Epoch 90/500
Epoch 91/500
Epoch 92/500
Epoch 93/500
Epoch 94/500
Epoch 95/500
Epoch 96/500
Epoch 97/500
Epoch 98/500
Epoch 99/500
Epoch 100/500
Epoch 101/500
Epoch 102/500
Epoch 103/500
Epoch 104/500
Epoch 105/500
Epoch 106/500
Epoch 107/500
Epoch 108/500
Epoch 109/500
Epoch 110/500
Epoch 111/500
Epoch 112/500


Epoch 113/500
Epoch 114/500
Epoch 115/500
Epoch 116/500
Epoch 117/500
Epoch 118/500
Epoch 119/500
Epoch 120/500
Epoch 121/500
Epoch 122/500
Epoch 123/500
Epoch 124/500
Epoch 125/500
Epoch 126/500
Epoch 127/500
Epoch 128/500
Epoch 129/500
Epoch 130/500
Epoch 131/500
Epoch 132/500
Epoch 133/500
Epoch 134/500
Epoch 135/500
Epoch 136/500
Epoch 137/500
Epoch 138/500
Epoch 139/500
Epoch 140/500
Epoch 141/500
Epoch 142/500
Epoch 143/500
Epoch 144/500
Epoch 145/500
Epoch 146/500
Epoch 147/500
Epoch 148/500
Epoch 149/500
Epoch 150/500
Epoch 151/500
Epoch 152/500
Epoch 153/500
Epoch 154/500
Epoch 155/500
Epoch 156/500
Epoch 157/500
Epoch 158/500
Epoch 159/500
Epoch 160/500
Epoch 161/500
Epoch 162/500
Epoch 163/500
Epoch 164/500
Epoch 165/500
Epoch 166/500
Epoch 167/500
Epoch 168/500


Epoch 169/500
Epoch 170/500
Epoch 171/500
Epoch 172/500
Epoch 173/500
Epoch 174/500
Epoch 175/500
Epoch 176/500
Epoch 177/500

KeyboardInterrupt: 

Save model

In [72]:
# Save model
model.save("alphabet2alphabet")



INFO:tensorflow:Assets written to: alphabet2alphabet/assets


INFO:tensorflow:Assets written to: alphabet2alphabet/assets


## Run inference (sampling)

1. encode input and retrieve initial decoder state
2. run one step of decoder with this initial state
and a "start of sequence" token as target.
Output will be the next target token.
3. Repeat with the current target token and current states


In [73]:
# Define sampling models
# Restore the model and construct the encoder and decoder.
#model = keras.models.load_model("alphabet2alphabet")

encoder_inputs = model.input[0]  # input_1
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output  # lstm_1
encoder_states = [state_h_enc, state_c_enc]
encoder_model = keras.Model(encoder_inputs, encoder_states)

decoder_inputs = model.input[1]  # input_2
decoder_state_input_h = keras.Input(shape=(latent_dim,))
decoder_state_input_c = keras.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = keras.Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)

# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())


def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index["\t"]] = 1.0

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ""
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if sampled_char == "\n" or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        # Update states
        states_value = [h, c]
    return decoded_sentence



You can now generate decoded sentences as such:


In [74]:
while(True):
    i=0
    input_data = input("Enter your word: ")
    encoder_input_data_1 = np.zeros((1, max_encoder_seq_length, num_encoder_tokens), dtype="float32")
    for t, char in enumerate(input_data.upper()):
        # print(char)
        # print(t)
        encoder_input_data_1[i, t, input_token_index[char]] = 1.0
    encoder_input_data_1[i, t + 1 :, input_token_index[" "]] = 1.0
    
    # print(encoder_input_data_1[0,1])
    input_seq = encoder_input_data_1[0 :  1]
    # print(input_seq)
    decoded_sentence = decode_sequence(input_seq)
    # decoded_sentence = decode_sequence(encoder_input_data_1[0:1])
    # print("-")
    #print("Input sentence:", input_texts[seq_index])
    decoded_output = re.split('\t|, |\n',decoded_sentence)
    
#     decoded_sentence = ''
#     for ouput in decoded_output:
#         if output != '':
#             decoded_sentence = decoded_sentence + output
    print("Decoded sentence:", decoded_output)


Enter your word: red
Decoded sentence: ['RED', '']


KeyboardInterrupt: Interrupted by user

In [75]:
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    # print(encoder_input_data[seq_index : seq_index + 1])
    input_seq = encoder_input_data[seq_index : seq_index + 1]
#     print(input_seq)
    decoded_sentence = decode_sequence(input_seq)
    print("-")
    print("Input sentence:", input_texts[seq_index])
    print("Decoded sentence:", decoded_sentence)


-
Input sentence: RED
Decoded sentence: RED

-
Input sentence: GLAD
Decoded sentence: RED

-
Input sentence: READ
Decoded sentence: RAD

-
Input sentence: RAT
Decoded sentence: RED

-
Input sentence: GREEN
Decoded sentence: GREEN

-
Input sentence: CLEAN
Decoded sentence: GREEN

-
Input sentence: GREAT
Decoded sentence: GREEN

-
Input sentence: BLUE
Decoded sentence: BLUE

-
Input sentence: HELLO
Decoded sentence: HEAR

-
Input sentence: WHITE
Decoded sentence: WHITE

-
Input sentence: WHY
Decoded sentence: WHITE

-
Input sentence: WHAT
Decoded sentence: WHIT

-
Input sentence: LIKE
Decoded sentence: BIKE

-
Input sentence: BLACK
Decoded sentence: BLACK

-
Input sentence: BREAK
Decoded sentence: BLACK

-
Input sentence: ORANGE
Decoded sentence: ORANGE

-
Input sentence: RAIN
Decoded sentence: VAN

-
Input sentence: PURPLE
Decoded sentence: PURPLE

-
Input sentence: CALYPSO
Decoded sentence: PURPLE

-
Input sentence: HELP
Decoded sentence: HILL

-
Input sentence: YELLOW
Decoded sentence

In [31]:
len(input_texts)

738