# English to Cherokee

# Abstract

We wanted to approach creating an English to Cherokee model based on NLP techniques. Cherokee, or Tsalagi, is an endangered-to-moribund Iroquoian language and the native language of the Cherokee people. As the number of speakers of this language is in decline, we wanted to create a model that would help to easily translate between the two languages. Notice that Cherokee is written in a different syllabary than English (ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ), this syllabary was invented by Sequoyah in the 1810s and 1820s. The characters used to represent the Cherokee language are covered by the Unicode blocks U+13A0 to U+13FF (uppercase letters + six lowercase letters) and U+AB70 to U+ABBF (the rest of the lowercase letters).

To this end, our group took three approaches to this model. After creating a dataloader for English and Cherokee sentences, we created three models: A simple RNN, an encoder-decoder model and a transformer model. While we were unable to troubleshoot the first and second models to get them fully functioning, we were able to get the third model to work and produce Cherokee sentences as output to English input.

## Milestone Information

### Team Members:
Rithvik Doshi, Saisriram Gunturu, Ruihang Liu

### Project Description

We aim to create a model to translate English text to Cherokee. We're hoping to come up with an approach to this problem since Cherokee is an endangered language, and we can use the models we learned about in class specifically regarding machine translation to see how well we can do.

### Approach
We'll use the following data sources:
- https://github.com/ZhangShiyue/ChrEn/tree/main/data
- https://github.com/CherokeeLanguage/CherokeeEnglishCorpus/tree/master/corpus.aligned/en_chr

Additionally, we will experiment with one of the following architectures/approaches to see what's the best way to translate from English to Cherokee:
- https://github.com/lukysummer/Machine-Translation-Seq2Seq-Keras/tree/master/data
- https://medium.com/@patrickhk/use-keras-to-build-a-english-to-french-translator-with-various-rnn-model-architecture-a374
- https://github.com/LaurentVeyssier/Machine-translation-English-French-with-Deep-neural-Network/blob/main/machine_translation.ipynb
- https://arxiv.org/pdf/2010.04791v1.pdf

### Project Plan:

The project will consist of the following phases:
1. EDA / Data Loading

    a. Concatenating as many data sources as possible to get as big of a corpus as we can

    b. Split data into training, testing and validation sets
2. Model Developemnt

    a. Finalize Model Selection and Architecture and build in Pytorch
    
3. Model Training

Run the below cell in colab if first time running:

In [None]:
!pip install keras --upgrade
!pip install Keras-Preprocessing



# EDA

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os

data_dir = "/content/drive/MyDrive/Senior/CS505/Project/corpus.aligned/en_chr" # for Ruihang
# data_dir = "/content/drive/MyDrive/Colab Notebooks/corpus.aligned/en_chr"

In [None]:
def load_input_for(language=".en"):
    """
    Load the input for the given language
    :param language: the language of the input (".en" for English and ".chr" for Cherokee)
    :return: the input
    """
    # Get all .en files in the directory
    file_list = [file for file in os.listdir(data_dir) if file.endswith(language)]

    # Initialize the empty array for the input
    lines_array = []    # structure: [lines in the document]

    for file in file_list:
        file_path = os.path.join(data_dir, file)
        with open(file_path, "r") as f:
            lines = f.readlines()
            for line in lines:
                lines_array.append(line.strip())

    return lines_array

In [None]:
english_sentences = load_input_for(".en")
cherokee_sentences = load_input_for(".chr")

In [None]:
print(len(english_sentences), len(cherokee_sentences))  # should match

107168 107168


## Data pre processing

### Tokenizer:

In [None]:
from keras_preprocessing.text import Tokenizer

def tokenize(x):
    x_tk = Tokenizer()
    x_tk.fit_on_texts(x)
    return x_tk.texts_to_sequences(x), x_tk

In [None]:
# Test our tokenize()
test_text = ["In the beginning God created the heavens and the earth.",
             "And God said, Let there be light: and there was light."]  # just 2 short sentences from our data
test_text_tokenized, test_tokenizer = tokenize(test_text)

test_text_tokenized

[[6, 1, 7, 3, 8, 1, 9, 2, 1, 10], [2, 3, 11, 12, 4, 13, 5, 2, 4, 14, 5]]

In [None]:
test_tokenizer.word_index

{'the': 1,
 'and': 2,
 'god': 3,
 'there': 4,
 'light': 5,
 'in': 6,
 'beginning': 7,
 'created': 8,
 'heavens': 9,
 'earth': 10,
 'said': 11,
 'let': 12,
 'be': 13,
 'was': 14}

We can see that keras has already taken into account of capital/lowercased letter and punctuations. So we don't have to.

Apply tokenizer on our input data:

In [None]:
english_sentences_tokenized, english_tokenizer = tokenize(english_sentences)
cherokee_sentences_tokenized, cherokee_tokenizer = tokenize(cherokee_sentences)

In [None]:
english_vocab_size = len(english_tokenizer.word_index)
cherokee_vocab_size = len(cherokee_tokenizer.word_index)
print("English vocab size = {}, Cherokee vocab size = {}".format(english_vocab_size, cherokee_vocab_size))

English vocab size = 20763, Cherokee vocab size = 72759


### Padding
Truncate all sentences into equal length for our input: pad to the max length, leave trailing 0 (post)

In [None]:
from keras_preprocessing.sequence import pad_sequences
def pad(x):
    length = max([len(sentence) for sentence in x])
    return pad_sequences(x, maxlen=length, padding='post')

In [None]:
# testing padding function:
test_text_padded = pad(test_text_tokenized)
test_text_padded

array([[ 6,  1,  7,  3,  8,  1,  9,  2,  1, 10,  0],
       [ 2,  3, 11, 12,  4, 13,  5,  2,  4, 14,  5]], dtype=int32)

In [None]:
# Apply padding to input:
english_sentences_padded = pad(english_sentences_tokenized)
cherokee_sentences_padded = pad(cherokee_sentences_tokenized)

### Write function to map logits back to token label
Function to convert predictions (a bunch of probability) back to sentence

In [None]:
import numpy as np

def logits_to_text(logits, tokenizer):
    idx_to_words = {id: word for word, id in tokenizer.word_index.items()}
    idx_to_words[0] = '<PAD>'
    return ' '.join([idx_to_words[prediction] for prediction in np.argmax(logits, 1)])

### Make Dataloader

In [None]:
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset

class Basic_Dataset(Dataset):

    def __init__(self, X,Y):
        self.X = X
        self.Y = Y

    def __len__(self):
        return len(self.X)

    # return a pair x,y at the index idx in the data set
    def __getitem__(self, idx):
        return self.X[idx], self.Y[idx]


In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(english_sentences_padded, cherokee_sentences_padded, test_size=0.2, random_state=42)

# Split the train data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

train_dataset = Basic_Dataset(X_train, y_train)
val_dataset = Basic_Dataset(X_val, y_val)
test_dataset = Basic_Dataset(X_test, y_test)

In [None]:
# For torch models:
from torch.utils.data import DataLoader

batch_size = 128

# Data loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# Or a loader with all data:
all_loader = DataLoader(Basic_Dataset(english_sentences_padded, cherokee_sentences_padded), batch_size=batch_size, shuffle=True)

# First model (Simple RNN, not working neither locally or on colab):

In [None]:
print(len(X_train), len(y_train))
print(len(X_val), len(y_val))

68587 68587
17147 17147


In [None]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

# Define the model architecture
model = Sequential()
model.add(Embedding(input_dim=english_vocab_size, output_dim=100))
model.add(LSTM(units=128))
model.add(Dense(units=cherokee_vocab_size, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

Epoch 1/10


ValueError: ignored

# Second model:

In [None]:
import torch
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import torch.nn as nn
import torch.optim as optim

class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, num_layers)

    def forward(self, input_seq, input_lengths, hidden=None):
        embedded = self.embedding(input_seq)
        packed = pack_padded_sequence(embedded, input_lengths)
        outputs, hidden = self.gru(packed, hidden)
        outputs, _ = pad_packed_sequence(outputs)
        return outputs, hidden

class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size, num_layers=1):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, num_layers)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input_seq, hidden):
        embedded = self.embedding(input_seq)
        output, hidden = self.gru(embedded, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_lengths, target_seq, teacher_forcing_ratio=0.5):
        batch_size = input_seq.size(0)
        target_length = target_seq.size(1)
        target_vocab_size = self.decoder.out.out_features

        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_lengths)

        decoder_input = torch.tensor([[SOS_token]] * batch_size, device=device)
        decoder_hidden = encoder_hidden

        use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

        if use_teacher_forcing:
            for di in range(target_length):
                decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)
                decoder_input = target_seq[:, di]
        else:
            for di in range(target_length):
                decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)
                topv, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze().detach()

        return decoder_output

# Define hyperparameters
input_size = len(english_tokenizer.word_index) + 1
output_size = len(cherokee_tokenizer.word_index) + 1
hidden_size = 256
num_layers = 2

# Create encoder and decoder instances
encoder = Encoder(input_size, hidden_size, num_layers)
decoder = Decoder(hidden_size, output_size, num_layers)

# Create the Seq2Seq model
model = Seq2Seq(encoder, decoder)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the device
model = model.to(device)

# Set the number of epochs
num_epochs = 10

# Training loop
for epoch in range(num_epochs):
    total_loss = 0

    # Set the model to train mode
    model.train()

    # Iterate over the training data
    for input_seq, target_seq in train_loader:
        # Get the input sequence lengths
        input_lengths = torch.sum(input_seq != 0, dim=1)

        # Move the input and target sequences to the device
        input_seq = input_seq.to(device)
        target_seq = target_seq.to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        output = model(input_seq, input_lengths, target_seq)

        # Compute the loss
        loss = criterion(output.view(-1, output_size), target_seq.view(-1))

        # Backward pass
        loss.backward()

        # Update the parameters
        optimizer.step()

        # Update the total loss
        total_loss += loss.item()

    # Print the average loss for the epoch
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

RuntimeError: ignored

# Sequence to Sequence Translator

In [None]:
import pathlib
import random
import string
import re
import numpy as np

import tensorflow.data as tf_data
import tensorflow.strings as tf_strings

import keras
from keras import layers
from keras import ops
from keras.layers import TextVectorization

from sklearn.model_selection import train_test_split

In [None]:
eng_len = [len(sentence) for sentence in english_sentences]
print(max(eng_len), min(eng_len))

7287 0


In [None]:
che_len = [len(sentence) for sentence in cherokee_sentences]
print(max(che_len), min(che_len))

4846 0


In [None]:
# make text_pair and prepend the token "[start]" and postpend "[end]" to cherokee_sentences
text_pair = []
for i in range(len(english_sentences)):
    text_pair.append([english_sentences[i], "[start] " + cherokee_sentences[i] + " [end]"])

# split the sentence pairs into a training set, a validation set, and a test set.
train_pairs, test_pairs = train_test_split(text_pair, test_size=0.2, random_state=42)
train_pairs, val_pairs = train_test_split(train_pairs, test_size=0.2, random_state=42)
print(f"{len(text_pair)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

107168 total pairs
68587 training pairs
17147 validation pairs
21434 test pairs


## Vectorizing the text data

use 2 TextVectorization layers to vectorize the text data (1 for English, 1 for Cherokee)

In [None]:
import string
strip_chars = string.punctuation
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

vocab_size = max(english_vocab_size, cherokee_vocab_size)
sequence_length = 20
batch_size = 64

def custom_standardization(input_string):
    return tf_strings.regex_replace(input_string, "[%s]" % re.escape(strip_chars), "")

eng_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

che_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    # standardize=custom_standardization,
)

train_eng_texts = [pair[0] for pair in train_pairs]
train_che_texts = [pair[1] for pair in train_pairs]
eng_vectorization.adapt(train_eng_texts)
che_vectorization.adapt(train_che_texts)

In [None]:
# Example usage: transforming a single sentence
sample_eng_sentence = train_eng_texts[0]
sample_che_sentence = train_che_texts[0]

# Applying the vectorization to the sample sentences
eng_vectorized = eng_vectorization([sample_eng_sentence])
che_vectorized = che_vectorization([sample_che_sentence])

# Displaying the results
print(sample_eng_sentence)
print("English vectorized:", eng_vectorized)
print()
print(sample_che_sentence)
print("Chechen vectorized:", che_vectorized)

And Simon himself had faith and, having had baptism, he went with Philip and, seeing the signs and the great wonders which he did, he was full of surprise.
English vectorized: tf.Tensor(
[[  3 425 137  50 140   3  73  50 737   9  55  22 741   3 360   2 566   3
    2 114]], shape=(1, 20), dtype=int64)

[start] ᏌᏩᏂᏃ ᎾᏍᏉ ᎤᏬᎯᏳᏁᎢ, ᎠᎦᏬᎥᏃ ᎤᏍᏓᏩᏗᏙᎴ ᏈᎵᎩ; ᎠᎪᏩᏗᏍᎬᏃ ᎤᏍᏆᏂᎪᏗ ᎠᎴ ᎤᏰᎸᏛ ᏚᎸᏫᏍᏓᏁᎲᎢ, ᎠᏍᏆᏂᎪᏍᎨᎢ. [end]
Chechen vectorized: tf.Tensor(
[[    2  1206    14 11149 24367 29106   572 30943   186     4   378   533
  30644     3     0     0     0     0     0     0     0]], shape=(1, 21), dtype=int64)


Now, format the datasets

At each training step, the model will predict target word N + 1 using the source sentence and the target words 0 to N.

Thus, the training dataset will yield (`inputs`, `targets`) where:
* `inputs` - a dictionary with 2 keys:
    - `encoder_inputs` - vectorized source sentence.
    - `decoder_inputs` - the target sentence so far (words 0 to N used to predict the words 0 to N + 1)
* `targets` - target sentence offset by 1 step - provides the next words in the target sentence — what the model will try to predict

In [None]:
def format_dataset(eng, che):
    eng = eng_vectorization(eng)
    che = che_vectorization(che)

    inputs = {
        "encoder_inputs": eng,
        "decoder_inputs": che[:, :-1],
    }

    return (inputs, che[:, 1:])

def make_dataset(pairs):
    eng_texts, che_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    che_texts = list(che_texts)
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, che_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.cache().shuffle(2048).prefetch(16)

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [None]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"].shape: (64, 20)
targets.shape: (64, 20)


## Building the model:


Our sequence-to-sequence Transformer consists of a TransformerEncoder and a TransformerDecoder chained together. To make the model aware of word order, we also use a PositionalEmbedding layer.

source sequence --> `TransformerEncoder` (output a new representation of the source) --> pass the new representation with the target sequence so far (target words 0 to N) to `TransformerDecoder` --> predict the next words in the target sequence (N + 1 and beyond)

Layers adapted from https://keras.io/examples/nlp/neural_machine_translation_with_transformer/

In [None]:
import keras.ops as ops

In [None]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        if mask is not None:
            padding_mask = ops.cast(mask[:, None, :], dtype="int32")
        else:
            padding_mask = None

        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
        )
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "dense_dim": self.dense_dim,
                "num_heads": self.num_heads,
            }
        )
        return config

In [None]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(latent_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = ops.cast(mask[:, None, :], dtype="int32")
            padding_mask = ops.minimum(padding_mask, causal_mask)
        else:
            padding_mask = None

        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
        )
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = ops.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = ops.arange(sequence_length)[:, None]
        j = ops.arange(sequence_length)
        mask = ops.cast(i >= j, dtype="int32")
        mask = ops.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = ops.concatenate(
            [ops.expand_dims(batch_size, -1), ops.convert_to_tensor([1, 1])],
            axis=0,
        )
        return ops.tile(mask, mult)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "latent_dim": self.latent_dim,
                "num_heads": self.num_heads,
            }
        )
        return config

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = ops.shape(inputs)[-1]
        positions = ops.arange(0, length, 1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        if mask is None:
            return None
        else:
            return ops.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "sequence_length": self.sequence_length,
                "vocab_size": self.vocab_size,
                "embed_dim": self.embed_dim,
            }
        )
        return config


Assemble the end-to-end model:

In [None]:
embed_dim = 256
latent_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])
transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)

## Training our model

In [None]:
epochs = 3  # 1 for testing

transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)

Epoch 1/3
[1m1072/1072[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 114ms/step - accuracy: 0.5970 - loss: 3.9345 - val_accuracy: 0.6393 - val_loss: 2.9544
Epoch 2/3
[1m1072/1072[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m111s[0m 99ms/step - accuracy: 0.6470 - loss: 2.8382 - val_accuracy: 0.6724 - val_loss: 2.4897
Epoch 3/3
[1m1072/1072[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 98ms/step - accuracy: 0.6692 - loss: 2.5346 - val_accuracy: 0.6923 - val_loss: 2.2505


<keras.src.callbacks.history.History at 0x7f1fe00fbe50>

Save the model for later evaluation

In [None]:
transformer.save("sequence_to_sequence_3.keras")

## Decoding test sentences

In [None]:
che_vocab = che_vectorization.get_vocabulary()
che_index_lookup = dict(zip(range(len(che_vocab)), che_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = eng_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = che_vectorization([decoded_sentence])[:, :-1]
        # print(f"tokenized_target_sentence: {tokenized_target_sentence}")
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
        # ops.argmax(predictions[0, i, :]) is not a concrete value for jax here
        sampled_token_index = ops.convert_to_numpy(
            ops.argmax(predictions[0, i, :])
        ).item(0)
        # print(f"sampled_token_index: {sampled_token_index}")
        sampled_token = che_index_lookup[sampled_token_index]
        # print(sampled_token)
        decoded_sentence += " " + sampled_token

        if sampled_token == "[end]":
            break
    return decoded_sentence

In [None]:
test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(10):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequence(input_sentence)

    print(f"Input      = {input_sentence}\nTranslated = {translated}")

Input      = This is a great secret: but my words are about Christ and the church.
Translated = [start] ᎾᏍᎩ ᎢᏳᏍᏗ ᏞᏍᏗ ᎩᎶ ᏂᎯ ᎢᏤᎲ ᎤᏓᏑᏰᏍᏗ ᎤᏢᎨᏍᏗ ᏫᏓᏯᏂᏍᎨᏍᏗ ᏗᎨᎦᏗᎶᏗ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᎤᏁᎳᏅᎯ ᎤᏤᎵᎦ
Input      = Just now went to be spelling.
Translated = [start] ᎾᏍᎩ ᎢᏳᏍᏗ ᏞᏍᏗ ᎩᎶ ᏂᎯ ᎢᏤᎲ ᎤᏓᏑᏰᏍᏗ ᎤᏢᎨᏍᏗ ᏫᏓᏯᏂᏍᎨᏍᏗ ᏗᎨᎦᏗᎶᏗ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᎤᏁᎳᏅᎯ ᎤᏤᎵᎦ
Input      = And Joseph, who was given by the Apostles the name of Barnabas (the sense of which is, Son of comfort), a Levite and a man of Cyprus by birth,
Translated = [start] ᎾᏍᎩ ᎢᏳᏍᏗ ᏞᏍᏗ ᎩᎶ ᏂᎯ ᎢᏤᎲ ᎤᏓᏑᏰᏍᏗ ᎤᏢᎨᏍᏗ ᏫᏓᏯᏂᏍᎨᏍᏗ ᏗᎨᎦᏗᎶᏗ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᎤᏁᎳᏅᎯ ᎤᏤᎵᎦ
Input      = But when they went on with their questions, he got up and said to them, Let him among you who is without sin be the first to send a stone at her.
Translated = [start] ᎾᏍᎩ ᎢᏳᏍᏗ ᏞᏍᏗ ᎩᎶ ᏂᎯ ᎢᏤᎲ ᎤᏓᏑᏰᏍᏗ ᎤᏢᎨᏍᏗ ᏫᏓᏯᏂᏍᎨᏍᏗ ᏗᎨᎦᏗᎶᏗ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᏧᎾᏁᎶᏗ ᎤᎾᏓᏡᎬ ᎤᏁᎳᏅᎯ ᎤᏤᎵᎦ
Input      = took it off a fire

# Future Directions

It may be interesting to compare the performances of English to Cherokee vs a Cherokee to English models.

We'd also like to flush out the RNN and Encoder-Decoder approaches more in order to compare between other options for EnChr translation.

A final avenue that we might pursue is comparing our output with that of an LLM translation and evaluating accuracy over all models.