# Translating Meowian

A cardboard box has appeared at your front door. In it, there is a shivering cat with a PS Vita around its neck. As you bring in the cat to your warm home, it insistently meows at you and text in Latin letters pops up on the screen. It seems to be translating the cat's meowing! After fidgeting with the bootlegged PS Vita, you notice that, while the audio to text functionality works, meowian_txt_to_english.py is partially corrupt. It's up to you to recreate the Meowian-to-English translator to understand what this cat is telling you!

## On Meowian

From perusing the files contained in the PS Vita, you come to learn that Meowian is a language spoken by all cats, but which has many dialects and varieties. Your new cat is a worldly cat and thus speaks many different types of Meowian, including Informal Meowian and Formal Meowian.

In this folder, you have two text files, informal_meowian.txt and formal_meowian.txt, each containing transcriptions of the cat's output in those respective varieties. There is also a meowian-english.csv file containing a parallel of Meowian words and their English translation. The developer was even kind enough to leave an English meowian_report.txt file which describes the structure of Meowian and some of its varieties.

## Task

In order to successfully translate one of the Meowian text files into English, your job is to fix the code below by completing the TODOs. Informal Meowian is easier as its structure is more similar to that of English, but if you want a challenge, you can try your hand at translating the more complex Formal Meowian!

## 1. Preprocessing the Corpus

Your first step is preprocessing the Meowian-English aligned corpus. Preprocessing is when you prepare the data before the main operation. This can include cleaning the data of unwanted details, lowercasing, removing stop words("the", "and", "or" etc.), splitting sentences into words, and more!

First, let's extract the data from the parallel corpus.

In [None]:
import pandas as pd

tokenized_english = []
tokenized_meowian = []

file = pd.read_csv("meowian-english.csv")
english_sentences = file['english'].tolist() # Take the English data
meowian_sentences = None # TODO: Take the Meowian data

print(english_sentences)
print(meowian_sentences)

all_sentences = [english_sentences, meowian_sentences]
all_tokenized = [tokenized_english, tokenized_meowian]

Now, let's tokenize the sentence. Tokenization is the process of breaking text into smaller units called tokens. Depending on the system, a token may represent a word, a subword, or even punctuation.

In [None]:
import nltk
nltk.download('punkt_tab')

for i in range(len(all_sentences)):
    for sentence in all_sentences[i]:
        #TODO: use nltk.word_tokenize, then add it to the appropriate list of tokenized items!
        tokenized_sentence = None #fix this
        all_tokenized[i].extend(tokenized_sentence)

## 2. Model Training

We can now start setting up our models and training them. If you're unfamiliar with this work, start with the Informal Meowian section which uses an IBM Model 1. If you want a challenge, skip to implementing a Transformer for Formal Meowian!

## A. Informal Meowian and IBM Model 1

IBM Model 1 is one of a family of statistical machine translation models from the early 1990s which were at the forefront of machine translation until neural network models gained popularity. To illustrate how IBM Model 1 works in practice, the following code trains the model on a small parallel corpus. Each pair of sentences is wrapped in an AlignedSent object, and then the model is trained for 10 iterations.

In [None]:
from nltk.translate import AlignedSent, IBMModel1

aligned_sentences = []

for source, target in zip(tokenized_meowian, tokenized_english):
        aligned_sentences.append(AlignedSent(source.split(), target.split()))
        ibm_model = IBMModel1(aligned_sentences, 10)

Training IBM Model 1 is that simple! Now, we must preprocess the text to be translated.

In [None]:
tokenized_input = []
translated_words = []

# TODO: Extract the input. Remember read_csv from the first preprocessing step? You can also use it on text files.
input = None 
input = input.to_string()

# TODO: Clean the output, similarly as before.
for sentence in nltk.sent_tokenize(input):
        tokenized_sentence = None # TODO: use nltk.word_tokenize
        tokenized_input.extend(tokenized_sentence)

Now to actually translating!

In [None]:
for source_word in tokenized_input:
    max_prob = 0.0
    translated_word = None
    for target_word in ibm_model.translation_table[source_word]:
        prob = ibm_model.translation_table[source_word][target_word]
        if prob > max_prob:
            max_prob = prob
            translated_word = target_word
    if translated_word is not None:
        translated_words.append(translated_word)
translated_text = ' '.join(translated_words)

print(translated_text)

Wahoo! We've successfully translated Informal Meowian to English! Now, go back and try running the code on Formal Meowian. What do you notice?

## B. Formal Meowian and RNNs

Since IBM Model 1 is rather primitive, ignoring word order among other things, we'll use a more recent type of model to deal with Formal Meowian: Recurrent Neural Networks! RNNs process sequences one element at a time, allowing them to capture ordering and contextual relationships that statistical models like IBM Model 1 simply can’t represent.

First, let's build a tokenizer for each language. A tokenizer, well, tokenizes, as we did before, except these ones are special: they turn the strings into number representations that our model will be able to handle.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, RepeatVector, TimeDistributed
from tensorflow.keras.callbacks import EarlyStopping

np.random.seed(42)
tf.random.set_seed(42)

# First, we create a Tokenizer object
meowian_tokenizer = Tokenizer(oov_token="<UNK>") # sets unknown vocab to be UNK
# Fit the tokenizer to the sentences
meowian_tokenizer.fit_on_texts(meowian_sentences)
# Turn sentences into number representations (sequences)
meowian_sequences = meowian_tokenizer.texts_to_sequences(meowian_sentences)
meowian_max_len = max(len(seq) for seq in meowian_sequences)
 # Make sequences all the same length
meowian_x = pad_sequences(meowian_sequences, maxlen=meowian_max_len, padding='post')
meowian_vocab_size = len(meowian_tokenizer.word_index) + 1

# TODO: Do the same for English!
en_tokenizer = None # TODO: create Tokenizer object
#TODO: fit the tokenizer
en_sequences = None # TODO: turn sentences into sequences
english_max_len = None # TODO: find max length
en_y = None # TODO: pad sequences
en_vocab_size = None # TODO: find vocab size

print(f"\nMeowian vocabulary: {meowian_tokenizer.word_index}")
print(f"English vocabulary: {en_tokenizer.word_index}")
print(f"\nPadded Meowian sequences shape: {meowian_x.shape}")
print(f"Padded English sequences shape: {en_y.shape}")
print(f"\nMeowian vocab size: {meowian_vocab_size}")
print(f"English vocab size: {en_vocab_size}")


We can finally build our model! It will be made up of:
    1. An embedding layer
    2. An encoder
    3. A repeat vector
    4. A decoder
    5. An output layer


In [None]:
model = Sequential([
    # Input layer: Embedding to convert word indices to dense vectors
    Embedding(
        input_dim=meowian_vocab_size,
        output_dim=8,  # Tiny embedding size for our tiny dataset
        input_length=meowian_max_len,
        mask_zero=True,  # Ignore padding
        name="embedding_layer"
    ),
    
    # RNN layer: Processes the sequence
    SimpleRNN(
        units=8,  # Very small for our tiny dataset
        activation='tanh',
        return_sequences=False,  # Only output at the end
        name="rnn_layer"
    ),
    
    # Repeat the RNN output for each time step in output
    RepeatVector(english_max_len, name="repeat_layer"),
    
    # Another RNN for decoding
    SimpleRNN(
        units=8,
        activation='tanh',
        return_sequences=True,  # Output at each time step
        name="decoder_rnn"
    ),
    
    # Output layer: Predict English word at each position
    TimeDistributed(
        Dense(en_vocab_size, activation='softmax'),
        name="output_layer"
    )
])

# Compile model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',  # Use sparse since we have integer labels
    metrics=['accuracy']
)

# Show model architecture
model.build(input_shape=(None, meowian_max_len))
model.summary()

Training time

In [None]:
history = model.fit(
    meowian_x,
    en_y,  # en_y is already the right shape (batch_size, sequence_length)
    epochs=200,
    batch_size=1,
    verbose=1
)

Now to actual translating!

In [None]:
def translate_with_rnn(text, model, meowian_tokenizer, en_tokenizer, max_len):
    """Translate Meowian to English using the trained RNN"""
    
    # Clean and tokenize input
    text = text.strip()
    
    # Convert to sequence
    sequence = meowian_tokenizer.texts_to_sequences([text])
    
    # Pad sequence
    padded = pad_sequences(sequence, maxlen=max_len, padding='post')
    
    # Get prediction
    predictions = model.predict(padded, verbose=0)[0]  # Shape: (english_max_len, en_vocab_size)
    
    # Convert predictions to words
    translated_words = []
    for time_step in predictions:
        # Get most likely word index
        word_idx = np.argmax(time_step)
        
        # Skip padding (0) and unknown (<UNK>)
        if word_idx > 0 and word_idx <= len(en_tokenizer.word_index):
            # Find word for this index
            for word, idx in en_tokenizer.word_index.items():
                if idx == word_idx:
                    translated_words.append(word)
                    break
    
    return ' '.join(translated_words) if translated_words else "[No translation]"

translate_with_rnn("mew mewe mewew", model, meowian_tokenizer, en_tokenizer, meowian_max_len)


Wahoo! It works! Now try generating Meowian sentences of your own by combining words from meowian-english.csv and put them in the translator. What do you notice?

In [None]:
your_meowian_sentence = ""
english_translation = ""

result = translate_with_rnn(your_meowian_sentence, model, meowian_tokenizer, en_tokenizer, meowian_max_len)

print(your_meowian_sentence)
print(english_translation, result)

Wah! The cat’s output happened to be something it already saw during training. That doesn’t tell us whether it actually learned anything, so we still need to test it on unseen data to evaluate real generalization. But for now, let's translate formal_meowian.txt. 

In [None]:
# TODO: Extract formal_meowian.txt
input = None
input = input.to_string()

#call the translate_with_rnn function on the input and print the result
result = None #TODO call the function
print(result)

This gives us a rough idea of how the model behaves on new text. Proper evaluation will have to wait for another adventure.

# Conclusion
At last, the PS Vita hums to life with a working Meowian‑to‑English translator. The cat curls up on your lap, finally able to tell you exactly what it’s been meowing about this whole time. Our workshop ends here, but your adventures in feline linguistics are only just beginning.

Congratulations on completing the workshop! You’ve explored statistical models, neural models, and the full pipeline of building a translation system.