#Rhys Cabot's Capstone Final Demo/Code
In this demo I'll show you my working model, and some of the steps I've taken toa ttempt to improve the issue I identified. Said issue is the fact that because Japanese usually follows a rigid SOV sentence order, and English ordinarily uses SVO, sometimes machine translation, since it is focused on accuracy, will mess up the sentence structure. This is especially common when using conversation-level text, and when using weaker models. (Particularly non-neural models, like the one we use today.)

Data: https://github.com/rpryzant/JESC

Machine Translation: https://www.geeksforgeeks.org/nlp/machine-translation-with-transformer-in-python/

https://huggingface.co/docs/transformers/tasks/translation

I used the above tutorials to guide me in creating the model I used. Below we import a list of necessary packages. We need to import them every time we launch the virtual environment in Google Colab.

In [25]:
!pip install transformers datasets evaluate sacrebleu wandb



Now with our packages impoted, we can move on to loading our data.

#Part 1: Data & Model Training

We'll be formatting our data and training our model so we have a working (though flawed) machine translation model of our own, as opposed to calling a pretrained model.

Note: This decision was made before the decision to cut hyperparameter tuning. If I were starting this from scratch with the knowledge I have now, I'd probably resign myself to a pretrained model to avoid the weeks and weeks of getting all the services running.

In [26]:
import numpy as np
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
import re

nltk.download('punkt_tab')

#Here we load the data into a pandas dataframe.
#The English and Japanese are separated by small chunks of whitespace
df = pd.read_csv("train", sep="\t", names=["en", "jp"])
testdf = pd.read_csv("test", sep="\t", names=["en", "jp"])

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


With the data as it is, training the model would take around 300 hours per epoch, so I trim it down for the sake of this project. We take a random assortment each time to ensure that the model is not overly biased towards certain translation samples. See the comments in the code block for how to adjust this for testing!

In [27]:
def trim_dataframe(input_df):
    sample_size = max(1, len(input_df) // 100) #Edit the number here to change the ratio
    sampled_df = input_df.sample(n=sample_size, random_state=42)
    return sampled_df

df = trim_dataframe(df)
testdf = trim_dataframe(testdf)

The model expects a DatasetDict object as input, so we need to convert the dataframe to a DatasetDict. There is probably an easier way of loading a file directly to this but I'm not familiar with DatasetDicts so I'm playing it close to my strong suits. Thankfully, datasets has a handy function for this!

In [28]:
from datasets import Dataset, DatasetDict

def create_dataset_dict_from_pandas(train_df_pandas: pd.DataFrame, test_df_pandas: pd.DataFrame) -> DatasetDict:

    train_dataset = Dataset.from_pandas(train_df_pandas)
    test_dataset = Dataset.from_pandas(test_df_pandas)

    dataset_dict = DatasetDict({
        "train": train_dataset,
        "test": test_dataset
    })
    return dataset_dict

The step below will help us format our data in a way that the model can read, in accordanc ewith the tutorials I used. It slows down considerably with larger datasets.

In [29]:
df = create_dataset_dict_from_pandas(df, testdf)

def format_translation(examples):
  return {"translation": {"ja": examples["jp"], "en": examples["en"]}}

df = df.map(format_translation)

Map:   0%|          | 0/26945 [00:00<?, ? examples/s]

Map:   0%|          | 0/19 [00:00<?, ? examples/s]

With this we can see a sample of our data to make sure it has a properly formatted "translation" dictionary.

In [30]:
df["train"][0]

{'en': 'they understood it, they embraced the technology',
 'jp': '彼らは自ら理解し テクノロジーを受け入れて',
 '__index_level_0__': 489607,
 'translation': {'en': 'they understood it, they embraced the technology',
  'ja': '彼らは自ら理解し テクノロジーを受け入れて'}}

We use a pretrained AutoTokenizer to help process our data in a way the model can digest.

In [31]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

We set up some parameters here, establishing our source language, target language, and prefix, and we create a preprocess function to help the model digest this further.

In [32]:
source_lang = "ja"
target_lang = "en"
prefix = "Translate Japanese to English: " # Corrected prefix

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

We need to map out information to a dataframe as indicated below.

In [33]:
tokenized_data = df.map(preprocess_function, batched=True)

Map:   0%|          | 0/26945 [00:00<?, ? examples/s]

Map:   0%|          | 0/19 [00:00<?, ? examples/s]

We then import a DataCollator, essentially just a middle man to help the model manage our data.

In [34]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

Below is a general quality metric for translation. It doesn't help us determine whether translations have the issue or not, but it does help us measure the model's overall quality. We want to score somewhere around 20+ at least, though with low training times that's unlikely to occur.

In [35]:
import evaluate

metric = evaluate.load("sacrebleu")

We define some functions here for output handling and model validation.

In [71]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Importing from transformers again for, you guessed it, more language processing tools!

In [37]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Now we put the model itself together and train it!

Note: Back when I started this project you needed a wandb account to run it. Now I think you can select 3 when it prompts you for a number to avoid using wandb. If you are by chance trying to run this code, please do this! However keep the training time in mind!

In [38]:
training_args = Seq2SeqTrainingArguments(
    output_dir="cabotcapstone",
    eval_strategy="epoch",
    learning_rate=2e-4, #Higher rate speeds up training, but can make the model much less consistent
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16, #Messing with batch sizes can also alter training time and accuracy.
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2, #Higher means better results but more training time.
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.5957,3.025811,0.6734,12.6842
2,3.4817,3.014218,1.2231,14.3158




TrainOutput(global_step=3370, training_loss=3.562636124698628, metrics={'train_runtime': 11437.9155, 'train_samples_per_second': 4.712, 'train_steps_per_second': 0.295, 'total_flos': 227777608482816.0, 'train_loss': 3.562636124698628, 'epoch': 2.0})

# Part 2: Post Processing

We start by making a handy translation function to make running the model less of a chore.

In [72]:
def translate_japanese_to_english(japanese_text):
    #Preprocess the text
    input_text = prefix + japanese_text
    print(input_text)
    tokenized_input = tokenizer(input_text, return_tensors="pt").input_ids

    #Translating the text itself!
    translated_tokens = model.generate(tokenized_input, max_new_tokens=128)

    #Decoding the input
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

    return translated_text


japanese_sentence = "こんにちは！"
english_translation = translate_japanese_to_english(japanese_sentence)
print("Japanese: " + japanese_sentence)
print("English: " + english_translation)

Translate Japanese to English: こんにちは！
Japanese: こんにちは！
English: i'm not gonna be able to do it!


And...yikes. That message is a default response to the model encountering an error. Yet after lots of bug fixes...I think the issue is that the model is too weak. Yet the focus of this assignment is less on the model itself at this point, so while I am unfortunately short on time, I'll move on to resorting to mock-ups.

The NLTK function below parses the sentence into words and gives them tags

In [65]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

def get_parts_of_speech_nltk(sentence):
  words = nltk.word_tokenize(sentence)
  pos_tags = nltk.pos_tag(words)
  return pos_tags

print(get_parts_of_speech_nltk("My name is Rhys"))

[('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), ('Rhys', 'NNP')]


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


I then constructed a function that takes said tags and turns them into a dictionary that can more easily be interpreted by humans.

In [66]:
def format_pos(pos_list):
  #Creating an empty dictionary with keys for parts of speech
  my_dict = {
      "subjects": [],
      "objects": [],
      "verbs": [],
      "adjectives": [],
      "adverbs": [],
      "others": []
  }

  #Declaring lists for the nltk tags
  nouns = ["NN", "NNS", "NNP", "NNPS"]
  verbs = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
  adjectives = ["JJ", "JJR", "JJS"]
  adverbs = ["RB", "RBR", "RBS"]
  pronouns = ["PRP", "PRP$", "WP", "WP$"]
  prepositions = ["IN"]

  #Now we iterate through and add to the dictionary!
  for item in pos_list:
    if item[1] in nouns:
      if my_dict["subjects"] == []:
        my_dict["subjects"].append(item[0])
      else:
        my_dict["objects"].append(item[0])
    elif item[1] in verbs:
      my_dict["verbs"].append(item[0])
    elif item[1] in adjectives:
      my_dict["adjectives"].append(item[0])
    elif item[1] in adverbs:
      my_dict["adverbs"].append(item[0])
    else:
      my_dict["others"].append(item[0])

  return my_dict

Then I build my translation guide!

In [67]:
def build_translation_guide(english_text, japanese_text, pos_dict):
  subjects = pos_dict["subjects"][0]
  objects = ""
  verbs = ""
  adjectives = ""
  adverbs = ""
  others = ""

  for obj in pos_dict["objects"]:
    objects += obj + ", "
  for verb in pos_dict["verbs"]:
    verbs += verb + ", "
  for adj in pos_dict["adjectives"]:
    adjectives += adj + ", "
  for adv in pos_dict["adverbs"]:
    adverbs += adv + ", "
  for other in pos_dict["others"]:
    others += other + ", "


  print("--------------------------------------")
  print("JPN: " + japanese_text)
  print("ENG: " + english_text)
  print("Subjects: " + subjects)
  print("Objects: " + objects)
  print("Verbs: " + verbs)
  print("Adjectives: " + adjectives)
  print("Adverbs: " + adverbs)
  print("Other words: " + others)
  print("--------------------------------------")

--------------------------------------
JPN: リスはアメリカン大学に行きました。
ENG: Rhys went to American University
Subjects: Rhys
Objects: American, University, 
Verbs: went, 
Adjectives: 
Adverbs: 
Other words: to, 
--------------------------------------


In [68]:
english_text = "Rhys went to American University"
japanese_text = "リスはアメリカン大学に行きました。"

def format_translation(japanese_text, english_text):
  pos_dict = format_pos(get_parts_of_speech_nltk(english_text))
  build_translation_guide(english_text, japanese_text, pos_dict)

#The line below is an example of a correct function call, but it is commented
#out since our translation model is being grumpy
#format_translation(japanese_text, translate_japanese_to_english(japanese_text))

format_translation(japanese_text, english_text)

--------------------------------------
JPN: リスはアメリカン大学に行きました。
ENG: Rhys went to American University
Subjects: Rhys
Objects: American, University, 
Verbs: went, 
Adjectives: 
Adverbs: 
Other words: to, 
--------------------------------------


We thankfully see the output is very fast, but let's try some broken sentences!

In [70]:
english_text = "The Rhys is ate food midnight after"
japanese_text = "リスは十二時の後で食べ物を食べました。"

format_translation(japanese_text, english_text)

--------------------------------------
JPN: リスは十二時の後で食べ物を食べました。
ENG: The Rhys is ate food midnight after
Subjects: Rhys
Objects: food, midnight, 
Verbs: is, 
Adjectives: ate, 
Adverbs: 
Other words: The, after, 
--------------------------------------


Not perfect by any means, but it does work! While it does have the key issue of assuming the subject comes first, in both sentence orders the subject will come first so it's not a massiv eissue here. The bigger issue will be if the subject is omitted completely, which I don't currently have a fix for.

#This has been Rhys's Final Demonstration! Thank you very much!