# English to Cherokee

## Milestone Information

### Team Members:
Rithvik Doshi, Saisriram Gunturu, Ruihang (Henry) Liu

### Project Description

We aim to create a model to translate English text to Cherokee. We're hoping to come up with an approach to this problem since Cherokee is an endangered language, and we can use the models we learned about in class specifically regarding machine translation to see how well we can do.

### Approach
We'll use the following data sources:
- https://github.com/ZhangShiyue/ChrEn/tree/main/data
- https://github.com/CherokeeLanguage/CherokeeEnglishCorpus/tree/master/corpus.aligned/en_chr

Additionally, we will experiment with one of the following architectures/approaches to see what's the best way to translate from English to Cherokee:
- https://github.com/lukysummer/Machine-Translation-Seq2Seq-Keras/tree/master/data
- https://medium.com/@patrickhk/use-keras-to-build-a-english-to-french-translator-with-various-rnn-model-architecture-a374
- https://github.com/LaurentVeyssier/Machine-translation-English-French-with-Deep-neural-Network/blob/main/machine_translation.ipynb
- https://arxiv.org/pdf/2010.04791v1.pdf

### Project Plan:

The project will consist of the following phases:
1. EDA / Data Loading
    a. Concatenating as many data sources as possible to get as big of a corpus as we can
    b. Split data into training, testing and validation sets
2. Model Developemnt
    a. Finalize Model Selection and Architecture and build in Pytorch
3. Model Training
    a. Use available SCC GPUs to train the model.
4. Model Validation
    a. Validation sentences should give us a metric of accuracy.
5. Model Testing
    a. Use ChrEn model to translate back to English to see how we did.

# EDA

## Imports

In [1]:
import math
import numpy as np
from numpy.random import shuffle, seed, choice
from tqdm import tqdm
from collections import defaultdict, Counter
import pandas as pd
import re
import matplotlib.pyplot as plt

import torch
from torch.utils.data import Dataset,DataLoader
import torch.nn.functional as F
from torch.utils.data import random_split,Dataset,DataLoader,TensorDataset
from torchvision import datasets, transforms
from torch import nn, optim

import torchvision.transforms as T

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import nltk
import pickle
from statistics import mean
from torchtext.vocab import GloVe
from nltk.corpus import gutenberg

from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split

In [7]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cpu')

In [8]:
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = torch.device("mps")

In [9]:
device

device(type='mps')

# Data Loading and pre-processing

In [1]:
import os

data_dir = "./CherokeeEnglishCorpus/corpus.aligned/en_chr/"

In [2]:
def load_input_for(language=".en"):
    """
    Load the input for the given language
    :param language: the language of the input (".en" for English and ".chr" for Cherokee)
    :return: the input
    """
    # Get all .en files in the directory
    file_list = [file for file in os.listdir(data_dir) if file.endswith(language)]

    # Initialize the empty array for the input
    lines_array = []    # structure: [lines in the document]

    for file in file_list:
        file_path = os.path.join(data_dir, file)
        with open(file_path, "r") as f:
            lines = f.readlines()
            for line in lines:
                lines_array.append(line.strip())

    return lines_array

In [3]:
english_sentences = load_input_for(".en")
cherokee_sentences = load_input_for(".chr")

In [4]:
print(len(english_sentences), len(cherokee_sentences))  # should match

107168 107168


## Data pre processing

### Tokenizer:

In [5]:
from keras.preprocessing.text import Tokenizer

def tokenize(x):
    x_tk = Tokenizer()
    x_tk.fit_on_texts(x)
    return x_tk.texts_to_sequences(x), x_tk



In [6]:
# Test our tokenize()
test_text = ["In the beginning God created the heavens and the earth.",
             "And God said, Let there be light: and there was light."]  # just 2 short sentences from our data
test_text_tokenized, test_tokenizer = tokenize(test_text)

test_text_tokenized

[[6, 1, 7, 3, 8, 1, 9, 2, 1, 10], [2, 3, 11, 12, 4, 13, 5, 2, 4, 14, 5]]

In [7]:
test_tokenizer.word_index

{'the': 1,
 'and': 2,
 'god': 3,
 'there': 4,
 'light': 5,
 'in': 6,
 'beginning': 7,
 'created': 8,
 'heavens': 9,
 'earth': 10,
 'said': 11,
 'let': 12,
 'be': 13,
 'was': 14}

We can see that keras has already taken into account of capital/lowercased letter and punctuations. So we don't have to.

Apply tokenizer on our input data:

In [8]:
english_sentences_tokenized, english_tokenizer = tokenize(english_sentences)
cherokee_sentences_tokenized, cherokee_tokenizer = tokenize(cherokee_sentences)

In [9]:
english_vocab_size = len(english_tokenizer.word_index)
cherokee_vocab_size = len(cherokee_tokenizer.word_index)
print("English vocab size = {}, Cherokee vocab size = {}".format(english_vocab_size, cherokee_vocab_size))

English vocab size = 20763, Cherokee vocab size = 72759


### Padding
Truncate all sentences into equal length for our input: pad to the max length, leave trailing 0 (post)

In [10]:
from keras_preprocessing.sequence import pad_sequences   # for Apple Sillicon is "keras_preprocessing". Otherwise "keras.preprocessing"
def pad(x):
    length = max([len(sentence) for sentence in x])
    return pad_sequences(x, maxlen=length, padding='post')

In [11]:
# testing padding function:
test_text_padded = pad(test_text_tokenized)
test_text_padded

array([[ 6,  1,  7,  3,  8,  1,  9,  2,  1, 10,  0],
       [ 2,  3, 11, 12,  4, 13,  5,  2,  4, 14,  5]], dtype=int32)

In [12]:
# Apply padding to input:
english_sentences_padded = pad(english_sentences_tokenized)
cherokee_sentences_padded = pad(cherokee_sentences_tokenized)

### Write function to map logits back to token label
Function to convert predictions (a bunch of probability) back to sentence

In [13]:
import numpy as np

def logits_to_text(logits, tokenizer):
    idx_to_words = {id: word for word, id in tokenizer.word_index.items()}
    idx_to_words[0] = '<PAD>'
    return ' '.join([idx_to_words[prediction] for prediction in np.argmax(logits, 1)])

### Make Dataloader

In [14]:
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset

class Basic_Dataset(Dataset):

    def __init__(self, X,Y):
        self.X = X
        self.Y = Y

    def __len__(self):
        return len(self.X)

    # return a pair x,y at the index idx in the data set
    def __getitem__(self, idx):
        return self.X[idx], self.Y[idx]


In [15]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(english_sentences_padded, cherokee_sentences_padded, test_size=0.2, random_state=42)

# Split the train data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

train_dataset = Basic_Dataset(X_train, y_train)
val_dataset = Basic_Dataset(X_val, y_val)
test_dataset = Basic_Dataset(X_test, y_test)

In [16]:
# For torch models:
from torch.utils.data import DataLoader

batch_size = 128

# Data loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# Or a loader with all data:
all_loader = DataLoader(Basic_Dataset(english_sentences_padded, cherokee_sentences_padded), batch_size=batch_size, shuffle=True)

# Model Development

## First Model: