# ISYE 6740 - Project
### Natural Language Processing Methods with Machine Translation

### Step 0. Load Libraries

In [1]:
# load libraries
import keras
import tensorflow as tf
import collections # used to identify counts of words
import pandas as pd # used to load in dataset
import numpy as np
from keras.layers import TextVectorization # for word tokenization
from keras_preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

### Step 1. Load and Inspect Dataset

In [2]:
words_df = (pd.read_csv('pairs.tsv', sep='\t', header=None)).drop([0, 2], axis=1)
words_df.columns = ['English', 'Norwegian']

words_df.head()

Unnamed: 0,English,Norwegian
0,Let's try something.,La oss prøve noe.
1,I have to go to sleep.,Jeg må legge meg.
2,Today is June 18th and it is Muiriel's birthday!,I dag er det juni den 18. og det er Muiriels b...
3,Muiriel is 20 now.,Muiriel er 20 nå.
4,"The password is ""Muiriel"".","Passordet er ""Muiriel""."


In [3]:
words_df.shape

(13066, 2)

In [4]:
# get total, unique counts as well as most frequent words for each language
for lang in words_df.columns:
    counts_df = words_df.copy() # create new df for string editing
    counts_df[lang] = counts_df[lang].str.replace(r'[^\w\s]+', '', regex=True) # removes all punctuation
    
    # use Counter function to get counts of each word
    results = collections.Counter()
    counts_df[lang].str.lower().str.split().apply(results.update)
    
    # print results
    print(f'Total {lang} Words: {sum(results.values())}')
    print(f'Unique {lang} Words: {len(results.keys())}')
    print(f'Top 3 {lang} Words: {results.most_common(3)}\n')
  
    counts_df.head()

Total English Words: 82687
Unique English Words: 6704
Top 3 English Words: [('the', 3195), ('i', 2872), ('to', 2462)]

Total Norwegian Words: 81460
Unique Norwegian Words: 8338
Top 3 Norwegian Words: [('jeg', 3663), ('er', 3335), ('det', 2094)]



We can see that the total words for the English language in this dataset is only slighly higher than its Norwegian counterpart data; however, Norway has 1,634 more unique words. Another interesting insight from the information gathered here is that the words 'jeg', 'er', and 'det' are the most popular Norwegian words. Intuitively, this makes sense; a neat little thing about Norwegian is that 'er' serves as a linking verb for both singular and plural nouns. For example:
- "He is" = "Han er"
- "They are" = "De er"
- "I am" = "Jeg er"

### Step 2. Tokenize and Pad Data for Modelling

No cleaning is needed with the dataset, as the raw data was already in grammatically correct fashion. While we removed punctuation for the data inspection phase to get true counts of all words, that missing information may provide further syntactic insight and help the models we create in the future. For that reason, we leave punctuation characters within the dataset.

Our next step is to tokenize the data, which for each language, creates a unique numeric identifier for each word. This is a necessary step before we can feed our language data into the models we make. With the TextVectorization function from the Keras package, we map integer values to all the words in the dataset for each language (separately). 

Another step we must employ is a technique called padding. Since NLP models require all inputs to be of the same length, there is an incongruency with how our data is currently shaped due to sentences having different word counts. What padding does is add values to the shorter sentences in order to have the same value count as the longest sentence in the dataset. A method often used in tandem with this is truncating, which is cutting down the maximum amount of words allowed in a particular sentence. Since we lose data this way, we will prepare our data only with padding.

In [5]:
# Make vectorization layer.
vectorize_layer = keras.layers.TextVectorization(
    max_tokens=None, # no limit on max tokens
    standardize='lower', # standardize lowercase (no stripping of punctuation)
    split='whitespace', # splitting on space between words
    output_mode='int', # getting integer values for tokens
)

lang_dict = {}
lang_vocab = {}
for lang in words_df.columns:
    # Adapt layer on the text-only dataset to create the vocabulary.
    vectorize_layer.adapt(words_df[lang])
    lang_vocab[lang] = vectorize_layer.get_vocabulary(include_special_tokens = False)
    
    # Create the model that uses the text layer.
    model = keras.models.Sequential()

    # Start by creating an explicit input layer. It needs to have a shape of
    # (1,) (because we need to guarantee that there is exactly one string
    # input per batch), and the dtype needs to be 'string'.
    model.add(keras.Input(shape=(1,), dtype=tf.string))

    # The first layer in our model is the vectorization layer. After this
    # layer, we have a tensor of shape (batch_size, max_len) containing vocab
    # indices.
    model.add(vectorize_layer)

    # Get token value array.
    input_data = words_df[lang]
    lang_dict[lang] = np.array(model.predict(input_data).to_tensor())
    
    # Detail the shape of that array.
    max_list = max(lang_dict[lang], key = lambda i: len(i))
    max_len = len(max_list)
    print(f'{lang} Token Array Shape: {lang_dict[lang].shape}\nLength of Longest Sentence: {max_len}')

English Token Array Shape: (13066, 209)
Length of Longest Sentence: 209
Norwegian Token Array Shape: (13066, 213)
Length of Longest Sentence: 213


As expected, both token arrays have the same length as the original dataset (one list for each sentence). We also see that the maximum sentence lengths for English and Norwegian are 209 and 213 words, respectively. To double check that there was no error in the padding process, we have added a list at the bottom of the previous code cell to calculate and return the vocabulary array entry with the longest length. In doing so, we have confirmed that the padding process was successful; we now have two arrays with tokenized values for each language, which will be used for modelling.

To make sure our models perform correctly, we need both token arrays to have the same dimensions. This requires additional padding for the English array to match the Norwegian array's size.

In [8]:
tmp_x = np.pad(lang_dict["English"], ((0,0),(0,4)), 'constant') # pad English array to Norwegian array size

A final step is to make a function that we can apply to each model. Output is in the form of IDs, so we want to get this back to text. The function below does just that.

In [9]:
def logits_to_text(logits, tokenizer):
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'
    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])
print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


### Step 3. Use SMT and NMT techniques to make models

##### Model 1. RNN

In [None]:
# get shape details of token array length and word indeces
max_norsk_sequence_length = lang_dict["Norwegian"].shape[1]
english_vocab_size = np.unique(lang_dict["English"]).shape[0]
norsk_vocab_size = np.unique(lang_dict["Norwegian"]).shape[0]


def simple_model(input_shape, output_sequence_length, english_vocab_size, norsk_vocab_size):
    learning_rate = 1e-3
    input_seq = Input(input_shape[1:])
    rnn = GRU(64, return_sequences = True)(input_seq)
    logits = TimeDistributed(Dense(norsk_vocab_size))(rnn)
    model = Model(input_seq, Activation('softmax')(logits))
    model.compile(loss = sparse_categorical_crossentropy, 
                 optimizer = Adam(learning_rate), 
                 metrics = ['accuracy'])
    
    return model

#tests.test_simple_model(simple_model)

#tmp_x = pad(lang_dict['English'], lang_dict['Norwegian'].shape[1]) # padding english token vectors to max norwegian
tmp_x = tmp_x.reshape((lang_dict["Norwegian"].shape[0], lang_dict["Norwegian"].shape[1], 1)) # reshape english token vector

# Train the neural network
simple_rnn_model = simple_model(
    tmp_x.shape,
    max_norsk_sequence_length,
    english_vocab_size,
    norsk_vocab_size)
simple_rnn_model.fit(tmp_x, lang_dict["Norwegian"], batch_size=1024, epochs=10, validation_split=0.2)
# Print prediction(s)
#print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))

Epoch 1/10


### Step 4. Compare model accuracies (summary section)