Machine Translation with RNNs

Udacity Artificial Intelligence Nanodegree

Term 2: Capstone Project

Machine Translation with RNNs

Goal

In this project, we build a deep neural network that functions as part of a machine translation pipeline. The pipeline accepts English text as input and returns the French translation. The goal is to achieve the highest translation accuracy possible.

Background

The ability to communicate with one another is a fundamental part of being human. There are nearly 7,000 different languages worldwide. As our world becomes increasingly connected, language translation provides a critical cultural and economic bridge between people from different countries and ethnic groups. Some of the more obvious use-cases include:

business: international trade, investment, contracts, finance
commerce: travel, purchase of foreign goods and services, customer support
media: accessing information via search, sharing information via social networks, localization of content and advertising
education: sharing of ideas, collaboration, translation of research papers
government: foreign relations, negotiation

To meet this need, technology companies are investing heavily in machine translation. This investment paired with recent advancements in deep learning have yielded major improvements in translation quality. According to Google, switching to deep learning produced a 60% increase in translation accuracy compared to the phrase-based approach used previously. Today, translation applications from Google and Microsoft can translate over 100 different languages and are approaching human-level accuracy for many of them.

However, while machine translation has made lots of progress, it's still not perfect. 😬

Bad translation or extreme carnivorism?

Approach

To translate a corpus of English text to French, we need to build a recurrent neural network (RNN). Before diving into the implementation, let's first build some intuition of RNNs and why they're useful for NLP tasks.

RNN Overview

RNNs are designed to take sequences of text as inputs or return sequences of text as outputs, or both. They're called recurrent because the network's hidden layers have a loop in which the output from one time step becomes an input at the next time step. This recurrence serves as a form of memory. It allows contextual information to flow through the network so that relevant outputs from previous time steps can be applied to network operations at the current time step.

This is analogous to how we read. As you read this post, you're storing important pieces of information from previous words and sentences and using it as context to understand each new word and sentence.

Other types of neural networks can't do this. Imagine you're using a convolutional neural network (CNN) to perform object detection in a movie. Currently, there's no way for information from objects detected in previous scenes to inform the model's detection of objects in the current scene. For example, if a courtroom and judge were detected in a previous scene, that information could help correctly classify the judge's gavel in the current scene (instead of misclassifying it as a hammer or mallet). But CNNs don't allow this type of time-series context to flow through the network like RNNs do.

RNN Setup

Depending on the use-case, you'll want to setup your RNN to handle inputs and outputs differently. For this project, we'll use a many-to-many process where the input is a sequence of English words and the output is a sequence of French words (see fourth model from the left in the diagram below).

Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state (more on this soon).

From left to right: (1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.

Image and quote source: karpathy.github.io

Building the Pipeline

Below is a summary of the various preprocessing and modeling steps. The high-level steps include:

Preprocessing: load and examine data, cleaning, tokenization, padding
Modeling: build, train, and test the model
Prediction: generate specific translations of English to French, and compare the output translations to the ground truth translations
Iteration: iterate on the model, experimenting with different architectures

For a more detailed walkthrough including the source code, check out the Jupyter notebook in the main directory (machine_translation.ipynb).

Toolset

We use Keras for the frontend and TensorFlow for the backend in this project. I prefer using Keras on top of TensorFlow because the syntax is simpler, which makes building the model layers more intuitive. However, there is a trade-off with Keras as you lose the ability to do fine-grained customizations. But this won't affect the models we're building in this project.

Preprocessing

Load & Examine Data

Here is a sample of the data. The inputs are sentences in English; the outputs are the corresponding translations in French.

When we run a word count, we can see that the vocabulary for the dataset is quite small. This was by design for this project. This allows us to train the models in a reasonable time.

Cleaning

No additional cleaning needs to be done at this point. The data has already been converted to lowercase and split so that there are spaces between all words and punctuation.

Note: For other NLP projects you may need to perform additional steps such as: remove HTML tags, remove stop words, remove punctuation or convert to tag representations, label the parts of speech, or perform entity extraction.

Tokenization

Next we need to tokenize the data—i.e., convert the text to numerical values. This allows the neural network to perform operations on the input data. For this project, each word and punctuation mark will be given a unique ID. (For other NLP projects, it might make sense to assign each character a unique ID.)

When we run the tokenizer, it creates a word index, which is then used to convert each sentence to a vector.

Padding

When we feed our sequences of word IDs into the model, each sequence needs to be the same length. To achieve this, padding is added to any sequence that is shorter than the max length (i.e. shorter than the longest sentence).

One-Hot Encoding (not used)

In this project, our input sequences will be a vector containing a series of integers. Each integer represents an English word (as seen above). However, in other projects, sometimes an additional step is performed to convert each integer into a one-hot encoded vector. We don't use one-hot encoding (OHE) in this project, but you'll see references to it in certain diagrams (like the one below). I just didn't want you to get confused.

One of the advantages of OHE is efficiency since it can run at a faster clock rate than other encodings. The other advantage is that OHE better represents categorical data where there is no ordinal relationship between different values. For example, let's say we're classifying animals as either a mammal, reptile, fish, or bird. If we encode them as 1, 2, 3, 4 respectively, our model may assume there is a natural ordering between them, which there isn't. It's not useful to structure our data such that mammal comes before reptile and so forth. This can mislead our model and cause poor results. However, if we then apply one-hot encoding to these integers, changing them to binary representations—1000, 0100, 0010, 0001 respectively—then no ordinal relationship can be inferred by the model.

But, one of the drawbacks of OHE is that the vectors can get very long and sparse. The length of the vector is determined by the vocabulary, i.e. the number of unique words in your text corpus. As we saw in the data examination step above, our vocabulary for this project is very small—only 227 English words and 355 French words. By comparison, the Oxford English Dictionary has 172,000 words. But, if we include various proper nouns, words tenses, and slang there could be millions of words in each language. For example, Google's word2vec is trained on a vocabulary of 3 million unique words. If we used OHE on this vocabulary, the vector for each word would include one positive value (1) surrounded by 2,999,999 zeros!

And, since we're using embeddings (in the next step) to further encode the word representations, we don't need to bother with OHE. Any efficiency gains aren't worth it on a data set this small.

Modeling

First, let's breakdown the architecture of a RNN at a high level. Referring to the diagram above, there are a few parts of the model we to be aware of:

Inputs — Input sequences are fed into the model with one word for every time step. Each word is encoded as a unique integer or one-hot encoded vector that maps to the English dataset vocabulary.
Embedding Layers — Embeddings are used to convert each word to a vector. The size of the vector depends on the complexity of the vocabulary.
Recurrent Layers (Encoder) — This is where the context from word vectors in previous time steps is applied to the current word vector.
Dense Layers (Decoder) — These are typical fully connected layers used to decode the encoded input into the correct translation sequence.
Outputs — The outputs are returned as a sequence of integers or one-hot encoded vectors which can then be mapped to the French dataset vocabulary.

Embeddings

Embeddings allow us to capture more precise syntactic and semantic word relationships. This is achieved by projecting each word into n-dimensional space. Words with similar meanings occupy similar regions of this space; the closer two words are, the more similar they are. And often the vectors between words represent useful relationships, such as gender, verb tense, or even geopolitical relationships.

Training embeddings on a large dataset from scratch requires a huge amount of data and computation. So, instead of doing it ourselves, we'd normally use a pre-trained embeddings package such as GloVe or word2vec. When used this way, embeddings are a form of transfer learning. However, since our dataset for this project has a small vocabulary and little syntactic variation, we'll use Keras to train the embeddings ourselves.

Encoder & Decoder

Our sequence-to-sequence model links two recurrent networks: an encoder and decoder. The encoder summarizes the input into a context variable, also called the state. This context is then decoded and the output sequence is generated.

Image credit: Udacity

Since both the encoder and decoder are recurrent, they have loops which process each part of the sequence at different time steps. To picture this, it's best to unroll the network so we can see what's happening at each time step.

In the example below, it takes four time steps to encode the entire input sequence. At each time step, the encoder "reads" the input word and performs a transformation on its hidden state. Then it passes that hidden state to the next time step. Keep in mind that the hidden state represents the relevant context flowing through the network. The bigger the hidden state, the greater the learning capacity of the model, but also the greater the computation requirements. We'll talk more about the transformations within the hidden state when we cover gated recurrent units (GRU).

Image credit: modified version from Udacity

For now, notice that for each time step after the first word in the sequence there are two inputs: the hidden state and a word from the sequence. For the encoder, it's the next word in the input sequence. For the decoder, it's the previous word from the output sequence.

Also, remember that when we refer to a "word," we really mean the vector representation of the word which comes from the embedding layer.

Bidirectional Layer

Now that we understand how context flows through the network via the hidden state, let's take it a step further by allowing that context to flow in both directions. This is what a bidirectional layer does.

In the example above, the encoder only has historical context. But, providing future context can result in better model performance. This may seem counterintuitive to the way humans process language, since we only read in one direction. However, humans often require future context to interpret what is being said. In other words, sometimes we don't understand a sentence until an important word or phrase is provided at the end. Happens this does whenever Yoda speaks. 😑 🙏

To implement this, we train two RNN layers simultaneously. The first layer is fed the input sequence as-is and the second is fed a reversed copy.

Hidden Layer — Gated Recurrent Unit (GRU)

Now let's make our RNN a little bit smarter. Instead of allowing all of the information from the hidden state to flow through the network, what if we could be more selective? Perhaps some of the information is more relevant, while other information should be discarded. This is essentially what a gated recurrent unit (GRU) does.

There are two gates in a GRU: an update gate and reset gate. This article by Simeon Kostadinov, explains these in detail. To summarize, the update gate (z) helps the model determine how much information from previous time steps needs to be passed along to the future. Meanwhile, the reset gate (r) decides how much of the past information to forget.

Image Credit: analyticsvidhya.com

Final Model

Now that we've discussed the various parts of our model, let's take a look at the code. Again, all of the source code is available here in the notebook.

def  model_final (input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Hyperparameters
    learning_rate = 0.003

    # Build the layers    
    model = Sequential()
    # Embedding
    model.add(Embedding(english_vocab_size, 128, input_length=input_shape[1],
                         input_shape=input_shape[1:]))
    # Encoder
    model.add(Bidirectional(GRU(128)))
    model.add(RepeatVector(output_sequence_length))
    # Decoder
    model.add(Bidirectional(GRU(128, return_sequences=True)))
    model.add(TimeDistributed(Dense(512, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

Results

The results from the final model can be found in cell 20 of the notebook.

Validation accuracy: 97.5%

Training time: 23 epochs

Future Improvements

If I were to expand on it in the future, here's where I'd start.

Do proper data split (training, validation, test) — Currently there is no test set, only training and validation. Obviously this doesn't follow best practices.
LSTM + attention — This has been the de facto architecture for RNNs over the past few years, although there are some limitations. I didn't use LSTM because I'd already implemented it in TensorFlow in another project (found here), and I wanted to experiment with GRU + Keras for this project.
Train on a larger and more diverse text corpus — The text corpus and vocabulary for this project are quite small with little variation in syntax. As a result, the model is very brittle. To create a model that generalizes better, you'll need to train on a larger dataset with more variability in grammar and sentence structure.
Residual layers — You could add residual layers to a deep LSTM RNN, as described in this paper. Or, use residual layers as an alternative to LSTM and GRU, as described here.
Embeddings — If you're training on a larger dataset, you should definitely use a pre-trained set of embeddings such as word2vec or GloVe. Even better, use ELMo or BERT.

Embedding Language Model (ELMo) — One of the biggest advances in universal embeddings in 2018 was ELMo, developed by the Allen Institute for AI. One of the major advantages of ELMo is that it addresses the problem of polysemy, in which a single word has multiple meanings. ELMo is context-based (not word-based), so different meanings for a word occupy different vectors within the embedding space. With GloVe and word2vec, each word has only one representation in the embedding space. For example, the word "queen" could refer to the matriarch of a royal family, a bee, a chess piece, or the 1970s rock band. With traditional embeddings, all of these meanings are tied to a single vector for the word queen. With ELMO, these are four distinct vectors, each with a unique set of context words occupying the same region of the embedding space. For example, we'd expect to see words like queen, rook, and pawn in a similar vector space related to the game of chess. And we'd expect to see queen, hive, and honey in a different vector space related to bees. This provides a significant boost in semantic encoding.
Bidirectional Encoder Representations from Transformers (BERT). So far in 2019, the biggest advancement in bidirectional embeddings has been BERT, which was open-sourced by Google. How is BERT different?

Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional. —Jacob Devlin and Ming-Wei Chang, Google AI Blog

Contact

I hope you found this useful. If you have any feedback, I’d love to hear it. Feel free to post in the comments.

If you’d like to inquire about collaboration or career opportunities you can find me here on LinkedIn or view my portfolio here.

Project Starter Code

In case you want to run this project yourself, below is the project starter code.

Setup

The original Udacity repo for this project can be found here.

This project requires GPU acceleration to run efficiently. Support is available to use either of the following two methods for accessing GPU-enabled cloud computing resources.

Udacity Workspaces (Recommended)

Udacity Workspaces provide remote connection to GPU-enabled instances right from the classroom. Refer to the classroom lesson for this project to find an overview of navigating & using Jupyter notebook Workspaces.

Amazon Web Services (Optional)

Please refer to the Udacity instructions for setting up a GPU instance for this project, and refer to the project instructions in the classroom for setup. The recommended AMI should include compatible versions of all required software and libraries to complete the project. link for AIND students

Install

Python 3
NumPy
TensorFlow 1.x
Keras 2.x

Submission

When you are ready to submit your project, do the following steps:

Ensure you pass all points on the rubric.
Submit the following in a zip file:

helper.py
machine_translation.ipynb
machine_translation.html

Converting to HTML

There are several ways to generate an HTML copy of the notebook:

Running the last cell of the notebook will export an HTML copy
Navigating to File -> Download as -> HTML (.html) within the notebook
Using nbconvert from the command line

$ pip install nbconvert $ nbconvert machine_translation.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
images		images
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
helper.py		helper.py
machine_translation.html		machine_translation.html
machine_translation.ipynb		machine_translation.ipynb
project_tests.py		project_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity Artificial Intelligence Nanodegree

Term 2: Capstone Project

Machine Translation with RNNs

Goal

Background

Approach

RNN Overview

RNN Setup

Building the Pipeline

Toolset

Preprocessing

Load & Examine Data

Cleaning

Tokenization

Padding

One-Hot Encoding (not used)

Modeling

Embeddings

Encoder & Decoder

Bidirectional Layer

Hidden Layer — Gated Recurrent Unit (GRU)

Final Model

Results

Future Improvements

Contact

Project Starter Code

Setup

Udacity Workspaces (Recommended)

Amazon Web Services (Optional)

Install

Submission

Converting to HTML

About

Releases

Packages

Languages

License

tommytracey/AIND-Capstone

Folders and files

Latest commit

History

Repository files navigation

Udacity Artificial Intelligence Nanodegree

Term 2: Capstone Project

Machine Translation with RNNs

Goal

Background

Approach

RNN Overview

RNN Setup

Building the Pipeline

Toolset

Preprocessing

Load & Examine Data

Cleaning

Tokenization

Padding

One-Hot Encoding (not used)

Modeling

Embeddings

Encoder & Decoder

Bidirectional Layer

Hidden Layer — Gated Recurrent Unit (GRU)

Final Model

Results

Future Improvements

Contact

Project Starter Code

Setup

Udacity Workspaces (Recommended)

Amazon Web Services (Optional)

Install

Submission

Converting to HTML

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages