# Machine Translation and the Dataset
:label:`sec_machine_translation`

We have used RNNs to design language models,
which are key to natural language processing.
Another flagship benchmark is *machine translation*,
a central problem domain for *sequence transduction* models
that transform input sequences into output sequences.
Playing a crucial role in various modern AI applications,
sequence transduction models will form the focus of the remainder of this chapter
and :numref:`chap_attention`.
To this end,
this section introduces the machine translation problem
and its dataset that will be used later.


*Machine translation* refers to the
automatic translation of a sequence
from one language to another.
In fact, this field
may date back to 1940s
soon after digital computers were invented,
especially by considering the use of computers
for cracking language codes in World War II.
For decades,
statistical approaches
had been dominant in this field :cite:`Brown.Cocke.Della-Pietra.ea.1988,Brown.Cocke.Della-Pietra.ea.1990`
before the rise
of
end-to-end learning using
neural networks.
The latter
is often called
*neural machine translation*
to distinguish itself from
*statistical machine translation*
that involves statistical analysis
in components such as
the translation model and the language model.


Emphasizing end-to-end learning,
this book will focus on neural machine translation methods.
Different from our language model problem
in :numref:`sec_language_model`
whose corpus is in one single language,
machine translation datasets
are composed of pairs of text sequences
that are in
the source language and the target language, respectively.
Thus,
instead of reusing the preprocessing routine
for language modeling,
we need a different way to preprocess
machine translation datasets.
In the following,
we show how to
load the preprocessed data
into minibatches for training.


In [1]:
%use @file[../djl.json]
%use lets-plot
@file:DependsOn("../D2J-1.0-SNAPSHOT.jar")
import jp.live.ugai.d2j.timemachine.RNNModelScratch
import jp.live.ugai.d2j.timemachine.TimeMachine
import jp.live.ugai.d2j.timemachine.TimeMachineDataset
import jp.live.ugai.d2j.timemachine.Vocab
import jp.live.ugai.d2j.RNNModel
import kotlin.random.Random
import kotlin.collections.List
import kotlin.collections.Map
import kotlin.Pair

// %load ../utils/djl-imports
// %load ../utils/plot-utils
// %load ../utils/Functions.java
//
// %load ../utils/timemachine/Vocab.java
// %load ../utils/timemachine/RNNModel.java
// %load ../utils/timemachine/RNNModelScratch.java
// %load ../utils/timemachine/TimeMachine.java
// %load ../utils/timemachine/TimeMachineDataset.java

In [2]:
import java.nio.charset.*;
import java.util.zip.*;
import java.util.stream.*;
import java.util.Locale

In [3]:
val manager = NDManager.newBaseManager();

## Downloading and Preprocessing the Dataset

To begin with,
we download an English-French dataset
that consists of [bilingual sentence pairs from the Tatoeba Project](http://www.manythings.org/anki/).
Each line in the dataset
is a tab-delimited pair
of an English text sequence
and the translated French text sequence.
Note that each text sequence
can be just one sentence or a paragraph of multiple sentences.
In this machine translation problem
where English is translated into French,
English is the *source language*
and French is the *target language*.


In [4]:
fun readDataNMT() : String? {
    DownloadUtils.download(
            "http://d2l-data.s3-accelerate.amazonaws.com/fra-eng.zip", "fra-eng.zip");
    val zipFile = ZipFile(File("fra-eng.zip"));
    val entries = zipFile.entries();
    while (entries.hasMoreElements()) {
        val entry = entries.nextElement();
        if (entry.getName().contains("fra.txt")) {
            val stream = zipFile.getInputStream(entry);
            return String(stream.readAllBytes(), StandardCharsets.UTF_8);
        }
    }
    return null
}

val rawText = readDataNMT()
println(rawText!!.substring(0, 75))

Go.	Va !
Hi.	Salut !
Run!	Cours !
Run!	Courez !
Who?	Qui ?
Wow!	Ça alors !



After downloading the dataset,
we proceed with several preprocessing steps
for the raw text data.
For instance,
we replace non-breaking space with space,
convert uppercase letters to lowercase ones,
and insert space between words and punctuation marks.


In [5]:
fun noSpace(currChar: Char, prevChar: Char) : Boolean {
    /* Preprocess the English-French dataset. */
    return listOf(',', '.', '!', '?').contains(currChar)
            && prevChar != ' '
}


fun preprocessNMT(_text: String): String {
    // Replace non-breaking space with space, and convert uppercase letters to
    // lowercase ones

    val text = _text.replace('\u202f', ' ').replace("\\xa0".toRegex(), " ").lowercase(Locale.getDefault())

    // Insert space between words and punctuation marks
    val out = StringBuilder();
    var currChar : Char
    for (i in 0 until text.length) {
        currChar = text[i]
            if (i > 0 && noSpace(currChar, text[i - 1])) {
                out.append(' ')
            }
            out.append(currChar)
    }
    return out.toString();
}


val text = preprocessNMT(rawText);
println(text.substring(0, 80));

go .	va !
hi .	salut !
run !	cours !
run !	courez !
who ?	qui ?
wow !	ça alors !


## Tokenization

Different from character-level tokenization
in :numref:`sec_language_model`,
for machine translation
we prefer word-level tokenization here
(state-of-the-art models may use more advanced tokenization techniques).
The following `tokenizeNMT` function
tokenizes the the first `numExamples` text sequence pairs,
where
each token is either a word or a punctuation mark.
This function returns
two lists of token lists: `source` and `target`.
Specifically,
`source.get(i)` is a list of tokens from the
$i^\mathrm{th}$ text sequence in the source language (English here) and `target.get(i)` is that in the target language (French here).


In [6]:
fun  tokenizeNMT(
        text: String, numExamples: Int?) : Pair<List<List<String>>, List<List<String>>>{
    val source = mutableListOf<List<String>>()
    val target =  mutableListOf<List<String>>()

    var i = 0;
    for (line in text.split("\n")) {
        if (numExamples != null && i > numExamples) {
            break;
        }
        val parts = line.split("\t");
        if (parts.size == 2) {
            source.add(parts[0].split(" "));
            target.add(parts[1].split(" "));
        }
        i += 1;
    }
    return Pair(source, target)
}

val pair = tokenizeNMT(text.toString(), null);
val source = pair.first
val target = pair.second
for (subArr in source.subList(0, 6)) {
    println(subArr)
}


[go, .]
[hi, .]
[run, !]
[run, !]
[who, ?]
[wow, !]


In [7]:
for (subArr in target.subList(0, 6)) {
    println(subArr)
}

[va, !]
[salut, !]
[cours, !]
[courez, !]
[qui, ?]
[ça, alors, !]


Let us plot the histogram of the number of tokens per text sequence.
In this simple English-French dataset,
most of the text sequences have fewer than 20 tokens.


In [34]:
val y1 = source.map { it.size }
val y2 = target.map { it.size }
val x1 = List<String>(source.size){"SOURCE"}
val x2 = List<String>(target.size){"TARGET"}
val data = mapOf( "source" to y1+y2, "color" to x1 + x2 )

var plot = letsPlot(data)
plot += geomHistogram(binWidth=5.0 , position=Pos.dodge){ x = "source" ; fill = "color" }
plot + ggsize(800,500)


## Vocabulary

Since the machine translation dataset
consists of pairs of languages,
we can build two vocabularies for
both the source language and
the target language separately.
With word-level tokenization,
the vocabulary size will be significantly larger
than that using character-level tokenization.
To alleviate this,
here we treat infrequent tokens
that appear less than 2 times
as the same unknown ("&lt;unk&gt;") token.
Besides that,
we specify additional special tokens
such as for padding ("&lt;pad&gt;") sequences to the same length in minibatches,
and for marking the beginning ("&lt;bos&gt;") or end ("&lt;eos&gt;") of sequences.
Such special tokens are commonly used in
natural language processing tasks.


In [35]:
val srcVocab =
                Vocab(
                        source,
                        2,
                       listOf("<pad>", "<bos>", "<eos>"))
println(srcVocab.length())

10012


## Loading the Dataset
:label:`subsec_mt_data_loading`

Recall that in language modeling
each sequence example,
either a segment of one sentence
or a span over multiple sentences,
has a fixed length.
This was specified by the `numSteps`
(number of time steps or tokens) argument in :numref:`sec_language_model`.
In machine translation, each example is
a pair of source and target text sequences,
where each text sequence may have different lengths.

For computational efficiency,
we can still process a minibatch of text sequences
at one time by *truncation* and *padding*.
Suppose that every sequence in the same minibatch
should have the same length `numSteps`.
If a text sequence has fewer than `numSteps` tokens,
we will keep appending the special "&lt;pad&gt;" token
to its end until its length reaches `numSteps`.
Otherwise,
we will truncate the text sequence
by only taking its first `numSteps` tokens
and discarding the remaining.
In this way,
every text sequence
will have the same length
to be loaded in minibatches of the same shape.

The following `truncatePad` function
truncates or pads text sequences as described before.


In [36]:
fun truncatePad(integerLine: List<Int>, numSteps: Int, paddingToken: Int): List<Int> {
    /* Truncate or pad sequences */
    val line = integerLine
    if (line.size > numSteps) {
        return line.subList(0,numSteps)
    }
    val paddingTokenArr = List<Int>(numSteps-line.size){paddingToken} // Pad
    return line + paddingTokenArr
}

val result = truncatePad(srcVocab.getIdxs(source.get(0)), 10, srcVocab.getIdx("<pad>"));
println(result)

[47, 4, 1, 1, 1, 1, 1, 1, 1, 1]


Now we define a function to transform
text sequences into minibatches for training.
We append the special “&lt;eos&gt;” token
to the end of every sequence to indicate the
end of the sequence.
When a model is predicting
by
generating a sequence token after token,
the generation
of the “&lt;eos&gt;” token
can suggest that
the output sequence is complete.
Besides,
we also record the length
of each text sequence excluding the padding tokens.
This information will be needed by
some models that
we will cover later.


In [38]:
fun buildArrayNMT(lines: List<List<String>> , vocab: Vocab, numSteps: Int): Pair<NDArray, NDArray> {
    /* Transform text sequences of machine translation into minibatches. */
    val linesIntArr = lines.map { vocab.getIdxs(it)}.toMutableList()
    for (i in linesIntArr.indices) {
            val temp: MutableList<Int> = linesIntArr[i].toMutableList()
            temp.add(vocab.getIdx("<eos>"))
            linesIntArr[i] = temp
    }
   
    val manager = NDManager.newBaseManager();

    val arr = manager.create(Shape(linesIntArr.size.toLong(), numSteps.toLong()), DataType.INT32)
    var row = 0
    for (line in linesIntArr) {
        val rowArr = manager.create(truncatePad(line, numSteps, vocab.getIdx("<pad>")).toIntArray())
        arr.set(NDIndex("{}:", row), rowArr)
        row += 1;
    }
    val validLen = arr.neq(vocab.getIdx("<pad>")).sum(intArrayOf(1))
    return Pair(arr, validLen)
}

## Putting All Things Together

Finally, we define the `loadDataNMT` function
to return the data iterator, together with
the vocabularies for both the source language and the target language.


In [43]:
fun  loadDataNMT(
        batchSize: Int, numSteps: Int, numExamples: Int): Pair<ArrayDataset, Pair<Vocab, Vocab>> {
    /* Return the iterator and the vocabularies of the translation dataset. */
    val text = preprocessNMT(readDataNMT()!!)
    val pair = tokenizeNMT(text, numExamples)
    val source = pair.first
    val target = pair.second
    val srcVocab =
            Vocab(source, 2, listOf("<pad>", "<bos>", "<eos>"))
    val tgtVocab =Vocab(target, 2, listOf("<pad>", "<bos>", "<eos>"))

    var pairArr = buildArrayNMT(source, srcVocab, numSteps);
    val srcArr = pairArr.first
    val srcValidLen = pairArr.second

    pairArr = buildArrayNMT(target, tgtVocab, numSteps);
    val tgtArr = pairArr.first
    val tgtValidLen = pairArr.second

    val dataset = ArrayDataset.Builder()
                    .setData(srcArr, srcValidLen)
                    .optLabels(tgtArr, tgtValidLen)
                    .setSampling(batchSize, true)
                    .build();

    return Pair(dataset, Pair(srcVocab, tgtVocab))
}

Let us read the first minibatch from the English-French dataset.


In [44]:
val output = loadDataNMT(2, 8, 600);
val dataset = output.first
val srcVocab = output.second.first
val tgtVocab = output.second.second

val batch = dataset.getData(manager).iterator().next();
val X = batch.getData().get(0);
val xValidLen = batch.getData().get(1);
val Y = batch.getLabels().get(0);
val yValidLen = batch.getLabels().get(1);
println(X);
println(xValidLen);
println(Y);
println(yValidLen);

ND: (2, 8) cpu() int32
[[  9,  82,   4,   3,   1,   1,   1,   1],
 [ 17, 119,   4,   3,   1,   1,   1,   1],
]

ND: (2) cpu() int64
[ 4,  4]

ND: (2, 8) cpu() int32
[[67,  4,  3,  1,  1,  1,  1,  1],
 [11,  0,  4,  3,  1,  1,  1,  1],
]

ND: (2) cpu() int64
[ 3,  4]



## Summary

* Machine translation refers to the automatic translation of a sequence from one language to another.
* Using word-level tokenization, the vocabulary size will be significantly larger than that using character-level tokenization. To alleviate this, we can treat infrequent tokens as the same unknown token.
* We can truncate and pad text sequences so that all of them will have the same length to be loaded in minibatches.


## Exercises

1. Try different values of the `numExamples` argument in the `loadDataNMT` function. How does this affect the vocabulary sizes of the source language and the target language?
1. Text in some languages such as Chinese and Japanese does not have word boundary indicators (e.g., space). Is word-level tokenization still a good idea for such cases? Why or why not?
