Skip to content

Collect a corpus

David Baines edited this page Jun 10, 2021 · 4 revisions

In order to train a model that can translate between two languages a parallel corpus is required. For the purposes of our process a parallel corpus is one where every line of text in one language has a corresponding line of text in another language. Ideally each line of text is a single sentence and the corresponding line of text is a direct human generated translation from the first language into the second language. This ideal scenario is not always possible and variations are possible and necessary.

There are several sources from which you can download parallel corpora. Some of those that we have found useful:

We have also used scriptures that are available in many languages as a form of parallel corpus. In this case the alignment between the source and target text are at the Verse level. Genesis, Chapter 1, Verse 1 in an English Bible corresponds to Genesis, Chapter 1, Verse 1 in a French or Spanish Bible for example. So with our tools each verse occupies a single line in the text file. There may be multiple sentences in a verse, and even different numbers of sentences in the source and target texts. The source and target verses are not direct translations of one language to another. They are both translations that attempt to convey the meaning of the original Hebrew source text. This isn't ideal as a data set but for lower resourced languages it is the closest thing that we have to a parallel corpus.

The tools expect to receive parallel data in two text files one for the source language data and one for the target language data. These must have exactly the same number of lines and the alignment between them is based on the line number. Therefore care must be taken not to delete or add lines in either file without performing the same operation in the corresponding file.