# Text Corpora Tutorial

In this notebook, we will demonstrate how to use Machine to load datasets as text corpora.

## Loading Text Files

Let's start with a simple example of loading a set of text files.

In [6]:
from machine.corpora import TextFileTextCorpus

corpus = TextFileTextCorpus("data/en_tok.txt")

It is easy to iterate through the sentences in the corpus. We simply call the `get_rows` method on the corpus class.

In [7]:
for row in corpus.take(10):
    print(row.text)

Would you mind giving us the keys to the room , please ?
I have made a reservation for a quiet , double room with a telephone and a tv for Rosario Cabedo .
Would you mind moving me to a quieter room ?
I have booked a room .
I think that there is a problem .
Do you have any rooms with a tv , air conditioning and a safe available ?
Would you mind showing us a room with a tv ?
Does it have a telephone ?
I am leaving on the second at eight in the evening .
How much does a single room cost per week ?


## Loading Scripture

Machine contains classes for loading Scripture in various formats, such as USFM and USX.

### USX

USX is a common XML format for Scripture. Let's take a look at how to load a set of USX files. First, we create an instance of the `UsxFileTextCorpus` class. We ensure that the correct verse references are used by loading the versification file for this translation. If a versification is not provided, then the English versification is used. We want untokenized verse text, so we use the `NullTokenizer`.

In [8]:
from machine.corpora import UsxFileTextCorpus
from machine.scripture import Versification

versification = Versification.load("data/WEB-DBL/release/versification.vrs", fallback_name="web")
corpus = UsxFileTextCorpus("data/WEB-DBL/release/USX_1", versification=versification)

Let's iterate through the corpus. You will notice that each text segment in the corpus has an associated reference. In the case of Scripture, these are `VerseRef` objects.

In [9]:
for row in corpus.take(10):
    print(f"{row.ref}: {row.text}")

1JN 1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.
1JN 1:4: And we write these things to you, that our joy may be fulfilled.
1JN 1:5: This is the message which we have heard from him and announce to you, that God is light, and in him is no darkness at all.
1JN 1:6: If we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: But if we walk in the light as he is in the light, we have fellowship with one another, and the blood of Jesus Christ his Son, cleanses us from all 

You can also iterate through verses in the corpus by book.

In [10]:
for text in corpus.texts:
    print(text.id)
    print("======")
    for row in text.take(3):
        verse_ref = row.ref
        chapter_verse = f"{verse_ref.chapter}:{verse_ref.verse}"
        print(f"{chapter_verse}: {row.text}")
    print()

1JN
1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.

2JN
1:1: The elder, to the chosen lady and her children, whom I love in truth, and not I only, but also all those who know the truth,
1:2: for the truth’s sake, which remains in us, and it will be with us forever:
1:3: Grace, mercy, and peace will be with us, from God the Father and from the Lord Jesus Christ, the Son of the Father, in truth and love.

3JN
1:1: The elder to Gaius the beloved, whom I love in truth.
1:2: Beloved, I pray that you may prosper in all things and be 

### Digital Bible Library Bundles

Now, let's load a Digital Bible Library (DBL) bundle. A DBL bundle is a zip archive that contains all of the data that you need for a publishable Bible translation.

In [11]:
import shutil

shutil.make_archive("out/web", "zip", "data/WEB-DBL")
print("DBL bundle created.")

DBL bundle created.


First, we create a `DblBundleTextCorpus` instance. There is no need to specify versification, because the `DblBundleTextCorpus` class takes care of that for us.

In [12]:
from machine.corpora import DblBundleTextCorpus

corpus = DblBundleTextCorpus("out/web.zip")

We can iterate through the corpus just as we did before. All text corpus classes in Machine adhere to the same interface, so it is easy to switch between the various classes. Also, you can see that the verse text is nicely tokenized.

In [13]:
for row in corpus.take(10):
    print(f"{row.ref}: {row.text}")

1JN 1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.
1JN 1:4: And we write these things to you, that our joy may be fulfilled.
1JN 1:5: This is the message which we have heard from him and announce to you, that God is light, and in him is no darkness at all.
1JN 1:6: If we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: But if we walk in the light as he is in the light, we have fellowship with one another, and the blood of Jesus Christ his Son, cleanses us from all 

### Paratext Projects

Another useful text corpus class is `ParatextTextCorpus`. This class is used to load a Paratext project. It properly loads the configured encoding and versification for the project.

In [1]:
from machine.corpora import ParatextTextCorpus

corpus = ParatextTextCorpus("data/WEB-PT")

Now, let's iterate through the verses.

In [15]:
for row in corpus.take(10):
    print(f"{row.ref}: {row.text}")

1JN 1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.
1JN 1:4: And we write these things to you, that our joy may be fulfilled.
1JN 1:5: This is the message which we have heard from him and announce to you, that God is light, and in him is no darkness at all.
1JN 1:6: If we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: But if we walk in the light as he is in the light, we have fellowship with one another, and the blood of Jesus Christ his Son, cleanses us from all 

### Extracting Scripture to BibleNLP text format

[BibleNLP](https://github.com/BibleNLP) uses a simple text format for Scripture that makes it easy to align verses across different translations. Each line contains text for a specific verse. The sequence of verses is aligned to a canonical list of books, chapters, and verses. This canonical verse list is based on the Original versification. The [vref.txt](https://github.com/BibleNLP/ebible/blob/main/metadata/vref.txt) contains the list of verse references that align to this sequence. If a corpus contains verse ranges, then all of the text in the verse range will be found on the line corresponding to the first verse. The remaining verses will be marked with the special "&lt;range&gt;" token.

You can use the `extract_scripture_corpus` to generate a text file in the BibleNLP format. The function returns the verses aligned with the Original versification. The extracted verses can be written line-by-line to a text file. The function also returns the corresponding Original and corpus verse references.

In [2]:
from machine.corpora import extract_scripture_corpus

output = list(extract_scripture_corpus(corpus))
print(len(output), "verses extracted.")
print("Lines 30608-30617 (1JN 1:1-10):")
for line, _, _ in output[30607:30617]:
    print(line)

41899 verses extracted.
Lines 30608-30617 (1JN 1:1-10):
That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
(and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.
And we write these things to you, that our joy may be fulfilled.
This is the message which we have heard from him and announce to you, that God is light, and in him is no darkness at all.
If we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
But if we walk in the light as he is in the light, we have fellowship with one another, and the blood of Jesus Christ his Son, cleanses us from all sin.
If

## Parallel Text Corpora

So far we have only dealt with monolingual corpora. For many tasks, such as machine translation, parallel corpora are required. Machine provides a corpus class for combining two monolingual corpora into a parallel corpus.

In order to create a parallel text corpus, we must first create the source and target monolingual text corpora. Then, we create the parallel corpus using the `align_rows` method.

In [17]:
source_corpus = ParatextTextCorpus("data/VBL-PT")
target_corpus = ParatextTextCorpus("data/WEB-PT")
parallel_corpus = source_corpus.align_rows(target_corpus)

We can now iterate through the parallel verses.

In [18]:
for row in parallel_corpus.take(5):
    print(row.ref)
    print("Source:", row.source_text)
    print("Target:", row.target_text)

1JN 1:1
Source: Esta carta trata sobre la Palabra de vida que existía desde el principio, que hemos escuchado, que hemos visto con nuestros propios ojos y le hemos contemplado, y que hemos tocado con nuestras manos.
Target: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2
Source: Esta Vida nos fue revelada. La vimos y damos testimonio de ella. Estamos hablándoles de Aquél que es la Vida Eterna, que estaba con el Padre, y que nos fue revelado.
Target: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3
Source: Los que hemos visto y oído eso mismo les contamos, para que también puedan participar de esta amistad junto a nosotros. Esta amistad con el Padre y su Hijo Jesucristo.
Target: that which we have seen and heard we declare to you, that you also

### Hugging Face Datasets

Hugging Face is a popular community and AI platform that provides access to many datasets and models. Machine provides the ability to convert `ParallelTextCorpus` to/from a Hugging Face dataset.

You can convert a Hugging Face dataset to a parallel text corpus by using the `from_hf_dataset` class method. In this example, we will load the English-Afrikaans Tatoeba corpus using the Hugging Face Datasets library and then convert it to a `ParallelTextCorpus`. Machine supports any datasets that use the [`Translation`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Translation) or [`TranslationVariableLanguages`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.TranslationVariableLanguages) feature. The source and target language must be specified when calling `from_hf_dataset`.

In [19]:
from datasets import load_dataset
from machine.corpora import ParallelTextCorpus

ds = load_dataset("tatoeba", lang1="af", lang2="en")
parallel_corpus = ParallelTextCorpus.from_hf_dataset(ds["train"], source_lang="af", target_lang="en")
for row in parallel_corpus.take(5):
    print(row.ref)
    print("Source:", row.source_text)
    print("Target:", row.target_text)

  from .autonotebook import tqdm as notebook_tqdm
Using custom data configuration af-en-lang1=af,lang2=en
Reusing dataset tatoeba (C:\Users\damie\.cache\huggingface\datasets\tatoeba\af-en-lang1=af,lang2=en\0.0.0\b3ea9c6bb2af47699c5fc0a155643f5a0da287c7095ea14824ee0a8afd74daf6)
100%|██████████| 1/1 [00:00<00:00, 494.49it/s]

0
Source: Hy skop my!
Target: He's kicking me!
1
Source: Ek is lief vir jou.
Target: I love you.
2
Source: Ek hou van jou.
Target: I love you.
3
Source: Baie geluk!
Target: Congratulations!
4
Source: Ek praat nie goed genoeg Frans nie!
Target: I don't speak French well enough!





You can also convert a parallel text corpus to a Hugging Face dataset. This is useful if you want to use a parallel text corpus in a Hugging Face model.

In this example, we will convert a parallel corpus of Paratext projects to a Hugging Face dataset using the `to_hf_dataset` method. This method returns the dataset as an [`IterableDataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.IterableDataset) object.

In [20]:
import json

source_corpus = ParatextTextCorpus("data/VBL-PT")
target_corpus = ParatextTextCorpus("data/WEB-PT")
parallel_corpus = source_corpus.align_rows(target_corpus)
ds = parallel_corpus.to_hf_dataset(source_lang="es", target_lang="en")
print(json.dumps(next(iter(ds)), indent=2))

{
  "translation": {
    "es": "Esta carta trata sobre la Palabra de vida que exist\u00eda desde el principio, que hemos escuchado, que hemos visto con nuestros propios ojos y le hemos contemplado, y que hemos tocado con nuestras manos.",
    "en": "That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life"
  },
  "text": "1JN",
  "ref": [
    "1JN 1:1"
  ],
  "alignment": null
}


## Corpus Processing

Often a text corpus must be processed in some way as a part of a AI/ML pipeline. Machine has a set of operations that can be used to process a corpus easily. Lowercasing text is a common pre-processing step, so let's show how to apply the "lowercase" operation.

In [21]:
corpus = ParatextTextCorpus("data/WEB-PT")

for row in corpus.lowercase().take(10):
    print(f"{row.ref}: {row.text}")

1JN 1:1: that which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. yes, and our fellowship is with the father and with his son, jesus christ.
1JN 1:4: and we write these things to you, that our joy may be fulfilled.
1JN 1:5: this is the message which we have heard from him and announce to you, that god is light, and in him is no darkness at all.
1JN 1:6: if we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: but if we walk in the light as he is in the light, we have fellowship with one another, and the blood of jesus christ his son, cleanses us from all 

Multiple operations can be chained together. Here we will tokenize, lowercase, and normalize the corpus.

In [22]:
from machine.tokenization import LatinWordTokenizer

tokenizer = LatinWordTokenizer()

for row in corpus.tokenize(tokenizer).lowercase().nfc_normalize().take(10):
    print(f"{row.ref}: {row.text}")

1JN 1:1: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
1JN 1:2: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
1JN 1:3: that which we have seen and heard we declare to you , that you also may have fellowship with us . yes , and our fellowship is with the father and with his son , jesus christ .
1JN 1:4: and we write these things to you , that our joy may be fulfilled .
1JN 1:5: this is the message which we have heard from him and announce to you , that god is light , and in him is no darkness at all .
1JN 1:6: if we say that we have fellowship with him and walk in the darkness , we lie and don’t tell the truth .
1JN 1:7: but if we walk in the light as he is in the light , we have fellowship with one another , and the blood of jesus christ hi

Corpus processing operations are also available on parallel corpora.

In [23]:
source_corpus = ParatextTextCorpus("data/VBL-PT")
target_corpus = ParatextTextCorpus("data/WEB-PT")
parallel_corpus = source_corpus.align_rows(target_corpus)

for row in parallel_corpus.tokenize(tokenizer).lowercase().nfc_normalize().take(5):
    print(row.ref)
    print("Source:", row.source_text)
    print("Target:", row.target_text)

1JN 1:1
Source: esta carta trata sobre la palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
1JN 1:2
Source: esta vida nos fue revelada . la vimos y damos testimonio de ella . estamos hablándoles de aquél que es la vida eterna , que estaba con el padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
1JN 1:3
Source: los que hemos visto y oído eso mismo les contamos , para que también puedan participar de esta amistad junto a nosotros . esta amistad con el padre y su hijo jesucristo .
Target: that which we have seen and heard we dec