# Text Corpora Tutorial

In this notebook, we will demonstrate how to use Machine to load datasets as text corpora.

In [1]:
#r "nuget:SIL.Scripture,12.0.1"
#r "../src/SIL.Machine/bin/Debug/netstandard2.0/SIL.Machine.dll"

void WriteLine(string text = "")
{
    Console.Write(text + "\n");
}

## Loading Text Files

Let's start with a simple example of loading a set of text files.

In [2]:
using SIL.Machine.Corpora;

var corpus = new TextFileTextCorpus("data/en_tok.txt");

It is easy to iterate through the sentences in the corpus. We simply iterate on the corpus object.

In [3]:
foreach (var row in corpus.Take(10))
    WriteLine(row.Text);

Would you mind giving us the keys to the room , please ?
I have made a reservation for a quiet , double room with a telephone and a tv for Rosario Cabedo .
Would you mind moving me to a quieter room ?
I have booked a room .
I think that there is a problem .
Do you have any rooms with a tv , air conditioning and a safe available ?
Would you mind showing us a room with a tv ?
Does it have a telephone ?
I am leaving on the second at eight in the evening .
How much does a single room cost per week ?


## Loading Scripture

Machine contains classes for loading Scripture in various formats, such as USFM and USX.

### USX

USX is a common XML format for Scripture. Let's take a look at how to load a set of USX files. First, we create an instance of the `UsxFileTextCorpus` class. We ensure that the correct verse references are used by loading the versification file for this translation. If a versification is not provided, then the English versification is used.

In [4]:
using SIL.Scripture;

var versification = Versification.Table.Implementation.Load("data/WEB-DBL/release/versification.vrs", "web");
var corpus = new UsxFileTextCorpus("data/WEB-DBL/release/USX_1", versification);

Let's iterate through the corpus. You will notice that each text segment in the corpus has an associated reference. In the case of Scripture, these are `VerseRef` objects.

In [5]:
foreach (var row in corpus.Take(10))
    WriteLine($"{row.Ref}: {row.Text}");

1JN 1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.
1JN 1:4: And we write these things to you, that our joy may be fulfilled.
1JN 1:5: This is the message which we have heard from him and announce to you, that God is light, and in him is no darkness at all.
1JN 1:6: If we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: But if we walk in the light as he is in the light, we have fellowship with one another, and the blood of Jesus Christ his Son, cleanses us from all 

You can also iterate through verses in the corpus by book.

In [6]:
foreach (var text in corpus.Texts)
{
    WriteLine(text.Id);
    WriteLine("======");
    foreach (var row in text.Take(3))
    {
        var verseRef = (VerseRef)row.Ref;
        WriteLine($"{verseRef.Chapter}:{verseRef.Verse}: {row.Text}");
    }
    WriteLine();
}

1JN
1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.

2JN
1:1: The elder, to the chosen lady and her children, whom I love in truth, and not I only, but also all those who know the truth,
1:2: for the truth’s sake, which remains in us, and it will be with us forever:
1:3: Grace, mercy, and peace will be with us, from God the Father and from the Lord Jesus Christ, the Son of the Father, in truth and love.

3JN
1:1: The elder to Gaius the beloved, whom I love in truth.
1:2: Beloved, I pray that you may prosper in all things and be 

### Digital Bible Library Bundles

Now, let's load a Digital Bible Library (DBL) bundle. A DBL bundle is a zip archive that contains all of the data that you need for a publishable Bible translation.

In [7]:
using System.IO;
using System.IO.Compression;

Directory.CreateDirectory("out");
if (File.Exists("out/web.zip"))
    File.Delete("out/web.zip");
ZipFile.CreateFromDirectory("data/WEB-DBL", "out/web.zip");
WriteLine("DBL bundle created.")

DBL bundle created.


First, we create a `DblBundleTextCorpus` instance. There is no need to specify versification, because the `DblBundleTextCorpus` class takes care of that for us.

In [8]:
var corpus = new DblBundleTextCorpus("out/web.zip");

We can iterate through the corpus just as we did before. All text corpus classes in Machine adhere to the same interface, so it is easy to switch between the various classes. Also, you can see that the verse text is nicely tokenized.

In [9]:
foreach (var row in corpus.Take(10))
    WriteLine($"{row.Ref}: {row.Text}");

1JN 1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.
1JN 1:4: And we write these things to you, that our joy may be fulfilled.
1JN 1:5: This is the message which we have heard from him and announce to you, that God is light, and in him is no darkness at all.
1JN 1:6: If we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: But if we walk in the light as he is in the light, we have fellowship with one another, and the blood of Jesus Christ his Son, cleanses us from all 

### Paratext Projects

Another useful text corpus class is `ParatextTextCorpus`. This class is used to load a Paratext project. It properly loads the configured encoding and versification for the project.

In [10]:
var corpus = new ParatextTextCorpus("data/WEB-PT");

Now, let's iterate through the segments.

In [11]:
foreach (var row in corpus.Take(10))
    WriteLine($"{row.Ref}: {row.Text}");

1JN 1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.
1JN 1:4: And we write these things to you, that our joy may be fulfilled.
1JN 1:5: This is the message which we have heard from him and announce to you, that God is light, and in him is no darkness at all.
1JN 1:6: If we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: But if we walk in the light as he is in the light, we have fellowship with one another, and the blood of Jesus Christ his Son, cleanses us from all 

## Parallel Text Corpora

So far we have only dealt with monolingual corpora. For many tasks, such as machine translation, parallel corpora are required. Machine provides a corpus class for combining two monolingual corpora into a parallel corpus.

In order to create a parallel text corpus, we must first create the source and target monolingual text corpora. Then, we create the parallel corpus using the `AlignRows` method.

In [12]:
var sourceCorpus = new ParatextTextCorpus("data/VBL-PT");
var targetCorpus = new ParatextTextCorpus("data/WEB-PT");
var parallelCorpus = sourceCorpus.AlignRows(targetCorpus);

We can now iterate through the parallel segments.

In [13]:
foreach (var row in parallelCorpus.Take(5))
{
    WriteLine($"{row.Ref}");
    WriteLine($"Source: {row.SourceText}");
    WriteLine($"Target: {row.TargetText}");
}

1JN 1:1
Source: Esta carta trata sobre la Palabra de vida que existía desde el principio, que hemos escuchado, que hemos visto con nuestros propios ojos y le hemos contemplado, y que hemos tocado con nuestras manos.
Target: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2
Source: Esta Vida nos fue revelada. La vimos y damos testimonio de ella. Estamos hablándoles de Aquél que es la Vida Eterna, que estaba con el Padre, y que nos fue revelado.
Target: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3
Source: Los que hemos visto y oído eso mismo les contamos, para que también puedan participar de esta amistad junto a nosotros. Esta amistad con el Padre y su Hijo Jesucristo.
Target: that which we have seen and heard we declare to you, that you also

## Corpus Processing

Often a text corpus must be processed in some way as a part of a AI/ML pipeline. Machine has a set of operations that can be used to process a corpus easily. Lowercasing text is a common pre-processing step, so let's show how to apply the "lowercase" operation.

In [14]:
var corpus = new ParatextTextCorpus("data/WEB-PT");

foreach (var row in corpus.Lowercase().Take(10))
    WriteLine($"{row.Ref}: {row.Text}");

1JN 1:1: that which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. yes, and our fellowship is with the father and with his son, jesus christ.
1JN 1:4: and we write these things to you, that our joy may be fulfilled.
1JN 1:5: this is the message which we have heard from him and announce to you, that god is light, and in him is no darkness at all.
1JN 1:6: if we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: but if we walk in the light as he is in the light, we have fellowship with one another, and the blood of jesus christ his son, cleanses us from all 

Multiple operations can be chained together. Here we will tokenize, lowercase, and normalize the corpus.

In [15]:
using SIL.Machine.Tokenization;

foreach (var row in corpus.Tokenize<LatinWordTokenizer>().Lowercase().NfcNormalize().Take(10))
    WriteLine($"{row.Ref}: {row.Text}");

1JN 1:1: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
1JN 1:2: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
1JN 1:3: that which we have seen and heard we declare to you , that you also may have fellowship with us . yes , and our fellowship is with the father and with his son , jesus christ .
1JN 1:4: and we write these things to you , that our joy may be fulfilled .
1JN 1:5: this is the message which we have heard from him and announce to you , that god is light , and in him is no darkness at all .
1JN 1:6: if we say that we have fellowship with him and walk in the darkness , we lie and don’t tell the truth .
1JN 1:7: but if we walk in the light as he is in the light , we have fellowship with one another , and the blood of jesus christ hi

Corpus processing operations are also available on parallel corpora.

In [16]:
var sourceCorpus = new ParatextTextCorpus("data/VBL-PT");
var targetCorpus = new ParatextTextCorpus("data/WEB-PT");
var parallelCorpus = sourceCorpus.AlignRows(targetCorpus);

foreach (var row in parallelCorpus.Tokenize<LatinWordTokenizer>().Lowercase().NfcNormalize().Take(5))
{
    WriteLine($"{row.Ref}");
    WriteLine($"Source: {row.SourceText}");
    WriteLine($"Target: {row.TargetText}");
}

1JN 1:1
Source: esta carta trata sobre la palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
1JN 1:2
Source: esta vida nos fue revelada . la vimos y damos testimonio de ella . estamos hablándoles de aquél que es la vida eterna , que estaba con el padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
1JN 1:3
Source: los que hemos visto y oído eso mismo les contamos , para que también puedan participar de esta amistad junto a nosotros . esta amistad con el padre y su hijo jesucristo .
Target: that which we have seen and heard we dec