# Text Corpora Tutorial

In this notebook, we will demonstrate how to use Machine to load datasets as text corpora.

In [None]:
#r "nuget:SIL.Scripture,7.0.0"
#r "../src/SIL.Machine/bin/Debug/netstandard2.0/SIL.Machine.dll"

## Loading Text Files

Let's start with a simple example of loading a set of text files. Every text corpus class requires a tokenizer. Our text corpus has already been tokenized. The tokens are delimited using whitespace, so we will use the `WhitespaceTokenizer`.

In [None]:
using SIL.Machine.Corpora;
using SIL.Machine.Tokenization;

var tokenizer = new WhitespaceTokenizer();
var corpus = new TextFileTextCorpus(tokenizer, "data/en_tok.txt");

It is easy to iterate through the sentences in the corpus. We simply call the `GetSegments` method on the corpus class.

In [None]:
foreach (TextSegment textSegment in corpus.GetSegments().Take(10))
    Console.WriteLine(string.Join(" ", textSegment.Segment));

I would like to book a room until tomorrow , please .
Please wake us up tomorrow at a quarter past seven .
I am leaving today in the afternoon .
Would you mind sending down our luggage to room number oh one three , please ?
Could you give me the key to room number two four four , please ?
Are there a tv , air conditioning and a safe in the rooms ?
We are leaving on the eighth at half past seven in the afternoon .
I want a single room for this week , please .
I would like you to give us the keys to the room .
I have made a reservation for a quiet , single room with a view of the mountain and a shower for Carmen Aguilera .


## Loading Scripture

Machine contains classes for loading Scripture in various formats, such as USFM and USX.

### USX

USX is a common XML format for Scripture. Let's take a look at how to load a set of USX files. First, we create an instance of the `UsxFileTextCorpus` class. We ensure that the correct verse references are used by loading the versification file for this translation. If a versification is not provided, then the English versification is used. We want untokenized verse text, so we use the `NullTokenizer`.

In [None]:
using SIL.Scripture;

var tokenizer = new NullTokenizer();
var versification = Versification.Table.Implementation.Load("data/WEB-DBL/release/versification.vrs", "web");
var corpus = new UsxFileTextCorpus(tokenizer, "data/WEB-DBL/release/USX_1", versification);

Let's iterate through the corpus. You will notice that each text segment in the corpus has an associated reference. In the case of Scripture, these are `VerseRef` objects.

In [None]:
foreach (TextSegment textSegment in corpus.GetSegments().Take(10))
{
    var verseRefStr = textSegment.SegmentRef.ToString();
    var verseText = string.Join(" ", textSegment.Segment);
    Console.WriteLine($"{verseRefStr}: {verseText}");
}

1JN 1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1JN 1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1JN 1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.
1JN 1:4: And we write these things to you, that our joy may be fulfilled.
1JN 1:5: This is the message which we have heard from him and announce to you, that God is light, and in him is no darkness at all.
1JN 1:6: If we say that we have fellowship with him and walk in the darkness, we lie and don’t tell the truth.
1JN 1:7: But if we walk in the light as he is in the light, we have fellowship with one another, and the blood of Jesus Christ his Son, cleanses us fro

You can also iterate through verses in the corpus by book.

In [None]:
foreach (IText text in corpus.Texts)
{
    Console.WriteLine(text.Id);
    Console.WriteLine("======");
    foreach (TextSegment textSegment in text.GetSegments().Take(3))
    {
        var verseRef = (VerseRef)textSegment.SegmentRef;
        var chapterVerse = $"{verseRef.Chapter}:{verseRef.Verse}";
        var verseText = string.Join(" ", textSegment.Segment);
        Console.WriteLine($"{chapterVerse}: {verseText}");
    }
    Console.WriteLine();
}

1JN
1:1: That which was from the beginning, that which we have heard, that which we have seen with our eyes, that which we saw, and our hands touched, concerning the Word of life
1:2: (and the life was revealed, and we have seen, and testify, and declare to you the life, the eternal life, which was with the Father, and was revealed to us);
1:3: that which we have seen and heard we declare to you, that you also may have fellowship with us. Yes, and our fellowship is with the Father and with his Son, Jesus Christ.

2JN
1:1: The elder, to the chosen lady and her children, whom I love in truth, and not I only, but also all those who know the truth,
1:2: for the truth’s sake, which remains in us, and it will be with us forever:
1:3: Grace, mercy, and peace will be with us, from God the Father and from the Lord Jesus Christ, the Son of the Father, in truth and love.

3JN
1:1: The elder to Gaius the beloved, whom I love in truth.
1:2: Beloved, I pray that you may prosper in all th

### Digital Bible Library Bundles

Now, let's load a Digital Bible Library (DBL) bundle. A DBL bundle is a zip archive that contains all of the data that you need for a publishable Bible translation.

In [None]:
using System.IO;
using System.IO.Compression;

Directory.CreateDirectory("out");
if (File.Exists("out/web.zip"))
    File.Delete("out/web.zip");
ZipFile.CreateFromDirectory("data/WEB-DBL", "out/web.zip");
Console.WriteLine("DBL bundle created.")

DBL bundle created.


First, we create a `DblBundleTextCorpus` instance. This time we want to tokenize the text, so we use the `LatinWordTokenizer`, a good default tokenizer for languages with Latin-based scripts. There is no need to specify versification, because the `DblBundleTextCorpus` class takes care of that for us.

In [None]:
var tokenizer = new LatinWordTokenizer();
var corpus = new DblBundleTextCorpus(tokenizer, "out/web.zip");

We can iterate through the corpus just as we did before. All text corpus classes in Machine adhere to the same interface, so it is easy to switch between the various classes. Also, you can see that the verse text is nicely tokenized.

In [None]:
foreach (TextSegment textSegment in corpus.GetSegments().Take(10))
{
    var verseRefStr = textSegment.SegmentRef.ToString();
    var verseText = string.Join(" ", textSegment.Segment);
    Console.WriteLine($"{verseRefStr}: {verseText}");
}

1JN 1:1: That which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the Word of life
1JN 1:2: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the Father , and was revealed to us ) ;
1JN 1:3: that which we have seen and heard we declare to you , that you also may have fellowship with us . Yes , and our fellowship is with the Father and with his Son , Jesus Christ .
1JN 1:4: And we write these things to you , that our joy may be fulfilled .
1JN 1:5: This is the message which we have heard from him and announce to you , that God is light , and in him is no darkness at all .
1JN 1:6: If we say that we have fellowship with him and walk in the darkness , we lie and don’t tell the truth .
1JN 1:7: But if we walk in the light as he is in the light , we have fellowship with one another , and the blood of Jesus Chr

### Paratext Projects

Another useful text corpus class is `ParatextTextCorpus`. This class is used to load a Paratext project. It properly loads the configured encoding and versification for the project.

In [None]:
var corpus = new ParatextTextCorpus(tokenizer, "data/WEB-PT");

Now, let's iterate through the segments.

In [None]:
foreach (TextSegment textSegment in corpus.GetSegments().Take(10))
{
    var verseRefStr = textSegment.SegmentRef.ToString();
    var verseText = string.Join(" ", textSegment.Segment);
    Console.WriteLine($"{verseRefStr}: {verseText}");
}

1JN 1:1: That which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the Word of life
1JN 1:2: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the Father , and was revealed to us ) ;
1JN 1:3: that which we have seen and heard we declare to you , that you also may have fellowship with us . Yes , and our fellowship is with the Father and with his Son , Jesus Christ .
1JN 1:4: And we write these things to you , that our joy may be fulfilled .
1JN 1:5: This is the message which we have heard from him and announce to you , that God is light , and in him is no darkness at all .
1JN 1:6: If we say that we have fellowship with him and walk in the darkness , we lie and don’t tell the truth .
1JN 1:7: But if we walk in the light as he is in the light , we have fellowship with one another , and the blood of Jesus Chr

## Token Processors

Often tokenized text must be processed in some way as a part of a AI/ML pipeline. Machine has a set of token processors that can be used to process text segments easily. Lowercasing text is a common pre-processing step, so let's show how to apply the `TokenProcessors.Lowercase` token processor.

In [None]:
using static SIL.Machine.Corpora.TokenProcessors;

var sentence = "New York is cold in the Winter .".Split();
Console.WriteLine(string.Join(" ", Lowercase.Process(sentence)))

new york is cold in the winter .


Multiple token processors can be applied in sequence using the `TokenProcessors.Pipeline` function. Here we will lowercase a segment and normalize it to NFC.

In [None]:
IReadOnlyList<string> sentence = "Here is a decomposed Swedish name Åström .".Split();
Console.WriteLine($"The length of decomposed {sentence[6]} is {sentence[6].Length}.");
sentence = Pipeline(NfcNormalize, Lowercase).Process(sentence);
Console.WriteLine($"The length of precomposed {sentence[6]} is {sentence[6].Length}.");

The length of decomposed Åström is 8.
The length of precomposed åström is 6.


## Parallel Text Corpora

So far we have only dealt with monolingual corpora. For many tasks, such as machine translation, parallel corpora are required. Machine provides a corpus class for combining two monolingual corpora into a parallel corpus.

In order to create a parallel text corpus, we must first create the source and target monolingual text corpora. Then, we can create the `ParallelTextCorpus` object from the monolingual corpus objects.

In [None]:
var sourceCorpus = new ParatextTextCorpus(tokenizer, "data/VBL-PT");
var targetCorpus = new ParatextTextCorpus(tokenizer, "data/WEB-PT");
var parallelCorpus = new ParallelTextCorpus(sourceCorpus, targetCorpus);

We can now iterate through the parallel segments.

In [None]:
foreach (ParallelTextSegment textSegment in parallelCorpus.GetSegments().Take(5))
{
    var verseRefStr = textSegment.SegmentRef.ToString();
    var sourceVerseText = string.Join(" ", textSegment.SourceSegment);
    var targetVerseText = string.Join(" ", textSegment.TargetSegment);
    Console.WriteLine(verseRefStr);
    Console.WriteLine($"Source: {sourceVerseText}");
    Console.WriteLine($"Target: {targetVerseText}");
}

1JN 1:1
Source: Esta carta trata sobre la Palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: That which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the Word of life
1JN 1:2
Source: Esta Vida nos fue revelada . La vimos y damos testimonio de ella . Estamos hablándoles de Aquél que es la Vida Eterna , que estaba con el Padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the Father , and was revealed to us ) ;
1JN 1:3
Source: Los que hemos visto y oído eso mismo les contamos , para que también puedan participar de esta amistad junto a nosotros . Esta amistad con el Padre y su Hijo Jesucristo .
Target: that which we have seen and hear