# Paratext/USFM Processing Tutorial

Machine provides various classes for parsing and mutating USFM. These classes are used to implement the USFM and Paratext corpus classes. Machine also provides functionality for parsing Paratext project settings. The USFM processing functionality is designed to be as compatible as possible with USFM that is produced by Paratext.

In [None]:
%pip install sil-machine

In [None]:
!git clone https://github.com/sillsdev/machine.py.git
%cd machine.py/samples

## Parsing USFM

Machine provides a couple of options for parsing USFM. These classes can also be used to perform simple changes to the USFM.

First, let's define a simple USFM string and instantiate a USFM stylesheet object using the default stylesheet that is included in Machine. If you are using a custom stylesheet, you can use that file when you create the stylesheet object.

In [1]:
from machine.corpora import UsfmStylesheet

usfm = """\\id MAT - Test
\\h Matthew
\\mt Matthew
\\ip An introduction to Matthew
\\c 1
\\s Chapter One
\\p
\\v 1 This is verse \\pn one\\pn* of chapter one.
\\v 2 This is verse two\\f + \\fr 1:2: \\ft This is a footnote.\\f* of chapter one. 
""".replace("\n", "\r\n")

stylesheet = UsfmStylesheet("usfm.sty")

### Option 1: UsfmTokenizer

Machine provides a `UsfmTokenizer` class that can be used to split up the USFM into separate tokens. A `UsfmToken` object is used to represent each token. The class contains minimal metadata about each token. This is the simplest method for parsing USFM. You can process the tokens by simply iterating through them. USFM can be reconstructed from the tokens by using the `detokenize` method.

In this example, we uppercase all text tokens.

In [2]:
from machine.corpora import UsfmTokenizer, UsfmTokenType

usfm_tokenizer = UsfmTokenizer(stylesheet)
tokens = usfm_tokenizer.tokenize(usfm)
for token in tokens:
  if token.type is UsfmTokenType.TEXT:
    token.text = token.text.upper()

print(usfm_tokenizer.detokenize(tokens))

\id MAT - TEST
\h MATTHEW
\mt MATTHEW
\ip AN INTRODUCTION TO MATTHEW
\c 1
\s CHAPTER ONE
\p
\v 1 THIS IS VERSE \pn ONE\pn* OF CHAPTER ONE.
\v 2 THIS IS VERSE TWO\f + \fr 1:2: \ft THIS IS A FOOTNOTE.\f* OF CHAPTER ONE.



### Option 2: UsfmParser

The `UsfmParser` class is a higher-level parsing option that provides more information about the semantics and context of the current token. The `UsfmParserState` instance contains the current context and metadata.

In this example, we use the more intelligent features of the `UsfmParser` to only uppercase verse text.

In [4]:
from machine.corpora import UsfmParser

usfm_parser = UsfmParser(usfm, stylesheet=stylesheet)
state = usfm_parser.state
while usfm_parser.process_token():
  if state.token.type is UsfmTokenType.TEXT and state.is_verse_text:
    state.token.text = state.token.text.upper()

print(usfm_tokenizer.detokenize(state.tokens))

\id MAT - Test
\h Matthew
\mt Matthew
\ip An introduction to Matthew
\c 1
\s Chapter One
\p
\v 1 THIS IS VERSE \pn ONE\pn* OF CHAPTER ONE.
\v 2 THIS IS VERSE TWO\f + \fr 1:2: \ft This is a footnote.\f* OF CHAPTER ONE.



### Option 3: UsfmParser with UsfmParserHandler

This option is very similar to the previous except an instance of `UsfmParserHandler` is passed to the `UsfmParser`. This class provides a set of callback methods that can be extended with custom processing logic. The methods are called when certain events are encountered during parsing. For example, there are callbacks for the start/end of a book, start/end of a paragraph, a chapter/verse milestone, etc.

In this example, we override the `text` callback to uppercase verse text just like in the previous example.

In [5]:
from machine.corpora import UsfmParserHandler

class VerseTextUppercaser(UsfmParserHandler):
  def text(self, state, text):
    if state.is_verse_text:
      state.token.text = text.upper()

usfm_parser = UsfmParser(usfm, VerseTextUppercaser(), stylesheet)
usfm_parser.process_tokens()

print(usfm_tokenizer.detokenize(usfm_parser.state.tokens))

\id MAT - Test
\h Matthew
\mt Matthew
\ip An introduction to Matthew
\c 1
\s Chapter One
\p
\v 1 THIS IS VERSE \pn ONE\pn* OF CHAPTER ONE.
\v 2 THIS IS VERSE TWO\f + \fr 1:2: \ft This is a footnote.\f* OF CHAPTER ONE.



Machine provides a convenience function, `parse_usfm`, for easily parsing USFM using a handler.

In [7]:
from machine.corpora import parse_usfm

class VerseTextCollector(UsfmParserHandler):
  def __init__(self):
    self.verse_texts = []

  def text(self, state, text):
    if state.is_verse_text:
      self.verse_texts.append(text)

collector = VerseTextCollector()
parse_usfm(usfm, collector, stylesheet)
for verse_text in collector.verse_texts:
  print(verse_text)

This is verse 
one
 of chapter one. 
This is verse two
 of chapter one.


## Processing Paratext Projects

Machine provides classes for parsing Paratext project settings and updating the text segments in a project.

### Parsing Paratext Settings

First, let's demonstrate how to parse Paratext settings. We use `FileParatextProjectSettingsParser` to read the settings for a project. The parsed settings contain important information about the project, such as the project name, file encoding, versification, USFM stylesheet, USFM file name format, language code, and Biblical terms list.

In [1]:
from machine.corpora import FileParatextProjectSettingsParser

settings_parser = FileParatextProjectSettingsParser("data/WEB-PT")
settings = settings_parser.parse()
print("Name:", settings.name)
print("Full Name:", settings.full_name)
print("Language Code:", settings.language_code)
print("Encoding:", settings.encoding)

Name: engWEB14
Full Name: World English Bible (American Edition)
Language Code: en
Encoding: utf_8_sig


### Updating a Paratext Book

Machine has the ability to replace the text segments in a Paratext project. This is useful for implementing machine translation systems. We will use `FileParatextProjectTextUpdater` to update the text segments. All we have to do is pass in a list of text segments and the corresponding Scripture reference to the `update_usfm` method. The class can be used to update both verse text and non-Scripture text. By default, `update_usfm` will only update text segments that are blank, so we set the `behavior` parameter to `UpdateUsfmBehavior.PREFER_NEW` to indicate that we want to replace existing segments.

In [2]:
from machine.corpora import FileParatextProjectTextUpdater, ScriptureRef, UpdateUsfmBehavior

new_segments = [
  ([ScriptureRef.parse("3JN 1:0/mt1", settings.versification)], "THIS IS THE MAJOR TITLE OF 3 JOHN."),
  ([ScriptureRef.parse("3JN 1:1", settings.versification)], "THIS IS THE FIRST VERSE OF 3 JOHN."),
  ([ScriptureRef.parse("3JN 1:14", settings.versification)], "THIS IS THE FOURTEENTH VERSE OF 3 JOHN.")
]

updater = FileParatextProjectTextUpdater("data/WEB-PT")
new_usfm = updater.update_usfm("3JN", new_segments, behavior=UpdateUsfmBehavior.PREFER_NEW)
print(new_usfm)

\id 3JN 64-3JN-web.sfm World English Bible (WEB)
\ide UTF-8
\h 3 John
\toc1 John’s Third Letter
\toc2 3 John
\toc3 3 John
\mt1 THIS IS THE MAJOR TITLE OF 3 JOHN.
\c 1
\p
\v 1 THIS IS THE FIRST VERSE OF 3 JOHN.
\p
\v 2 Beloved, I pray that you may prosper in all things and be healthy, even as your soul prospers.
\v 3 For I rejoiced greatly when brothers came and testified about your truth, even as you walk in truth.
\v 4 I have no greater joy than this: to hear about my children walking in truth.
\p
\v 5 Beloved, you do a faithful work in whatever you accomplish for those who are brothers and strangers.
\v 6 They have testified about your love before the assembly. You will do well to send them forward on their journey in a way worthy of God,
\v 7 because for the sake of the Name they went out, taking nothing from the Gentiles.
\v 8 We therefore ought to receive such, that we may be fellow workers for the truth.
\p
\v 9 I wrote to the assembly, but Diotrephes, who loves to be first among

By setting the `behavior` parameter to `UpdateUsfmBehavior.STRIP_EXISTING`, we can remove all existing text segments, leaving only the new segments and the basic USFM structure.

In [3]:
new_usfm = updater.update_usfm("3JN", new_segments, behavior=UpdateUsfmBehavior.STRIP_EXISTING)
print(new_usfm)

\id 3JN
\ide
\h
\toc1
\toc2
\toc3
\mt1 THIS IS THE MAJOR TITLE OF 3 JOHN.
\c 1
\p
\v 1 THIS IS THE FIRST VERSE OF 3 JOHN.
\p
\v 2
\v 3
\v 4
\p
\v 5
\v 6
\v 7
\v 8
\p
\v 9
\v 10
\p
\v 11
\v 12
\p
\v 13
\v 14 THIS IS THE FOURTEENTH VERSE OF 3 JOHN.
\p

