Alignment Data Structure: USFM 3

Overview

Determining the desired data structure for alignment data includes the obvious data to be stored and used as well as the not so obvious long term planning of storing and managing of the new data. We need to plan how this data will fit in our existing uW eco-system.

Case for CSV

While CSV is easier to edit by hand, it will require extra measures to ensure it is in sync with the target language USFM project file. Once the implementation is complete editing by hand will not be necessary and negate advantages.

Pros

CSV is easier to edit by hand
USFM target files stay lightweight and easier to edit

Cons

Managing synchronization may cause more issues than it solves
Added complexity of planning how to store and publish the alignment data
Needs additional planning that could add months to implementation

Case for USFM 3

If USFM 3 is used to store alignment data in the target text USFM markup, then no extra data is required. It is more difficult to edit by hand. Once implementation is complete editing by hand will no longer be necessary and http://ubsicap.github.io/usfm/characters/index.html?highlight=srclocnegate the disadvantage.

Pros

Use development time from CSV to improve existing USFM support
Synchronization is much easier
No extra files to manage and publish
Use existing libraries for validation, etc...

Cons

USFM markup is more complex
USFM is now harder to edit by hand

Data

Essential

This information is required for alignment data and will cause issues if it is missing.

Source
- Reference
- Word/Phrase
- Occurrence/Occurrences
Target
- Reference
- Word/Phrase
- Occurrence/Occurrences

Supplemental

This information is nice so that lookups to the source text do not have to be done.

Source
- Strongs Number
- Morphology

CSV Examples

Essential Data

book, c:v, source_phrase, source_occurrence, target_phrase, target_occurrence
mat, 1:1, "βίβλος", 1/1, "the book of", 1/1
mat, 1:1, "γενέσεως", 1/1, "the genealogy of", 1/1
mat, 1:1, "ἰησοῦ", 1/1, "jesus", 1/1
mat, 1:1, "χριστοῦ", 1/1, "christ", 1/1
mat, 1:1, "υἱοῦ", 1/2, "son of", 1/2
mat, 1:1, "δαυεὶδ", 1/1, "david", 1/1
mat, 1:1, "υἱοῦ", 2/2, "son of", 2/2
mat, 1:1, "ἀβραάμ", 1/1, "abraham", 1/1

Supplemental Data

book, c:v, source_phrase, strong, source_morph, source_occurrence, target_phrase, target_occurrence
mat, 1:1, "βίβλος", G9760, "N./....NFS", 1/1, "the book of", 1/1
mat, 1:1, "γενέσεως", G10780, "N./....GFS", 1/1, "the genealogy of", 1/1
mat, 1:1, "ἰησοῦ", G24240, "N./....GMS", 1/1, "jesus", 1/1
mat, 1:1, "χριστοῦ", G55470, "N./....GMS", 1/1, "christ", 1/1
mat, 1:1, "υἱοῦ", G52070, "N./....GMS", 1/2, "son of", 1/2
mat, 1:1, "δαυεὶδ", G11380, "N./....GMS", 1/1, "david", 1/1
mat, 1:1, "υἱοῦ", G52070, "N./....GMS", 2/2, "son of", 2/2
mat, 1:1, "ἀβραάμ", G110, "N./....GMS", 1/1, "abraham", 1/1

USFM 3 Example

Notes

Not sure if a phrase can be used in the \w tag
If phrase cannot be used, example can be produced with each target word

Essential Data

\v 1
\w the book of|x-ugnt-phrase="βίβλος" x-source_occurrence="1/1" x-target-occurrence="1/1" \w*
\w the genealogy of|x-ugnt-phrase="γενέσεως" x-source_occurrence="1/1" x-target-occurrence="1/1" \w*
\w jesus|x-ugnt-phrase="ἰησοῦ" x-source_occurrence="1/1" x-target-occurrence="1/1" \w*
\w christ|x-ugnt-phrase="χριστοῦ" x-source_occurrence="1/1" x-target-occurrence="1/1" \w*
\w son of|x-ugnt-phrase="υἱοῦ" x-source_occurrence="1/2" x-target-occurrence="1/2" \w*
\w david|x-ugnt-phrase="δαυεὶδ" x-source_occurrence="1/1" x-target-occurrence="1/1" \w*
\w son of|x-ugnt-phrase="υἱοῦ" x-source_occurrence="2/2" x-target-occurrence="2/2" \w*
\w abraham|x-ugnt-phrase="ἀβραάμ" x-source_occurrence="1/1" x-target-occurrence="1/1" \w*

Supplemental Data

\v 1
\w the book of|x-ugnt-phrase="βίβλος" x-source_occurrence="1/1" strong="G9760:N./....NFS" x-target-occurrence="1/1" \w*
\w the genealogy of|x-ugnt-phrase="γενέσεως" x-source_occurrence="1/1" strong="G10780:N./....GFS" x-target-occurrence="1/1" \w*
\w jesus|x-ugnt-phrase="ἰησοῦ" x-source_occurrence="1/1" strong="G24240:N./....GMS" x-target-occurrence="1/1" \w*
\w christ|x-ugnt-phrase="χριστοῦ" x-source_occurrence="1/1" strong="G55470:N./....GMS" x-target-occurrence="1/1" \w*
\w son of|x-ugnt-phrase="υἱοῦ" x-source_occurrence="1/2" strong="G52070:N./....GMS" x-target-occurrence="1/2" \w*
\w david|x-ugnt-phrase="δαυεὶδ" x-source_occurrence="1/1" strong="G11380:N./....GMS" x-target-occurrence="1/1" \w*
\w son of|x-ugnt-phrase="υἱοῦ" x-source_occurrence="2/2" strong="G52070:N./....GMS" x-target-occurrence="2/2" \w*
\w abraham|x-ugnt-phrase="ἀβραάμ" x-source_occurrence="1/1" strong="G110:N./....GMS" x-target-occurrence="1/1" \w*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment Data Structure: USFM 3

Overview

Case for CSV

Pros

Cons

Case for USFM 3

Pros

Cons

Data

Essential

Supplemental

CSV Examples

Essential Data

Supplemental Data

USFM 3 Example

Notes

Essential Data

Supplemental Data

Clone this wiki locally