New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some thoughts on cross-references, especially for non-Roman scripts #82

DavidHaslam opened this Issue Jan 21, 2019 · 1 comment


None yet
1 participant
Copy link

DavidHaslam commented Jan 21, 2019

I'm facing an unusual yet interesting task that required the conversion of a source text from OSIS to USFM.

It's unusual in that the software team that did the original digitisation went straight from transcribed text to OSIS by means of a Perl script, such that the OSIS files could be part of a website content management system.

It's interesting in that examining the work a decade later has brought to light numerous mistakes in the OSIS files that can only be attributed to weaknesses in their Perl script, which (alas) is no longer available.

Vernacular references are notoriously difficult to parse at the best of times. I'm therefore not in the least surprised that so many errors were found.

One particular area of interest involves the conversion of OSIS cross-reference notes to USFM for a Bible translation in a language that uses a non-Roman script.

Here's an example:

သုတစံးအၢအီၤတဂ့ၤ.<note><reference osisRef="Exod.22.28">၂မိၤ. ၂၂း၂၈</reference></note>.</p>

Now in going in the other direction, converting USFM to OSIS, the most well-known difficulty facing programmers is parsing and converting the references such that the correct osisRef attribute values are inserted into each OSIS reference element. This task is almost always not straightforward, because of the variability in how references are punctuated and separated in different languages. The difficulty is compounded when it's a non-Roman script in that not only the book abbreviations but also the numerals are different than in Latin based scripts. The problem is made even worse because of the inconsistencies in the way in which the vernacular references are written even by the same translator at different times.

Bible study software has a general requirement that a displayed cross-reference can act as a hyperlink to the passage it refers to, such that the user can readily go to the passage and find out why it was suggested as a useful further passage to read in the first context.

For such hyperlinks to be effective, the reference system has to be standardised. It would be too difficult for Bible software to make sense of every possible variety of Scripture reference in all the world's languages and scripts. Now in saying this, it doesn't really matter if OSIS and USFM have a different standard reference system, since a one to one correspondence can be mapped between two standards.

In converting a cross-reference note from OSIS to USFM, it soon occurred to me that it made no sense to discard the information provided in the osisRef attribute values. These are also easier to read for the programmer coming from the English speaking world than retaining only the vernacular form of the reference text. This is the part wrapped between the two reference element tags.

Being already familiar with the extended syntax that USFM 3.0 provides for word level attributes, I came up with the idea that the same kind of syntax could be used to tackle this problem.

Here's what my example might become in USFM.

သုတစံးအၢအီၤတဂ့ၤ.\x + \xt ၂မိၤ. ၂၂း၂၈|oref="Exod.22.28"\xt*.\x*

The vernacular reference becomes the word which then has an attribute following the vertical line.

Alternatively, we could simply dispense with using an identified attribute, and just record the [OSIS] reference after the vertical line. viz.

သုတစံးအၢအီၤတဂ့ၤ.\x + \xt ၂မိၤ. ၂၂း၂၈|Exod.22.28\xt*.\x*

For this to be valid USFM, the unidentified attribute would require that oref is defined as the default attribute for \xt ...\xt*.

The practical benefit for the project I'm seeking to improve is that retaining all the references in a readable form helped me to go through the process of checking each and every reference in the OSIS file, making corrections in numerous places. This would be a much harder task if only the vernacular references were present in the USFM files.

For the time being, this is merely a provisional idea. It enables me to put off the consideration of how to parse and convert the vernacular references for this translation. The extra content can easily be removed by a regex search and replace operation (whether scripted or otherwise).

I like to share any creative ideas that I have with the wider community, so that's why I'm recording these thoughts here for further consideration.


This comment has been minimized.

Copy link

DavidHaslam commented Jan 24, 2019

@klassenjm - please review, and feel free to ask any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment