Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Word Attributes for Compound Words #50
Related to #26. We are looking at encoding the Hebrew text in USFM 3, using word level attributes for
What we are curious about is if there is a recommended way to encode lemma/strongs/morphological data for parts of compound words. For example, this OSIS line is using a forward slash separator so that the morph data can be distinguished between the conjunction and the verb.
How can we do this in USFM 3? Is there direct support for this or do we need to use a workaround in our custom attributes?
The upcoming USFM 3.0 spec has defined markers for Ruby annotations (#31). We have found that the existing spec (as documented today) needs some tweaking to be more agnostic about typesetting phrases. What I mean by that is that the spec provides markup for selecting a base text string (\rb...\rb*) and a following Ruby text string (\rt...\rt*).
However, in some cases this limits the project to deciding up front on what length of strings they wish to provide annotations to. In some case the project may wish to annotate a phrase.
The following updated syntax would allow the use of a colon to separate multiple pieces within a phrase gloss.
This is a significant change. \rb ...\rb would be used to mark text with Ruby annotation. And, the annotations (glosses) would be provided using the standard word level attributes syntax.
I think that a syntax proposed for Ruby will be relevant for your need as well. I would recommend following the use of a colon to separate morphological parts. We can document the use of colon for purposes like this as a standard syntax.
What do you think?
Thanks for the poke @DavidHaslam .
I don't think I follow how the
to USFM3 as:
Note that following this model would require that we preserve the forward slash in the text itself. This would require software that is aware that the forward slash should be a non-printing character when the text is displayed. Perhaps it would be more effective to use a unicode zero-width space instead of the
@jag3773 @DavidHaslam Jesse - sorry about the extra background on \rb ...\rb* :-). As you concluded, I was not recommending that marker for use in your situation, but just identifying one other markup addition in USFM 3.0 which required a similar type of separator to identify multiple components in the attribute string.
In the case of ruby, the attributes separated by colon would related to single ideographs. I see that you would need a separator in the text to identify the relationship to the attribute sequence.
If you can use this : syntax, that wold be excellent. I will add this as the recommended syntax for this purpose.
I'm also wishing to apologize that the last 10% of final documentation for USFM 3.0 has been (extremely) delayed. It should have been done ages ago - but things have conspired against that happening. I anticipate this being done in December. I appreciate you accepting the text in these GH issues for now.