Word Attributes for Compound Words #50

jag3773 · 2017-08-11T18:33:28Z

Related to #26. We are looking at encoding the Hebrew text in USFM 3, using word level attributes for lemma, strongs, and a x-morph attribute to hold specific morphological data.

What we are curious about is if there is a recommended way to encode lemma/strongs/morphological data for parts of compound words. For example, this OSIS line is using a forward slash separator so that the morph data can be distinguished between the conjunction and the verb.

How can we do this in USFM 3? Is there direct support for this or do we need to use a workaround in our custom attributes?

The text was updated successfully, but these errors were encountered:

klassenjm · 2017-08-21T19:08:06Z

Hi Jesse.

The upcoming USFM 3.0 spec has defined markers for Ruby annotations (#31). We have found that the existing spec (as documented today) needs some tweaking to be more agnostic about typesetting phrases. What I mean by that is that the spec provides markup for selecting a base text string (\rb...\rb*) and a following Ruby text string (\rt...\rt*).

However, in some cases this limits the project to deciding up front on what length of strings they wish to provide annotations to. In some case the project may wish to annotate a phrase.

The following updated syntax would allow the use of a colon to separate multiple pieces within a phrase gloss.

\rb BB|gg:gg\rb*

This is a significant change. \rb ...\rb would be used to mark text with Ruby annotation. And, the annotations (glosses) would be provided using the standard word level attributes syntax.

I think that a syntax proposed for Ruby will be relevant for your need as well. I would recommend following the use of a colon to separate morphological parts. We can document the use of colon for purposes like this as a standard syntax.

What do you think?

DavidHaslam · 2017-11-29T10:40:15Z

@jag3773 - @klassenjm still awaits your response.

jag3773 · 2017-11-29T21:13:05Z

Thanks for the poke @DavidHaslam .

I don't think I follow how the \rb code can help, but I do see how using a colon to separate the pieces could work in the attributes if there is a corresponding separator in the text. For example, we could convert this line:

<w lemma="c/1961" n="1.1.1" morph="HC/Vqw3ms">וַ/יְהִ֗י</w>

to USFM3 as:

\w וַ/יְהִ֗י|strong="c:1961" x-morph="HC:Vqw3ms" \w*

Note that following this model would require that we preserve the forward slash in the text itself. This would require software that is aware that the forward slash should be a non-printing character when the text is displayed. Perhaps it would be more effective to use a unicode zero-width space instead of the / to separate the word in the text itself? That way the text would show correctly even if the software doesn't recognize the multiple components.

klassenjm · 2017-11-29T23:06:46Z

@jag3773 @DavidHaslam Jesse - sorry about the extra background on \rb ...\rb* :-). As you concluded, I was not recommending that marker for use in your situation, but just identifying one other markup addition in USFM 3.0 which required a similar type of separator to identify multiple components in the attribute string.

In the case of ruby, the attributes separated by colon would related to single ideographs. I see that you would need a separator in the text to identify the relationship to the attribute sequence.

If you can use this : syntax, that wold be excellent. I will add this as the recommended syntax for this purpose.

I'm also wishing to apologize that the last 10% of final documentation for USFM 3.0 has been (extremely) delayed. It should have been done ages ago - but things have conspired against that happening. I anticipate this being done in December. I appreciate you accepting the text in these GH issues for now.

klassenjm added this to the 3.0.0 milestone Nov 29, 2017

klassenjm added attribute documentation new status-proposed type-char labels Nov 29, 2017

jag3773 mentioned this issue Dec 14, 2017

Convert OSHB OSIS to UHB USFM3 unfoldingWord/translationCore#3414

Closed

3 tasks

klassenjm added a commit that referenced this issue Apr 2, 2018

Clarify use of colon to separate multiple word-level attribute parts #50

4ee850b

klassenjm removed the status-proposed label Apr 2, 2018

klassenjm closed this as completed Apr 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Attributes for Compound Words #50

Word Attributes for Compound Words #50

jag3773 commented Aug 11, 2017

klassenjm commented Aug 21, 2017

DavidHaslam commented Nov 29, 2017

jag3773 commented Nov 29, 2017

klassenjm commented Nov 29, 2017 •

edited

Loading

Word Attributes for Compound Words #50

Word Attributes for Compound Words #50

Comments

jag3773 commented Aug 11, 2017

klassenjm commented Aug 21, 2017

DavidHaslam commented Nov 29, 2017

jag3773 commented Nov 29, 2017

klassenjm commented Nov 29, 2017 • edited Loading

klassenjm commented Nov 29, 2017 •

edited

Loading