New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Attributes for Compound Words #50

Closed
jag3773 opened this Issue Aug 11, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@jag3773

jag3773 commented Aug 11, 2017

Related to #26. We are looking at encoding the Hebrew text in USFM 3, using word level attributes for lemma, strongs, and a x-morph attribute to hold specific morphological data.

What we are curious about is if there is a recommended way to encode lemma/strongs/morphological data for parts of compound words. For example, this OSIS line is using a forward slash separator so that the morph data can be distinguished between the conjunction and the verb.

How can we do this in USFM 3? Is there direct support for this or do we need to use a workaround in our custom attributes?

@klassenjm

This comment has been minimized.

Show comment
Hide comment
@klassenjm

klassenjm Aug 21, 2017

Contributor

Hi Jesse.

The upcoming USFM 3.0 spec has defined markers for Ruby annotations (#31). We have found that the existing spec (as documented today) needs some tweaking to be more agnostic about typesetting phrases. What I mean by that is that the spec provides markup for selecting a base text string (\rb...\rb*) and a following Ruby text string (\rt...\rt*).

However, in some cases this limits the project to deciding up front on what length of strings they wish to provide annotations to. In some case the project may wish to annotate a phrase.

The following updated syntax would allow the use of a colon to separate multiple pieces within a phrase gloss.

\rb BB|gg:gg\rb*

This is a significant change. \rb ...\rb would be used to mark text with Ruby annotation. And, the annotations (glosses) would be provided using the standard word level attributes syntax.

I think that a syntax proposed for Ruby will be relevant for your need as well. I would recommend following the use of a colon to separate morphological parts. We can document the use of colon for purposes like this as a standard syntax.

What do you think?

Contributor

klassenjm commented Aug 21, 2017

Hi Jesse.

The upcoming USFM 3.0 spec has defined markers for Ruby annotations (#31). We have found that the existing spec (as documented today) needs some tweaking to be more agnostic about typesetting phrases. What I mean by that is that the spec provides markup for selecting a base text string (\rb...\rb*) and a following Ruby text string (\rt...\rt*).

However, in some cases this limits the project to deciding up front on what length of strings they wish to provide annotations to. In some case the project may wish to annotate a phrase.

The following updated syntax would allow the use of a colon to separate multiple pieces within a phrase gloss.

\rb BB|gg:gg\rb*

This is a significant change. \rb ...\rb would be used to mark text with Ruby annotation. And, the annotations (glosses) would be provided using the standard word level attributes syntax.

I think that a syntax proposed for Ruby will be relevant for your need as well. I would recommend following the use of a colon to separate morphological parts. We can document the use of colon for purposes like this as a standard syntax.

What do you think?

@DavidHaslam

This comment has been minimized.

Show comment
Hide comment
@DavidHaslam

DavidHaslam Nov 29, 2017

@jag3773 - @klassenjm still awaits your response.

DavidHaslam commented Nov 29, 2017

@jag3773 - @klassenjm still awaits your response.

@jag3773

This comment has been minimized.

Show comment
Hide comment
@jag3773

jag3773 Nov 29, 2017

Thanks for the poke @DavidHaslam .

I don't think I follow how the \rb code can help, but I do see how using a colon to separate the pieces could work in the attributes if there is a corresponding separator in the text. For example, we could convert this line:

<w lemma="c/1961" n="1.1.1" morph="HC/Vqw3ms">וַ/יְהִ֗י</w>

to USFM3 as:

\w וַ/יְהִ֗י|strong="c:1961" x-morph="HC:Vqw3ms" \w*

Note that following this model would require that we preserve the forward slash in the text itself. This would require software that is aware that the forward slash should be a non-printing character when the text is displayed. Perhaps it would be more effective to use a unicode zero-width space instead of the / to separate the word in the text itself? That way the text would show correctly even if the software doesn't recognize the multiple components.

jag3773 commented Nov 29, 2017

Thanks for the poke @DavidHaslam .

I don't think I follow how the \rb code can help, but I do see how using a colon to separate the pieces could work in the attributes if there is a corresponding separator in the text. For example, we could convert this line:

<w lemma="c/1961" n="1.1.1" morph="HC/Vqw3ms">וַ/יְהִ֗י</w>

to USFM3 as:

\w וַ/יְהִ֗י|strong="c:1961" x-morph="HC:Vqw3ms" \w*

Note that following this model would require that we preserve the forward slash in the text itself. This would require software that is aware that the forward slash should be a non-printing character when the text is displayed. Perhaps it would be more effective to use a unicode zero-width space instead of the / to separate the word in the text itself? That way the text would show correctly even if the software doesn't recognize the multiple components.

@klassenjm

This comment has been minimized.

Show comment
Hide comment
@klassenjm

klassenjm Nov 29, 2017

Contributor

@jag3773 @DavidHaslam Jesse - sorry about the extra background on \rb ...\rb* :-). As you concluded, I was not recommending that marker for use in your situation, but just identifying one other markup addition in USFM 3.0 which required a similar type of separator to identify multiple components in the attribute string.

In the case of ruby, the attributes separated by colon would related to single ideographs. I see that you would need a separator in the text to identify the relationship to the attribute sequence.

If you can use this : syntax, that wold be excellent. I will add this as the recommended syntax for this purpose.

I'm also wishing to apologize that the last 10% of final documentation for USFM 3.0 has been (extremely) delayed. It should have been done ages ago - but things have conspired against that happening. I anticipate this being done in December. I appreciate you accepting the text in these GH issues for now.

Contributor

klassenjm commented Nov 29, 2017

@jag3773 @DavidHaslam Jesse - sorry about the extra background on \rb ...\rb* :-). As you concluded, I was not recommending that marker for use in your situation, but just identifying one other markup addition in USFM 3.0 which required a similar type of separator to identify multiple components in the attribute string.

In the case of ruby, the attributes separated by colon would related to single ideographs. I see that you would need a separator in the text to identify the relationship to the attribute sequence.

If you can use this : syntax, that wold be excellent. I will add this as the recommended syntax for this purpose.

I'm also wishing to apologize that the last 10% of final documentation for USFM 3.0 has been (extremely) delayed. It should have been done ages ago - but things have conspired against that happening. I anticipate this being done in December. I appreciate you accepting the text in these GH issues for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment