Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formal grammar #85

Open
jaakristioja opened this Issue Mar 18, 2019 · 6 comments

Comments

Projects
None yet
3 participants
@jaakristioja
Copy link

jaakristioja commented Mar 18, 2019

I'm having trouble understanding the exact grammar for even the very basic USFM, and I think that the specification is very vague at this. For example, I don't understand whether a USFM file must immediately start with a \ or not, what is the overall structure of the file (whether some part of the file is considered to be a header, a body etc), where exactly can certain markers occur (e.g can one use \ide switch the character encoding mid-document?), whether certain identification markers are compulsory or optional etc.

Would it be possible to amend the specification with more formal grammar rules, e.g. written in BNF, EBNF or similar? This would make the specification far less ambiguous, and easier for developers like me to write correct parsers.

Thanks!

@cmahte

This comment has been minimized.

Copy link

cmahte commented Mar 18, 2019

If you look in the usfm.sty files that are usually available wherever the spec is, The style sheet contains a bit more information about where each tag is valid. They have an Occursunder field. This should help guide you.

These Occursunder fields aren't spec'd because they are customizeable, but if you design for the default, anyone who's using a custom.sty file typically already knows customizing the stylesheet puts them outside of formal expectation of of full support.

@cmahte

This comment has been minimized.

Copy link

cmahte commented Mar 18, 2019

But an usfm file does always start with the \id tag. and the id tag must always have the 3 character book code immediately following \id . This should be (was at one time) in the specification.

However, you CAN have multiple \id lines in a single usfm file, and the usfm remains valid. This isn't specified as required or not, but my testing and queries on the subject suggest there is nothing invalid with a 2nd or 66th \id field in a single file.

@cmahte

This comment has been minimized.

Copy link

cmahte commented Mar 18, 2019

I agree with Jak that at least the introductory tags \id \periph \usfm \ide \h \toc1,2,3 should have a more formal order defined in the USFM specification. There's no reason not to do so, and not having a defined order makes parsing files much more complicated.

@jaakristioja

This comment has been minimized.

Copy link
Author

jaakristioja commented Mar 18, 2019

I'm not sure I understand the \OccursUnder logic, nor the exact relation between these style sheets and USFM. The analogy which comes to my mind is HTML and CSS, where CSS only specifies some additional presentational properties for the document. But \Marker, \Name, \Description, \OccursUnder, \Rank etc in these style sheets seem to indicate that these style sheets are more to USFM than CSS is to HTML.

Is there a specification for the style sheets as well? I was unable to locate a reference to it in the USFM spec.

Would something like the following would be valid USFM?

\id MAT Doesn't matter what I write here, because
\id GEN the specification doesn't seem to specify
\id GEN a strict format for these strings after the <CODE>.

A good formal grammar for USFM could rectify most such ambiguities (but not all).

@cmahte

This comment has been minimized.

Copy link

cmahte commented Mar 18, 2019

As far as I know, that is valid USFM.... unless both GEN sections contain the same chapter number in them. I think any duplicate pre chapter 1 material (any tag except a \c coming after the \id ) would make this a duplicate book as well:

\id GEN
\h Genesis
\mt1 Genesis
\id GEN
\c 1
\id GEN
\c 2

Is valid but

\id GEN
\mt1 Genesis
\c 1
\id GEN
\mt1 Genesis 
\c 2 

is not valid.
And nor is

\id GEN
\c 1
\v 1
\id GEN
\c 1 
\v 2
etc.

nor is

\id GEN
\c 1
\c 1

Any repeated id + c chapter tag invalidates the file (chapter zero included: the introductory stuff).

However, I don't represent any official USFM body. Any comments that disagree with this likely carry more weight than my understanding.

klassenjm added a commit that referenced this issue Mar 19, 2019

@klassenjm

This comment has been minimized.

Copy link
Contributor

klassenjm commented Mar 19, 2019

@cmahte Thank you for your helpful responses for Jaak.

I agree that the current documentation is not sufficient as a grammar. As Michael mentions, the usfm.sty stylesheet contains some additional definition, and is as suggested more than what CSS is for HTML.

I have added a basic description of stylesheet properties to the sty folder in README.md. Take a look there.

Also, in case it assists, let me refer you to a more formal grammar for use in checking USFM 3 content which is being developed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.