Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for USX 3.0 #39

Merged
merged 31 commits into from
Mar 12, 2021
Merged

Add support for USX 3.0 #39

merged 31 commits into from
Mar 12, 2021

Conversation

Rolf-Smit
Copy link
Contributor

@Rolf-Smit Rolf-Smit commented Dec 3, 2020

Implements feature request #38

This adds support for importing and exporting USX 3.0 files.

  • USX 2 > USX 3
  • USX 3 > USX 2 (with some loss of newer features obviously)
  • USX 3 > USFM 2 (with some loss of newer features obviously)
  • USFM 2 > USX 3
  • USFX > USX 3
  • USX 3 > USFX (not supported since USFX export is not supported)

Also the internal format that is used for debugging (ParatextDump) is supported (dump > USX 3 and USX 3 > dump).

I've touched a lot of code, fixed a few bugs, added many comments, if you have any questions on why I did certain things, please do ask.

Most important thing to mention is that I've chosen to include the end chapter and verse milestones in the internal format, and make the old formats USFM 2, USX 2 and USFX polyfill those. Since I think it makes more sense to support the newest format in the best way possible (which is USX 3). Also since there is no perfect way to find out where end chapter/verse milestones must be placed it is best to preserve them. This polyfilling should not impact USX 2 to USFM 2 etc. but if you do an "upgrade" (from USX 2 to 3) it can have some impact since end chapter/verse milestones are basically inserted in what the importer thinks is the most appropriate position.

Few other changes:

  • References now accept chapter ranges and single books as references
  • USMF now always adds the ide attribute/marker or overrides a read one with UTF-8 (since this library always writes USFM files in UTF-8).
  • Verse numbers are no longer integers but strings, since they may be in the format 6a or 6-7
  • USFM no longer removes intended whitespace at the end of a line during import (removed \\p{Z} from the regex).

@Rolf-Smit Rolf-Smit mentioned this pull request Dec 3, 2020
4 tasks
@schierlm
Copy link
Owner

schierlm commented Dec 3, 2020

Hello Rolf,

thank you for the pull request.

Your commits are well structured and the shorter ones are easy to review, yet there are still a few big ones where I'd like to have a more closer look and also do some testing :)

I noticed some of the changes are more of taste (e.g. you seem to like to remove implicit modifiers like static for nested enum classes or public for interface methods). I won't reject the pull request for things like this, but they make the diff larger, and I might later (accidentally) add them back :) (I can't tell you why, but I like interface methods to be declared public, but don't mind if they are not declared abstract. Probably dates back to times where I not always had an IDE available and used grep to find public stuff).

I assume you have made sure the bible snippets you included in the unit tests are covered by a license that allows us to distribute them?

You widened the amount of valid verse references to support even verse numbers that are not covered by Utils.VERSE_REGEX, which will cause exceptions when parsing USFM/USX files that use them. As I would not want to change the regex, probably we should filter the references whenever they "reach" the internal representation format (convert 3a-4b to 3 or 3-4)? Maybe with a warning?

On the other hand, verse numbers which are legal in the internal format cause exceptions when converting to Paratext. Here is an example from Einheitsübersetzung 1980:

Exception in thread "main" java.lang.IllegalArgumentException: Provided verse String is not a valid verse ID: 2/7
    at biblemulticonverter.format.paratext.model.VerseIdentifier.<init>(VerseIdentifier.java:27)
    at biblemulticonverter.format.paratext.model.VerseIdentifier.<init>(VerseIdentifier.java:18)
    at biblemulticonverter.format.paratext.AbstractParatextFormat.doExport(AbstractParatextFormat.java:322)
    at biblemulticonverter.Main.main(Main.java:67)

Which "worked" before your change (no exception thrown, but maybe invalid Paratext output...)

As you noticed correctly, there is no direct support to references of whole books or chapters. Some formats (e.g. Zefania XML or MyBible.Zone) have the convention to rewrite these as e. g. Gen 1.1-999.999 or Gen 3.1-3.999 (i.e they use chapter or verse 999 to mean "the last one"). Maybe you want to rewrite those references on import the same way (currently you are thowing NullPointerExceptions). It is your decision if you also want to detect these kinds of references on import and rewrite them to the "proper" Paratext format.

The reason I normalized all \p{Z} to a single space, was to get a simple way of implementing https://ubsicap.github.io/usfm/about/syntax.html#whitespace-normalization
Also some formats will not be able to work with consecutive whitespaces inside a Text element, so you'd have to take care of getting rid of them before you convert to the internal representation (at least), or the AppendVisitor will also throw an exception.

Do you have a real-world example where collapsing \p{Z} causes the wrong result?

In your test cases, you may also want to test the ParatextDump, that when dumping and restoring the classes in the middle, nothing will get lost.

You may also register your new USX3 format in MainModuleRegistry, otherwise you cannot use it from the command line.

More comments later :)

@Rolf-Smit
Copy link
Contributor Author

Rolf-Smit commented Dec 4, 2020

I noticed some of the changes are more of taste (e.g. you seem to like to remove implicit modifiers like static for nested enum classes or public for interface methods). I won't reject the pull request for things like this, but they make the diff larger, and I might later (accidentally) add them back :) (I can't tell you why, but I like interface methods to be declared public, but don't mind if they are not declared abstract. Probably dates back to times where I not always had an IDE available and used grep to find public stuff).

  1. This is true, these are taste changes. My IDE can actually remove those automatically, and warns me about them since interface methods are always public. I can revert these changes if you like. Speaking about taste, I would rather have the whole Paratext part of this library as a separate artefact/module and maybe even use Kotlin instead of Java, mostly because of the null-safety that Kotlin brings to the table. But I guess that's a bit out of scope for this library. However I am considering forking this project to create exactly that, a specialised Paratext converter/library.

I assume you have made sure the bible snippets you included in the unit tests are covered by a license that allows us to distribute them?

  1. Yes they are ours to use, see: May the 001GEN.usx file be freely used in other opensource projects? ubsicap/dbl-archive-validation#47. I can unfortunately not publish the Bibles I'm using, I may distribute those in app form but not the plain USX files.

You widened the amount of valid verse references to support even verse numbers that are not covered by Utils.VERSE_REGEX, which will cause exceptions when parsing USFM/USX files that use them. As I would not want to change the regex, probably we should filter the references whenever they "reach" the internal representation format (convert 3a-4b to 3 or 3-4)? Maybe with a warning?

  1. I do think it makes sense to convert/filter those references when they reach the internal format, I was aiming for lossless conversion between different USFM/USX formats. Let me see what I can do about those.

On the other hand, verse numbers which are legal in the internal format cause exceptions when converting to Paratext. Here is an example from Einheitsübersetzung 1980:

...

Which "worked" before your change (no exception thrown, but maybe invalid Paratext output...)

  1. Maybe we should filter or downgrade these? Just like the previous issue? I prefer not to create invalid Paratext output. I will check how hard it is to fix this.

As you noticed correctly, there is no direct support to references of whole books or chapters. Some formats (e.g. Zefania XML or MyBible.Zone) have the convention to rewrite these as e. g. Gen 1.1-999.999 or Gen 3.1-3.999 (i.e they use chapter or verse 999 to mean "the last one"). Maybe you want to rewrite those references on import the same way (currently you are thowing NullPointerExceptions). It is your decision if you also want to detect these kinds of references on import and rewrite them to the "proper" Paratext format.

  1. Makes sense to rewrite those if you ask me.

The reason I normalized all \p{Z} to a single space, was to get a simple way of implementing https://ubsicap.github.io/usfm/about/syntax.html#whitespace-normalization
Also some formats will not be able to work with consecutive whitespaces inside a Text element, so you'd have to take care of getting rid of them before you convert to the internal representation (at least), or the AppendVisitor will also throw an exception.

Do you have a real-world example where collapsing \p{Z} causes the wrong result?

  1. Good point, I believe I have a real world sample let me check that. Otherwise we can reverse this. I believe this was causing a difference in one of the ParatextDump compares that I did in one of the tests. Reading the same file in USX 2 vs USFM 2 resulted in a different ParatextDump, because the USFM 2 variant was missing some whitespace.

In your test cases, you may also want to test the ParatextDump, that when dumping and restoring the classes in the middle, nothing will get lost.

  1. I believe I already have something that does this, but an explicit test indeed makes sense.

You may also register your new USX3 format in MainModuleRegistry, otherwise you cannot use it from the command line.

  1. Ah I forgot about that!

More comments later :)

Really appreciated, thanks so far! I'm glad I can add something useful to this library.

@schierlm
Copy link
Owner

schierlm commented Dec 4, 2020

Maybe we need import and export methods that accept InputStreams/OutputStreams? So we can easily work with resources?

Either that (we already have this for some formats, e.g. Compact or Diffable), or you have to properly handle the exception (IllegalArgumentException I believe) and copy the content into a temp file.

Good suggestion. We do need to make sure that the enum values are in the right order then, and that we keep them that way, or we need to add a simple value ourselves?

I cannot think of any good reason why you would want to sort enum values differently than "by version", but if you can think of one, you'd have to add your own value...

  1. This is true, these are taste changes. My IDE can actually remove those automatically, and warns me about them since interface methods are always public. I can revert these changes if you like. Speaking about taste, I would rather have the whole Paratext part of this library as a separate artefact/module and maybe even use Kotlin instead of Java, mostly because of the null-safety that Kotlin brings to the table. But I guess that's a bit out of scope for this library. However I am considering forking this project to create exactly that, a specialised Paratext converter/library.

There are already separate artefacts for SWORD and SQLite module formats, but mostly because they drag in several megabytes of dependencies which you may not want to download. So the classes/methods should be public enough to actually implement extra formats in a separate module.

Speaking of Kotlin, my dislikes are not technical but political (having effectively a single company in the driving seat, and while they are open source, they are not committed to support bootstrapping their compiler if you don't want to use the binaries from them but compile it yourself from source. Not sure, perhaps the latter has changed recently, though). But considering the number of JVM languages that jumped out of the void (Kotlin, Scala, Groovy, Clojure), I am not confident that Kotlin would be the best choice. And the @NonNull support for plain Java in third party libraries and IDEs also gets better (I'm an Eclipse user, I may see some conflict of interest for JetBrains to add good @nonnull support to IDEA, but I guess that they added it anyway).

But I digress.

I assume you have made sure the bible snippets you included in the unit tests are covered by a license that allows us to distribute them?

  1. Yes they are ours to use, see: ubsicap/dbl-archive-validation#47. I can unfortunately not publish the Bibles I'm using, I may distribute those in app form but not the plain USX files.

Yeah, that is how publishers like to handle it nowadays...

On the other hand, verse numbers which are legal in the internal format cause exceptions when converting to Paratext. Here is an example from Einheitsübersetzung 1980:
...
Which "worked" before your change (no exception thrown, but maybe invalid Paratext output...)

  1. Maybe we should filter or downgrade these? Just like the previous issue? I prefer not to create invalid Paratext output. I will check how hard it is to fix this.

Yes, I agree. Just this time on the "export to Paratext" and not on the "Import from Paratext" path.

  1. Good point, I believe I have a real world sample let me check that. Otherwise we can reverse this. I believe this was causing a difference in one of the ParatextDump compares that I did in one of the tests. Reading the same file in USX 2 vs USFM 2 resulted in a different ParatextDump, because the USFM 2 variant was missing some whitespace.

This is "interesting". I tested how (some old version of) Paratext handles unnormalized whitespace on USFM->USX conversion, and it normalized that. So I assumed that in USX whitespace will always be normalized and decided to do the normalization in the USFM import instead of in AbstractParatextFormat As you seem to have examples of USX files with unnormalized white space, it may make sense to move the normalization into the "import" path of AbstractParatext format (and remove the other normalization in USFM or make it optional via system property, as USFM files are often written in simple text editors and therefore it is more likely for them to contain unnormalized white space than USX). In that case, you should make extra sure to properly sanitize or escape unnormalized whitespace (like tabs or CRLF) when outputting ParatextDump files, otherwise you would have problems importing the file again later.

@Rolf-Smit
Copy link
Contributor Author

Rolf-Smit commented Dec 8, 2020

Hi Micheal,

I'm working on fixing the comments on the PR, regarding the whitespace normalisation:

This is "interesting". I tested how (some old version of) Paratext handles unnormalized whitespace on USFM->USX conversion, and it normalized that. So I assumed that in USX whitespace will always be normalized and decided to do the normalization in the USFM import instead of in AbstractParatextFormat As you seem to have examples of USX files with unnormalized white space, it may make sense to move the normalization into the "import" path of AbstractParatext format (and remove the other normalization in USFM or make it optional via system property, as USFM files are often written in simple text editors and therefore it is more likely for them to contain unnormalized white space than USX). In that case, you should make extra sure to properly sanitize or escape unnormalized whitespace (like tabs or CRLF) when outputting ParatextDump files, otherwise you would have problems importing the file again later.

In the sample file GEN-with-only-USX3-end-milestones.usx you will find the following content:

    <chapter number="1" style="c" sid="GEN 1" />
    <para style="p">
        <verse number="1" style="v" sid="GEN 1:1" />In the beginning, God<note caller="+" style="f"><char style="fr" closed="false">1:1 </char><char style="ft" closed="false">The Hebrew word rendered “God” is “אֱלֹהִ֑ים” (Elohim).</char></note> created the heavens and the earth. <verse eid="GEN 1:1" /><verse number="2" style="v" sid="GEN 1:2" />The earth was formless and empty. Darkness was on the surface of the deep and God’s Spirit was hovering over the surface of the waters.<verse eid="GEN 1:2" /></para>

Right after the "created the heavens and the earth." part you will find a space, just before the verse end-milestone. The USX readers do read this space, and also output the space. However when exporting from USX to USFM 2 and then reading again from USFM 2 you will lose that space. This makes a "roundtrip" from USX to USFM and back produce a different USX output (it misses the space), you can test this by executing the test_USFM2_end_milestone_insertion test while using the \p{Z} option in the Regex.

So maybe it is indeed better to do the normalisation also for USX? Or in the AbstractParatextFormat? Or maybe even while creating ParatextCharacterContent.Text objects?

Edit: I tried "Or maybe even while creating ParatextCharacterContent.Text objects?" and that seems to work perfectly, which makes sense. Advantage is that this way we cannot forget about doing the normalisation since it is always there. I will commit my changes (basically reverting the removal of \p{Z} while importing USFM, and adding normalisation to all Text elements) tomorrow or maybe even this evening.

@schierlm
Copy link
Owner

schierlm commented Dec 8, 2020

regarding the whitespace normalisation

When looking at the sample file you mentioned, it seems to me (and I have verified with a bunch of regex searches) that this file follows a convention that every verse that is not the last one of a paragraph will contain trailing whitespace. So the whitespace is not actually significant, but it is there to make it possible to convert the text to unstructured formats like HTML or EPUB without having to special-case verses at end of paragraph vs. others.

I do not know how prevalent this convention is, but maybe it even makes sense to enable this on USX export, (maybe based on a system property)? So that if you want to have this style of USX, you can get it from every Bible.

verse number transformation

The general implementation looks good to me (from a quick glance), however there are two points

  1. I believe you should perform the same or similar transformation when handling referrences / crossreferences. But probably you are still working on that point as we also had some discussion about how to handle references to a single book only. [I am not sure if there is a better way to communicate whether the PR is "finished". I know there is a "convert to draft" link for reviewers so they can make clear the PR is not ready, but I'm not sure if that is also accessible to the submitter.]

  2. I am not confident that you should convert verse numbers like 2/7 or 2.7, or 2,7 to 2-7 without a warning, yet throw an exception on 1.3.7 or 10/12G. One could always work around it by first converting to StrippedDiffable ChangeVerseStructure -range (so that in these cases, verses get merged until they form distinct ranges), which would always make it possible to convert the file as long as chapter verses are not out of order, but perhaps just showing a warning in either case (reducing the verse number to the longest valid prefix in the second case) would make more sense.

For the record, while the usage of separators in verse numbers are not really uniform, here are some cases I've seen in practice (and this is how I try to treat these separators in cases where it matters):

2-7(obviously) range verse 2 to 7
2.7 verses 2 and 7 (without 3-6 in between)
2/7 either (more common) the same as 2.7 or a verse which is number 2 in one versification and 7 in another one
2G a verse that only existed in the Greek version of the book (mostly apocrypha/deuterocanonical books), which is different from 2g which would be the 7th part of verse 2.
10/12G verse 10 in the hebrew and 12 in the greek version
Isa 40:41,6 (between Isa 40:20 and Isa 40:21) a verse that got reordered into the middle of another chapter. The part before the comma is the original chapter number.

I am aware that it is not 100% possible to represent those in USFM/USX (some may be possible by supporting \va tag, but that case is rare enough that I did not want to spend time on it).

@schierlm
Copy link
Owner

schierlm commented Dec 8, 2020

Sorry about the force-push, I tried to add CI to the repo and made a typo in the .yaml file. If you did not pull in the last few minutes, you should not be affected.

@Rolf-Smit
Copy link
Contributor Author

Rolf-Smit commented Dec 9, 2020

I believe you should perform the same or similar transformation when handling referrences / crossreferences. But probably you are still working on that point as we also had some discussion about how to handle references to a single book only. [I am not sure if there is a better way to communicate whether the PR is "finished". I know there is a "convert to draft" link for reviewers so they can make clear the PR is not ready, but I'm not sure if that is also accessible to the submitter.]

This is now implemented, I would like to write some more test cases but I think this is OK for now.

Regarding the PR and when it is finished. I will make sure to give you a clear message to tell you "I'm done" ;) Currently working on this in my spare time.

I am not confident that you should convert verse numbers like 2/7 or 2.7, or 2,7 to 2-7 without a warning, yet throw an exception on 1.3.7 or 10/12G. One could always work around it by first converting to StrippedDiffable ChangeVerseStructure -range (so that in these cases, verses get merged until they form distinct ranges), which would always make it possible to convert the file as long as chapter verses are not out of order, but perhaps just showing a warning in either case (reducing the verse number to the longest valid prefix in the second case) would make more sense.

I will improve the handeling of 2/7 2.7 and 2,7 (and also 2-7), with a warning. Your comments are helpful here since I know quite something about USX/USFM but my knowledge about other formats is lacking.

This seems to be another case where Eclipse and IDEA auto-format differently:

Will fix that!

@schierlm
Copy link
Owner

schierlm commented Dec 9, 2020

This is now implemented, I would like to write some more test cases but I think this is OK for now.

ChapterRange should also be possible? MAT 4:1-6:999

Currently working on this in my spare time.

Same applies to me :)

@Rolf-Smit
Copy link
Contributor Author

@schierlm

I want to finish up the verse conversions, but I'm wondering how to handle the different cases, what do you think about this:

  • 2-7 > 2-7 without warning
  • 2.7 or 2/7 > We can either turn these into 2-7 or just 2 both with a warning, but which style do we take?
  • 2G > 2 with a warning
  • 2g > 2g without a warning
  • 10/12G > Either 10-12 or just 10, both with a warning.
  • Isa 40:41,6 > Exception?

@Rolf-Smit
Copy link
Contributor Author

Rolf-Smit commented Dec 13, 2020

When looking at the sample file you mentioned, it seems to me (and I have verified with a bunch of regex searches) that this file follows a convention that every verse that is not the last one of a paragraph will contain trailing whitespace. So the whitespace is not actually significant, but it is there to make it possible to convert the text to unstructured formats like HTML or EPUB without having to special-case verses at end of paragraph vs. others.

I do not know how prevalent this convention is, but maybe it even makes sense to enable this on USX export, (maybe based on a system property)? So that if you want to have this style of USX, you can get it from every Bible.

I have now applied the whitespace normalisation that was used in USFM 2 to USX 2 and USX 3, so every format now normalises the whitespace, see: 7a234bb

Is adding the option to add a whitespace to every verse that is not at the end of a paragraph something you want to implement in this PR?

@schierlm
Copy link
Owner

  • 2.7 or 2/7 -> We can either turn these into 2-7 or just 2 both with a warning, but which style do we take?

I would go with 2 only. Same for 2.4.7 -> 2. In general, I think Bible software can cope better with missing verse numbers (7 in this example) than with duplicate verse numbers (3-6).

  • 10/12G > Either 10-12 or just 10, both with a warning.

10

  • Isa 40:41,6 > Exception?

I would prefer no exception, but probably nothing much better available... I guess the only other option would be 6 with a warning... When converting from a binary format, exception would mean that you have to do an intermediat conversion to some text-based format to "fix" the number, before attempting the conversion again. In case of a warning, you can edit the verse number directly in the output format.

I have now applied the whitespace normalisation that was used in USFM 2 to USX 2 and USX 3, so every format now normalises the whitespace, see: 7a234bb

Is adding the option to add a whitespace to every verse that is not at the end of a paragraph something you want to implement in this PR?

I don't mind. As I understood it, you added the disabling of trimming trailing whitespace so that your test case bible could be roundtrip converted. If you remove this, it would mean that your testcases would not roundtrip convert. So if you want to not break your test cases, you'd have to add that option back in this PR. Also, I have not seen that style in USX 2.0 (but I haven't seen very many USX 2.0 files either), so it would fit into the theme of adding USX 3.0 support. If you prefer to not do it in this PR, I'm fine with it as well.

@Rolf-Smit
Copy link
Contributor Author

I've implemented the warnings for converting verse numbers from the internal format to Paratext, I also fixed the conversion of 6.8, 6,8 and 6/8 which becomes 6 in Paratext. Also added some extra tests.

Only thing I left out, and still throws an exception is Isa 40:41,6. Because how I understand it the verse number will be 41,6. How can I tell that apart from two separate verses such as 6,8 (verse 6 and verse 8 combined into one verse)? I guess I needs some guidance again XD

@schierlm
Copy link
Owner

Because how I understand it the verse number will be 41,6. How can I tell that apart from two separate verses such as 6,8 (verse 6 and verse 8 combined into one verse)?

The former would be 41,6 (comma), the latter would be 6.8 (period/dot).

// 5/7
// 5/7/9
// 5.6G
Matcher matcher = Pattern.compile("([1-9][0-9a-zG]*)(?:([,/.-])([1-9][0-9a-zG]*))*").matcher(internalNumber);
Copy link
Owner

@schierlm schierlm Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure if these capturing groups inside of noncapturing groups behave as you would expect (or whether it is even specified in Java). If they do, please add explicit test cases for 1-4.7 and 1.4-7 (resulting in 1-4 and 1 respectively).

One special case I forgot which comes from Accordance is 1/t, refering to the title "verse" of a Psalm (if the psalm starts with verse 1 after the title verse). But rewriting this to 1 should be fine (it would also be fine to special case it to "0" or whatever is usually used in USX for this verse number).

@Rolf-Smit
Copy link
Contributor Author

Rolf-Smit commented Jan 6, 2021

I'm back from some time off and started working on a problem that I find hard to solve (with my limited JAXB experience):

Sample text:

	<para style="p">
		<verse number="28" style="v" sid="1KI 2:28" />Toen dit gerucht Joab bereikte – Joab had zich immers achter Adonia geschaard, maar achter Absalom had hij zich niet geschaard – vluchtte Joab naar de tent van de <char style="nd">HEERE</char> <note caller="-" style="x"><char style="xo" closed="false">2:28 </char><char style="xt" closed="false"><ref loc="1KI 1:50">1 Kon. 1:50</ref></char></note>en greep de hoorns van het altaar vast.<verse eid="1KI 2:28" /></para>

The resulting Para.getContent() misses one critical piece of information after parsing: the single space between <char style="nd">HEERE</char> and <note caller="-" style="x">.

This space is for obvious reasons quite important and should be preserved, but somehow JAXB is not even returning it as String. I tried to use xml:space="preserve" but that did not seem to have any effect (also I don't like adding this to the document). Could it be that we need to adjust the XSD file in order for whitespace to be preserved within the para element?

I found many topics on the internet about roughly the same thing, but non with an actual solution, similar issue: https://bugs.eclipse.org/bugs/show_bug.cgi?id=453934

@schierlm
Copy link
Owner

schierlm commented Jan 6, 2021

This now definitely was a weird issue. I know from my experience that mixed content usually preserves white space text nodes, and other content does not (only exception if you pass JAXB a DOM source where whitespace has already been eliminated). So I pulled your branch and made a small test case to pinpoint the issue. I was surprised when the test passed in the Eclipse debugger (without changes). Then I ran the test in Maven, and it failed.

Turns out I have set my Eclipse launch configuration to not shadow JRE bundled libraries like JAXB. Disabling this option (which I still had enabled from times when I developed Java applets for Java 6...) made the test fail consistently.

So in Eclipse I used the JAXB 2.2.8 bundled in the JRE 8, while Maven used the one in Maven central. And so I learned that they are not the same.

I updated JAXB (a small step) from 2.2.8 to 2.2.11, and the issue is gone for me. Can you please merge/cherry-pick/rebase 729ca73 and re-test?

…tent.Text

In many if not all cases a single whitespace before the text or after te text is to be expected, for example when two text pieces separated by a note or text style.

#38
@Rolf-Smit
Copy link
Contributor Author

Thanks for the quick action, I can confirm (with my own unit test) that 729ca73 fixes the issue.

Small update:
2 weeks ago I pushed an update to around 10.000 app users with 3 Bibles in USX 3 format (converted to USFM), this is so far the only issue found by my users (the old USX 2 Bibles did not have these whitespace cases).

I will try to finish up this PR soon, I have added the option to preserve spaces found at the end of lines in USFM, and will push that soon.

@petervdschelde
Copy link

Thanks for the quick action, I can confirm (with my own unit test) that 729ca73 fixes the issue.

Small update:
2 weeks ago I pushed an update to around 10.000 app users with 3 Bibles in USX 3 format (converted to USFM), this is so far the only issue found by my users (the old USX 2 Bibles did not have these whitespace cases).

I will try to finish up this PR soon, I have added the option to preserve spaces found at the end of lines in USFM, and will push that soon. <
@Rolf-Smit See you have added the support for USX3. I need some help with the MultiBibleConvertor to have it running for converting USX3 to USFM as well as the OnLineBible .Exp format. I am well informed about the last format, have some knowledge on USFM, but USX is new for me. The MultiBibleConvertor is new for me as well. I obtained USX3 files from DBL and these need to be converted.
Maybe I do something wrong. I am Dutch, so maybe we can have direct contact by phone? (content@onlinebible.org)

@schierlm
Copy link
Owner

schierlm commented Mar 5, 2021

@Rolf-Smit Getting back on topic, you wrote that you want to finish up the PR soon, but I did not get any more feedback. So I am not sure at the moment if you believe this PR to be finished already or if you are still doing improvements. Or if you just did not find any time recently to finish it up. Also I'd like to make sure we both believe the same thing to be open.

I don't know all open items from the top of my head, but proper registration - should be a one-liner - in MainModuleRegistry is currently missing (so, in fact, @petervdschelde would not be able to use it out-of-the box now to convert USX3 to OnLineBible). This is a blocker for me to get this landed.

Also, the combined verse number handling should be clarified before this can be landed.

If you don't think you have time/interest in fixing this in the near future, I can probably find some free time around Easter to fix this up, test it, and merge it.

@Rolf-Smit
Copy link
Contributor Author

Rolf-Smit commented Mar 6, 2021

Sorry, unfortunately I have been extremely busy in private life. I could have finished this sooner.

@schierlm from my side the PR is finished:

  • I updated the registry so USX3 can now be used from the CLI;
  • I fixed the combined verse number conversion and added tests for 1-4.7 and 1.4-7;
  • I updated the readme and command line help texts

If there is anything else, let me know.

@petervdschelde I have seen your email (and message here for that matter), we are basically direct-competitors (at least for the Dutch market) but since we are both non-profit and working towards the same shared goals I'm extremely glad to help out, so expect an email from me soon ;)

The bullet points in Supported Formats should be rather short. Details
about the formats come later in the README.
When there is more than one verse in the same paragraph, the separator
text is added between them. Depending on the input file, you may want to
set this to one or more spaces.
The first and the last branch were actually the same, and can be easily
handled the same way by making the Regexp only match "-".
When a verse contains a chapter number at the start, the current logic
used the chapter number as verse number. Prefer using the verse number
instead. Both are wrong, but the verse number may require a lower amount
of post-editing.
@schierlm
Copy link
Owner

schierlm commented Mar 7, 2021

Sorry if I rushed you, I did not intend to do so.

I just skimmed over the conversation and fixed the things I wanted to still fix, and pushed those to your branch. In case you find time, feel free to review. If not, you do not have to.

When running a few test conversions, I found an issue where USX3 export creates invalid XML and then dies with a validation exception. This happens if there is a footnote that contains a (cross) reference. I am not sure how to fix it, whether it is better to update the schema to allow this case, or to skip the reference tagging in this case. Do you have any experience with that?

See attached footnote-ref.zip.

Command line was: java -jar BibleMultiConverter.jar Diffable input.bmc USX3 . #-*

However, this is not a regression as USX2 export in current version shows the same behaviour.

So I also believe now that everything should be fine for merging.

I will run a few more test conversions (mainly from other formats to Paratext formats) in the following days and if they do not show any severe regressions, I will merge it.

Thank you again for your work on these formats!

@Rolf-Smit
Copy link
Contributor Author

I checked your commits, those are some good improvements, code looks good to me.

As for the reference contained within a footnote:
In USX there is not really a distinction between a reference and footnote, both use the <note> element, which may not be nested (since that would create invalid USX). However we may be able to use the <ref> element to append those references to the footnote text element (ft)?
https://ubsicap.github.io/usx/master/notes.html#ft
https://ubsicap.github.io/usx/master/elements.html#ref

Open a new issue for this?

When importing to a Paratext format, footnote content is generally not
wrapped in \ft or \xt. Therefore synthesize them if the foonote does not
start with a `\f?` or `\x?` tag.

When exporting to a non-Paratext format, if the footnote content starts
with a `\ft?` or `\xt` tag, skip that tag and just export its contents.

This avoids invalid USX3 export when exporting a footnote that contains
references without the footnote text being wrapped as footnote text.
@schierlm
Copy link
Owner

schierlm commented Mar 8, 2021

However we may be able to use the element to append those references to the footnote text element (ft)?

Thanks, I must have totally missed that footnotes in USFM/USX do not directly contain the text but generally wrap their text in tags like ft etc.

I now added code when converting to Paratext formats, to synthesize the \ft or \xt tag, and to strip it again in case a footnote starts with that tag while exporting to another format.

My text export that failed previously passes now. So consider this resolved.

@Rolf-Smit
Copy link
Contributor Author

@schierlm I feel like this PR is now ready to be merged, anything else that needs to be done?

When the input bible contains dictionary entries or unknown CSS
formatting, `ParatextExportVisitor` reused the same visitor, which
resultied into multiple VerseEnd elements in the Paratext content
(resulting in multiple <verse eid="..."> tags in USX3.
@schierlm
Copy link
Owner

Sorry, it took me longer to expected (I am also doing this on my free time, which sometimes may be a bit more limited than I'd like) to run through my own test Bibles, export them to the affected formats, and have a look at the output. I spotted one more bug in the process which I fixed, so now this is really ready to merge. In case I find more bugs later, I can fix them outside this PR as well.

Thanks for your patience :-)

@schierlm schierlm merged commit 09dbfd0 into schierlm:master Mar 12, 2021
schierlm added a commit that referenced this pull request Mar 24, 2021
Some locations still expected to see "denormalized" references
(lastChapter == firstChapter instead of lastChapter == -1). Fix those
that I could find quickly.

Also fix NullPointerExcception when importing an USX file with
references to books not contained in that bible.

See #41.
@schierlm schierlm added this to the v0.0.8 milestone Mar 26, 2021
schierlm added a commit that referenced this pull request Mar 31, 2021
In commit a1b60dc (part of #39)
encoding of VerseStart and VerseEnd was enhanced, but not in a
compatible way for import and export.

Dumps created with the broken version may need fixing: VERSE needs one
less tab, VERSE-END needs one more tab.
schierlm added a commit that referenced this pull request Apr 2, 2021
- Verse starts and verse ends may be inside table rows
- Verse end may not be before verse start
- In case the verse start is in an invalid location (e.g. prolog or
  headline), do not try to inject a verse end marker and print a
  warning instead.

Also fix table row and table cell export for USX2/USX3, and add a
missing break; statement when importing chapter marks from ParatextDump.

See #39.
@shadow-light shadow-light mentioned this pull request Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants