-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for USX 3.0 #39
Conversation
Hello Rolf, thank you for the pull request. Your commits are well structured and the shorter ones are easy to review, yet there are still a few big ones where I'd like to have a more closer look and also do some testing :) I noticed some of the changes are more of taste (e.g. you seem to like to remove implicit modifiers like I assume you have made sure the bible snippets you included in the unit tests are covered by a license that allows us to distribute them? You widened the amount of valid verse references to support even verse numbers that are not covered by On the other hand, verse numbers which are legal in the internal format cause exceptions when converting to Paratext. Here is an example from Einheitsübersetzung 1980:
Which "worked" before your change (no exception thrown, but maybe invalid Paratext output...) As you noticed correctly, there is no direct support to references of whole books or chapters. Some formats (e.g. Zefania XML or MyBible.Zone) have the convention to rewrite these as e. g. Gen 1.1-999.999 or Gen 3.1-3.999 (i.e they use chapter or verse 999 to mean "the last one"). Maybe you want to rewrite those references on import the same way (currently you are thowing NullPointerExceptions). It is your decision if you also want to detect these kinds of references on import and rewrite them to the "proper" Paratext format. The reason I normalized all Do you have a real-world example where collapsing \p{Z} causes the wrong result? In your test cases, you may also want to test the ParatextDump, that when dumping and restoring the classes in the middle, nothing will get lost. You may also register your new USX3 format in MainModuleRegistry, otherwise you cannot use it from the command line. More comments later :) |
biblemulticonverter/src/main/java/biblemulticonverter/format/paratext/model/Version.java
Show resolved
Hide resolved
biblemulticonverter/src/test/java/biblemulticonverter/format/paratext/USX3Test.java
Outdated
Show resolved
Hide resolved
Really appreciated, thanks so far! I'm glad I can add something useful to this library. |
Either that (we already have this for some formats, e.g.
I cannot think of any good reason why you would want to sort enum values differently than "by version", but if you can think of one, you'd have to add your own value...
There are already separate artefacts for SWORD and SQLite module formats, but mostly because they drag in several megabytes of dependencies which you may not want to download. So the classes/methods should be public enough to actually implement extra formats in a separate module. Speaking of Kotlin, my dislikes are not technical but political (having effectively a single company in the driving seat, and while they are open source, they are not committed to support bootstrapping their compiler if you don't want to use the binaries from them but compile it yourself from source. Not sure, perhaps the latter has changed recently, though). But considering the number of JVM languages that jumped out of the void (Kotlin, Scala, Groovy, Clojure), I am not confident that Kotlin would be the best choice. And the But I digress.
Yeah, that is how publishers like to handle it nowadays...
Yes, I agree. Just this time on the "export to Paratext" and not on the "Import from Paratext" path.
This is "interesting". I tested how (some old version of) Paratext handles unnormalized whitespace on USFM->USX conversion, and it normalized that. So I assumed that in USX whitespace will always be normalized and decided to do the normalization in the USFM import instead of in AbstractParatextFormat As you seem to have examples of USX files with unnormalized white space, it may make sense to move the normalization into the "import" path of AbstractParatext format (and remove the other normalization in USFM or make it optional via system property, as USFM files are often written in simple text editors and therefore it is more likely for them to contain unnormalized white space than USX). In that case, you should make extra sure to properly sanitize or escape unnormalized whitespace (like tabs or CRLF) when outputting ParatextDump files, otherwise you would have problems importing the file again later. |
Hi Micheal, I'm working on fixing the comments on the PR, regarding the whitespace normalisation:
In the sample file GEN-with-only-USX3-end-milestones.usx you will find the following content:
Right after the "created the heavens and the earth." part you will find a space, just before the verse end-milestone. The USX readers do read this space, and also output the space. However when exporting from USX to USFM 2 and then reading again from USFM 2 you will lose that space. This makes a "roundtrip" from USX to USFM and back produce a different USX output (it misses the space), you can test this by executing the So maybe it is indeed better to do the normalisation also for USX? Or in the Edit: I tried |
regarding the whitespace normalisationWhen looking at the sample file you mentioned, it seems to me (and I have verified with a bunch of regex searches) that this file follows a convention that every verse that is not the last one of a paragraph will contain trailing whitespace. So the whitespace is not actually significant, but it is there to make it possible to convert the text to unstructured formats like HTML or EPUB without having to special-case verses at end of paragraph vs. others. I do not know how prevalent this convention is, but maybe it even makes sense to enable this on USX export, (maybe based on a system property)? So that if you want to have this style of USX, you can get it from every Bible. verse number transformationThe general implementation looks good to me (from a quick glance), however there are two points
For the record, while the usage of separators in verse numbers are not really uniform, here are some cases I've seen in practice (and this is how I try to treat these separators in cases where it matters):
I am aware that it is not 100% possible to represent those in USFM/USX (some may be possible by supporting |
...multiconverter/src/main/java/biblemulticonverter/format/paratext/AbstractParatextFormat.java
Outdated
Show resolved
Hide resolved
Sorry about the force-push, I tried to add CI to the repo and made a typo in the .yaml file. If you did not pull in the last few minutes, you should not be affected. |
This is now implemented, I would like to write some more test cases but I think this is OK for now. Regarding the PR and when it is finished. I will make sure to give you a clear message to tell you "I'm done" ;) Currently working on this in my spare time.
I will improve the handeling of 2/7 2.7 and 2,7 (and also 2-7), with a warning. Your comments are helpful here since I know quite something about USX/USFM but my knowledge about other formats is lacking.
Will fix that! |
ChapterRange should also be possible? MAT 4:1-6:999
Same applies to me :) |
I want to finish up the verse conversions, but I'm wondering how to handle the different cases, what do you think about this:
|
I have now applied the whitespace normalisation that was used in USFM 2 to USX 2 and USX 3, so every format now normalises the whitespace, see: 7a234bb Is adding the option to add a whitespace to every verse that is not at the end of a paragraph something you want to implement in this PR? |
I would go with 2 only. Same for 2.4.7 -> 2. In general, I think Bible software can cope better with missing verse numbers (7 in this example) than with duplicate verse numbers (3-6).
10
I would prefer no exception, but probably nothing much better available... I guess the only other option would be 6 with a warning... When converting from a binary format, exception would mean that you have to do an intermediat conversion to some text-based format to "fix" the number, before attempting the conversion again. In case of a warning, you can edit the verse number directly in the output format.
I don't mind. As I understood it, you added the disabling of trimming trailing whitespace so that your test case bible could be roundtrip converted. If you remove this, it would mean that your testcases would not roundtrip convert. So if you want to not break your test cases, you'd have to add that option back in this PR. Also, I have not seen that style in USX 2.0 (but I haven't seen very many USX 2.0 files either), so it would fit into the theme of adding USX 3.0 support. If you prefer to not do it in this PR, I'm fine with it as well. |
I've implemented the warnings for converting verse numbers from the internal format to Paratext, I also fixed the conversion of Only thing I left out, and still throws an exception is |
The former would be |
// 5/7 | ||
// 5/7/9 | ||
// 5.6G | ||
Matcher matcher = Pattern.compile("([1-9][0-9a-zG]*)(?:([,/.-])([1-9][0-9a-zG]*))*").matcher(internalNumber); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure if these capturing groups inside of noncapturing groups behave as you would expect (or whether it is even specified in Java). If they do, please add explicit test cases for 1-4.7
and 1.4-7
(resulting in 1-4
and 1
respectively).
One special case I forgot which comes from Accordance is 1/t, refering to the title "verse" of a Psalm (if the psalm starts with verse 1 after the title verse). But rewriting this to 1 should be fine (it would also be fine to special case it to "0" or whatever is usually used in USX for this verse number).
I'm back from some time off and started working on a problem that I find hard to solve (with my limited JAXB experience): Sample text:
The resulting Para.getContent() misses one critical piece of information after parsing: the single space between This space is for obvious reasons quite important and should be preserved, but somehow JAXB is not even returning it as String. I tried to use I found many topics on the internet about roughly the same thing, but non with an actual solution, similar issue: https://bugs.eclipse.org/bugs/show_bug.cgi?id=453934 |
This now definitely was a weird issue. I know from my experience that mixed content usually preserves white space text nodes, and other content does not (only exception if you pass JAXB a DOM source where whitespace has already been eliminated). So I pulled your branch and made a small test case to pinpoint the issue. I was surprised when the test passed in the Eclipse debugger (without changes). Then I ran the test in Maven, and it failed. Turns out I have set my Eclipse launch configuration to not shadow JRE bundled libraries like JAXB. Disabling this option (which I still had enabled from times when I developed Java applets for Java 6...) made the test fail consistently. So in Eclipse I used the JAXB 2.2.8 bundled in the JRE 8, while Maven used the one in Maven central. And so I learned that they are not the same. I updated JAXB (a small step) from 2.2.8 to 2.2.11, and the issue is gone for me. Can you please merge/cherry-pick/rebase 729ca73 and re-test? |
…tent.Text In many if not all cases a single whitespace before the text or after te text is to be expected, for example when two text pieces separated by a note or text style. #38
Thanks for the quick action, I can confirm (with my own unit test) that 729ca73 fixes the issue. Small update: I will try to finish up this PR soon, I have added the option to preserve spaces found at the end of lines in USFM, and will push that soon. |
#38 Also includes some whitespace normalisation for USX and additional test cases.
|
@Rolf-Smit Getting back on topic, you wrote that you want to finish up the PR soon, but I did not get any more feedback. So I am not sure at the moment if you believe this PR to be finished already or if you are still doing improvements. Or if you just did not find any time recently to finish it up. Also I'd like to make sure we both believe the same thing to be open. I don't know all open items from the top of my head, but proper registration - should be a one-liner - in Also, the combined verse number handling should be clarified before this can be landed. If you don't think you have time/interest in fixing this in the near future, I can probably find some free time around Easter to fix this up, test it, and merge it. |
Sorry, unfortunately I have been extremely busy in private life. I could have finished this sooner. @schierlm from my side the PR is finished:
If there is anything else, let me know. @petervdschelde I have seen your email (and message here for that matter), we are basically direct-competitors (at least for the Dutch market) but since we are both non-profit and working towards the same shared goals I'm extremely glad to help out, so expect an email from me soon ;) |
The bullet points in Supported Formats should be rather short. Details about the formats come later in the README.
When there is more than one verse in the same paragraph, the separator text is added between them. Depending on the input file, you may want to set this to one or more spaces.
The first and the last branch were actually the same, and can be easily handled the same way by making the Regexp only match "-".
When a verse contains a chapter number at the start, the current logic used the chapter number as verse number. Prefer using the verse number instead. Both are wrong, but the verse number may require a lower amount of post-editing.
Sorry if I rushed you, I did not intend to do so. I just skimmed over the conversation and fixed the things I wanted to still fix, and pushed those to your branch. In case you find time, feel free to review. If not, you do not have to. When running a few test conversions, I found an issue where See attached footnote-ref.zip. Command line was: However, this is not a regression as USX2 export in current version shows the same behaviour. So I also believe now that everything should be fine for merging. I will run a few more test conversions (mainly from other formats to Paratext formats) in the following days and if they do not show any severe regressions, I will merge it. Thank you again for your work on these formats! |
I checked your commits, those are some good improvements, code looks good to me. As for the reference contained within a footnote: Open a new issue for this? |
When importing to a Paratext format, footnote content is generally not wrapped in \ft or \xt. Therefore synthesize them if the foonote does not start with a `\f?` or `\x?` tag. When exporting to a non-Paratext format, if the footnote content starts with a `\ft?` or `\xt` tag, skip that tag and just export its contents. This avoids invalid USX3 export when exporting a footnote that contains references without the footnote text being wrapped as footnote text.
Thanks, I must have totally missed that footnotes in USFM/USX do not directly contain the text but generally wrap their text in tags like ft etc. I now added code when converting to Paratext formats, to synthesize the My text export that failed previously passes now. So consider this resolved. |
@schierlm I feel like this PR is now ready to be merged, anything else that needs to be done? |
When the input bible contains dictionary entries or unknown CSS formatting, `ParatextExportVisitor` reused the same visitor, which resultied into multiple VerseEnd elements in the Paratext content (resulting in multiple <verse eid="..."> tags in USX3.
Sorry, it took me longer to expected (I am also doing this on my free time, which sometimes may be a bit more limited than I'd like) to run through my own test Bibles, export them to the affected formats, and have a look at the output. I spotted one more bug in the process which I fixed, so now this is really ready to merge. In case I find more bugs later, I can fix them outside this PR as well. Thanks for your patience :-) |
Some locations still expected to see "denormalized" references (lastChapter == firstChapter instead of lastChapter == -1). Fix those that I could find quickly. Also fix NullPointerExcception when importing an USX file with references to books not contained in that bible. See #41.
- Verse starts and verse ends may be inside table rows - Verse end may not be before verse start - In case the verse start is in an invalid location (e.g. prolog or headline), do not try to inject a verse end marker and print a warning instead. Also fix table row and table cell export for USX2/USX3, and add a missing break; statement when importing chapter marks from ParatextDump. See #39.
Implements feature request #38
This adds support for importing and exporting USX 3.0 files.
Also the internal format that is used for debugging (
ParatextDump
) is supported (dump > USX 3 and USX 3 > dump).I've touched a lot of code, fixed a few bugs, added many comments, if you have any questions on why I did certain things, please do ask.
Most important thing to mention is that I've chosen to include the end chapter and verse milestones in the internal format, and make the old formats USFM 2, USX 2 and USFX polyfill those. Since I think it makes more sense to support the newest format in the best way possible (which is USX 3). Also since there is no perfect way to find out where end chapter/verse milestones must be placed it is best to preserve them. This polyfilling should not impact USX 2 to USFM 2 etc. but if you do an "upgrade" (from USX 2 to 3) it can have some impact since end chapter/verse milestones are basically inserted in what the importer thinks is the most appropriate position.
Few other changes:
6a
or6-7
\\p{Z}
from the regex).