Handle USFM and Text corpora separately in pre-processing#894
Handle USFM and Text corpora separately in pre-processing#894pmachapman wants to merge 1 commit intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #894 +/- ##
==========================================
+ Coverage 67.94% 67.97% +0.02%
==========================================
Files 386 386
Lines 21205 21224 +19
Branches 2736 2740 +4
==========================================
+ Hits 14408 14427 +19
Misses 5812 5812
Partials 985 985 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
OK! Yeah, we should add a mixed (text+Paratext) test wherever we already have Paratext/text tests and no such mixed test - including extending the Nmt_Paratext() test.
I think this is the right way to solve it. The other option, like you say, would be to make the refs comparable, but you'd have to adjust the scripture filtering to preserve rows that are either non-scripture-refs or scripture refs that refer to verses - i.e., something like
return parallelTextRow.Ref is not ScriptureRef sr || sr.IsVerse;
and filter all training corpora with that, not just the scripture corpora. I think that could work as well. In some ways, I think that would be cleaner, but I'm sure Damien has an opinion.
@Enkidu93 reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on ddaspit).
ddaspit
left a comment
There was a problem hiding this comment.
I'm not sure what the broader implications would be of making different kinds of refs comparable, so I'm okay with this solution.
@ddaspit reviewed 1 file and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):
sourceInferencingCorpus, targetInferencingCorpus, targetCorpus!,
I don't think targetCorpus can be null, unless I'm missing something.
cbaea8d to
cab499a
Compare
pmachapman
left a comment
There was a problem hiding this comment.
@pmachapman made 1 comment.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on ddaspit and Enkidu93).
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I don't think
targetCorpuscan be null, unless I'm missing something.
Done. Thank you! (This was a hangover from an earlier attempt I made to fix the bug.)
ddaspit
left a comment
There was a problem hiding this comment.
@ddaspit reviewed 1 file and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on pmachapman).
Fixes #893.
I will write an e2e test on Monday, but I thought I should gather feedback on this PR.
This PR fixes a regression from #882 where text and scripture corpora were being combined, resulting in a crash when comparing a MultiKeyRef to a ScriptureKeyRef. I tried initially just updating those structs to compare to each other, but although that fixed the crash, the text corpora rows disappeared as they weren't scripture. And so I settled on iterating over them separately.
@Enkidu93 Please let me know if I have (or have not) gone about the right way of fixing this bug.
This change is