Handle USFM and Text corpora separately in pre-processing by pmachapman · Pull Request #894 · sillsdev/serval

pmachapman · 2026-03-26T04:21:28Z

Fixes #893.

I will write an e2e test on Monday, but I thought I should gather feedback on this PR.

This PR fixes a regression from #882 where text and scripture corpora were being combined, resulting in a crash when comparing a MultiKeyRef to a ScriptureKeyRef. I tried initially just updating those structs to compare to each other, but although that fixed the crash, the text corpora rows disappeared as they weren't scripture. And so I settled on iterating over them separately.

@Enkidu93 Please let me know if I have (or have not) gone about the right way of fixing this bug.

This change is

codecov-commenter · 2026-03-26T04:23:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.97%. Comparing base (c57e306) to head (cab499a).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #894      +/-   ##
==========================================
+ Coverage   67.94%   67.97%   +0.02%     
==========================================
  Files         386      386              
  Lines       21205    21224      +19     
  Branches     2736     2740       +4     
==========================================
+ Hits        14408    14427      +19     
  Misses       5812     5812              
  Partials      985      985

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Enkidu93

OK! Yeah, we should add a mixed (text+Paratext) test wherever we already have Paratext/text tests and no such mixed test - including extending the Nmt_Paratext() test.

I think this is the right way to solve it. The other option, like you say, would be to make the refs comparable, but you'd have to adjust the scripture filtering to preserve rows that are either non-scripture-refs or scripture refs that refer to verses - i.e., something like

        return parallelTextRow.Ref is not ScriptureRef sr || sr.IsVerse;

and filter all training corpora with that, not just the scripture corpora. I think that could work as well. In some ways, I think that would be cleaner, but I'm sure Damien has an opinion.

@Enkidu93 reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on ddaspit).

ddaspit

I'm not sure what the broader implications would be of making different kinds of refs comparable, so I'm okay with this solution.

@ddaspit reviewed 1 file and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):

                sourceInferencingCorpus,
                targetInferencingCorpus,
                targetCorpus!,

I don't think targetCorpus can be null, unless I'm missing something.

pmachapman

@pmachapman made 1 comment.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on ddaspit and Enkidu93).

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think targetCorpus can be null, unless I'm missing something.

Done. Thank you! (This was a hangover from an earlier attempt I made to fix the bug.)

ddaspit

@ddaspit reviewed 1 file and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

pmachapman requested review from Enkidu93 and ddaspit March 26, 2026 04:21

pmachapman marked this pull request as ready for review March 26, 2026 04:26

Enkidu93 approved these changes Mar 26, 2026

View reviewed changes

ddaspit requested changes Mar 26, 2026

View reviewed changes

Handle USFM and Text corpora separately in pre-processing

cab499a

pmachapman force-pushed the separate-usfm-text-corpora branch from cbaea8d to cab499a Compare March 26, 2026 19:13

pmachapman commented Mar 26, 2026

View reviewed changes

ddaspit approved these changes Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle USFM and Text corpora separately in pre-processing#894

Handle USFM and Text corpora separately in pre-processing#894
pmachapman wants to merge 1 commit intomainfrom
separate-usfm-text-corpora

pmachapman commented Mar 26, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 26, 2026 •

edited

Loading

Uh oh!

Enkidu93 left a comment •

edited

Loading

Uh oh!

ddaspit left a comment

Uh oh!

pmachapman left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

pmachapman commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Enkidu93 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pmachapman commented Mar 26, 2026 •

edited

Loading

codecov-commenter commented Mar 26, 2026 •

edited

Loading

Enkidu93 left a comment •

edited

Loading