Skip to content

Handle USFM and Text corpora separately in pre-processing#894

Open
pmachapman wants to merge 1 commit intomainfrom
separate-usfm-text-corpora
Open

Handle USFM and Text corpora separately in pre-processing#894
pmachapman wants to merge 1 commit intomainfrom
separate-usfm-text-corpora

Conversation

@pmachapman
Copy link
Copy Markdown
Collaborator

@pmachapman pmachapman commented Mar 26, 2026

Fixes #893.

I will write an e2e test on Monday, but I thought I should gather feedback on this PR.

This PR fixes a regression from #882 where text and scripture corpora were being combined, resulting in a crash when comparing a MultiKeyRef to a ScriptureKeyRef. I tried initially just updating those structs to compare to each other, but although that fixed the crash, the text corpora rows disappeared as they weren't scripture. And so I settled on iterating over them separately.

@Enkidu93 Please let me know if I have (or have not) gone about the right way of fixing this bug.


This change is Reviewable

@pmachapman pmachapman requested review from Enkidu93 and ddaspit March 26, 2026 04:21
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.97%. Comparing base (c57e306) to head (cab499a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #894      +/-   ##
==========================================
+ Coverage   67.94%   67.97%   +0.02%     
==========================================
  Files         386      386              
  Lines       21205    21224      +19     
  Branches     2736     2740       +4     
==========================================
+ Hits        14408    14427      +19     
  Misses       5812     5812              
  Partials      985      985              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pmachapman pmachapman marked this pull request as ready for review March 26, 2026 04:26
Copy link
Copy Markdown
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Yeah, we should add a mixed (text+Paratext) test wherever we already have Paratext/text tests and no such mixed test - including extending the Nmt_Paratext() test.

I think this is the right way to solve it. The other option, like you say, would be to make the refs comparable, but you'd have to adjust the scripture filtering to preserve rows that are either non-scripture-refs or scripture refs that refer to verses - i.e., something like

        return parallelTextRow.Ref is not ScriptureRef sr || sr.IsVerse;

and filter all training corpora with that, not just the scripture corpora. I think that could work as well. In some ways, I think that would be cleaner, but I'm sure Damien has an opinion.

@Enkidu93 reviewed 1 file and all commit messages, and made 1 comment.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on ddaspit).

Copy link
Copy Markdown
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the broader implications would be of making different kinds of refs comparable, so I'm okay with this solution.

@ddaspit reviewed 1 file and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):

                sourceInferencingCorpus,
                targetInferencingCorpus,
                targetCorpus!,

I don't think targetCorpus can be null, unless I'm missing something.

@pmachapman pmachapman force-pushed the separate-usfm-text-corpora branch from cbaea8d to cab499a Compare March 26, 2026 19:13
Copy link
Copy Markdown
Collaborator Author

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman made 1 comment.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on ddaspit and Enkidu93).


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusService.cs line 284 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think targetCorpus can be null, unless I'm missing something.

Done. Thank you! (This was a hangover from an earlier attempt I made to fix the bug.)

Copy link
Copy Markdown
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit reviewed 1 file and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash when additional training data is used

4 participants