-
-
Notifications
You must be signed in to change notification settings - Fork 17
Key terms support #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Key terms support #143
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #143 +/- ##
==========================================
+ Coverage 66.06% 66.39% +0.32%
==========================================
Files 425 426 +1
Lines 33033 33208 +175
Branches 4433 4454 +21
==========================================
+ Hits 21824 22047 +223
+ Misses 10149 10093 -56
- Partials 1060 1068 +8 ☔ View full report in Codecov by Sentry. |
|
Should we really crash if there is no keyterms file? That appears excessive. |
|
Previously, johnml1135 (John Lambert) wrote…
It's caught upstream. I think it's appropriate that if you try to make a ParatextKeyTermsCorpus with no key terms file, it throws an exception. |
|
Previously, Enkidu93 (Eli C. Lowry) wrote…
ok - as long as it just keeps going. |
|
There should be a nice set of 20 or so tests to check this. I would recommend breaking out the parsing section as a separate function that takes a list of keyTermsElements and returns rows. That should be more testable than having to create a bunch of xml files. |
|
Previously, johnml1135 (John Lambert) wrote…
The 20 or so tests would be the official documentation of the logic. |
|
Previously, johnml1135 (John Lambert) wrote…
Or, you could have a function that just takes a single element and returns a list of glosses - that may be better. |
|
This needs a comment explaining what it does and a few tests to confirm behavior. |
|
Previously, johnml1135 (John Lambert) wrote…
And the function can be static. |
|
Using keyterms should be default. Possibly you could switch the logic? What would be the best way to do this @ddaspit? |
|
This needs more testing. |
johnml1135
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r1 (raw file):
foreach (XElement element in keyTermsElements) { if ((element.Element("Category")?.Value ?? "") != "PN")
This logic should be documented in a wiki.
|
I'll go ahead and add testing and documentation. The logic in the key terms corpus is taken from SILNLP which I'm realizing is not really well-tested itself (and understandably so). |
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want this functionality for all engine types, including SMT, so the code to create a ParatextKeyTermsCorpus should be moved to CorpusService.CreateTextCorpus. Once this functionality is no longer specific to NMT, it is no longer appropriate to use build options to configure this. Build options are engine-specific options. A new property should be added to StartBuildRequest for the StartBuild gRPC endpoint to toggle this functionality.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic has been moved.
Reviewable status: 0 of 10 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This logic should be documented in a wiki.
This logic has been removed since we're using TermRenderings.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 61 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
And the function can be static.
Done.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 81 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This needs a comment explaining what it does and a few tests to confirm behavior.
Done.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 133 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Using keyterms should be default. Possibly you could switch the logic? What would be the best way to do this @ddaspit?
OK, it's easy enough to make that the default; I just didn't know that it was supposed to be a default feature. I've left it as-is. Please confirm that we'd want this to be the default behavior and I can make that change.
|
The added testing also fixes #144 . |
|
This appears to not have code coverage: https://app.codecov.io/gh/sillsdev/machine/pull/143?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sillsdev |
|
Previously, Enkidu93 (Eli C. Lowry) wrote…
@ddaspit - do you have any objection to making it default? |
johnml1135
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @Enkidu93)
|
Previously, johnml1135 (John Lambert) wrote…
Thank you! Previously, this file had 0% coverage. The generic |
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize, but I am asking you to change the design again. See my comments below.
Reviewed 10 of 10 files at r2, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 15 at r2 (raw file):
public string BiblicalTermsType { get; set; } public ParatextKeyTermsCorpus(string projectDir)
I would rename this to fileName.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 23 at r2 (raw file):
if (termsFileEntry is null) { throw new ArgumentException(
Instead of throwing an exception, we should just return an empty corpus.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r2 (raw file):
foreach (XElement element in termsElements) { string id = element.Attribute("Id").Value;
We need some way to filter by category. This means that we will need to pull in the Biblical terms list.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 56 at r2 (raw file):
string id = element.Attribute("Id").Value; id = id.Replace("\n", "
"); string gloss = element.Element("Renderings").Value;
I would rename this to rendering since it could be confused with the term gloss.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 60 at r2 (raw file):
rows.Add(new TextRow("KeyTerms", id) { Segment = glosses }); } IText text = new MemoryText("KeyTerms", rows);
If the text id is the Biblical terms list, then we won't need to add an extra check to see if the source and target are using the same list.
If we do this, then we don't need to perform the special check to see if the lists match, so we can treat this just like any other corpus. I think we should change CorpusService.CreateTextCorpus to return a dictionary of corpora, which was one of the options we discussed previously. The key would be an enum that defines the type of data in the corpus, i.e. Text and Terms. You can match up the source and target corpora using the enum values. You can also use the enum value to remove the key terms if they are disabled.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 88 at r2 (raw file):
buildOptionsObject is not null && buildOptionsObject["use_key_terms"] is not null && buildOptionsObject["use_key_terms"]!.ToString() == "true"
I think this can be cast to a bool.
|
OK 😵💫 |
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay - got side-tracked. Done.
Reviewable status: 4 of 11 files reviewed, 8 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 15 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would rename this to
fileName.
Done. Would you like me to rename "projectDir" -> "fileName" in ParatextTextCorpus as well then? It was already inconsistent language, but "projectDir" seemed the more appropriate of the two.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 23 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Instead of throwing an exception, we should just return an empty corpus.
Done.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We need some way to filter by category. This means that we will need to pull in the Biblical terms list.
Done.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 56 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would rename this to
renderingsince it could be confused with the term gloss.
Done.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 60 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
If the text id is the Biblical terms list, then we won't need to add an extra check to see if the source and target are using the same list.
If we do this, then we don't need to perform the special check to see if the lists match, so we can treat this just like any other corpus. I think we should change
CorpusService.CreateTextCorpusto return a dictionary of corpora, which was one of the options we discussed previously. The key would be an enum that defines the type of data in the corpus, i.e.TextandTerms. You can match up the source and target corpora using the enum values. You can also use the enum value to remove the key terms if they are disabled.
Done.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 88 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I think this can be cast to a
bool.
Done.
|
Previously, Enkidu93 (Eli C. Lowry) wrote…
If it is tested in E2E testing (and no logic of note), then I am ok ignoring it. |
johnml1135
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r3, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @Enkidu93)
OK, cool. And it is slightly more covered now anyways after the re-redesign. |
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 7 files at r3, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 15 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Done. Would you like me to rename
"projectDir"->"fileName"inParatextTextCorpusas well then? It was already inconsistent language, but "projectDir" seemed the more appropriate of the two.
I think you are referring to ParatextBackupTextCorpus. That class already uses fileName.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Done.
We still need to deal with the case where the project is using one of the predefined Biblical terms lists. I think that means we need to include the predefined XML files as embedded resources in the SIL.Machine assembly. You can see an example of this with the embedded usfm.sty file. We can talk next week, if you want more details.
src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 74 at r3 (raw file):
id = id.Replace("\n", "
"); string rendering = element.Element("Renderings").Value; IReadOnlyList<string> glosses = GetRenderings(rendering);
You missed this variable name.
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r10, 1 of 1 files at r11, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 80 at r10 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This is potentially expensive, since it is searching through the whole terms list XML for every term rendering. It would probably be better to create a dictionary from the terms list that maps the term id to the category.
You should use TryGetValue, so you only perform one dictionary lookup.
|
Previously, ddaspit (Damien Daspit) wrote…
Still not passing a test - checking the logic again now. |
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 23 of 26 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 49 at r7 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
For a project terms list, you just need to check if the list type is
Projectand the project id matches theNamefield in the project settings.
Done.
src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 84 at r7 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Yes, that looks correct.
Cool.
src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 80 at r10 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Still not passing a test - checking the logic again now.
Fixed the test data.
johnml1135
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r12, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r12, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 61 at r12 (raw file):
{ biblicalTermsDoc = XDocument.Load(keyTermsFile); termIdToCategoryDictionary = biblicalTermsDoc
Can you move this code to a static method so it isn't repeated?
src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 69 at r12 (raw file):
else { if (
This can be an else if statement.
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)
src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 61 at r12 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Can you move this code to a static method so it isn't repeated?
Done.
src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 69 at r12 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This can be an
else ifstatement.
Done.
johnml1135
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r13, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job sticking with this PR.
Reviewed 1 of 1 files at r13, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
*Testing *Moving logic (as possible) out of the build job *Added logic to SMT *Stuck with buildOptions for configuring (after further conversation) Issues - Not as much abstraction as I'd like between the CorpusService, *BuildJobs, etc.
dc85c31 to
1ab2ad1
Compare
|
And thank you to you for being persistent in your reviewing. |
Fixes sillsdev/serval#225
Parallel PR: sillsdev/serval#226
This change is