Skip to content

Conversation

@Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Nov 28, 2023

Fixes sillsdev/serval#225

Parallel PR: sillsdev/serval#226


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Nov 28, 2023

Codecov Report

Attention: 51 lines in your changes are missing coverage. Please review.

Comparison is base (e1106b4) 66.06% compared to head (dc85c31) 66.39%.
Report is 10 commits behind head on master.

Files Patch % Lines
...c/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs 73.22% 25 Missing and 9 partials ⚠️
...Machine.AspNetCore/Services/SmtTransferBuildJob.cs 23.80% 16 Missing ⚠️
...chine.AspNetCore/Services/NmtPreprocessBuildJob.cs 96.77% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #143      +/-   ##
==========================================
+ Coverage   66.06%   66.39%   +0.32%     
==========================================
  Files         425      426       +1     
  Lines       33033    33208     +175     
  Branches     4433     4454      +21     
==========================================
+ Hits        21824    22047     +223     
+ Misses      10149    10093      -56     
- Partials     1060     1068       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 23 at r1 (raw file):

                if (keyTermsFileEntry is null)
                {
                    throw new ArgumentException(

Should we really crash if there is no keyterms file? That appears excessive.

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 23 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Should we really crash if there is no keyterms file? That appears excessive.

It's caught upstream. I think it's appropriate that if you try to make a ParatextKeyTermsCorpus with no key terms file, it throws an exception.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 23 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

It's caught upstream. I think it's appropriate that if you try to make a ParatextKeyTermsCorpus with no key terms file, it throws an exception.

ok - as long as it just keeps going.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 61 at r1 (raw file):

                    Regex rx = new Regex(@"\[(.+?)\]", RegexOptions.Compiled);
                    Match match = rx.Match(gloss);
                    if (match.Success)

There should be a nice set of 20 or so tests to check this. I would recommend breaking out the parsing section as a separate function that takes a list of keyTermsElements and returns rows. That should be more testable than having to create a bunch of xml files.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 61 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There should be a nice set of 20 or so tests to check this. I would recommend breaking out the parsing section as a separate function that takes a list of keyTermsElements and returns rows. That should be more testable than having to create a bunch of xml files.

The 20 or so tests would be the official documentation of the logic.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 61 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

The 20 or so tests would be the official documentation of the logic.

Or, you could have a function that just takes a single element and returns a list of glosses - that may be better.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 81 at r1 (raw file):

        }

        private static string StripParens(string termString, char left = '(', char right = ')')

This needs a comment explaining what it does and a few tests to confirm behavior.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 61 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Or, you could have a function that just takes a single element and returns a list of glosses - that may be better.

And the function can be static.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 133 at r1 (raw file):

                if (
                    buildOptionsObject is not null
                    && buildOptionsObject["use_key_terms"] is not null

Using keyterms should be default. Possibly you could switch the logic? What would be the best way to do this @ddaspit?

@johnml1135
Copy link
Collaborator

This needs more testing.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r1 (raw file):

                foreach (XElement element in keyTermsElements)
                {
                    if ((element.Element("Category")?.Value ?? "") != "PN")

This logic should be documented in a wiki.

@Enkidu93
Copy link
Collaborator Author

I'll go ahead and add testing and documentation. The logic in the key terms corpus is taken from SILNLP which I'm realizing is not really well-tested itself (and understandably so).

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want this functionality for all engine types, including SMT, so the code to create a ParatextKeyTermsCorpus should be moved to CorpusService.CreateTextCorpus. Once this functionality is no longer specific to NMT, it is no longer appropriate to use build options to configure this. Build options are engine-specific options. A new property should be added to StartBuildRequest for the StartBuild gRPC endpoint to toggle this functionality.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic has been moved.

Reviewable status: 0 of 10 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This logic should be documented in a wiki.

This logic has been removed since we're using TermRenderings.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 61 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

And the function can be static.

Done.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 81 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This needs a comment explaining what it does and a few tests to confirm behavior.

Done.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 133 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Using keyterms should be default. Possibly you could switch the logic? What would be the best way to do this @ddaspit?

OK, it's easy enough to make that the default; I just didn't know that it was supposed to be a default feature. I've left it as-is. Please confirm that we'd want this to be the default behavior and I can make that change.

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Dec 1, 2023

The added testing also fixes #144 .

@johnml1135
Copy link
Collaborator

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 133 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK, it's easy enough to make that the default; I just didn't know that it was supposed to be a default feature. I've left it as-is. Please confirm that we'd want this to be the default behavior and I can make that change.

@ddaspit - do you have any objection to making it default?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 10 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @Enkidu93)

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Dec 1, 2023

src/SIL.Machine.AspNetCore/Services/CorpusService.cs line 23 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This appears to not have code coverage: https://app.codecov.io/gh/sillsdev/machine/pull/143?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=sillsdev

Thank you! Previously, this file had 0% coverage. The generic CreateTextCorpus gets tested in the NmtPreprocessBuildJobTests. It would be a bit forced and odd to test this there as well. I could also make it its own CorpusServiceTests but there's just so little to test. This is tested in the Serval Paratext E2E test. If you'd really like me to, I can incorporate a test now for this. It may be worthwhile to open up a separate issue that involves systematically going through the existing code coverage reports and trying to improve our coverage.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize, but I am asking you to change the design again. See my comments below.

Reviewed 10 of 10 files at r2, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 15 at r2 (raw file):

        public string BiblicalTermsType { get; set; }

        public ParatextKeyTermsCorpus(string projectDir)

I would rename this to fileName.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 23 at r2 (raw file):

                if (termsFileEntry is null)
                {
                    throw new ArgumentException(

Instead of throwing an exception, we should just return an empty corpus.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r2 (raw file):

                foreach (XElement element in termsElements)
                {
                    string id = element.Attribute("Id").Value;

We need some way to filter by category. This means that we will need to pull in the Biblical terms list.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 56 at r2 (raw file):

                    string id = element.Attribute("Id").Value;
                    id = id.Replace("\n", "&#xA");
                    string gloss = element.Element("Renderings").Value;

I would rename this to rendering since it could be confused with the term gloss.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 60 at r2 (raw file):

                    rows.Add(new TextRow("KeyTerms", id) { Segment = glosses });
                }
                IText text = new MemoryText("KeyTerms", rows);

If the text id is the Biblical terms list, then we won't need to add an extra check to see if the source and target are using the same list.

If we do this, then we don't need to perform the special check to see if the lists match, so we can treat this just like any other corpus. I think we should change CorpusService.CreateTextCorpus to return a dictionary of corpora, which was one of the options we discussed previously. The key would be an enum that defines the type of data in the corpus, i.e. Text and Terms. You can match up the source and target corpora using the enum values. You can also use the enum value to remove the key terms if they are disabled.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 88 at r2 (raw file):

                    buildOptionsObject is not null
                    && buildOptionsObject["use_key_terms"] is not null
                    && buildOptionsObject["use_key_terms"]!.ToString() == "true"

I think this can be cast to a bool.

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Dec 1, 2023

OK 😵‍💫

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay - got side-tracked. Done.

Reviewable status: 4 of 11 files reviewed, 8 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 15 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would rename this to fileName.

Done. Would you like me to rename "projectDir" -> "fileName" in ParatextTextCorpus as well then? It was already inconsistent language, but "projectDir" seemed the more appropriate of the two.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 23 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Instead of throwing an exception, we should just return an empty corpus.

Done.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We need some way to filter by category. This means that we will need to pull in the Biblical terms list.

Done.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 56 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would rename this to rendering since it could be confused with the term gloss.

Done.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 60 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

If the text id is the Biblical terms list, then we won't need to add an extra check to see if the source and target are using the same list.

If we do this, then we don't need to perform the special check to see if the lists match, so we can treat this just like any other corpus. I think we should change CorpusService.CreateTextCorpus to return a dictionary of corpora, which was one of the options we discussed previously. The key would be an enum that defines the type of data in the corpus, i.e. Text and Terms. You can match up the source and target corpora using the enum values. You can also use the enum value to remove the key terms if they are disabled.

Done.


src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs line 88 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think this can be cast to a bool.

Done.

@johnml1135
Copy link
Collaborator

src/SIL.Machine.AspNetCore/Services/CorpusService.cs line 23 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Thank you! Previously, this file had 0% coverage. The generic CreateTextCorpus gets tested in the NmtPreprocessBuildJobTests. It would be a bit forced and odd to test this there as well. I could also make it its own CorpusServiceTests but there's just so little to test. This is tested in the Serval Paratext E2E test. If you'd really like me to, I can incorporate a test now for this. It may be worthwhile to open up a separate issue that involves systematically going through the existing code coverage reports and trying to improve our coverage.

If it is tested in E2E testing (and no logic of note), then I am ok ignoring it.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r3, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @Enkidu93)

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Dec 1, 2023

src/SIL.Machine.AspNetCore/Services/CorpusService.cs line 23 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…

If it is tested in E2E testing (and no logic of note), then I am ok ignoring it.

OK, cool. And it is slightly more covered now anyways after the re-redesign.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 7 files at r3, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 15 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Done. Would you like me to rename "projectDir" -> "fileName" in ParatextTextCorpus as well then? It was already inconsistent language, but "projectDir" seemed the more appropriate of the two.

I think you are referring to ParatextBackupTextCorpus. That class already uses fileName.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 54 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Done.

We still need to deal with the case where the project is using one of the predefined Biblical terms lists. I think that means we need to include the predefined XML files as embedded resources in the SIL.Machine assembly. You can see an example of this with the embedded usfm.sty file. We can talk next week, if you want more details.


src/SIL.Machine/Corpora/ParatextKeyTermsCorpus.cs line 74 at r3 (raw file):

                    id = id.Replace("\n", "&#xA");
                    string rendering = element.Element("Renderings").Value;
                    IReadOnlyList<string> glosses = GetRenderings(rendering);

You missed this variable name.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r10, 1 of 1 files at r11, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 80 at r10 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is potentially expensive, since it is searching through the whole terms list XML for every term rendering. It would probably be better to create a dictionary from the terms list that maps the term id to the category.

You should use TryGetValue, so you only perform one dictionary lookup.

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Dec 5, 2023

src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 80 at r10 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should use TryGetValue, so you only perform one dictionary lookup.

Still not passing a test - checking the logic again now.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 23 of 26 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 49 at r7 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

For a project terms list, you just need to check if the list type is Project and the project id matches the Name field in the project settings.

Done.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 84 at r7 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, that looks correct.

Cool.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 80 at r10 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Still not passing a test - checking the logic again now.

Fixed the test data.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r12, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r12, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 61 at r12 (raw file):

                    {
                        biblicalTermsDoc = XDocument.Load(keyTermsFile);
                        termIdToCategoryDictionary = biblicalTermsDoc

Can you move this code to a static method so it isn't repeated?


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 69 at r12 (raw file):

                else
                {
                    if (

This can be an else if statement.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 61 at r12 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Can you move this code to a static method so it isn't repeated?

Done.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 69 at r12 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This can be an else if statement.

Done.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r13, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job sticking with this PR.

Reviewed 1 of 1 files at r13, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@Enkidu93 Enkidu93 merged commit f7ea2e0 into master Dec 6, 2023
@Enkidu93 Enkidu93 deleted the key_terms_support branch December 6, 2023 17:56
@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Dec 6, 2023

And thank you to you for being persistent in your reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Include Keyterms from Paratext Project in Training Data with toggle

5 participants