Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad corpora should be flagged #2

Closed
johnml1135 opened this issue Apr 21, 2023 · 36 comments
Closed

Bad corpora should be flagged #2

johnml1135 opened this issue Apr 21, 2023 · 36 comments
Assignees
Labels
bug Something isn't working

Comments

@johnml1135
Copy link
Collaborator

If there is an extraneous tab in a corpora, it will crash the SMT build process with an error "Object must be of type String." This error is cryptic and does not allow the user to debug what is going on. It should either be clear "this file on this line is not correctly formatted - here is the correct format" or clean the files in a deterministic way before being passed down to the engines. This may need to be specific per engine type.
image

@johnml1135 johnml1135 added the bug Something isn't working label Apr 21, 2023
@johnml1135 johnml1135 added this to the Improved Mother Tongue milestone Jul 19, 2023
@Enkidu93
Copy link
Collaborator

Would you prefer a 'silently clean' solution or a 'throw an error from the machine side' solution? Seems like the latter might be more straightforward and less likely to cause difficult-to-debug headaches. But if it this trailing tab or similar issue is likely to be common, maybe it is just worth cleaning.

@johnml1135
Copy link
Collaborator Author

@Enkidu93, I would rather throw an intelligent error. If we clean, I would want to define it clearly - I would hate for people's data to be changed unexpectedly.

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Aug 1, 2023

OK, sounds good. I tried to recreate this by adding a tab with no luck (it completed and worked properly). Where did you add the extra tab?

@johnml1135
Copy link
Collaborator Author

johnml1135 commented Aug 2, 2023

At least before, this type of file caused errors:
MyFileWithOneTab.txt

my long sentence\t
my other long sentence
my third long sentence

Note the tab (\t) at the end of the first sentence
I believe the trailing tab specifically causes errors - in that it makes a zero length line as per these to pieces of code:
https://github.com/sillsdev/machine/blob/940ca147dc83638cc9c3da2128b7573af511381d/src/SIL.Machine/Corpora/TextFileAlignmentCollection.cs#L52-L59
https://github.com/sillsdev/machine/blob/940ca147dc83638cc9c3da2128b7573af511381d/src/SIL.Machine/Corpora/TextFileText.cs#L43-L48

The idea was seeing that but a bit more general. We have an odd format - either text per line or a keys (separated at times by '_' and by '-') that is then separated by a tab from the rest of the text.
A tab at the beginning, a tab at the end, keys separated by '-', a tab in a sentence that is unintentional - all of these would create bad data or crashing.
To remedy this here is a proposal:

  • When uploading files, if they are of type "text", run a "text" check:
    • Verify that all lines either have one tab or none
  • Change the behavior of separating keys to only underscores "_". Document it in the API. Update the code in TextFileAlignmentCollection.cs to only split on underscores.

@ddaspit
Copy link
Contributor

ddaspit commented Aug 3, 2023

What is the issue with dashes in a key?

@ddaspit
Copy link
Contributor

ddaspit commented Aug 3, 2023

As @Enkidu93 said, we could make the code a bit more fault-tolerant and simply ignore any trailing whitespace, such as tabs, when reading the text file.

@johnml1135
Copy link
Collaborator Author

(1) the issue with the dashes is that it is inconsistent (used in one place but not the other) and undocumented. trimming white space sounds fine - that is unless we get a whole raft of verse references with no sentences! I would be more inclined to have the clients clean the data and then they can establish the intention better. It is really a data format that we are specifying - I wouldn't want to "clean" csv files saying "you didn't follow the rules, but I think this is what you intended". I would rather say "here are the rules, and if you don't follow them, I will error out and tell you why."

@ddaspit
Copy link
Contributor

ddaspit commented Aug 4, 2023

We can document that we support either dashes or underscores to separate different components of a key.

I would like to keep the Machine corpus classes as fault-tolerant as possible, since those are often used by researchers, but I don't have a problem with being more strict in Serval and validating the files when they are uploaded.

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Aug 9, 2023

So besides updating the documentation in machine, all edits should be in serval, right? I.e., we're making it serval's responsibility to validate file format? Also, are the file formats clearly documented somewhere? (I've figured out what they are myself from fixing the echo engine, but I didn't know if there were an official description somewhere).

@johnml1135
Copy link
Collaborator Author

@Enkidu93 - Let's have Serval document and check the file format. The only documentation is what is in this issue - and what we can determine from the code. I am still inclined to only use one separator (and even switch it to a semi-colon (;) for added clarity). I don't want verses (i.e., Mat 1:3-7) to trigger the separator.

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Aug 9, 2023

As I understand it, a text file can have one of two formats:

  1. Lines of text (no tabs - only spaces and newlines at the end of each line).
  2. Lines of text with one or two tabs in one of the following formats: a key, a tab, and then text; a key and a tab (empty); a key, a tab, text, a tab, a column code (ss, ir, rr...).

So I should be able to validate with a regex pretty easily. @johnml1135, could you verify that the above is correct? Thank you!

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Aug 9, 2023

Sorry, didn't see your reply til I posted.

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Aug 9, 2023

@johnml1135 I'm going off of what I learned with Peter Chapman about formats for the echo engine. I don't think verses like those should be a problem - maybe the column codes account for the formatting issues? Semicolons are also present in many verses though - depending on the version - and they're also more English-specific since punctuation is very variable across languages. Seems like a special character like tab might be better, right? I'm up for anything though.

@johnml1135
Copy link
Collaborator Author

johnml1135 commented Aug 9, 2023 via email

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Aug 10, 2023

That might work. I know that some languages use a danda which looks like that for punctuation but that's a different unicode character I believe. I do wonder if that could be confusing though or difficult given that other keyboards may not have it?? Why the dislike of the tab? Also, is this assumption about tabs already deeply ingrained in any projects using these tools?

@johnml1135
Copy link
Collaborator Author

The tab idea is growing on me. Since it is a non-visual character, it will not conflict with any other script. any visual punctuation carries the risk of being used in some scripts versus. Then, each reference would be separated by a tab, and after the last tab would be the actual segment. There should be no tabs at the beginning or end of the line.

@ddaspit
Copy link
Contributor

ddaspit commented Aug 10, 2023

I would prefer that you not make significant changes to the text file format for backwards compatibility purposes in the Machine libraries. It is currently working as-is with SF, so there is no need to change it. All we need to do is validate the current format in Serval.

@johnml1135
Copy link
Collaborator Author

johnml1135 commented Aug 10, 2023 via email

@johnml1135
Copy link
Collaborator Author

One edge case to make sure we cover is where there is a reference, but no segment text (and/or there is text and then an ending tab). My best understanding is we should just ignore the segment completely - it does not add anything to us.
The other edge case I can imagine is where the source and target references don't align (because they are matched by line number, not reference). A few ways to handle this:

  • Ignore any target references, and document that this is what is happening (my preferred path)
  • Concatenate the two sets of references (too complex - how is it benefiting the client?)

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Aug 11, 2023

I think a regex like this should cover it, right? (\S := any non-whitespace charater)

"^(((\S| )*\n)+|((\S+\t(\S| )*(\t\S+)?)\n))+$"

or in more readable notation:

((\S + \space)(\S + \space)* + \S\S*\t(\S + \space)(\S + \space)*(\t\S\S* + \empty))\n

@Enkidu93
Copy link
Collaborator

One edge case to make sure we cover is where there is a reference, but no segment text (and/or there is text and then an ending tab). My best understanding is we should just ignore the segment completely - it does not add anything to us. The other edge case I can imagine is where the source and target references don't align (because they are matched by line number, not reference). A few ways to handle this:

* Ignore any target references, and document that this is what is happening (my preferred path)

* Concatenate the two sets of references (too complex - how is it benefiting the client?)

So the first issue is if the source line is empty? And the second issue is if there are more target lines than source lines or target lines that have no source reference?

@ddaspit
Copy link
Contributor

ddaspit commented Aug 14, 2023

@johnml1135 If I am understanding your edge cases correctly, Machine should handle both of these cases gracefully. Empty segments and segments that do not have a matching pair are simply ignored.

@johnml1135
Copy link
Collaborator Author

johnml1135 commented Aug 15, 2023

If I am correct then, we can add:

documentation for text files:

  • All text files have one source or target segment per line with no tabs in the segment
  • Text files can optionally have references associated with each line
  • If references are used, a few things must be followed:
    • Each line must contain a tab to separate references and the segment, even if there are no references or text in the segment
    • Multiple references can be added, separated by an underscore (_), such as: Ref 1_Ref-2-with-dashes_1 JON 2:18-21
    • All starting and ending whitespace will be trimmed from references and segments
    • For parallel text, segments are associated by line number AND all references for every line must match between the source and target side, or else the line/segment will be ignored

Checking and feeback:

  • When creating a corpus, enhance the response to include:
    • For each parallel text (matching textId) return the following information:
    • Number of total lines in the files
    • Number of parallel segments successfully read in
    • Number of source only segments successfully read in
    • Whether there are references in the text or not
    • StatusMessage - either "All lines successfully read in and associated" or, output a message as appropriate based upon the mal-formed inputs.
    • The StatusMessage can be concatenated and placed in the MongoDB to be recalled when requesting info about the corpora.

@ddaspit Is this all accurate? Would anything else need to be said?

@Enkidu93
Copy link
Collaborator

@johnml1135 I'm confused about the multiple references element. So in the example data I've used, there are references like 40_1 or chapter_01_verse_001. My impression was that these each constituted a single reference, not a set {40, 1} or {chapter, 01, verse, 001}. When the references are used in the textId selection for pretranslation, they are referred to as a unit, right? i.e. 40_1. Is that just parsed as two refs that happen to only correspond to "40_1"? Wouldn't that imply a M:N relationship between source segments and target segments? I'm just a little confused. Where is this logic? Is this documented anywhere?

Otherwise, I'm on board with what you said, and I think my regex should be sufficient for validation (given the tests I've run on sample data). And I think the logic on the machine side already handles the missing refs issue.

@ddaspit
Copy link
Contributor

ddaspit commented Aug 15, 2023

@johnml1135 As @Enkidu93 said, the underscore/dash does not indicate multiple references. It indicates multiple components of a single reference. It is what I refer to as a MultiKeyRef. The different components/keys are used for sorting purposes. For example, if you wanted to represent a verse ref that is sorted correctly, you would construct JOHN 1:1 like this 40JHN_01_001. The rest seems accurate.

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Sep 8, 2023

Hey, folks! I'm running out of MVP issues I can make progress on alone, so I'm moving back to this. I have basic validation working and updated the docs (see https://github.com/sillsdev/serval/tree/handle_malformatted_files_%232). As for the additional edits @johnml1135 suggested (i.e., checking and feedback), I'm not sure how to go about incorporating those since - as I understand it - aligning the source and target lines is done machine-side. I understand the utility of including this information in the AddCorpus and GetCorpus replies, but I'm not sure how to cleanly include it without doing the same work twice (which seems like a bad move in many ways esp. maintaining consistency) or creating a more complex system of calls between Serval and Machine (and complexity ∝brokenness).

@Enkidu93
Copy link
Collaborator

@johnml1135 Thoughts?

@johnml1135
Copy link
Collaborator Author

@Enkidu93 and @ddaspit - how about doing the checking when actually building the engine? It would be in one place and either (1) it would fail with an error message about a specific corpora or (2) it would return "statistics" about the build process as a new field in the Mongo database (time to build, number of matching verses, number pretranslated - as a text output).

@ddaspit
Copy link
Contributor

ddaspit commented Sep 12, 2023

I think it makes sense to let the engines validate the data. It is already the responsibility of the engine to load/parse the data. This does mean that different engines might validate the data differently. One engine might be more forgiving than another.

The engine model already has a corpusSize property that is filled in by the engine during building. We could add some additional stats if that is useful.

@ddaspit
Copy link
Contributor

ddaspit commented Sep 12, 2023

I would say that at this point, we should probably only add new features if SF has a specific requirement for it.

@ddaspit
Copy link
Contributor

ddaspit commented Sep 12, 2023

The Machine engine is already essentially validating the data when it parses it during a build and will return an error if it is unable to do so. We could add a strict parsing mode to the TextFileTextCorpus class by adding a flag to the constructor.

@Enkidu93
Copy link
Collaborator

I think it makes sense to let the engines validate the data. It is already the responsibility of the engine to load/parse the data. This does mean that different engines might validate the data differently. One engine might be more forgiving than another.

The engine model already has a corpusSize property that is filled in by the engine during building. We could add some additional stats if that is useful.

That makes sense to me too. The only complaint I would have is that, as a user, I now have to wait until building to see if the text files I uploaded are valid, so it may still make sense to inform the user if individual text files are invalid (not corpus-level issues, just file-level). I can imagine a situation where I upload 100s of files, and they're all malformed. I go through the process only to find at the build stage that there's an issue. But maybe this is better dealt with at the SF-level - I don't know.

I would say that at this point, we should probably only add new features if SF has a specific requirement for it.

So do you think we should not pursue this? What do you mean by this?

The Machine engine is already essentially validating the data when it parses it during a build and will return an error if it is unable to do so. We could add a strict parsing mode to the TextFileTextCorpus class by adding a flag to the constructor.

Sure, and the initial issue-posting acknowledges that, but isn't the point to make these errors more descriptive/fixable?

@ddaspit
Copy link
Contributor

ddaspit commented Sep 12, 2023

I think it makes sense to let the engines validate the data. It is already the responsibility of the engine to load/parse the data. This does mean that different engines might validate the data differently. One engine might be more forgiving than another.

The engine model already has a corpusSize property that is filled in by the engine during building. We could add some additional stats if that is useful.

That makes sense to me too. The only complaint I would have is that, as a user, I now have to wait until building to see if the text files I uploaded are valid, so it may still make sense to inform the user if individual text files are invalid (not corpus-level issues, just file-level). I can imagine a situation where I upload 100s of files, and they're all malformed. I go through the process only to find at the build stage that there's an issue. But maybe this is better dealt with at the SF-level - I don't know.

Since Serval is a web service for applications and not users, I think we can safely push the responsibility for more user-friendly handling of malformed data to the application. Returning an error at build-time is probably sufficient at this point. Validating the data requires a background job of some type. We can't block a request while we validate a potentially large corpus. This would require a complex design/API that supports asynchronous processing and notification when it is finished validating. We don't know how big of an issue this will be for calling applications, so I would prefer to wait to implement something this complex until we have a specific requirement for it.

I would say that at this point, we should probably only add new features if SF has a specific requirement for it.

So do you think we should not pursue this? What do you mean by this?

I think that adding additional statistics isn't necessary at this point. SF doesn't have a specific requirement for it, so we can defer that to a future release.

The Machine engine is already essentially validating the data when it parses it during a build and will return an error if it is unable to do so. We could add a strict parsing mode to the TextFileTextCorpus class by adding a flag to the constructor.

Sure, and the initial issue-posting acknowledges that, but isn't the point to make these errors more descriptive/fixable?

I was saying that we could make the errors more descriptive and add more strict validation in the TextFileTextCorpus class. This would deal with the issue without have to make any changes to Serval or the Machine engine service.

@Enkidu93
Copy link
Collaborator

I was saying that we could make the errors more descriptive and add more strict validation in the TextFileTextCorpus class. This would deal with the issue without have to make any changes to Serval or the Machine engine service.

OK, I see. I'm realizing though: Is there really any corpus-level validation necessary? I suppose the only thing we might check is if a corpus is all target content and no source content. Thoughts? What ought to be checked here? If it's the case that we don't need any corpus-level validation, then we can just move the validation I had worked on in Serval to Machine on a per-text basis and, like you said, hide it behind a constructor flag. (And if we need more specific error-reporting (i.e., "unexpected token in " rather than just " is not formatted correctly"), we can do something iterative rather than just a regex validation). How does that sound?

@johnml1135
Copy link
Collaborator Author

Validation may be just throwing a descriptive error if we were unable to process a file. It would be nice if we could identify the specific line that was failing and why. Beyond that, I don't think we need more validation.

@Enkidu93
Copy link
Collaborator

OK, I'll go ahead then.

Enkidu93 added a commit that referenced this issue Sep 25, 2023
Fixes Bad corpora should be flagged #2
ddaspit pushed a commit that referenced this issue Jul 5, 2024
Add strict parsing

Fixes #2
johnml1135 added a commit that referenced this issue Jul 9, 2024
* Fix DistributedReaderWriterLockTests

* Refactor timeout function

* Add HostId to RWLock

* Separate SMT training from engine runtime

* Add separate job server

* Add support for webhooks

* Rename webhooks controller

* Rename creation DTOs

* Rename BuildRevision to ModelRevision

* Release all distributed locks on startup

* Give priority to first writer lock waiter

* Fix EngineRuntimeTests

* Fix SMT training

* Remove "admin" url

* Use filename if no name is provided when uploading file

* Refactor corpus processing

* Further simplify corpus processing

* Refactor truecaser interface

* Refactor truecaser interface

* Add ability to only load certain texts

* Include step in progress status

* Add ICorpus and IParallelTextCorpus interfaces

* Add NMT engine to web API

* Add Corpus model

* Add more documentation to API

* Add separate job queues

* Add parent models dir setting

* Initial RestSharp testing.  I can authenticate with the QA server and perform operations!

* Add NMT engine to web API

* Add Corpus model

* Add more documentation to API

* Add separate job queues

* Add parent models dir setting

* Format code using CSharpier

* Format code using CSharpier

* Add ClearMLNmtEngineRuntime and ClearMLNmtEngineBuildJob

* Fix unit tests

* Delete all old pretranslations

* Switch default machine.py branch to "release"

* Update confidence of NMT engine

* Set correct metrics for NMT engine

* Use script to start NMT job instead of repository

* Make root ClearML project configurable

* Fix unit tests

* Make BuildService unit tests more reliable

* Initial tests kind of working- they are still failing, but it is an issue with the underlying SMT model.  Do I understand it correctly?
Primary SpecFlow test functioning with Client and local API.

* Fix a breaking change from SDK 6.0.1: https://docs.microsoft.com/en-us/dotnet/core/compatibility/sdk/6.0/duplicate-files-in-output

* Make a clean switch to var/machine/corpora
reorganize settings.json files
Note that ASPNETCORE environment variables will not override the *.settings.json files because they are custom named and added after builder initialization.

* Add bible.txt to testing
Revert back to app.settings.json - by having 2 separate output folders (update k8s, docker, cs, etc.)
Allow for large files across ingress (100m)
Increase memory for k8s jobs and api

* Better fix - the files should be app.settings.json

* Implement removing a corpus from a translation engine

- closes #27

* Refactor all engine runtime handling into a separate service

* Add engine server

* Add support for S3 shared files

* Refactor ITranslationEngine and ITranslationModel interfaces

* Parallelize TranslateBatch for ThotSmtModel

* Move data access classes to a separate assembly

* Refactor Machine Web API to Serval

* Use AsyncEx library

* Remove Serval projects

* Fix unit tests

* Update to csharpier 0.22.1

* Update all unit test projects to .NET 6

* Call platform service when job canceled while pending

* Fix compile errors

* Refactor engine services to be stateless

* Add initial retry policy for gRPC client

* Use IIdGenerator

* Disable retry for UpdateBuildStatus

* Throw exception if unable to cancel current build

* Update to latest version of Serval API

* Update to latest version of Serval.Grpc

* Add health checks

* Do not use transaction when starting a build

* Refactor translation engines to use strings instead of tokens

* Make files directory if not exist
Throttle update status to not overwhelm mongo

* Use latest version of Serval.Grpc

* Correctly handle punctuation in suggestions

* Update dependencies

* fixing clearml authentication and wierd build bug.

* Fix delete filter - needed to explicitly cast to a String

* fixes and updates:
* allow for choosing to save nmt model
* choose nmt docker image
* pretranslate all specified corpora, even if the actual translation exists.
* fix for deleting a clearml project

* Minor updates from code review
* redo- authorization token - refresh only every hour.

* Update ClearMLService to match Machine.py

* Resolve reviewer feedback:
* Change ClearML authentication token to IHostedService.
* Add using singletons as per: https://stackoverflow.com/questions/52036998/how-do-i-get-a-reference-to-an-ihostedservice-via-dependency-injection-in-asp-ne
* REvert the DeleteAsync lambda to a string - it is an error in MongoDB Client as per https://jira.mongodb.org/browse/CSHARP-4522.  It should be updated in SIL.DataAccess, not patched here.

* Fixes from review:
* update ClearML auth from many comments
* Update to SIL.DataAccess 0.3.1 to fix mongoDB bug for deleting Nmt engine.

* Refactor ClearMLAuthenticationService into a BackgroundService

* Convert language tag to NLLB language code

- closes #37

* Update Serval.Grpc to 0.8.0

* Ensure that icu.net.dll.config is deployed with servers

* #42 - fix Ubuntu 22x and libdl.so issue.

* #42 - revert back to ubuntu 20
- This issue needs to be resolved in icu-dotnet as well
- see sillsdev/icu-dotnet#186

* #42 - icu.net.dll.config was not copied to the output directory on publish

* #42 further fix:
- The filepath looking for icu.net.dll.config resolves before it is built, therefore, use the full path.  Then it will copy.

* Enable a non-root S3 bucket path

- see #29

* Fix missing pretranslation refs

* Update version to 3.0.0

* * #40 Get SMT job cancellation working
  * Subscribe to mongoDB for build being cancelled
* #36 - RWLock - remove upsert, change lock creation logic
* #42 - add BSON attributes
* hot-load serval projects when available for debugging data access without building new code
* minor fixes and project updates

* #32 Fixes for NMT cancellation
* Fix "get task by name" from ClearML
* Use same sub for cancellation as from SMT - and refractor
* Fix tests

* Update SIL.dataAccess

* Minor error-handling change (most was done on serval side)

An attempt at an S3 bucket health check but having some issues with it logging

--ECL

* Working S3 Bucket Health Check --ECL

* S3 bucket health check --ECL

* ClearML Health Chech --ECL

* Some fixes in response to PR review --ECL

* Removed AutoToString --ECL

* Changes as per PR review --ECL

* Changes as per PR review --ECL

* Fix to proj file --ECL

* Fixes #46

I tried to keep as much of the original structure as possible while replacing any interfacing with the bucket with API calls. This allowed me to keep the existing in-memory implementation. If this seems janky (i.e. using Stowage and the Amazon SDK - although Stowage is not used for anything to do with Amazon directly), I'm happy to remove the FileStorage layer and have a direct implementation of the SharedFileService - shouldn't be too hard. A few comments have been left in for review.

--ECL

* Removed Stowage library --ECL

* Removed Stowage and implemented & tested in-memory & local implementations (local disk tests are currently commented out; see comment on PR) --ECL

* Fixes as per PR review --ECL

* Fix after automerging --ECL

* Fixes #82 --ECL (#57)

Co-authored-by: Enkidu93 <lowryec17@gcc.edu>

* #54 - fix race condition

* #53 - don't pretranslate things already pretranslated.

* Fix #55 and #84 - remove SubscribeForCancellation

Remove subscribe for cancellation.
What was wrong before, I don't know - but now it is sucessfully cancelling without the extra subscription.

* Working S3 bucket uploading (removed bug with extra slash).

Switched to high-level API. Can revert if needed.

--ECL

* #58 - centralize and make directory deleting more robust

* Cleanup from review - #63

* #50 auto-restart hangfire

* Revert "#50 auto-restart hangfire"

This reverts commit c20e38c.

* Fix configuration of translation Engines

- fixes #67

* Remove unnecessary Mongo attributes

* Simplify distributed lock creation

* Removed cancellation token passing after line 99.

Opted to explicitly pass CancellationToken.None rather than leaving default for consistency with other code and added clarity.

Fixes Cancellation token - inconsistent SMT training state #62

--ECL

* Fixes #95 --ECL

* Fix - still fail job even with S3 bucket failures

* Fixes ClearML server is down - and then hangfire crashes #65 --ECL
* Simplified logging --ECL
* Added lock to token refresh;  changed logging format  --ECL
* Changed name to ...Async --ECL
* Fixed typo; removed unused var --ECL

* DataAccess - 0.5.0

* #75 - don't fail nmt job on delete failing

* Fix from review #78

* Reverted to low-level implementation (removing need for large buffers); fixes Refine S3 bucket usage #61

Also removed synchronous write; fixes S3 remove sync write (async only) #64

Note: Dispose has async calls/logic because "await using" on the StreamWriters in ClearMLNMTBuildJob calls DisposeAsync on the StreamWriters but Dispose on the BaseStream. This is the only way I could find to circumvent this - other solutions are welcome given there's a workaround I'm missing.

* Fixes Throw error for unbuilt engine #81

--ECL

* Switch to Aborted --ECL

* Decoupled engine service from gRPC --ECL

* Added error message --ECL

* Fix broken test --ECL

* Fixes to test --ECL

* Fixes Machine S3 health - warn on 1 missing, fail on >86 seconds down? #84

Change interval of health checks to 4 minutes.

S3 pings up to three times before returning Unhealthy.

(Also, removed redundant logging)

--ECL

* Added ClearML retry policy (this included checking to see if Sldr was initialized before initializing - not sure if this needs to be locked. Also, what is Sldr being used for?)

Undid change to default HealthCheck interval

Changed AWS retry policy to default values

Changed S3HealthCheck behavior to return Degraded on initial failures before returning Unhealthy

Other minor fixes

--ECL

* Remove debug logging --ECL

* Fixes as per PR --ECL

* Use ClearMLAuthentication service for token-fetching

* Get token each CallAsync

* Include status reason in build message

* Added open telemetry instrumentation (#88)

* Added open telemetry instrumentation

* Added gRPC instrumentation & moved package references

* Hide console exporter (#90)

* Strict parsing #2 (#92)

Add strict parsing

Fixes #2

* Update ci.yml (#95)

* Update ci.yml

* add coverlet

* Add badge

* Fixes RefId's don't get populated for Paratext Projects #93 (#94)

* Fixes RefId's don't get populated for Paratext Projects #93

If there is no target file, there are no target refs; in this case, use the source ref.

* Damien's fix to ref logic

(Note: minimally tested against serval main and verse refs are properly generated)

* Added queue name to ClearML health check failure message (#98)

* Update dataaccess to most recent version

* Refactor ClearML NMT build job

- add support for multiple build stages
- add support for running build jobs on Hangfire or ClearML
- add BuildJobService
- categorize build jobs into CPU or GPU jobs
- decouple build job runners from translation engines
- fix issues with S3FileStorage
- fix issues with ClearMLService

* Add ClearML Project setting (#106)

- create a ClearML project if it doesn't exist

* Config passing via build options for post-nmt/hangfire refactor (#109)

* Config passing via build options for post-nmt/hangfire refactor

* Fix camel/snake case

* Remove build options from args passed to machine.py when null

* Pass cancellation token

* Upgrade to Serval.GRPC 0.9.0

---------

Co-authored-by: John Lambert <john_lambert@sil.org>

* Queue depth endpoint + status message (#110)


* Support basic queue depth endpoint

* Working queue depth endpoint and placement in queue registered in status message

* Move queueDepth to build property and clean up logic

* environment types  (#111)

* Addresses #178

* REname from summybuild to staging

* Working percent completed (#112)

* Use named HTTP clients for ClearML (#115)

- move ClearML connection string to "ConnectionStrings" section
- fixes #108

* Fixes #121

* Don't delete S3 bucket files if the build faulted so we can debug (#120)

Don't cleanup files if job faulted so it can be recreate build

* Fix corpus count methods (#121)

- add support multiple columns in TextFileText.Count method
- remove MissingRowsAllowed property
- use standard Count implementation for ParallelTextCorpus

* Make lang tag to NLLB lang code conversion more robust (#123)

- properly handle languages with an implicit script
- normalize Chinese subtags
- convert macrolanguage subtag to corresponding standard language subtag

* Fixes #118

* Improve default script code lookup (#125)

- add LanguageTagService
- use langtags.json to improve default script lookup
- various code cleanup

* Update language code parsing (#126)

* Update some logic, flow and documentation of parsing out language codes

* Respond to reviewer comments

* Updates from Review

* Fixes #128

* Only upload when count is greater than 0

* Fixes #202

* Serval.Grpc 0.11.0

* grpc 12

* Fixes #130 (#134)

* Fixes #130

* Remove stack trace info

* Fixes #133 (#135)

* Fixes #133

* Use locks (also fix locking in S3)

* Change front of queue to queueDepth = 1

* Fix null reference return error (#140)

* Check for key terms and write aligned rows during preprocessing

* Working native key terms functionality

* Working key terms solution including:
*Testing
*Moving logic (as possible) out of the build job
*Added logic to SMT
*Stuck with buildOptions for configuring (after further conversation)

Issues - Not as much abstraction as I'd like between the CorpusService, *BuildJobs, etc.

* Redesign as per PR review

* Minor fixes from PR review

* Change SMT default behavior as well

* Fix dictionary logic

* More PR reviews fixes

* Added testing

* More PR review fixes

* Fix key term ids logic

* Remove null check

* Fixed test data

* Removed unecessary class from Serval and reworked Machine code using it

* Change using's order

* Revert to using CustomEnumConverter

* Build log output (#155)

* Log beginning of build log that can be built into metrics.

* Don't crash if the build options are bad.

* Refactor and fix naming.

* Updates as per conversation
Adding resolved codes

* Passed shared file folder to ClearML NMT job (#157)

* Drive health (#156)

* Initial stgab at improved health checking
* upgrade to net8.0
* Add health check endpoint to GRPC engine proto (from serval)
* Combine health reports into rich data

* Clean up fixes

* Clean up ClearML authentication error reporting.

* Update danger level to 1GB hard drive space left.

* Remove null forgiving operators.

* Fix build error

* Updates from review

* reviewer comments.

* Update serval GRPC interface

* Update to latest version of csharpier

* Language endpoint (#159)

* Add NLLB-200 language code checking support

Add SMT stub support

Reviewer Comments

IsSupportedNatively

InternalCode

UpdatedName

Optional Parameters for language info

Minor fixes

reviewer comments

Updates from review comments

Reviewer comments

* fix Korean script for NLLB: #290

* Update GRPC from Serval.

* Fix tests

* Fix formatting.

* Cancel build gRPC endpoint returns Aborted when no build is running (#165)

- fixes #162

* Added chapter-level filtering (#161)

* Added chapter-level filtering; fixes #150

* Move scripture range parsing to Serval

* Changes as per review comments

* Count() to Count

* Revert "Added chapter-level filtering (#161)"

This reverts commit 5f80d82.

* Chapter level aspnetcore (#167)

* Added chapter-level filtering; fixes #150

* Move logic to parallel text corpus

* Nmt download (#164)

* Presigned URL code
cleaning script
IsModelPersisted nullable
Return IsModelPersistedState to Serval
Check for modelRevision + 1 in cleanup and just delete without keeping internal states.

* Reviewer comments

* update to most recent api

* Try 2 for editor config: (#171)

* Try 2 for editor config:
* Namespace always file level
* Always primary constructor
* Inline null checks
* Enforce naming
* Fix spelling mistakes
* simplify default initializations
* Add static and readonly where valid
* Remove unnecessary usings.
* No Functional change
* Change private static to PascalCase
* Add if multiline brackets.  Align editorconfig with serval.
* TreatWarningsAsErrors

* reviewer comments

* Refactor models to records (#172)

* Better error handling for download endpoint (#173)

* Better error handling for download endpoint
In e2e test, grab current machine.py version

* Fix from reviewer comment

* fix e2e test

* fix #332 - only require fields used (#175)

* Remove required property from optional ClearMLTask attributes

* Add support for mixed source corpora (#177)

- align all Scripture corpora to Original versification

* Fix some logic found during e2e testing (#178)

Fix CI
* parenthesis for order of operations || &&
* master, not main

* default to major biblical terms (#180)

* default to major biblical terms - #357
Update tests - target-1 does not include BiblicalTermsListSetting to invoke default.

* fix

* Add support for non-verse text segments in Scripture corpora (#179)

- add new ScriptureRef corpus ref class
- update Scripture corpora classes to use ScriptureRef
- add ScriptureRefUsfmParserHandlerBase class to track ScriptureRef in USFM
- update UsfmTextUpdater and UsfmTextBase to use ScriptureRefUsfmParserHandlerBase
- add support for updating non-Scripture paragraphs and notes
- update NmtPreprocessBuildJob to support non-Scripture segments

Co-authored-by: John Lambert <john_lambert@sil.org>

* Update to fix vulnerability. (#186)

* Test coverage (#182)

* run test coverage locally

* More coverage of ScriptureRef and ScriptureElement.

* Reviewer Comments

* Updates from reviewer comments

* Convert to MongoDB callback API (#184)

* Convert to MongoDB callback API #314

* Update to newest data access layer

* Allow for write commands that are greater than 5MB. (#188)

* Control the pretranslation of existing text (#189)

* Control the pretranslation of existing text: #370

* Fix tests

* Serval was down until langtags.json was added back. (#194)

* Serval was down until langtags.json was added back.
Add test to force download.

* Reviwer comments

* Add logic to preprocess

* Add test

* Throw appropriate error; clarify langTag logic

* Skip parsing excluded books when preprocessing (#207)

- fixes #400

* Clarify and fix logic for textId and Chapters. (#209)

* Smt on clearml (#200)

* SMT on ClearML
* Replace CPU, GPU types with just Hangfire vs ClearML as well as engine type
* Allow each engine type to have it's own queue and docker image
* SMT build defaults on ClearML
* NMT local train removed
* Download and upload model in factory using tar.gz and the build directory
* Preserve changes from sillsdev/machine#205.

This reverts commit cf5f45393b4e86b2813ca1beb2c915d88539d159.

---------

Co-authored-by: Damien Daspit <damien_daspit@sil.org>

* Retry hangfire jobs (#197)

* retry hangfire jobs - sillsdev/machine#158
* Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160
* MongoDB can't handle documents greater than 16MB
* Treat messages from one id as a group
* Kill failing messages over 4 days old
* Make outbox truly generic, handling multiple queues
* Ensure globally ordered outbox messages
* Add "MoveAsync" to SharedStorage
* Refactor saving pretranslations file
* correctly handle scoped services in background services
* abstract file handling of content in outbox services
* merge Id and Context in Outbox model
* Consistently use strings for outbox message method identifiers
* split up tests into true unit tests

---------

Co-authored-by: Damien Daspit <damien_daspit@sil.org>

* Fix code style issues

* Rename Machine projects

* Fix error when loading icu.net

* remove unneeded file

---------

Co-authored-by: john lambert <john_lambert@sil.org>
Co-authored-by: Enkidu93 <lowryec17@gcc.edu>
Co-authored-by: Eli C. Lowry <83078660+Enkidu93@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants