-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up inconsistent states in MongoDB #158
Comments
Here is a plan:
|
I don't think we can switch all running jobs to pending on Machine startup. For example, a job can be running just fine on ClearML even if Machine restarts. I would really like to avoid adding any gRPC endpoints to deal with this issue. It isn't the responsibility of Serval to deal with the inconsistencies. It should be the sole responsibility of engine. There are many ways that this issue could be dealt with depending on how the engine is implemented. I want to give freedom to the engine to handle the inconsistencies in the best way possible. |
I updated the original description to reference machine-job and to keep jobs running if it can. Only if that restarts should there really be an issue. As for another gRPC endpoint, Serval may not know that a job is complete (even though machine tried to reach it). As far as machine knows, the job is complete but when a user asks for the status, Serval doesn't check with Machine, but just returns it's incorrect status. Do you have another way of resolving or syncing these two data sources? |
The inconsistency can occur, because, when a job completes, Machine needs to update the database and Serval using gRPC. These two operations need to be atomic, but they aren't. This is a common issue for distributed systems. Luckily, there is a pattern to handle this issue, called the transactional outbox pattern. Basically, we perform the database update and write the gRPC message to a database outbox in a single transaction. There is a separate process that monitors the database outbox and actually sends the message to Serval. This guarantees that the message is sent eventually even if Serval is down. |
Yes, the transactional outbox looks a bit more elegant - I'll work on implementing it then instead of the other GPRC endpoint. |
I thought about using Hangfire a bit more and realized that it won't work. We want to be able to update the outbox and the model in the same transaction. If we use Hangfire, the outbox would be the Hangfire queue, which is stored in a separate database. So there is no way to update the Machine database and the Hangfire database in a single transaction. I think this means that we will need to implement our own transactional outbox. |
Here is a sample implementation of the transactional outbox pattern for .NET. |
* Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160 * MongoDB can't handle documents greater than 16MB * Treat messages from one id as a group * Kill failing messages over 4 days old
* Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160 * MongoDB can't handle documents greater than 16MB * Treat messages from one id as a group * Kill failing messages over 4 days old * Universal message incrementer * Add testing
* Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160 * MongoDB can't handle documents greater than 16MB * Treat messages from one id as a group * Kill failing messages over 4 days old * Universal message incrementer * Add testing
* Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160 * MongoDB can't handle documents greater than 16MB * Treat messages from one id as a group * Kill failing messages over 4 days old * Universal message incrementer * Add testing
* Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160 * MongoDB can't handle documents greater than 16MB * Treat messages from one id as a group * Kill failing messages over 4 days old * Make outbox truly generic, handling multiple queues * Ensure globally ordered outbox messages * Add "MoveAsync" to SharedStorage * Refactor saving pretranslations file
* retry hangfire jobs - #158 * Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160 * MongoDB can't handle documents greater than 16MB * Treat messages from one id as a group * Kill failing messages over 4 days old * Make outbox truly generic, handling multiple queues * Ensure globally ordered outbox messages * Add "MoveAsync" to SharedStorage * Refactor saving pretranslations file * correctly handle scoped services in background services * abstract file handling of content in outbox services * merge Id and Context in Outbox model * Consistently use strings for outbox message method identifiers * split up tests into true unit tests --------- Co-authored-by: Damien Daspit <damien_daspit@sil.org>
* retry hangfire jobs - sillsdev/machine#158 * Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160 * MongoDB can't handle documents greater than 16MB * Treat messages from one id as a group * Kill failing messages over 4 days old * Make outbox truly generic, handling multiple queues * Ensure globally ordered outbox messages * Add "MoveAsync" to SharedStorage * Refactor saving pretranslations file * correctly handle scoped services in background services * abstract file handling of content in outbox services * merge Id and Context in Outbox model * Consistently use strings for outbox message method identifiers * split up tests into true unit tests --------- Co-authored-by: Damien Daspit <damien_daspit@sil.org>
* Fix DistributedReaderWriterLockTests * Refactor timeout function * Add HostId to RWLock * Separate SMT training from engine runtime * Add separate job server * Add support for webhooks * Rename webhooks controller * Rename creation DTOs * Rename BuildRevision to ModelRevision * Release all distributed locks on startup * Give priority to first writer lock waiter * Fix EngineRuntimeTests * Fix SMT training * Remove "admin" url * Use filename if no name is provided when uploading file * Refactor corpus processing * Further simplify corpus processing * Refactor truecaser interface * Refactor truecaser interface * Add ability to only load certain texts * Include step in progress status * Add ICorpus and IParallelTextCorpus interfaces * Add NMT engine to web API * Add Corpus model * Add more documentation to API * Add separate job queues * Add parent models dir setting * Initial RestSharp testing. I can authenticate with the QA server and perform operations! * Add NMT engine to web API * Add Corpus model * Add more documentation to API * Add separate job queues * Add parent models dir setting * Format code using CSharpier * Format code using CSharpier * Add ClearMLNmtEngineRuntime and ClearMLNmtEngineBuildJob * Fix unit tests * Delete all old pretranslations * Switch default machine.py branch to "release" * Update confidence of NMT engine * Set correct metrics for NMT engine * Use script to start NMT job instead of repository * Make root ClearML project configurable * Fix unit tests * Make BuildService unit tests more reliable * Initial tests kind of working- they are still failing, but it is an issue with the underlying SMT model. Do I understand it correctly? Primary SpecFlow test functioning with Client and local API. * Fix a breaking change from SDK 6.0.1: https://docs.microsoft.com/en-us/dotnet/core/compatibility/sdk/6.0/duplicate-files-in-output * Make a clean switch to var/machine/corpora reorganize settings.json files Note that ASPNETCORE environment variables will not override the *.settings.json files because they are custom named and added after builder initialization. * Add bible.txt to testing Revert back to app.settings.json - by having 2 separate output folders (update k8s, docker, cs, etc.) Allow for large files across ingress (100m) Increase memory for k8s jobs and api * Better fix - the files should be app.settings.json * Implement removing a corpus from a translation engine - closes #27 * Refactor all engine runtime handling into a separate service * Add engine server * Add support for S3 shared files * Refactor ITranslationEngine and ITranslationModel interfaces * Parallelize TranslateBatch for ThotSmtModel * Move data access classes to a separate assembly * Refactor Machine Web API to Serval * Use AsyncEx library * Remove Serval projects * Fix unit tests * Update to csharpier 0.22.1 * Update all unit test projects to .NET 6 * Call platform service when job canceled while pending * Fix compile errors * Refactor engine services to be stateless * Add initial retry policy for gRPC client * Use IIdGenerator * Disable retry for UpdateBuildStatus * Throw exception if unable to cancel current build * Update to latest version of Serval API * Update to latest version of Serval.Grpc * Add health checks * Do not use transaction when starting a build * Refactor translation engines to use strings instead of tokens * Make files directory if not exist Throttle update status to not overwhelm mongo * Use latest version of Serval.Grpc * Correctly handle punctuation in suggestions * Update dependencies * fixing clearml authentication and wierd build bug. * Fix delete filter - needed to explicitly cast to a String * fixes and updates: * allow for choosing to save nmt model * choose nmt docker image * pretranslate all specified corpora, even if the actual translation exists. * fix for deleting a clearml project * Minor updates from code review * redo- authorization token - refresh only every hour. * Update ClearMLService to match Machine.py * Resolve reviewer feedback: * Change ClearML authentication token to IHostedService. * Add using singletons as per: https://stackoverflow.com/questions/52036998/how-do-i-get-a-reference-to-an-ihostedservice-via-dependency-injection-in-asp-ne * REvert the DeleteAsync lambda to a string - it is an error in MongoDB Client as per https://jira.mongodb.org/browse/CSHARP-4522. It should be updated in SIL.DataAccess, not patched here. * Fixes from review: * update ClearML auth from many comments * Update to SIL.DataAccess 0.3.1 to fix mongoDB bug for deleting Nmt engine. * Refactor ClearMLAuthenticationService into a BackgroundService * Convert language tag to NLLB language code - closes #37 * Update Serval.Grpc to 0.8.0 * Ensure that icu.net.dll.config is deployed with servers * #42 - fix Ubuntu 22x and libdl.so issue. * #42 - revert back to ubuntu 20 - This issue needs to be resolved in icu-dotnet as well - see sillsdev/icu-dotnet#186 * #42 - icu.net.dll.config was not copied to the output directory on publish * #42 further fix: - The filepath looking for icu.net.dll.config resolves before it is built, therefore, use the full path. Then it will copy. * Enable a non-root S3 bucket path - see #29 * Fix missing pretranslation refs * Update version to 3.0.0 * * #40 Get SMT job cancellation working * Subscribe to mongoDB for build being cancelled * #36 - RWLock - remove upsert, change lock creation logic * #42 - add BSON attributes * hot-load serval projects when available for debugging data access without building new code * minor fixes and project updates * #32 Fixes for NMT cancellation * Fix "get task by name" from ClearML * Use same sub for cancellation as from SMT - and refractor * Fix tests * Update SIL.dataAccess * Minor error-handling change (most was done on serval side) An attempt at an S3 bucket health check but having some issues with it logging --ECL * Working S3 Bucket Health Check --ECL * S3 bucket health check --ECL * ClearML Health Chech --ECL * Some fixes in response to PR review --ECL * Removed AutoToString --ECL * Changes as per PR review --ECL * Changes as per PR review --ECL * Fix to proj file --ECL * Fixes #46 I tried to keep as much of the original structure as possible while replacing any interfacing with the bucket with API calls. This allowed me to keep the existing in-memory implementation. If this seems janky (i.e. using Stowage and the Amazon SDK - although Stowage is not used for anything to do with Amazon directly), I'm happy to remove the FileStorage layer and have a direct implementation of the SharedFileService - shouldn't be too hard. A few comments have been left in for review. --ECL * Removed Stowage library --ECL * Removed Stowage and implemented & tested in-memory & local implementations (local disk tests are currently commented out; see comment on PR) --ECL * Fixes as per PR review --ECL * Fix after automerging --ECL * Fixes #82 --ECL (#57) Co-authored-by: Enkidu93 <lowryec17@gcc.edu> * #54 - fix race condition * #53 - don't pretranslate things already pretranslated. * Fix #55 and #84 - remove SubscribeForCancellation Remove subscribe for cancellation. What was wrong before, I don't know - but now it is sucessfully cancelling without the extra subscription. * Working S3 bucket uploading (removed bug with extra slash). Switched to high-level API. Can revert if needed. --ECL * #58 - centralize and make directory deleting more robust * Cleanup from review - #63 * #50 auto-restart hangfire * Revert "#50 auto-restart hangfire" This reverts commit c20e38c. * Fix configuration of translation Engines - fixes #67 * Remove unnecessary Mongo attributes * Simplify distributed lock creation * Removed cancellation token passing after line 99. Opted to explicitly pass CancellationToken.None rather than leaving default for consistency with other code and added clarity. Fixes Cancellation token - inconsistent SMT training state #62 --ECL * Fixes #95 --ECL * Fix - still fail job even with S3 bucket failures * Fixes ClearML server is down - and then hangfire crashes #65 --ECL * Simplified logging --ECL * Added lock to token refresh; changed logging format --ECL * Changed name to ...Async --ECL * Fixed typo; removed unused var --ECL * DataAccess - 0.5.0 * #75 - don't fail nmt job on delete failing * Fix from review #78 * Reverted to low-level implementation (removing need for large buffers); fixes Refine S3 bucket usage #61 Also removed synchronous write; fixes S3 remove sync write (async only) #64 Note: Dispose has async calls/logic because "await using" on the StreamWriters in ClearMLNMTBuildJob calls DisposeAsync on the StreamWriters but Dispose on the BaseStream. This is the only way I could find to circumvent this - other solutions are welcome given there's a workaround I'm missing. * Fixes Throw error for unbuilt engine #81 --ECL * Switch to Aborted --ECL * Decoupled engine service from gRPC --ECL * Added error message --ECL * Fix broken test --ECL * Fixes to test --ECL * Fixes Machine S3 health - warn on 1 missing, fail on >86 seconds down? #84 Change interval of health checks to 4 minutes. S3 pings up to three times before returning Unhealthy. (Also, removed redundant logging) --ECL * Added ClearML retry policy (this included checking to see if Sldr was initialized before initializing - not sure if this needs to be locked. Also, what is Sldr being used for?) Undid change to default HealthCheck interval Changed AWS retry policy to default values Changed S3HealthCheck behavior to return Degraded on initial failures before returning Unhealthy Other minor fixes --ECL * Remove debug logging --ECL * Fixes as per PR --ECL * Use ClearMLAuthentication service for token-fetching * Get token each CallAsync * Include status reason in build message * Added open telemetry instrumentation (#88) * Added open telemetry instrumentation * Added gRPC instrumentation & moved package references * Hide console exporter (#90) * Strict parsing #2 (#92) Add strict parsing Fixes #2 * Update ci.yml (#95) * Update ci.yml * add coverlet * Add badge * Fixes RefId's don't get populated for Paratext Projects #93 (#94) * Fixes RefId's don't get populated for Paratext Projects #93 If there is no target file, there are no target refs; in this case, use the source ref. * Damien's fix to ref logic (Note: minimally tested against serval main and verse refs are properly generated) * Added queue name to ClearML health check failure message (#98) * Update dataaccess to most recent version * Refactor ClearML NMT build job - add support for multiple build stages - add support for running build jobs on Hangfire or ClearML - add BuildJobService - categorize build jobs into CPU or GPU jobs - decouple build job runners from translation engines - fix issues with S3FileStorage - fix issues with ClearMLService * Add ClearML Project setting (#106) - create a ClearML project if it doesn't exist * Config passing via build options for post-nmt/hangfire refactor (#109) * Config passing via build options for post-nmt/hangfire refactor * Fix camel/snake case * Remove build options from args passed to machine.py when null * Pass cancellation token * Upgrade to Serval.GRPC 0.9.0 --------- Co-authored-by: John Lambert <john_lambert@sil.org> * Queue depth endpoint + status message (#110) * Support basic queue depth endpoint * Working queue depth endpoint and placement in queue registered in status message * Move queueDepth to build property and clean up logic * environment types (#111) * Addresses #178 * REname from summybuild to staging * Working percent completed (#112) * Use named HTTP clients for ClearML (#115) - move ClearML connection string to "ConnectionStrings" section - fixes #108 * Fixes #121 * Don't delete S3 bucket files if the build faulted so we can debug (#120) Don't cleanup files if job faulted so it can be recreate build * Fix corpus count methods (#121) - add support multiple columns in TextFileText.Count method - remove MissingRowsAllowed property - use standard Count implementation for ParallelTextCorpus * Make lang tag to NLLB lang code conversion more robust (#123) - properly handle languages with an implicit script - normalize Chinese subtags - convert macrolanguage subtag to corresponding standard language subtag * Fixes #118 * Improve default script code lookup (#125) - add LanguageTagService - use langtags.json to improve default script lookup - various code cleanup * Update language code parsing (#126) * Update some logic, flow and documentation of parsing out language codes * Respond to reviewer comments * Updates from Review * Fixes #128 * Only upload when count is greater than 0 * Fixes #202 * Serval.Grpc 0.11.0 * grpc 12 * Fixes #130 (#134) * Fixes #130 * Remove stack trace info * Fixes #133 (#135) * Fixes #133 * Use locks (also fix locking in S3) * Change front of queue to queueDepth = 1 * Fix null reference return error (#140) * Check for key terms and write aligned rows during preprocessing * Working native key terms functionality * Working key terms solution including: *Testing *Moving logic (as possible) out of the build job *Added logic to SMT *Stuck with buildOptions for configuring (after further conversation) Issues - Not as much abstraction as I'd like between the CorpusService, *BuildJobs, etc. * Redesign as per PR review * Minor fixes from PR review * Change SMT default behavior as well * Fix dictionary logic * More PR reviews fixes * Added testing * More PR review fixes * Fix key term ids logic * Remove null check * Fixed test data * Removed unecessary class from Serval and reworked Machine code using it * Change using's order * Revert to using CustomEnumConverter * Build log output (#155) * Log beginning of build log that can be built into metrics. * Don't crash if the build options are bad. * Refactor and fix naming. * Updates as per conversation Adding resolved codes * Passed shared file folder to ClearML NMT job (#157) * Drive health (#156) * Initial stgab at improved health checking * upgrade to net8.0 * Add health check endpoint to GRPC engine proto (from serval) * Combine health reports into rich data * Clean up fixes * Clean up ClearML authentication error reporting. * Update danger level to 1GB hard drive space left. * Remove null forgiving operators. * Fix build error * Updates from review * reviewer comments. * Update serval GRPC interface * Update to latest version of csharpier * Language endpoint (#159) * Add NLLB-200 language code checking support Add SMT stub support Reviewer Comments IsSupportedNatively InternalCode UpdatedName Optional Parameters for language info Minor fixes reviewer comments Updates from review comments Reviewer comments * fix Korean script for NLLB: #290 * Update GRPC from Serval. * Fix tests * Fix formatting. * Cancel build gRPC endpoint returns Aborted when no build is running (#165) - fixes #162 * Added chapter-level filtering (#161) * Added chapter-level filtering; fixes #150 * Move scripture range parsing to Serval * Changes as per review comments * Count() to Count * Revert "Added chapter-level filtering (#161)" This reverts commit 5f80d82. * Chapter level aspnetcore (#167) * Added chapter-level filtering; fixes #150 * Move logic to parallel text corpus * Nmt download (#164) * Presigned URL code cleaning script IsModelPersisted nullable Return IsModelPersistedState to Serval Check for modelRevision + 1 in cleanup and just delete without keeping internal states. * Reviewer comments * update to most recent api * Try 2 for editor config: (#171) * Try 2 for editor config: * Namespace always file level * Always primary constructor * Inline null checks * Enforce naming * Fix spelling mistakes * simplify default initializations * Add static and readonly where valid * Remove unnecessary usings. * No Functional change * Change private static to PascalCase * Add if multiline brackets. Align editorconfig with serval. * TreatWarningsAsErrors * reviewer comments * Refactor models to records (#172) * Better error handling for download endpoint (#173) * Better error handling for download endpoint In e2e test, grab current machine.py version * Fix from reviewer comment * fix e2e test * fix #332 - only require fields used (#175) * Remove required property from optional ClearMLTask attributes * Add support for mixed source corpora (#177) - align all Scripture corpora to Original versification * Fix some logic found during e2e testing (#178) Fix CI * parenthesis for order of operations || && * master, not main * default to major biblical terms (#180) * default to major biblical terms - #357 Update tests - target-1 does not include BiblicalTermsListSetting to invoke default. * fix * Add support for non-verse text segments in Scripture corpora (#179) - add new ScriptureRef corpus ref class - update Scripture corpora classes to use ScriptureRef - add ScriptureRefUsfmParserHandlerBase class to track ScriptureRef in USFM - update UsfmTextUpdater and UsfmTextBase to use ScriptureRefUsfmParserHandlerBase - add support for updating non-Scripture paragraphs and notes - update NmtPreprocessBuildJob to support non-Scripture segments Co-authored-by: John Lambert <john_lambert@sil.org> * Update to fix vulnerability. (#186) * Test coverage (#182) * run test coverage locally * More coverage of ScriptureRef and ScriptureElement. * Reviewer Comments * Updates from reviewer comments * Convert to MongoDB callback API (#184) * Convert to MongoDB callback API #314 * Update to newest data access layer * Allow for write commands that are greater than 5MB. (#188) * Control the pretranslation of existing text (#189) * Control the pretranslation of existing text: #370 * Fix tests * Serval was down until langtags.json was added back. (#194) * Serval was down until langtags.json was added back. Add test to force download. * Reviwer comments * Add logic to preprocess * Add test * Throw appropriate error; clarify langTag logic * Skip parsing excluded books when preprocessing (#207) - fixes #400 * Clarify and fix logic for textId and Chapters. (#209) * Smt on clearml (#200) * SMT on ClearML * Replace CPU, GPU types with just Hangfire vs ClearML as well as engine type * Allow each engine type to have it's own queue and docker image * SMT build defaults on ClearML * NMT local train removed * Download and upload model in factory using tar.gz and the build directory * Preserve changes from sillsdev/machine#205. This reverts commit cf5f45393b4e86b2813ca1beb2c915d88539d159. --------- Co-authored-by: Damien Daspit <damien_daspit@sil.org> * Retry hangfire jobs (#197) * retry hangfire jobs - sillsdev/machine#158 * Fixed auto-retry as per this forum post: https://discuss.hangfire.io/t/recurring-jobs-do-not-automatically-get-retried-after-application-crash-net-core-service/9160 * MongoDB can't handle documents greater than 16MB * Treat messages from one id as a group * Kill failing messages over 4 days old * Make outbox truly generic, handling multiple queues * Ensure globally ordered outbox messages * Add "MoveAsync" to SharedStorage * Refactor saving pretranslations file * correctly handle scoped services in background services * abstract file handling of content in outbox services * merge Id and Context in Outbox model * Consistently use strings for outbox message method identifiers * split up tests into true unit tests --------- Co-authored-by: Damien Daspit <damien_daspit@sil.org> * Fix code style issues * Rename Machine projects * Fix error when loading icu.net * remove unneeded file --------- Co-authored-by: john lambert <john_lambert@sil.org> Co-authored-by: Enkidu93 <lowryec17@gcc.edu> Co-authored-by: Eli C. Lowry <83078660+Enkidu93@users.noreply.github.com>
MongoDB or the job server can crash at any time. We should have a periodic job to clean it up. This is really about build states getting stuck.
The text was updated successfully, but these errors were encountered: