fix(saga): Handle concurrency issue when same op is sent more than once #4229
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We experienced an issue with sagas where Orca submitted the same operation into clouddriver at the same time (separated by a few milliseconds). The second request returned a 500 due to the SQL integrity constraint violation, but the work was already being handled successfully by the other clouddriver instance.
What should happen is that the second instance returns a pointer to the original task so that Orca doesn't have to reason about its mistake and just carry on monitoring the operation that's being performed.
Also changed the duration that Kato Tasks are kept around. 1 hour is too short to be able to get notified of an error and to diagnose. 4 days seems a reasonable default so that if an error occurs EOW or on the weekend, there's still sufficient time to get information the following week.