Skip to content

Conversation

@Shivs11
Copy link
Member

@Shivs11 Shivs11 commented Mar 21, 2025

What changed?

Why?

  • verifying if the versioning feature can handle concurrent requests
  • desired behavior: back-to-back setCurrent/ramping requests with different inputs should not throw weird errors, only throttling or timeout, or successful result.

How did you test it?

  • Ran this suite locally and CI

Potential risks

Documentation

Is hotfix candidate?

  • No

@Shivs11 Shivs11 requested a review from a team as a code owner March 21, 2025 18:46
})
}

func (s *WorkerDeploymentSuite) TestSetCurrent_NoDeploymentVersion() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just moved this test to an area which had all other setCurrent tests - no need to review this.

@carlydf carlydf changed the title concurrent setCurrent/setRamping requests test concurrent setCurrent/setRamping requests Mar 24, 2025
Copy link
Contributor

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had a chance to test these new tests for flakiness? (ie. by running it through the CI action that runs it 50x in cassandra or something)

Copy link
Contributor

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding of the ticket, I think that these tests should not override the MaxInFlightUpdates -- we should test against the default because that's what customers will have. Ideally that will only cause throttling or timeout errors and nothing "weird." If any other error type comes up, that's a bug that should be addressed.


for i := 0; i < versions; i++ {
err := <-errChan
s.NoError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of these requests might be expected to get throttling error or context timeout error. If you decrease the # of WorkflowExecutionMaxInFlightUpdates to the default (which is 10), do you see anything weird, or just throttling + timeout errors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of s.NoError you could keep the rest of this test the same, and just assert that the error is either:

  • nil
  • timeout
  • throttling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just bringing this up since the ticket said "confirm that concurrent updates with different inputs don't result in any weird errors, only throttling or timeout or success"

Copy link
Member Author

@Shivs11 Shivs11 Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is one important point to consider - when a user sends in a setCurrent request (for example), there are going to be 2 in-flight updates (inflight updates are those updates that have been admitted but not completed). For each set request (1 in-flight update since it's not completed), we have a sync call request (another in-flight update). Thus, for 10 setCurrent requests, we have to have the limit set to 20. So, setting the limit to 10 shall give you ResourceExhaustedErrors.

Looking at the ticket again, I took a step back and tried understanding what it is asking us to test - It seems like the main motivation behind the ticket is that when we do send in 10 concurrent requests, we should not get any "weird" errors - however, I think we should slightly change what the ticket is asking to achieve it's main purpose.

We should be testing that given enough time for our system to process these 10 concurrent updates with the system being ready to handle these 10 concurrent updates, does anything weird at all happen? This makes it easy to catch for things we wanted to catch from the get-go: any weird errors.

wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I do see the other side of things now too - this ticket could just be asking how the system reacts to these large number of concurrent requests if we don't change the original state of the system! (like increasing the number of concurrent updates)

In that case, a user should not be seeing any of those "weird" errors except ResourceExhaustedErrors or Timeout.

I think this is what you had meant earlier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it the same way as you put it in your second comment! Are you on board with writing the test that way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on board and have written them like this! also tested each one of them in the CI and they all pass.

}

func (s *WorkerDeploymentSuite) TestSetCurrentVersion_Concurrent_DifferentVersions() {
s.OverrideDynamicConfig(dynamicconfig.WorkflowExecutionMaxInFlightUpdates, 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should use the default value for this (which is 10). What do you think?

}

func (s *WorkerDeploymentSuite) TestSetCurrentVersion_Concurrent_SameVersion() {
s.OverrideDynamicConfig(dynamicconfig.WorkflowExecutionMaxInFlightUpdates, 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we should be testing this with default config


for i := 0; i < 10; i++ {
err := <-errChan
s.NoError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, throttling or timeout is acceptable

@codecov
Copy link

codecov bot commented Mar 25, 2025

❌ We are unable to process any of the uploaded JUnit XML files. Please ensure your files are in the right format.

Copy link
Contributor

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth stating that your override is simply ensuring that the MaxUpdates value is the default for these tests! Other than that looks great

@Shivs11 Shivs11 merged commit 08c0bee into main Mar 25, 2025
50 checks passed
@Shivs11 Shivs11 deleted the ss/versioning-concurrent-safety-net branch March 25, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants