Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add repeat for wait_for_rebalance #6578

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
31 changes: 23 additions & 8 deletions ocs_ci/ocs/cluster.py
Expand Up @@ -757,24 +757,39 @@ def get_rebalance_status(self):
and states["count"] == total_pg_count
)

def wait_for_rebalance(self, timeout=600):
def wait_for_rebalance(self, timeout=600, repeat=3):
"""
Wait for re-balance to complete

Args:
timeout (int): Time to wait for the completion of re-balance
repeat (int): How many times to repeat the check to make sure, it's
really completed.

Returns:
bool: True if rebalance completed, False otherwise
bool: True if re-balance completed, False otherwise

"""
try:
for rebalance in TimeoutSampler(
timeout=timeout, sleep=10, func=self.get_rebalance_status
):
if rebalance:
logger.info("Re-balance is completed")
return True
start_time = time.time()
for attempt in range(1, repeat + 1):
diff_time = int(time.time() - start_time)
logger.debug(f"Attempt {attempt} out of {repeat} repeats.")
for rebalance in TimeoutSampler(
timeout=timeout - diff_time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petr-balogh , I see that timeout value could become -ve at some point where diff_time grows larger during repeats. This could lead to quick successive attempts and TimeoutSampler exception. WDYT ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petr-balogh for ex:
let's say timeout=600 and in the first iteration diff_time will be very small because we started just now.
Hence timeout = ~600, assume that rebalance takes 600 seconds and finsihes successfully. since repeat =3 we will have 2 more attempts to be done even though cluster is clean at this point.

In the second iteration of outermost for loop where repeat=2, diff_time will be > 600 as we have already spent 600 seconds in the previous attempt, now timeout will be < 0 (some -ve value). Its because of this line

if self.timeout <= (self.last_sample_time - self.start_time):
, TimeoutSampler raises exception even before calling the get_rebalance_status and then wait_for_rebalance will fail even though there was no error.

Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional @shylesh . As new behavior is that I would like to have the call of making really sure re-balance is complete don X many times as I have set the repeat value for in the mentioned timeout, in default value 600.

So in 600 seconds my function needs to prove rebalance completed 3 times.

So if it will take for first attempt more than 580 seconds, then it will not complete probably to do it 3 times and should fail which is intentional.

If user wants only one attempt, they need reduce repeat to 1.

sleep=10,
func=self.get_rebalance_status,
):
if rebalance:
logger.info(
f"Attempt {attempt} for re-balance out of {repeat} is completed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petr-balogh ,This message needs more clarity that we are reattempting rebalance check even though previously we have seen clean status. Currently it sounds as if in the previous attempt cluster was still rebalancing where as that's not the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mentioning that re banalance is completed and the number of the attempts and also out of how many I am going to perform. How you would like to re-phrase this?

)
if repeat == attempt:
return True
else:
logger.debug("Wait 10 seconds between next attempt")
time.sleep(10)
break
except exceptions.TimeoutExpiredError:
logger.error(
f"Data re-balance failed to complete within the given "
Expand Down