New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add repeat for wait_for_rebalance #6578
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -757,24 +757,39 @@ def get_rebalance_status(self): | |
| and states["count"] == total_pg_count | ||
| ) | ||
|
|
||
| def wait_for_rebalance(self, timeout=600): | ||
| def wait_for_rebalance(self, timeout=600, repeat=3): | ||
| """ | ||
| Wait for re-balance to complete | ||
|
|
||
| Args: | ||
| timeout (int): Time to wait for the completion of re-balance | ||
| repeat (int): How many times to repeat the check to make sure, it's | ||
| really completed. | ||
|
|
||
| Returns: | ||
| bool: True if rebalance completed, False otherwise | ||
| bool: True if re-balance completed, False otherwise | ||
|
|
||
| """ | ||
| try: | ||
| for rebalance in TimeoutSampler( | ||
| timeout=timeout, sleep=10, func=self.get_rebalance_status | ||
| ): | ||
| if rebalance: | ||
| logger.info("Re-balance is completed") | ||
| return True | ||
| start_time = time.time() | ||
| for attempt in range(1, repeat + 1): | ||
| diff_time = int(time.time() - start_time) | ||
| logger.debug(f"Attempt {attempt} out of {repeat} repeats.") | ||
| for rebalance in TimeoutSampler( | ||
| timeout=timeout - diff_time, | ||
| sleep=10, | ||
| func=self.get_rebalance_status, | ||
| ): | ||
| if rebalance: | ||
| logger.info( | ||
| f"Attempt {attempt} for re-balance out of {repeat} is completed" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @petr-balogh ,This message needs more clarity that we are reattempting rebalance check even though previously we have seen clean status. Currently it sounds as if in the previous attempt cluster was still rebalancing where as that's not the case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am mentioning that re banalance is completed and the number of the attempts and also out of how many I am going to perform. How you would like to re-phrase this? |
||
| ) | ||
| if repeat == attempt: | ||
| return True | ||
| else: | ||
| logger.debug("Wait 10 seconds between next attempt") | ||
| time.sleep(10) | ||
| break | ||
| except exceptions.TimeoutExpiredError: | ||
| logger.error( | ||
| f"Data re-balance failed to complete within the given " | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petr-balogh , I see that timeout value could become -ve at some point where diff_time grows larger during repeats. This could lead to quick successive attempts and TimeoutSampler exception. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petr-balogh for ex:
let's say timeout=600 and in the first iteration diff_time will be very small because we started just now.
Hence timeout = ~600, assume that rebalance takes 600 seconds and finsihes successfully. since repeat =3 we will have 2 more attempts to be done even though cluster is clean at this point.
In the second iteration of outermost for loop where repeat=2, diff_time will be > 600 as we have already spent 600 seconds in the previous attempt, now timeout will be < 0 (some -ve value). Its because of this line
ocs-ci/ocs_ci/utility/utils.py
Line 938 in 6ecc96c
get_rebalance_statusand thenwait_for_rebalancewill fail even though there was no error.Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional @shylesh . As new behavior is that I would like to have the call of making really sure re-balance is complete don X many times as I have set the
repeatvalue for in the mentioned timeout, in default value 600.So in 600 seconds my function needs to prove rebalance completed 3 times.
So if it will take for first attempt more than 580 seconds, then it will not complete probably to do it 3 times and should fail which is intentional.
If user wants only one attempt, they need reduce repeat to 1.