Process: fix found position bug with wait_output delta_only #3638

mcgov · 2025-02-07T17:02:50Z

Fixes: 603bd27

If a buffer is length N and found message 0 is not at buffer[N-len(message0):N], the current code can result in missed messages on future calls to wait_output.

wait_output currently stores the total length of the buffer as the search buffer offset, even when a keyword is found.

Consider a buffer of 100 characters, keyword of len 10 found at 50. The old code will record 100 as the last length. So if your next keyword is length 10 and found at position 70, it will be lost if you call wait_output with delta_only again.

This commit fixes that issue by saving only the end position of slice buffer[start:end] which contains the first keyword.

Otherwise, it saves the length of the buffer if no keyword was found.

Fixes: 603bd27 If a buffer is length N and found message 0 is not at buffer[N-len(message0):N], the current code can result in missed messages on future calls to wait_output. wait_output currently stores the total length of the buffer as the search buffer offset, even when a keyword is found. Consider a buffer of 100 characters, keyword of len 10 found at 50. The old code will record 100 as the last length. So if your next keyword is length 10 and found at position 70, it will be lost if you call wait_output with delta_only again. This commit fixes that issue by saving only the end position of slice buffer[start:end] which contains the first keyword. Otherwise, it saves the length of the buffer if no keyword was found. Signed-off-by: Matthew G McGovern <mamcgove@microsoft.com>

squirrelsc · 2025-02-07T23:47:01Z

lisa/util/process.py

+            if found_pos >= 0:
+                # note: find returns start for slice of search_buffer[start:end]
+                # So calculate end; start+len, as the new offset
+                self.log_buffer_offset = found_pos + len(keyword)


It should be the length of the whole log_buffer. It's not a scenario to search multiple times with the same string.

I explained the bug to you earlier and you said it was a good point... What changed? This isn't about searching for the same string twice. It's searching more than once. You will lose any log messages which follow the first message. There's no guarantee you'll get one line at a time. And delta_only means that the ordering is important and there may be duplicate messages in the logs. So why assume there's nothing important in the rest of the buffer...

I'm not just digging around looking for random bugs to open; I ran into this bug while working on a test case.

I thought the bug is that the self.log_buffer.getvalue() is called multiple times, so actually it doesn't present the same string. It may cause some problems in multi-threads.

But this place is by design. This is to find a signal to take action, not to find and match strings for other scenarios. Because it's low performance to get string out from a buffer. It's use cases like, 1) start a process, 2) block to waiting the process printing "started", 3) execute next step.

I thought the bug is that the self.log_buffer.getvalue() is called multiple times, so actually it doesn't present the same string. It may cause some problems in multi-threads.

Oh right, I guess there were two bugs. Even better!

Because it's low performance to get string out from a buffer.

Using delta_only to look for a ordered series of messages will make a single pass through the buffer. I don't think this will be any slower than making a single or more passes later. I'm also pretty sure this function previously searched the entire buffer on every call to wait_output before delta_only was added...

But this place is by design. This is to find a signal to take action, not to find and match strings for other scenarios.

What if you have to check that events happened in a specific order AND take action afterwards...

dmesg=node.tools[Dmesg] vf_info = node.nics.get_secondary_nic() primary = node.execute_async('program_name') # wait for start primary.wait_output("started") # toggle sriov node.features[NetworkInterface].switch_sriov(enable=False) primary.wait_output('hotplug: remove', delta_only=True) node.features[NetworkInterface].switch_sriov(enable=True) # wait for hotplug event primary.wait_output('hotplug: add', delta_only=True) dmesg.check_for_kernel_errors() # check for dump immediately after hotplug # start a secondary process (here's where things get interesting) secondary = node.execute("program_name --process_type=secondary") secondary.wait_output("started") primary.wait_output('secondary process: VF started {vf_info.name}', delta_only=True) # check info is sent to secondary secondary.wait_output("received info from primary: VF {vf_info.name}", delta_only=True) # check info is received by secondary.wait_output("sent primary info: ...", delta_only=True) # check secondary sent message to primary after start primary.wait_output("received secondary info:", delta_only=True) # check second bit of info after secondary vf start dmesg.check_for_kernel_errors() # check for dump after secondary start ... # send traffic from one nic to the other, do some test stuff. primary_result = proc.wait_result(expected_exit_code=0 ... ) secondary_result = proc.wait_result(expected_exit_code=0 ...) for match in some_regex.finditer(primary_result .stdout): ... # do output parsing for things like perf numbers and packet counts # where order does not matter

Execute, then wait for output. Switch on feature, wait for output. Check thing A happened in process A after that feature was enabled. Check thing B happened in process B after thing A happened.

Why is this bad?

Firstly, for above example, it doesn't really convince me it's needed. The wait on primary/secondary actually is not ordered, because they are running in parallel. Leave the last wait is enough.
primary.wait_output("received secondary info:", delta_only=True)

Anyway, also discussed with Paul, a better design to support both scenarios.

Add an argument called offset_to_end. The default value is True, because most scenarios, it shouldn't wait a process twice. If it's needed, set it to False. But I haven't seen any case need it yet.

Use the offset_to_end at the L515. When it's True, set it to len of the temp buffer. When it's False, following the current implementation.

I just really think the bug here is that 'delta_only' doesn't look at only the delta. This concept of what the function was 'for' wasn't clear to me based on the name and my expectation of what would be useful. I called this a 'bug fix' because I thought the delta_only flag would what the name implies it should do, look at the delta from the last thing we matched.

The problem is the names doesn't accurately describe the behavior. So maybe we make this explicit for the caller:

class StartAt(Enum): EndOfLastMatch='last_keyword' EndOfLastBuffer='last_buffer' wait_output(keyword: str, match_type: Optional[StartAt] = None) # So then callers can use it like this: process.wait_output(keyword="Some log line", match_type=StartAt.EndOfLastMatch)

ok, it can be changed like this, but it needs save two positions, last_end and last_matched. It needs to support search from beginning too. But don't overuse it. The cost of this method is to create a string from a buffer, not from the searching. So if a string is ready, searching in the string multiple times is much lower cost than get_value from buffer multiple times. Use it for async signal only, not for logic.

So, it looks like,

class StartAt(Enum): Beginning = "beginning" LastMatch = 'last_match' LastEnd = 'last_end' # Set default to most useful `LastEnd`. wait_output(keyword: str, start_at: Optional[StartAt] = StartAt.LastEnd) # So then callers can use it like this: process.wait_output(keyword="Some log line", match_type=StartAt.LastMatch)

mcgov added the 🐞 Bug Something isn't working label Feb 7, 2025

mcgov marked this pull request as ready for review February 7, 2025 17:10

mcgov requested review from squirrelsc and LiliDeng as code owners February 7, 2025 17:10

squirrelsc reviewed Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Process: fix found position bug with wait_output delta_only #3638

Process: fix found position bug with wait_output delta_only #3638

Uh oh!

mcgov commented Feb 7, 2025

Uh oh!

squirrelsc Feb 7, 2025 •

edited

Loading

Uh oh!

mcgov Feb 10, 2025

Uh oh!

squirrelsc Feb 10, 2025

Uh oh!

mcgov Feb 10, 2025 •

edited

Loading

Uh oh!

squirrelsc Feb 10, 2025

Uh oh!

mcgov Feb 11, 2025

Uh oh!

mcgov Feb 13, 2025

Uh oh!

squirrelsc Feb 16, 2025

Uh oh!

Uh oh!

Process: fix found position bug with wait_output delta_only #3638

Are you sure you want to change the base?

Process: fix found position bug with wait_output delta_only #3638

Uh oh!

Conversation

mcgov commented Feb 7, 2025

Uh oh!

squirrelsc Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgov Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

squirrelsc Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

mcgov Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squirrelsc Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

mcgov Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

mcgov Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

squirrelsc Feb 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

squirrelsc Feb 7, 2025 •

edited

Loading

mcgov Feb 10, 2025 •

edited

Loading