Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process: fix found position bug with wait_output delta_only #3638

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions lisa/util/process.py
Original file line number Diff line number Diff line change
@@ -498,21 +498,29 @@ def wait_output(
# check if stdout buffers contain the string "keyword" to determine if
# it is running
start_time = time.time()
searched_len = 0
while time.time() - start_time < timeout:
# LogWriter only flushes if "\n" is written, so we need to flush
# manually.
self._stdout_writer.flush()
self._stderr_writer.flush()

# check if buffer contains the keyword
search_buffer = self.log_buffer.getvalue()
find_pos = self.log_buffer_offset if delta_only else 0
if self.log_buffer.getvalue().find(keyword, find_pos) >= 0:
self.log_buffer_offset = len(self.log_buffer.getvalue())
found_pos = search_buffer.find(keyword, find_pos)
if found_pos >= 0:
# note: find returns start for slice of search_buffer[start:end]
# So calculate end; start+len, as the new offset
self.log_buffer_offset = found_pos + len(keyword)
Copy link
Member

@squirrelsc squirrelsc Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the length of the whole log_buffer. It's not a scenario to search multiple times with the same string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained the bug to you earlier and you said it was a good point... What changed? This isn't about searching for the same string twice. It's searching more than once. You will lose any log messages which follow the first message. There's no guarantee you'll get one line at a time. And delta_only means that the ordering is important and there may be duplicate messages in the logs. So why assume there's nothing important in the rest of the buffer...

I'm not just digging around looking for random bugs to open; I ran into this bug while working on a test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the bug is that the self.log_buffer.getvalue() is called multiple times, so actually it doesn't present the same string. It may cause some problems in multi-threads.

But this place is by design. This is to find a signal to take action, not to find and match strings for other scenarios. Because it's low performance to get string out from a buffer. It's use cases like, 1) start a process, 2) block to waiting the process printing "started", 3) execute next step.

Copy link
Collaborator Author

@mcgov mcgov Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the bug is that the self.log_buffer.getvalue() is called multiple times, so actually it doesn't present the same string. It may cause some problems in multi-threads.

Oh right, I guess there were two bugs. Even better!

Because it's low performance to get string out from a buffer.

Using delta_only to look for a ordered series of messages will make a single pass through the buffer. I don't think this will be any slower than making a single or more passes later. I'm also pretty sure this function previously searched the entire buffer on every call to wait_output before delta_only was added...

But this place is by design. This is to find a signal to take action, not to find and match strings for other scenarios.

What if you have to check that events happened in a specific order AND take action afterwards...
  

dmesg=node.tools[Dmesg]
vf_info = node.nics.get_secondary_nic()
primary = node.execute_async('program_name')

# wait for start
primary.wait_output("started")

# toggle sriov
node.features[NetworkInterface].switch_sriov(enable=False)
primary.wait_output('hotplug: remove', delta_only=True)
node.features[NetworkInterface].switch_sriov(enable=True)

# wait for hotplug event
primary.wait_output('hotplug: add', delta_only=True)  
dmesg.check_for_kernel_errors() # check for dump immediately after hotplug

# start a secondary process (here's where things get interesting)
secondary = node.execute("program_name --process_type=secondary") 
secondary.wait_output("started")

primary.wait_output('secondary process: VF started {vf_info.name}', delta_only=True)  # check info is sent to secondary
secondary.wait_output("received info from primary: VF {vf_info.name}", delta_only=True) # check info is received by 

secondary.wait_output("sent primary info: ...", delta_only=True) # check secondary sent message to primary after start
primary.wait_output("received secondary info:", delta_only=True) # check second bit of info after secondary vf start

dmesg.check_for_kernel_errors() # check for dump after secondary start

... # send traffic from one nic to the other, do some test stuff.

primary_result = proc.wait_result(expected_exit_code=0 ... )
secondary_result = proc.wait_result(expected_exit_code=0 ...)

for match in some_regex.finditer(primary_result .stdout):
    ... # do output parsing for things like perf numbers and packet counts
       # where order does not matter

Execute, then wait for output. Switch on feature, wait for output. Check thing A happened in process A after that feature was enabled. Check thing B happened in process B after thing A happened.

Why is this bad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, for above example, it doesn't really convince me it's needed. The wait on primary/secondary actually is not ordered, because they are running in parallel. Leave the last wait is enough.
primary.wait_output("received secondary info:", delta_only=True)

Anyway, also discussed with Paul, a better design to support both scenarios.

  1. Add an argument called offset_to_end. The default value is True, because most scenarios, it shouldn't wait a process twice. If it's needed, set it to False. But I haven't seen any case need it yet.
  2. Use the offset_to_end at the L515. When it's True, set it to len of the temp buffer. When it's False, following the current implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just really think the bug here is that 'delta_only' doesn't look at only the delta. This concept of what the function was 'for' wasn't clear to me based on the name and my expectation of what would be useful. I called this a 'bug fix' because I thought the delta_only flag would what the name implies it should do, look at the delta from the last thing we matched.

The problem is the names doesn't accurately describe the behavior. So maybe we make this explicit for the caller:

class StartAt(Enum):
    EndOfLastMatch='last_keyword' 
    EndOfLastBuffer='last_buffer'

wait_output(keyword: str, match_type: Optional[StartAt] = None) 

# So then callers can use it like this:
process.wait_output(keyword="Some log line", match_type=StartAt.EndOfLastMatch)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it can be changed like this, but it needs save two positions, last_end and last_matched. It needs to support search from beginning too. But don't overuse it. The cost of this method is to create a string from a buffer, not from the searching. So if a string is ready, searching in the string multiple times is much lower cost than get_value from buffer multiple times. Use it for async signal only, not for logic.

So, it looks like,

class StartAt(Enum):
    Beginning = "beginning"
    LastMatch = 'last_match' 
    LastEnd = 'last_end'

# Set default to most useful `LastEnd`.
wait_output(keyword: str, start_at: Optional[StartAt] = StartAt.LastEnd)

# So then callers can use it like this:
process.wait_output(keyword="Some log line", match_type=StartAt.LastMatch)

return True

# else, remember amount we've searched so far
searched_len = len(search_buffer)
time.sleep(interval)

self.log_buffer_offset = len(self.log_buffer.getvalue())
# if the keyword was not found, still retain the
# buffer length for a second future delta_only caller.
self.log_buffer_offset = searched_len

if error_on_missing:
raise LisaException(
Loading
Oops, something went wrong.