Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process: fix found position bug with wait_output delta_only #3638

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mcgov
Copy link
Collaborator

@mcgov mcgov commented Feb 7, 2025

Fixes: 603bd27

If a buffer is length N and found message 0 is not at buffer[N-len(message0):N], the current code can result in missed messages on future calls to wait_output.

wait_output currently stores the total length of the buffer as the search buffer offset, even when a keyword is found.

Consider a buffer of 100 characters, keyword of len 10 found at 50. The old code will record 100 as the last length. So if your next keyword is length 10 and found at position 70, it will be lost if you call wait_output with delta_only again.

This commit fixes that issue by saving only the end position of slice buffer[start:end] which contains the first keyword.

Otherwise, it saves the length of the buffer if no keyword was found.

Fixes: 603bd27

If a buffer is length N and found message 0 is not at
buffer[N-len(message0):N], the current code can result in missed
messages on future calls to wait_output.

wait_output currently stores the total length of the buffer as
the search buffer offset, even when a keyword is found.

Consider a buffer of 100 characters, keyword of len 10 found at 50.
The old code will record 100 as the last length. So if your next
keyword is length 10 and found at position 70, it will be lost
if you call wait_output with delta_only again.

This commit fixes that issue by saving only the end position
of slice buffer[start:end] which contains the first keyword.

Otherwise, it saves the length of the buffer if no keyword was
found.

Signed-off-by: Matthew G McGovern <mamcgove@microsoft.com>
@mcgov mcgov added the 🐞 Bug Something isn't working label Feb 7, 2025
@mcgov mcgov marked this pull request as ready for review February 7, 2025 17:10
if found_pos >= 0:
# note: find returns start for slice of search_buffer[start:end]
# So calculate end; start+len, as the new offset
self.log_buffer_offset = found_pos + len(keyword)
Copy link
Member

@squirrelsc squirrelsc Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the length of the whole log_buffer. It's not a scenario to search multiple times with the same string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained the bug to you earlier and you said it was a good point... What changed? This isn't about searching for the same string twice. It's searching more than once. You will lose any log messages which follow the first message. There's no guarantee you'll get one line at a time. And delta_only means that the ordering is important and there may be duplicate messages in the logs. So why assume there's nothing important in the rest of the buffer...

I'm not just digging around looking for random bugs to open; I ran into this bug while working on a test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the bug is that the self.log_buffer.getvalue() is called multiple times, so actually it doesn't present the same string. It may cause some problems in multi-threads.

But this place is by design. This is to find a signal to take action, not to find and match strings for other scenarios. Because it's low performance to get string out from a buffer. It's use cases like, 1) start a process, 2) block to waiting the process printing "started", 3) execute next step.

Copy link
Collaborator Author

@mcgov mcgov Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the bug is that the self.log_buffer.getvalue() is called multiple times, so actually it doesn't present the same string. It may cause some problems in multi-threads.

Oh right, I guess there were two bugs. Even better!

Because it's low performance to get string out from a buffer.

Using delta_only to look for a ordered series of messages will make a single pass through the buffer. I don't think this will be any slower than making a single or more passes later. I'm also pretty sure this function previously searched the entire buffer on every call to wait_output before delta_only was added...

But this place is by design. This is to find a signal to take action, not to find and match strings for other scenarios.

What if you have to check that events happened in a specific order AND take action afterwards...
  

dmesg=node.tools[Dmesg]
vf_info = node.nics.get_secondary_nic()
primary = node.execute_async('program_name')

# wait for start
primary.wait_output("started")

# toggle sriov
node.features[NetworkInterface].switch_sriov(enable=False)
primary.wait_output('hotplug: remove', delta_only=True)
node.features[NetworkInterface].switch_sriov(enable=True)

# wait for hotplug event
primary.wait_output('hotplug: add', delta_only=True)  
dmesg.check_for_kernel_errors() # check for dump immediately after hotplug

# start a secondary process (here's where things get interesting)
secondary = node.execute("program_name --process_type=secondary") 
secondary.wait_output("started")

primary.wait_output('secondary process: VF started {vf_info.name}', delta_only=True)  # check info is sent to secondary
secondary.wait_output("received info from primary: VF {vf_info.name}", delta_only=True) # check info is received by 

secondary.wait_output("sent primary info: ...", delta_only=True) # check secondary sent message to primary after start
primary.wait_output("received secondary info:", delta_only=True) # check second bit of info after secondary vf start

dmesg.check_for_kernel_errors() # check for dump after secondary start

... # send traffic from one nic to the other, do some test stuff.

primary_result = proc.wait_result(expected_exit_code=0 ... )
secondary_result = proc.wait_result(expected_exit_code=0 ...)

for match in some_regex.finditer(primary_result .stdout):
    ... # do output parsing for things like perf numbers and packet counts
       # where order does not matter

Execute, then wait for output. Switch on feature, wait for output. Check thing A happened in process A after that feature was enabled. Check thing B happened in process B after thing A happened.

Why is this bad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, for above example, it doesn't really convince me it's needed. The wait on primary/secondary actually is not ordered, because they are running in parallel. Leave the last wait is enough.
primary.wait_output("received secondary info:", delta_only=True)

Anyway, also discussed with Paul, a better design to support both scenarios.

  1. Add an argument called offset_to_end. The default value is True, because most scenarios, it shouldn't wait a process twice. If it's needed, set it to False. But I haven't seen any case need it yet.
  2. Use the offset_to_end at the L515. When it's True, set it to len of the temp buffer. When it's False, following the current implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just really think the bug here is that 'delta_only' doesn't look at only the delta. This concept of what the function was 'for' wasn't clear to me based on the name and my expectation of what would be useful. I called this a 'bug fix' because I thought the delta_only flag would what the name implies it should do, look at the delta from the last thing we matched.

The problem is the names doesn't accurately describe the behavior. So maybe we make this explicit for the caller:

class StartAt(Enum):
    EndOfLastMatch='last_keyword' 
    EndOfLastBuffer='last_buffer'

wait_output(keyword: str, match_type: Optional[StartAt] = None) 

# So then callers can use it like this:
process.wait_output(keyword="Some log line", match_type=StartAt.EndOfLastMatch)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it can be changed like this, but it needs save two positions, last_end and last_matched. It needs to support search from beginning too. But don't overuse it. The cost of this method is to create a string from a buffer, not from the searching. So if a string is ready, searching in the string multiple times is much lower cost than get_value from buffer multiple times. Use it for async signal only, not for logic.

So, it looks like,

class StartAt(Enum):
    Beginning = "beginning"
    LastMatch = 'last_match' 
    LastEnd = 'last_end'

# Set default to most useful `LastEnd`.
wait_output(keyword: str, start_at: Optional[StartAt] = StartAt.LastEnd)

# So then callers can use it like this:
process.wait_output(keyword="Some log line", match_type=StartAt.LastMatch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants