Skip to content

Conversation

@ranj063
Copy link
Contributor

@ranj063 ranj063 commented Jul 24, 2025

This change addresses the following error seen when doing pause with the multiple-pause-resume test:
aplay: do_pause:1586: pause push error: File descriptor in bad state

When an xrun happens during the test, the application tries to recover from the xrun by preparing and restarting the stream. There could be a race between when this happens and when the script tries to pause the stream. To avoid this, make sure that the stream state is RUNNING before going ahead with the pause.

Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR deals with an xrun that happened when in the pause/resume cycle?

Which xrun did you observe, underrun or overrun or both? Can you elaborate?

This looks like a fairly complex issue, I feel like a new https://github.com/thesofproject/sof-test/issues/new/choose would not hurt.

Shouldn't the test fail anyway when an xrun happened? Ideally, with an error message less cryptic than "file descriptor in bad state" but still fail? Doesn't this PR hide the xrun and make the test pass when it shouldn't?

@ranj063
Copy link
Contributor Author

ranj063 commented Jul 25, 2025

This PR deals with an xrun that happened when in the pause/resume cycle?

Which xrun did you observe, underrun or overrun or both? Can you elaborate?

This looks like a fairly complex issue, I feel like a new https://github.com/thesofproject/sof-test/issues/new/choose would not hurt.

Shouldn't the test fail anyway when an xrun happened? Ideally, with an error message less cryptic than "file descriptor in bad state" but still fail? Doesn't this PR hide the xrun and make the test pass when it shouldn't?

No, an.xrun doesn't mean the test should fail immediately. Only when the application cannot recover successfully from an xrun should the test fail. When that happens, you'd see the "input/output error".

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 25, 2025

Thanks, good to know.

The code change makes sense to me.

I still think it would be useful to have a short description of when the xrun happens and why. If not in a new sof-test bug, then in the commit message.

@marc-hb

This comment was marked as off-topic.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 25, 2025

So it looks like all the tests finally completed... more than 15h after the git push?

5be12d0 was pushed on July 24th, 23:50 UTC.

The screenshot above with unfinished jobs was captured on July 25th, 15:08 UTC

Maybe there was just a very long backlog... I would recommend taking at look at the test logs to make sure.

Copy link
Contributor

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so clearcut whether we should fail on the xruns, but given we don't have --fatal-errors in other tests at the moment so this PR is aligned with other tests we have. I'd say let's proceed with this change.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 28, 2025

What is (or... should be) the "canonical" way to detect xruns? If it is for instance scanning logs (kernel or firmware), then there could be some sort "universal" toggle that works the same across all tests.

Doesn't this test scan logs already? Most do. If it does, then isn't that PR not enough?


set _delay [substract_time_since_last_space $_record_for]

# wait 50ms for the PCM status to be RUNNING before pausing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# wait 50ms for the PCM status to be RUNNING before pausing
# pool for approximately 50ms for the PCM status to be RUNNING before pausing

I bet it's quite a lot more than 50ms in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think it would be a lot less 50ms is really like the worst case

Copy link
Collaborator

@marc-hb marc-hb Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid we are talking across each other. From experience, after 1 waits at least 1 but a lot more in reality. Hence my suggested change.

Unlike me, I suspect you are talking about audio, not about expect?

BTW take a look at the comment on line 63.

after 1
}
if {$attempt >= $max_attempts} {
log 0 "ERROR: timeout waiting for PCM status to be RUNNING before pause"
Copy link
Collaborator

@marc-hb marc-hb Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should print some "WARN: likely XRUN" if $attempt > 0

Then this could more easily be converted to an error, either temporarily by editing the code locally, or based on some command line option or similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ERROR is appropriate as we fail, but would be nice to print out the state which the PCM is in when we drop our hand in the air.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi updated with the current status print during the error

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 29, 2025

but given we don't have --fatal-errors in other tests at the moment

Actually: you can already do this with some tests: #489, #1120, SOF_ALSA_OPTS

Work in progress.

@kv2019i
Copy link
Contributor

kv2019i commented Aug 5, 2025

@marc-hb The application can detect xruns via ALSA API and it's upto the app to decide whether it's lethal or not (stremaing stopped, or does it proceed with a possible glitch in played/captured audio).

I agree this PR is in effect is reducing the coverage of this test case and it can hide fails we had earlier flagged as fails. OTOH, the xruns are a smaller impact issue compared to an IPC timeout (= one might need to reboot the DUT to get any audio functionality back). In practise we have disabled pause in the SOF drivers towards applications and we primarily use this test case to stress test the pipeline state machines and root out IPC timeout scenarios. In this context, I think this PR makes sense as it will filter out the xruns (that we anyways are not fixing in context of pause), and will get more reliable pass/fail w.r.t. to IPC timeouts.

@marc-hb
Copy link
Collaborator

marc-hb commented Aug 5, 2025

I agree this PR is in effect is reducing the coverage of this test case and it can hide fails we had earlier flagged as fails. OTOH, the xruns are a smaller impact issue compared to an IPC timeout

I see nothing wrong with prioritizing some failures above others - but there should IMHO at least be a flag easy way to restore the previous behavior. Or at the very least a WARNING printed = literally just one line of code. Also, your (crystal-clear as usual) explanation of coverage priorities belongs to the commit message and comments in the source. EDIT: not just lost in a PR comment.

EDIT: a "flag" is a big ask. A comment in the source explaining how to quickly and locally edit it would be enough.

@ujfalusi
Copy link
Contributor

ujfalusi commented Oct 8, 2025

@ranj063, do we have idea why we go to xrun during the pause_push/pause_release ?

@ranj063
Copy link
Contributor Author

ranj063 commented Oct 8, 2025

@ranj063, do we have idea why we go to xrun during the pause_push/pause_release ?

@ujfalusi its not that we're running into xruns during pause/release, the problem is with timing in the test I think. we pause for a a very very short time, release and then try to pause within a very short time right after. In this sequence, when we pause, I think it makes sense to wait until the PCM is actually in the correct RUNNING state before pausing.

This change addresses the following error seen when doing pause with the
multiple-pause-resume test:
aplay: do_pause:1586: pause push error: File descriptor in bad state

When an xrun happens during the test, the application tries to recover
from the xrun by preparing and restarting the stream. There could be a
race between when this happens and when the script tries to pause the
stream. To avoid this, make sure that the stream state is RUNNING before
going ahead with a subsequent pause.

Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
@ranj063 ranj063 force-pushed the fix/multiple_pause_resume branch from 5be12d0 to 2ce6ddc Compare October 14, 2025 16:13
@ranj063 ranj063 requested review from kv2019i and ujfalusi October 14, 2025 16:14
Copy link
Contributor

@redzynix redzynix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@redzynix
Copy link
Contributor

SOFCI TEST


set _delay [substract_time_since_last_space $_record_for]

# wait 50ms for the PCM status to be RUNNING before pausing
Copy link
Collaborator

@marc-hb marc-hb Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid we are talking across each other. From experience, after 1 waits at least 1 but a lot more in reality. Hence my suggested change.

Unlike me, I suspect you are talking about audio, not about expect?

BTW take a look at the comment on line 63.

log 0 "ERROR: timeout waiting for PCM to be in RUNNING state before pause"
log 0 "Current state: $pcm_status"
exit 1
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the number of attempts (at a high log level) would not hurt.


# wait 50ms for the PCM status to be RUNNING before pausing
# this is to make sure that in the case of an xrun the application
# successfully recovers and restarts the stream.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# successfully recovers and restarts the stream.
# successfully recovers and restarts the stream.
# change `max_attempts` to zero when observing xruns is desired.

@redzynix redzynix merged commit a9f04af into thesofproject:main Oct 16, 2025
4 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants