Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pcm_Read error when capturing on TGLU_RVP_NOCODEC #4163

Closed
XiaoyunWu6666 opened this issue May 10, 2021 · 14 comments
Closed

[BUG] pcm_Read error when capturing on TGLU_RVP_NOCODEC #4163

XiaoyunWu6666 opened this issue May 10, 2021 · 14 comments
Labels
bug Something isn't working as expected multicore Issues observed when not only core#0 is used. TGL Applies to Tiger Lake
Milestone

Comments

@XiaoyunWu6666
Copy link
Contributor

XiaoyunWu6666 commented May 10, 2021

Describe the bug
Found on May 09 daily test http://sof-ci.sh.intel.com/#/result/planresultdetail/3845

In daily test , read error also occured in check-capture-all-formats on TGLU_RVP_NOCODEC , but cannot be manually reproduced.

To Reproduce
TPLG=sof-tgl-nocodec-ci.tplg ~/sof-test/test-case/check-capture.sh -d 1 -l 1 -r 50

Reproduction Rate
100% on May 10 , but other issue[refer to #4164] may overlay it so you cannot see log shown below if trying to reproduce it

Environment
Kernel Branch: topic/sof-dev
Kernel Commit: 9101539a
SOF Branch: main
SOF Commit: 47d223c
TPLG=sof-tgl-nocodec-ci.tplg
Device:TGLU_RVP_NOCODEC [jf-tglu-rvp-nocodec-1]
Report ID:3845

Log

2021-05-09 22:01:45 UTC [REMOTE_INFO] ===== Testing: (Round: 19/50) (PCM: Port0 [hw:0,0]) (Loop: 1/1) =====
2021-05-09 22:01:45 UTC [REMOTE_INFO] no file prefix, use /dev/null as dummy capture output
2021-05-09 22:01:45 UTC [REMOTE_COMMAND] arecord   -Dhw:0,0 -r 48000 -c 2 -f S32_LE -d 1 /dev/null -v -q
Hardware PCM card 0 'sof-nocodec' device 0 subdevice 0
Its setup is:
  stream       : CAPTURE
  access       : RW_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 8192
  period_size  : 2048
  period_time  : 42666
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 2048
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 8192
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
**arecord: pcm_read:2155: read error: Input/output error
2021-05-09 22:01:45 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2021-05-09 22:01:45 UTC [REMOTE_ERROR]  arecord_opts()  @  /home/ubuntu/sof-test/test-case/../case-lib/lib.sh
2021-05-09 22:01:45 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/check-capture.sh:98**
2021-05-09 22:01:45 UTC [REMOTE_INFO] Starting /usr/bin/sof-logger  -l /etc/sof/sof-tgl.ldc -o /home/ubuntu/sof-test/logs/check-capture/2021-05-09-22:00:26-23746/etrace.txt
2021-05-09 22:01:46 UTC [REMOTE_INFO] pkill -TERM sof-logger
Terminated
2021-05-09 22:01:48 UTC [REMOTE_INFO] nlines=4 /home/ubuntu/sof-test/logs/check-capture/2021-05-09-22:00:26-23746/etrace.txt
2021-05-09 22:01:48 UTC [REMOTE_INFO] Test Result: FAIL!

sometimes it gets
arecord: set_params:1407: Unable to install hw params:
instead

soflogger.txt

@marc-hb
Copy link
Collaborator

marc-hb commented May 10, 2021

Reproduction Rate 100%

This is very recent, can someone bisect and find which commit introduced this?

@keyonjie
Copy link
Contributor

Thanks for reporting @XiaoyunWu6666 .
@marc-hb This looks like a new issue observed by our CI about multi-core, that's why we need the topology change merged as early as possible.

@marc-hb
Copy link
Collaborator

marc-hb commented May 11, 2021

Reproduction Rate 100%

Actually, that can't be right because:

===== Testing: (Round: 19/50)

that's why we need the topology change merged as early as possible.

Pretty sure you mean this one : #4153

Based on all the above I'm transferring this from thesofproject/linux to thesofproject/sof

@marc-hb marc-hb transferred this issue from thesofproject/linux May 11, 2021
@XiaoyunWu6666 XiaoyunWu6666 reopened this May 11, 2021
@XiaoyunWu6666 XiaoyunWu6666 changed the title [BUG] pcm_Read error occurs on TGLU_RVP_NOCODEC when check-capture-50rounds [BUG] pcm_Read error on TGLU_RVP_NOCODEC when check-capture-50rounds May 11, 2021
@mengdonglin mengdonglin added bug Something isn't working as expected TGL Applies to Tiger Lake multicore Issues observed when not only core#0 is used. labels May 11, 2021
@slawblauciak
Copy link
Collaborator

Can you please check if #4089 helps here?

@mengdonglin mengdonglin added this to the v1.8 milestone May 11, 2021
@plbossart
Copy link
Member

Thanks for reporting @XiaoyunWu6666 .
@marc-hb This looks like a new issue observed by our CI about multi-core, that's why we need the topology change merged as early as possible.

What topology change are you referring to @keyonjie ? I don't see any and if there is again an issue we probably need to revert the multicore changes again. It seems like the topology change in #4153 was merged too quickly and broke CI twice. We need to be more careful here, CI is not a debug tool when introducing features.

@keyonjie
Copy link
Contributor

@plbossart what do you mean by breaking CI? The basic support of putting a pipeline run on a slave core is already claimed supported, we need to ask validation of it.
IMHO, having bugs reported every time the validation of a new feature set introduced is quite common and under expected, we should go and fix them, just like what we have been doing for SDW.

@plbossart
Copy link
Member

@keyonjie I take issue with you merging your own PR before FIRST asking for validation. I routinely ask to run more thorough daily tests before we merge. This is what we do also for changes of the kernel to a new -rc1.

Merging and then doing validation is not right, sorry.

@keyonjie
Copy link
Contributor

keyonjie commented May 12, 2021

@plbossart we did run validation before merging #4153, and if you check the May 9th daily report (the first daily after the PR merged) http://sof-ci.sh.intel.com/#/result/planresultdetail/3845, only 3 cases failed and 2 of them are stress, it is already not bad than TGLU_VOLT_SDW no?

EDIT:
On the other hand, there are more errors on the May 10th http://sof-ci.sh.intel.com/#/result/planresultdetail/3868, and looks some regression observed, that's what we benefit from the new coverage, no?

@keyonjie
Copy link
Contributor

keyonjie commented May 12, 2021

EDIT
looks like the kernel has been built with wrong commit on May 11th build 3904, it is now 5.12-rc7 while 5.12-rc8 on May 9th, @keqiaozhang @marc-hb @fredoh9 @aiChaoSONG can you check? Without @kv2019i 's fix, even basic multi-core support is not available.

@marc-hb
Copy link
Collaborator

marc-hb commented May 12, 2021

Do you mean it was not built with 9101539a as reported or just that 9101539a is not the latest commit?

EDIT: only daily build 3904 tested the wrong commit. It tested a 1-month old commit by accident

@marc-hb
Copy link
Collaborator

marc-hb commented May 12, 2021

(the first daily after the PR merged) http://sof-ci.sh.intel.com/#/result/planresultdetail/3845, only 3 cases failed and 2 of them are stress, it is already not bad than TGLU_VOLT_SDW no?

It simply depends whether these failures were known before merge and relatively easy to reproduce. If they were then they should have been fixed before merge so these tests are not "lost" for other development.

@plbossart what do you mean by breaking CI?

It's a very simple idea: no regression.

@keyonjie
Copy link
Contributor

keyonjie commented May 12, 2021

Do you mean it was not built with 9101539a as reported or just that 9101539a is not the latest commit?

EDIT: only daily build 3904 tested the wrong commit. It tested a 1-month old commit by accident

you are right, only build 3904 (May 11th) was wrong, the 3868 was correct, and the issue filed here based on 3845 is valid, I just correct the comments above.
My PR #4089 was added to fix this issue, @XiaoyunWu6666 can you please help to check if it help?

@XiaoyunWu6666
Copy link
Contributor Author

#4089 is effective against this issue. [tested on ubuntu@sh-tglu-rvp-nocodec-02]
@keyonjie

@XiaoyunWu6666 XiaoyunWu6666 changed the title [BUG] pcm_Read error on TGLU_RVP_NOCODEC when check-capture-50rounds [BUG] pcm_Read error on TGLU_RVP_NOCODEC/HDA when check-capture-50rounds May 13, 2021
@XiaoyunWu6666 XiaoyunWu6666 changed the title [BUG] pcm_Read error on TGLU_RVP_NOCODEC/HDA when check-capture-50rounds [BUG] pcm_Read error when capturing May 13, 2021
@XiaoyunWu6666 XiaoyunWu6666 changed the title [BUG] pcm_Read error when capturing [BUG] pcm_Read error when capturing on TGLU_RVP_NOCODEC May 13, 2021
@XiaoyunWu6666
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected multicore Issues observed when not only core#0 is used. TGL Applies to Tiger Lake
Projects
None yet
Development

No branches or pull requests

6 participants