Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadwell: kernel warning w/ failed hw_params on media pipeline #932

Closed
plbossart opened this issue May 11, 2019 · 3 comments
Closed

Broadwell: kernel warning w/ failed hw_params on media pipeline #932

plbossart opened this issue May 11, 2019 · 3 comments
Labels
BDW Applies to Broadwell platform bug Something isn't working

Comments

@plbossart
Copy link
Member

plbossart commented May 11, 2019

Bad kernel oops when using device 0,1 (media)

root@galliumos:~# speaker-test -Dhw:0,1 -c2 -r48000 

speaker-test 1.1.0

Playback device is hw:0,1
Stream parameters are 48000Hz, S16_LE, 2 channels
Using 16 octaves of pink noise
Rate set to 48000Hz (requested 48000Hz)
Buffer size range from 96 to 2096640
Period size range from 48 to 65520
Using max buffer size 2096640
Periods = 4
Unable to set hw params for playback: Invalid argument
Setting of hwparams failed: Invalid argument
[   40.252524] sof-audio-acpi INT3438:00: ipc tx: 0x60010000: GLB_STREAM_MSG: PCM_PARAMS
[   40.252682] sof-audio-acpi INT3438:00: error: ipc error for 0x60010000 size 20
[   40.252687] sof-audio-acpi INT3438:00: error: hw params ipc failed for stream 0
[   40.252690] sof-audio-acpi INT3438:00: ASoC: INT3438:00 hw params failed: -22
[   40.252694]  Media Playback 1: ASoC: hw_params FE failed -22
[   40.252700] sof-audio-acpi INT3438:00: pcm: free stream 1 dir 0
[   40.252703] sof-audio-acpi INT3438:00: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[   40.252839] sof-audio-acpi INT3438:00: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[   40.252852] WARNING: CPU: 0 PID: 2841 at /data/pbossart/ktest/sof-dev/kernel/workqueue.c:3030 __flush_work+0x19f/0x1b0
[   40.252853] Modules linked in: snd_soc_sst_bdw_rt5677_mach x86_pkg_temp_thermal intel_powerclamp iwlmvm sof_acpi_dev iwlwifi snd_sof_intel_byt snd_sof_intel_bdw snd_sof_intel_ipc snd_sof snd_sof_xtensa_dsp intel_pch_thermal snd_soc_rt5677 snd_soc_acpi_intel_match snd_soc_rl6231 snd_soc_acpi snd_soc_core snd_pcm snd_soc_rt5677_spi xhci_pci xhci_hcd
[   40.252867] CPU: 0 PID: 2841 Comm: speaker-test Not tainted 5.1.0-test+ #77
[   40.252868] Hardware name: GOOGLE Samus, BIOS  08/17/2016
[   40.252871] RIP: 0010:__flush_work+0x19f/0x1b0
[   40.252873] Code: ff 8b 4d 00 48 8b 55 08 83 e1 08 48 0f ba 6d 00 03 80 c9 f0 eb c6 41 c6 45 00 00 fb 31 c0 e9 e1 fe ff ff 0f 0b e9 da fe ff ff <0f> 0b 31 c0 e9 d1 fe ff ff e8 33 2d fe ff 0f 1f 00 31 f6 e9 49 fe
[   40.252875] RSP: 0018:ffffb39e021bbbd0 EFLAGS: 00010246
[   40.252877] RAX: 0000000000000000 RBX: ffff8afc6c7b2428 RCX: 0000000000000000
[   40.252878] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8afc6c7b2428
[   40.252879] RBP: ffff8afc6c7b2428 R08: 000000000000038b R09: 0000000000000004
[   40.252880] R10: 0000000000000981 R11: 0000000000000001 R12: 0000000000000000
[   40.252881] R13: ffffb39e021bbc68 R14: ffffffffa707c890 R15: 0000000000000000
[   40.252883] FS:  00007f72e4861700(0000) GS:ffff8afc6ea00000(0000) knlGS:0000000000000000
[   40.252884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.252885] CR2: 00007ffeb97f8080 CR3: 0000000453cc2002 CR4: 00000000003606f0
[   40.252886] Call Trace:
[   40.252895]  ? ipc_log_header+0x1a6/0x4b0 [snd_sof]
[   40.252899]  ? sof_ipc_tx_message_unlocked+0x12d/0x2d0 [snd_sof]
[   40.252903]  __cancel_work_timer+0x11f/0x1a0
[   40.252907]  ? sof_ipc_tx_message+0x61/0x80 [snd_sof]
[   40.252910]  sof_pcm_hw_free+0x158/0x170 [snd_sof]
[   40.252919]  soc_pcm_components_hw_free+0x5c/0x80 [snd_soc_core]
[   40.252926]  soc_pcm_hw_free+0x121/0x1d0 [snd_soc_core]
[   40.252932]  dpcm_fe_dai_hw_free+0x73/0x100 [snd_soc_core]
[   40.252937]  snd_pcm_hw_params+0x12d/0x5d0 [snd_pcm]
[   40.252942]  ? _cond_resched+0x10/0x40
[   40.252945]  ? __kmalloc_track_caller+0x130/0x1b0
[   40.252950]  snd_pcm_common_ioctl+0x19e/0xc80 [snd_pcm]
[   40.252954]  snd_pcm_ioctl+0x25/0x30 [snd_pcm]
[   40.252958]  do_vfs_ioctl+0x9f/0x620
[   40.252961]  ksys_ioctl+0x6b/0x80
[   40.252963]  ? __x64_sys_fcntl+0x79/0xa0
[   40.252965]  __x64_sys_ioctl+0x11/0x20
[   40.252968]  do_syscall_64+0x43/0xf0
[   40.252972]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   40.252975] RIP: 0033:0x7f72e3b62f47
[   40.252977] Code: 00 00 00 48 8b 05 51 6f 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48
[   40.252978] RSP: 002b:00007ffeb97d6058 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   40.252980] RAX: ffffffffffffffda RBX: 0000000000ca8930 RCX: 00007f72e3b62f47
[   40.252981] RDX: 00007ffeb97d6200 RSI: 00000000c2604111 RDI: 0000000000000004
[   40.252982] RBP: 00007ffeb97d6200 R08: 0000000000000000 R09: 0000000000000000
[   40.252983] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000cb0f30
[   40.252984] R13: 00007ffeb97d6084 R14: 000000000000fff0 R15: 000000000000bb80
[   40.252986] ---[ end trace f650cc8d0b7400b0 ]---
[   40.253192] sof-audio-acpi INT3438:00: pcm: close stream 1 dir 0
[   40.253858] speaker-test (2841) used greatest stack depth: 11672 bytes left
[   40.339637] sof-audio-acpi INT3438:00: ipc rx: 0x90020000: GLB_TRACE_MSG
[   40.339648] sof-audio-acpi INT3438:00: ipc rx done: 0x90020000: GLB_TRACE_MSG

dmesg.log

firmware 6eb79af
kernel 35d7759

@plbossart plbossart added BDW Applies to Broadwell platform bug Something isn't working labels May 11, 2019
@plbossart
Copy link
Member Author

@keyonjie this can be root caused to your changes for the period_elapsed work queue. When an error occurs in sof_pcm_hw_params(), the INIT_WORK is not executed, but we still call sof_pcm_hw_free() and call cancel_work_sync() with an uninitialized work queue. that's not so good.

I am not sure how to deal with this except with a flag that indicates the workqueue was successfully initialized.

plbossart added a commit to plbossart/sound that referenced this issue May 14, 2019
When the SOF hw_params() fail, typically with an IPC error thrown by
the firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free().

Add a state variable to keep track of status and remove dmesg warnings
thrown by the kernel by conditionally cancelling the work queue.

GitHub issue: thesofproject#932
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@plbossart
Copy link
Member Author

PR947 fixes the issue

root@galliumos:~# speaker-test -Dhw:1,1 -c2 -r48000 

speaker-test 1.1.0

Playback device is hw:1,1
Stream parameters are 48000Hz, S16_LE, 2 channels
Using 16 octaves of pink noise
Rate set to 48000Hz (requested 48000Hz)
Buffer size range from 96 to 2096640
Period size range from 48 to 65520
Using max buffer size 2096640
Periods = 4
Unable to set hw params for playback: Invalid argument
Setting of hwparams failed: Invalid argument
[   71.939204] sof-audio-acpi INT3438:00: pcm: open stream 1 dir 0
[   71.939210] sof-audio-acpi INT3438:00: period min 192 max 262144 bytes
[   71.939213] sof-audio-acpi INT3438:00: period count 2 max 32
[   71.939216] sof-audio-acpi INT3438:00: buffer max 8388608 bytes
[   71.939487] sof-audio-acpi INT3438:00: rate_min: 48000 rate_max: 48000
[   71.939489] sof-audio-acpi INT3438:00: channels_min: 2 channels_max: 2
[   71.939492] sof-audio-acpi INT3438:00: rate_min: 48000 rate_max: 48000
[   71.939494] sof-audio-acpi INT3438:00: channels_min: 2 channels_max: 2
[   71.939499] sof-audio-acpi INT3438:00: rate_min: 48000 rate_max: 48000
[   71.939501] sof-audio-acpi INT3438:00: channels_min: 2 channels_max: 2
[   71.939503] sof-audio-acpi INT3438:00: pcm: hw params stream 1 dir 0
[   71.939506] sof-audio-acpi INT3438:00: generating page table for 000000007d5b824f size 0x7ff800 pages 2048
[   71.939517] sof-audio-acpi INT3438:00: stream_tag 0
[   71.939523] sof-audio-acpi INT3438:00: ipc tx: 0x60010000: GLB_STREAM_MSG: PCM_PARAMS
[   71.939612] sof-audio-acpi INT3438:00: error: ipc error for 0x60010000 size 20
[   71.939616] sof-audio-acpi INT3438:00: error: hw params ipc failed for stream 0
[   71.939619] sof-audio-acpi INT3438:00: ASoC: INT3438:00 hw params failed: -22
[   71.939621]  Media Playback 1: ASoC: hw_params FE failed -22
[   71.939625] sof-audio-acpi INT3438:00: pcm: free stream 1 dir 0
[   71.939628] sof-audio-acpi INT3438:00: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[   71.939695] sof-audio-acpi INT3438:00: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[   71.939776] sof-audio-acpi INT3438:00: pcm: close stream 1 dir 0

@plbossart plbossart changed the title Broadwell: kernel oops on using media pipeline Broadwell: kernel warning w/ failed hw_params on media pipeline May 14, 2019
plbossart added a commit to plbossart/sound that referenced this issue May 17, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: thesofproject#932
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit that referenced this issue May 18, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: #932
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit to plbossart/sound that referenced this issue May 21, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: thesofproject#932
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
@wenqingfu
Copy link

fixed by #947

plbossart added a commit that referenced this issue May 24, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: #932
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit that referenced this issue May 24, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: #932
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart added a commit that referenced this issue Jun 3, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: #932
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit fab4edf)
sathyap-chrome pushed a commit to sathyap-chrome/linux that referenced this issue Jun 25, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: thesofproject#932
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit fab4edf
 git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next)

BUG=b:134688240
TEST=Test Audio use cases for CML with full SOF patch series applied.

Change-Id: I3767f778d7b1accaafc517045970f83a810efc42
Signed-off-by: Ap, Kamal <kamal.ap@intel.corp-partner.google.com>
gs0622 pushed a commit to gs0622/linux that referenced this issue Jun 25, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: thesofproject#932
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
fcicq pushed a commit to fcicq/chromiumos-third_party-kernel that referenced this issue Jun 29, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: thesofproject/linux#932
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit fab4edf
 git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next)

BUG=b:134688240
TEST=Test Audio use cases for CML with full SOF patch series applied.

Signed-off-by: Ap, Kamal <kamal.ap@intel.corp-partner.google.com>
cujomalainey pushed a commit to cujomalainey/linux that referenced this issue Jul 9, 2019
If the SOF hw_params() fail, typically with an IPC error thrown by the
firmware, the period_elapsed workqueue is not initialized, but we
still cancel it in hw_free(), which results in a kernel warning.

Move the initialization to the .open callback. Tested on Broadwell
(Samus) and IceLake.

Fixes: e2803e6 ("ASoC: SOF: PCM: add period_elapsed work to fix
race condition in interrupt context")

GitHub issue: thesofproject#932
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit fab4edf)

BUG=b:135490563
TEST=Test Audio use cases for GLK with full SOF patch series applied.

Signed-off-by: Curtis Malainey <cujomalainey@chromium.org>
kv2019i pushed a commit that referenced this issue Sep 10, 2020
I got the following lockdep splat while testing:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc7-00172-g021118712e59 #932 Not tainted
  ------------------------------------------------------
  btrfs/229626 is trying to acquire lock:
  ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450

  but task is already holding lock:
  ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_scrub_dev+0x11c/0x630
	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
	 btrfs_ioctl+0x2799/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_run_dev_stats+0x49/0x480
	 commit_cowonly_roots+0xb5/0x2a0
	 btrfs_commit_transaction+0x516/0xa60
	 sync_filesystem+0x6b/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0xe/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x29/0x60
	 cleanup_mnt+0xb8/0x140
	 task_work_run+0x6d/0xb0
	 __prepare_exit_to_usermode+0x1cc/0x1e0
	 do_syscall_64+0x5c/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_commit_transaction+0x4bb/0xa60
	 sync_filesystem+0x6b/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0xe/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x29/0x60
	 cleanup_mnt+0xb8/0x140
	 task_work_run+0x6d/0xb0
	 __prepare_exit_to_usermode+0x1cc/0x1e0
	 do_syscall_64+0x5c/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_record_root_in_trans+0x43/0x70
	 start_transaction+0xd1/0x5d0
	 btrfs_dirty_inode+0x42/0xd0
	 touch_atime+0xa1/0xd0
	 btrfs_file_mmap+0x3f/0x60
	 mmap_region+0x3a4/0x640
	 do_mmap+0x376/0x580
	 vm_mmap_pgoff+0xd5/0x120
	 ksys_mmap_pgoff+0x193/0x230
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
	 __might_fault+0x68/0x90
	 _copy_to_user+0x1e/0x80
	 perf_read+0x141/0x2c0
	 vfs_read+0xad/0x1b0
	 ksys_read+0x5f/0xe0
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 perf_event_init_cpu+0x88/0x150
	 perf_event_init+0x1db/0x20b
	 start_kernel+0x3ae/0x53c
	 secondary_startup_64+0xa4/0xb0

  -> #1 (pmus_lock){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 perf_event_init_cpu+0x4f/0x150
	 cpuhp_invoke_callback+0xb1/0x900
	 _cpu_up.constprop.26+0x9f/0x130
	 cpu_up+0x7b/0xc0
	 bringup_nonboot_cpus+0x4f/0x60
	 smp_init+0x26/0x71
	 kernel_init_freeable+0x110/0x258
	 kernel_init+0xa/0x103
	 ret_from_fork+0x1f/0x30

  -> #0 (cpu_hotplug_lock){++++}-{0:0}:
	 __lock_acquire+0x1272/0x2310
	 lock_acquire+0x9e/0x360
	 cpus_read_lock+0x39/0xb0
	 alloc_workqueue+0x378/0x450
	 __btrfs_alloc_workqueue+0x15d/0x200
	 btrfs_alloc_workqueue+0x51/0x160
	 scrub_workers_get+0x5a/0x170
	 btrfs_scrub_dev+0x18c/0x630
	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
	 btrfs_ioctl+0x2799/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->scrub_lock);
				 lock(&fs_devs->device_list_mutex);
				 lock(&fs_info->scrub_lock);
    lock(cpu_hotplug_lock);

   *** DEADLOCK ***

  2 locks held by btrfs/229626:
   #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
   #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

  stack backtrace:
  CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
  Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x165/0x180
   __lock_acquire+0x1272/0x2310
   lock_acquire+0x9e/0x360
   ? alloc_workqueue+0x378/0x450
   cpus_read_lock+0x39/0xb0
   ? alloc_workqueue+0x378/0x450
   alloc_workqueue+0x378/0x450
   ? rcu_read_lock_sched_held+0x52/0x80
   __btrfs_alloc_workqueue+0x15d/0x200
   btrfs_alloc_workqueue+0x51/0x160
   scrub_workers_get+0x5a/0x170
   btrfs_scrub_dev+0x18c/0x630
   ? start_transaction+0xd1/0x5d0
   btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
   btrfs_ioctl+0x2799/0x30a0
   ? do_sigaction+0x102/0x250
   ? lockdep_hardirqs_on_prepare+0xca/0x160
   ? _raw_spin_unlock_irq+0x24/0x30
   ? trace_hardirqs_on+0x1c/0xe0
   ? _raw_spin_unlock_irq+0x24/0x30
   ? do_sigaction+0x102/0x250
   ? ksys_ioctl+0x83/0xc0
   ksys_ioctl+0x83/0xc0
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x50/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.

Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.

Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info.  We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BDW Applies to Broadwell platform bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants