Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v6.7.0-scx5 #7

Closed
wants to merge 32 commits into from
Closed

v6.7.0-scx5 #7

wants to merge 32 commits into from

Conversation

Byte-Lab
Copy link
Contributor

No description provided.

Byte-Lab and others added 30 commits January 29, 2024 13:37
Some schedulers, such as scx_nest, want to have different logic
depending on whether a task was running prior to the scheduler being
enabled. Let's add a flag that they can check to make it easy to find
out whether the task is being forked, or was already running and is
being transitioned to the new scheduler.

Signed-off-by: David Vernet <void@manifault.com>
Let's make sure it's working as expected as well.

Signed-off-by: David Vernet <void@manifault.com>
Printing a test preamble and status for a skipped test is only really
useful for CI. Let's make the default behavior to not print skipped
tests.

Signed-off-by: David Vernet <void@manifault.com>
Add fork field to struct scx_init_task_args
We recently added a "minimal" scheduler testcase that loaded the bare
minimum of fields to exercise the default codepaths for ext.c. Let's add
a "maximal" testcase which defines every single defined sched_ext_ops
callback and verifies that it's successfully able to be loaded.

Signed-off-by: David Vernet <void@manifault.com>
In sched-ext/sched_ext#129, Andrea fixed a bug
where we trip over a NULL pointer deref by trying to load multiple
schedulers at a time. Let's add a stress test that tries to load
multiple schedulers in a tight loop at the same time.

Additionally, do a bit of cleanup in the build system to have testcases
take all BPF progs as dependencies. We don't really gain anything by
artificially coupling the name of testcases to the BPF progs they use.

Signed-off-by: David Vernet <void@manifault.com>
This testcase was mixing tabs and spaces, let's make it use tabs.

Signed-off-by: David Vernet <void@manifault.com>
The source of truth is https://github.com/sched-ext/scx. Let's just leave
the very simple two - scx_simple and scx_qmap.

Signed-off-by: Tejun Heo <tj@kernel.org>
scx: Purge most schedulers from tools/sched_ext
And rearragent stuff a bit for better density.

Signed-off-by: Tejun Heo <tj@kernel.org>
scx: Print p->cpus_ptr when dumping scheduler state
This can be used to implement interlocking against dequeue. e.g, Let's say
there is one dsq per CPU and the scheduler can stall if a task is dispatched
to a dsq which doesn't intersect with p->cpus_ptr. In the dispatch path, if
you do the following:

	if (bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr))
		scx_bpf_dispatch(p, target_dsq, slice, 0);

The task can still end up in the wrong dsq because the cpumask can change
between bpf_cpumask_test_cpu() and scx_bpf_dispatch(). However,

	scx_bpf_dispatch(p, target_dsq, slice, 0);
	if (!bpf_cpumask_test_cpu(target_cpu, p->cpus_ptr))
		scx_bpf_dispatch_cancel();

eliminates the race window - either the set_cpumask path sees the task
dispatched or the dispatch path sees the updated cpumask. In both cases, the
dispatch gets canceled as it should.

Note that in schedules which don't implement .dequeue(), depending on the
implementation, there can be multiple enqueued instances of the same task.
However, as long as the scheduler guarantees that there is at least one
dispatch attempt on the task since the last enqueueing and that dispatch
attempt interlocks against the dequeue path as above, it's guaranteed to not
act upon stale information or miss dispatching an enqueued task.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <andrea.righi@canonical.com>
scx: Implement scx_bpf_dispatch_cancel()
We're basically always runnin two CI jobs: one for a remote push, and
another for when a PR is opened. These are essentially measuring the
same thing, so let's save CI bandwidth and just do a PR run. This will
hopefully make things a bit less noisy as well.

Signed-off-by: David Vernet <void@manifault.com>
BPF kfuncs are a bit special in that they have global linkage, even if
they're not called by any other function or compilation unit in the rest
of the kernel. This has caused issues in some other parts of the
codebase, and causes us to fail to compile starting in 6.8rc1. Let's
apply the existing BPF macros to our kfuncs to avoid these issues.

Signed-off-by: David Vernet <void@manifault.com>
scx: Use BPF macros for sched_ext kfuncs
- There's no reason to clear all scx_rq->cpus_* masks each time
  kick_cpus_irq_workfn() is run especially given that other cpumasks won't
  be set if ->cpus_to_kick isn't. Selectively clear bits which are set
  instead.

- Don't access rq->scx.pnt_seq if not needed.

- A bit of restructuring for redability and to avoid scanning ->cpus_to_wait
  unnecessarily.

Signed-off-by: Tejun Heo <tj@kernel.org>
No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Wanting to wake a CPU from idle is not an uncommon operaiton that a BPF
scheduler might want to do. The only way to do it is scx_bpf_kick_cpu();
however, this can lead to unnecessary churns when the target CPU is not idle
and there isn't a good way to synchronize against idle transitions to avoid
spurious kicks.

This patch implements SCX_KICK_IDLE which kicks the target CPU if it's idle.
The flag doesn't guarantee that a CPU will never be kicked unless idle but
it optimizes away spurious kicks in most cases.

Signed-off-by: Tejun Heo <tj@kernel.org>
To avoid crashing the kernel when CONFIG_CPUMASK_OFFSTACK is set.

Signed-off-by: Tejun Heo <tj@kernel.org>
scx: Actually allocate rq->scx.cpus_to_kick_if_idle
Signed-off-by: Tejun Heo <tj@kernel.org>
Trying to load a scheduler while `stress-ng --race-sched 8 --timeout 30` is
running easily triggers the following warning.

  sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[8698]
  WARNING: CPU: 1 PID: 1527 at kernel/sched/ext.c:2378 scx_set_task_state+0xb2/0x1a0
  CPU: 1 PID: 1527 Comm: stress-ng-race- Not tainted 6.7.0-work-00411-gc1c1b3b1133b-dirty #404
  Sched_ext: rustland (enabled+all), task: runnable_at=-76ms
  RIP: 0010:scx_set_task_state+0xb2/0x1a0
  ...
  Call Trace:
   <TASK>
   scx_ops_enable_task+0xd5/0x180
   switching_to_scx+0x17/0xa0
   __sched_setscheduler+0x623/0x810
   do_sched_setscheduler+0xea/0x170
   __x64_sys_sched_setscheduler+0x1c/0x30
   do_syscall_64+0x40/0xe0
   entry_SYSCALL_64_after_hwframe+0x46/0x4e

This race happens when sched_setscheudler() syscall races SCX ops_enable
path on a DEAD task. If the task is DEAD, scx_ops_enable() directly calls
scx_ops_exit_task() instead of enabling it. If another task was already in
the middle of sched_setscheduler() with the task's reference acquired, it
then can proceed to __sched_setscheduler() which will then call
scx_ops_enable_task(). However, the task has already been exited, so it ends
trying to transition the task from SCX_TASK_NONE to SCX_TASK_ENABLED
triggering the above warning.

This is because the ops_enable path is leaving the task in an inconsistent
state - scx_enabled() && scx_switching_all, so scx_should_scx(p) but p is
SCX_TASK_NONE instead of READY or ENABLED. If the task gets freed
afterwards, it's fine, but if there's an attribute change operation which
gets inbetween, that operation ends up trying to enable SCX on the
unprepared dead task.

The reason why both ops_enable and disable paths handle DEAD tasks
specially, I think, is historical. Before scx_tasks was introduced, the
paths used the usual tasklist iteration. However, tasks are removed from
tasklist before the task transitions to DEAD, so depending on the timing, a
DEAD task may show up in one iteration and disappear for the next one e.g.
while disabling is in progress without any way to detect the event. My
memory is hazy, so the details may not be accurate. Anyways, there were
issues around reliably iterating DEAD tasks and thus there was code to skip
them.

With scx_tasks, we're strongly synchronized against tasks being destoryed,
so this is no longer a concern, so it's likely that we can just remove this
special handling and things will just work.

After this patch, the kernel is surviving load/unload torture test but it's
possible that I'm completely misremembering things, in which case, we'll
have to relearn why TASK_DEAD has to be special.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fix invalid state transition while scheduler is loaded
…ops_disable_workfn()

When scx_ops_disable_workfn() invoked to disable a BPF scheduler, it cannot
depend on the scheduler working and thus can't depend on !RT tasks making
forward progress, so it performs a series of non-blocking operations to
restore forward progress guarantee and then kicks out the BPF scheduler.

Watchdog code added cancel_delayed_work_sync() in scx_ops_disable_workfn()
before forward progress guarantee is restored. cancel_delayed_work_sync()
implies flush_work() if the target work item is already executing and that
work item may not be able to run due to malfunctioning scheduling, making
the system stuck and unrecoverable.

There's no need to shutdown the watchdog timer early. Move
cancel_delayed_work_sync() to later in the disable process where all the
critical operaitons are complete and the kernel default scheduling is
restored.

Signed-off-by: Tejun Heo <tj@kernel.org>
scx: Don't wait for a work item before scheduling is restored in scx_…
@Byte-Lab
Copy link
Contributor Author

Byte-Lab commented Feb 13, 2024

Going to be merging future 6.7 fixes through the stable release branch instead

@Byte-Lab Byte-Lab closed this Feb 13, 2024
Byte-Lab pushed a commit that referenced this pull request Feb 13, 2024
[ Upstream commit fc3a553 ]

An issue occurred while reading an ELF file in libbpf.c during fuzzing:

	Program received signal SIGSEGV, Segmentation fault.
	0x0000000000958e97 in bpf_object.collect_prog_relos () at libbpf.c:4206
	4206 in libbpf.c
	(gdb) bt
	#0 0x0000000000958e97 in bpf_object.collect_prog_relos () at libbpf.c:4206
	#1 0x000000000094f9d6 in bpf_object.collect_relos () at libbpf.c:6706
	#2 0x000000000092bef3 in bpf_object_open () at libbpf.c:7437
	#3 0x000000000092c046 in bpf_object.open_mem () at libbpf.c:7497
	#4 0x0000000000924afa in LLVMFuzzerTestOneInput () at fuzz/bpf-object-fuzzer.c:16
	#5 0x000000000060be11 in testblitz_engine::fuzzer::Fuzzer::run_one ()
	#6 0x000000000087ad92 in tracing::span::Span::in_scope ()
	#7 0x00000000006078aa in testblitz_engine::fuzzer::util::walkdir ()
	#8 0x00000000005f3217 in testblitz_engine::entrypoint::main::{{closure}} ()
	#9 0x00000000005f2601 in main ()
	(gdb)

scn_data was null at this code(tools/lib/bpf/src/libbpf.c):

	if (rel->r_offset % BPF_INSN_SZ || rel->r_offset >= scn_data->d_size) {

The scn_data is derived from the code above:

	scn = elf_sec_by_idx(obj, sec_idx);
	scn_data = elf_sec_data(obj, scn);

	relo_sec_name = elf_sec_str(obj, shdr->sh_name);
	sec_name = elf_sec_name(obj, scn);
	if (!relo_sec_name || !sec_name)// don't check whether scn_data is NULL
		return -EINVAL;

In certain special scenarios, such as reading a malformed ELF file,
it is possible that scn_data may be a null pointer

Signed-off-by: Mingyi Zhang <zhangmingyi5@huawei.com>
Signed-off-by: Xin Liu <liuxin350@huawei.com>
Signed-off-by: Changye Wu <wuchangye@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231221033947.154564-1-liuxin350@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Byte-Lab pushed a commit that referenced this pull request Mar 28, 2024
…-maps'

Eduard Zingerman says:

====================
libbpf: type suffixes and autocreate flag for struct_ops maps

Tweak struct_ops related APIs to allow the following features:
- specify version suffixes for stuct_ops map types;
- share same BPF program between several map definitions with
  different local BTF types, assuming only maps with same
  kernel BTF type would be selected for load;
- toggle autocreate flag for struct_ops maps;
- automatically toggle autoload for struct_ops programs referenced
  from struct_ops maps, depending on autocreate status of the
  corresponding map;
- use SEC("?.struct_ops") and SEC("?.struct_ops.link")
  to define struct_ops maps with autocreate == false after object open.

This would allow loading programs like below:

    SEC("struct_ops/foo") int BPF_PROG(foo) { ... }
    SEC("struct_ops/bar") int BPF_PROG(bar) { ... }

    struct bpf_testmod_ops___v1 {
        int (*foo)(void);
    };

    struct bpf_testmod_ops___v2 {
        int (*foo)(void);
        int (*bar)(void);
    };

    /* Assume kernel type name to be 'test_ops' */
    SEC(".struct_ops.link")
    struct test_ops___v1 map_v1 = {
        /* Program 'foo' shared by maps with
         * different local BTF type
         */
        .foo = (void *)foo
    };

    SEC(".struct_ops.link")
    struct test_ops___v2 map_v2 = {
        .foo = (void *)foo,
        .bar = (void *)bar
    };

Assuming the following tweaks are done before loading:

    /* to load v1 */
    bpf_map__set_autocreate(skel->maps.map_v1, true);
    bpf_map__set_autocreate(skel->maps.map_v2, false);

    /* to load v2 */
    bpf_map__set_autocreate(skel->maps.map_v1, false);
    bpf_map__set_autocreate(skel->maps.map_v2, true);

Patch #8 ties autocreate and autoload flags for struct_ops maps and
programs.

Changelog:
- v3 [3] -> v4:
  - changes for multiple styling suggestions from Andrii;
  - patch #5: libbpf log capture now happens for LIBBPF_INFO and
    LIBBPF_WARN messages and does not depend on verbosity flags
    (Andrii);
  - patch #6: fixed runtime crash caused by conflict with newly added
    test case struct_ops_multi_pages;
  - patch #7: fixed free of possibly uninitialized pointer (Daniel)
  - patch #8: simpler algorithm to detect which programs to autoload
    (Andrii);
  - patch #9: added assertions for autoload flag after object load
    (Andrii);
  - patch #12: DATASEC name rewrite in libbpf is now done inplace, no
    new strings added to BTF (Andrii);
  - patch #14: allow any printable characters in DATASEC names when
    kernel validates BTF (Andrii)
- v2 [2] -> v3:
  - moved patch #8 logic to be fully done on load
    (requested by Andrii in offlist discussion);
  - in patch #9 added test case for shadow vars and
    autocreate/autoload interaction.
- v1 [1] -> v2:
  - fixed memory leak in patch #1 (Kui-Feng);
  - improved error messages in patch #2 (Martin, Andrii);
  - in bad_struct_ops selftest from patch #6 added .test_2
    map member setup (David);
  - added utility functions to capture libbpf log from selftests (David)
  - in selftests replaced usage of ...__open_and_load by separate
    calls to ..._open() and ..._load() (Andrii);
  - removed serial_... in selftest definitions (Andrii);
  - improved comments in selftest struct_ops_autocreate
    from patch #7 (David);
  - removed autoload toggling logic incompatible with shadow variables
    from bpf_map__set_autocreate(), instead struct_ops programs
    autoload property is computed at struct_ops maps load phase,
    see patch #8 (Kui-Feng, Martin, Andrii);
  - added support for SEC("?.struct_ops") and SEC("?.struct_ops.link")
    (Andrii).

[1] https://lore.kernel.org/bpf/20240227204556.17524-1-eddyz87@gmail.com/
[2] https://lore.kernel.org/bpf/20240302011920.15302-1-eddyz87@gmail.com/
[3] https://lore.kernel.org/bpf/20240304225156.24765-1-eddyz87@gmail.com/
====================

Link: https://lore.kernel.org/r/20240306104529.6453-1-eddyz87@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
htejun added a commit that referenced this pull request Mar 28, 2024
atropos: Remove static libclang dependency
Byte-Lab pushed a commit that referenced this pull request Apr 12, 2024
The driver creates /sys/kernel/debug/dri/0/mob_ttm even when the
corresponding ttm_resource_manager is not allocated.
This leads to a crash when trying to read from this file.

Add a check to create mob_ttm, system_mob_ttm, and gmr_ttm debug file
only when the corresponding ttm_resource_manager is allocated.

crash> bt
PID: 3133409  TASK: ffff8fe4834a5000  CPU: 3    COMMAND: "grep"
 #0 [ffffb954506b3b20] machine_kexec at ffffffffb2a6bec3
 #1 [ffffb954506b3b78] __crash_kexec at ffffffffb2bb598a
 #2 [ffffb954506b3c38] crash_kexec at ffffffffb2bb68c1
 #3 [ffffb954506b3c50] oops_end at ffffffffb2a2a9b1
 #4 [ffffb954506b3c70] no_context at ffffffffb2a7e913
 #5 [ffffb954506b3cc8] __bad_area_nosemaphore at ffffffffb2a7ec8c
 #6 [ffffb954506b3d10] do_page_fault at ffffffffb2a7f887
 #7 [ffffb954506b3d40] page_fault at ffffffffb360116e
    [exception RIP: ttm_resource_manager_debug+0x11]
    RIP: ffffffffc04afd11  RSP: ffffb954506b3df0  RFLAGS: 00010246
    RAX: ffff8fe41a6d1200  RBX: 0000000000000000  RCX: 0000000000000940
    RDX: 0000000000000000  RSI: ffffffffc04b4338  RDI: 0000000000000000
    RBP: ffffb954506b3e08   R8: ffff8fee3ffad000   R9: 0000000000000000
    R10: ffff8fe41a76a000  R11: 0000000000000001  R12: 00000000ffffffff
    R13: 0000000000000001  R14: ffff8fe5bb6f3900  R15: ffff8fe41a6d1200
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffffb954506b3e00] ttm_resource_manager_show at ffffffffc04afde7 [ttm]
 #9 [ffffb954506b3e30] seq_read at ffffffffb2d8f9f3
    RIP: 00007f4c4eda8985  RSP: 00007ffdbba9e9f8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 000000000037e000  RCX: 00007f4c4eda8985
    RDX: 000000000037e000  RSI: 00007f4c41573000  RDI: 0000000000000003
    RBP: 000000000037e000   R8: 0000000000000000   R9: 000000000037fe30
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007f4c41573000
    R13: 0000000000000003  R14: 00007f4c41572010  R15: 0000000000000003
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
Fixes: af4a25b ("drm/vmwgfx: Add debugfs entries for various ttm resource managers")
Cc: <stable@vger.kernel.org>
Reviewed-by: Zack Rusin <zack.rusin@broadcom.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240312093551.196609-1-jfalempe@redhat.com
Byte-Lab pushed a commit that referenced this pull request Apr 12, 2024
Petr Machata says:

====================
selftests: Fixes for kernel CI

As discussed on the bi-weekly call on Jan 30, and in mailing around
kernel CI effort, some changes are desirable in the suite of forwarding
selftests the better to work with the CI tooling. Namely:

- The forwarding selftests use a configuration file where names of
  interfaces are defined and various variables can be overridden. There
  is also forwarding.config.sample that users can use as a template to
  refer to when creating the config file. What happens a fair bit is
  that users either do not know about this at all, or simply forget, and
  are confused by cryptic failures about interfaces that cannot be
  created.

  In patches #1 - #3 have lib.sh just be the single source of truth with
  regards to which variables exist. That includes the topology variables
  which were previously only in the sample file, and any "tweak
  variables", such as what tools to use, sleep times, etc.

  forwarding.config.sample then becomes just a placeholder with a couple
  examples. Unless specific HW should be exercised, or specific tools
  used, the defaults are usually just fine.

- Several net/forwarding/ selftests (and one net/ one) cannot be run on
  veth pairs, they need an actual HW interface to run on. They are
  generic in the sense that any capable HW should pass them, which is
  why they have been put to net/forwarding/ as opposed to drivers/net/,
  but they do not generalize to veth. The fact that these tests are in
  net/forwarding/, but still complaining when run, is confusing.

  In patches #4 - #6 move these tests to a new directory
  drivers/net/hw.

- The following patches extend the codebase to handle well test results
  other than pass and fail.

  Patch #7 is preparatory. It converts several log_test_skip to XFAIL,
  so that tests do not spuriously end up returning non-0 when they
  are not supposed to.

  In patches #8 - #10, introduce some missing ksft constants, then support
  having those constants in RET, and then finally in EXIT_STATUS.

- The traffic scheduler tests generate a large amount of network traffic
  to test the behavior of the scheduler. This demands a relatively
  high-performance computer. On slow machines, such as with a debugging
  kernel, the test would spuriously fail.

  It can still be useful to "go through the motions" though, to possibly
  catch bugs in setup of the scheduler graph and passing packets around.
  Thus we still want to run the tests, just with lowered demands.

  To that end, in patches #11 - #12, introduce an environment variable
  KSFT_MACHINE_SLOW, with obvious meaning. Tests can then make checks
  more lenient, such as mark failures as XFAIL. A helper, xfail_on_slow,
  is provided to mark performance-sensitive parts of the selftest.

- In patch #13, use a similar mechanism to mark a NH group stats
  selftest to XFAIL HW stats tests when run on VETH pairs.

- All these changes complicate the hitherto straightforward logging and
  checking logic, so in patch #14, add a selftest that checks this
  functionality in lib.sh.

v1 (vs. an RFC circulated through linux-kselftest):
- Patch #9:
    - Clarify intended usage by s/set_ret/ret_set_ksft_status/,
      s/nret/ksft_status/
====================

Link: https://lore.kernel.org/r/cover.1711464583.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Byte-Lab pushed a commit that referenced this pull request Apr 23, 2024
At current x1e80100 interface table, interface #3 is wrongly
connected to DP controller #0 and interface #4 wrongly connected
to DP controller #2. Fix this problem by connect Interface #3 to
DP controller #0 and interface #4 connect to DP controller #1.
Also add interface #6, #7 and #8 connections to DP controller to
complete x1e80100 interface table.

Changs in V3:
-- add v2 changes log

Changs in V2:
-- add x1e80100 to subject
-- add Fixes

Fixes: e3b1f36 ("drm/msm/dpu: Add X1E80100 support")
Signed-off-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Reviewed-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
Patchwork: https://patchwork.freedesktop.org/patch/585549/
Link: https://lore.kernel.org/r/1711741586-9037-1-git-send-email-quic_khsieh@quicinc.com
Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
Byte-Lab pushed a commit that referenced this pull request Apr 23, 2024
…git/netfilter/nf

netfilter pull request 24-04-11

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patches #1 and #2 add missing rcu read side lock when iterating over
expression and object type list which could race with module removal.

Patch #3 prevents promisc packet from visiting the bridge/input hook
	 to amend a recent fix to address conntrack confirmation race
	 in br_netfilter and nf_conntrack_bridge.

Patch #4 adds and uses iterate decorator type to fetch the current
	 pipapo set backend datastructure view when netlink dumps the
	 set elements.

Patch #5 fixes removal of duplicate elements in the pipapo set backend.

Patch #6 flowtable validates pppoe header before accessing it.

Patch #7 fixes flowtable datapath for pppoe packets, otherwise lookup
         fails and pppoe packets follow classic path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Byte-Lab pushed a commit that referenced this pull request Apr 23, 2024
vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.

net_ratelimit mechanism can be used to limit the dumping rate.

PID: 33036    TASK: ffff949da6f20000  CPU: 23   COMMAND: "vhost-32980"
 #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
 #3 [fffffe00003fced0] do_nmi at ffffffff8922660d
 #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
    [exception RIP: io_serial_in+20]
    RIP: ffffffff89792594  RSP: ffffa655314979e8  RFLAGS: 00000002
    RAX: ffffffff89792500  RBX: ffffffff8af428a0  RCX: 0000000000000000
    RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff8af428a0
    RBP: 0000000000002710   R8: 0000000000000004   R9: 000000000000000f
    R10: 0000000000000000  R11: ffffffff8acbf64f  R12: 0000000000000020
    R13: ffffffff8acbf698  R14: 0000000000000058  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594
 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124
 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
 #12 [ffffa65531497b68] printk at ffffffff89318306
 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
 #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
 #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
 #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
 #18 [ffffa65531497f10] kthread at ffffffff892d2e72
 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f

Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors")
Signed-off-by: Lei Chen <lei.chen@smartx.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20240415020247.2207781-1-lei.chen@smartx.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants