Skip to content

Locating THROTL_GRP_QUANTUM in assembly (and the more general question of feasibility of #16) #30

@whentojump

Description

@whentojump

THROTL_GRP_QUANTUM

TL;DR: The second while loop (https://elixir.bootlin.com/linux/v6.14/source/block/blk-throttle.c#L960) is compiled with certain unrolling optimization

Experiments: directly mutating VALUE below:

		if (nr_writes >= VALUE)
			break;
for value in `seq 1 10`; do
    sed "s|if (nr_writes >= FINDME)|if (nr_writes >= $value)|" \
        < block/blk-throttle.template.c \
        > block/blk-throttle.c
    make block/blk-throttle.o
    objdump -d block/blk-throttle.o > disas/origin.$value.txt
done

Running x=N; diff origin.$x.txt origin.$((x+1)).txt
Only 4 to 5 produce reasonably small changes. All other adjacent pairs differ drastically.

4083c4083
<     3648:     41 83 fd 04             cmp    $0x4,%r13d
---
>     3648:     41 83 fd 05             cmp    $0x5,%r13d

Adding __attribute__((optimize("O0"))) to throtl_dispatch_tg function

1 vs. 2

73,74c73,74
<       c3:     83 7d e4 00             cmpl   $0x0,-0x1c(%rbp)
<       c7:     75 39                   jne    102 <throtl_dispatch_tg+0xf2>
---
>       c3:     83 7d e4 01             cmpl   $0x1,-0x1c(%rbp)
>       c7:     77 39                   ja     102 <throtl_dispatch_tg+0xf2>

All other adjacent pairs are like

73c73
<       c3:     83 7d e4 01             cmpl   $0x1,-0x1c(%rbp)
---
>       c3:     83 7d e4 02             cmpl   $0x2,-0x1c(%rbp)

With -O0, code structure can be retained

Adding __attribute__((optimize("O2","-fno-unroll-loops","-fno-peel-loops","-fno-unswitch-loops")))

$ x=4 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
<     350e:     41 83 fc 04             cmp    $0x4,%r12d
---
>     350e:     41 83 fc 05             cmp    $0x5,%r12d
$ x=5 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
<     350e:     41 83 fc 05             cmp    $0x5,%r12d
---
>     350e:     41 83 fc 06             cmp    $0x6,%r12d
$ x=7 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
<     3512:     41 83 fc 07             cmp    $0x7,%r12d
---
>     3512:     41 83 fc 08             cmp    $0x8,%r12d
$ x=8 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
<     3512:     41 83 fc 08             cmp    $0x8,%r12d
---
>     3512:     41 83 fc 09             cmp    $0x9,%r12d
$ x=9 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
<     3512:     41 83 fc 09             cmp    $0x9,%r12d
---
>     3512:     41 83 fc 0a             cmp    $0xa,%r12d

All other adjacent pairs exhibit huge differences. E.g. from 6 to 7, the cmp is moved to a different address

$ for x in `seq 1 9`; do echo "x=$x"; diff no-unroll.$x.txt no-unroll.$((x+1)).txt | wc -l; done
x=1
3349
x=2
3322
x=3
89
x=4
4
x=5
4
x=6
99
x=7
4
x=8
4
x=9
4

Cancelling more optimization make things better but still not like -O0

__attribute__((optimize(
	"O2",
	"-fno-unroll-loops",
	"-fno-peel-loops",
	"-fno-unswitch-loops",
	"-fno-reorder-blocks",
	"-fno-guess-branch-probability",
	"-fno-reorder-functions"
)))
$ for x in `seq 1 9`; do echo "x=$x"; diff no-reorder.$x.txt no-reorder.$((x+1)).txt | wc -l; done
x=1
3304
x=2
3344
x=3
4
x=4
4
x=5
4
x=6
4
x=7
4
x=8
4
x=9
4

Dump optimization record

gcc -Wp,-MMD,block/.blk-throttle.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -I./ubuntu/include -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -mindirect-branch-cs-prefix -mfunction-return=thunk-extern -fno-jump-tables -mharden-sls=all -fpatchable-function-entry=16,16 -fno-delete-null-pointer-checks -O2 -fno-allow-store-data-races -fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls -ftrivial-auto-var-init=zero -fno-stack-clash-protection -fzero-call-used-regs=used-gpr -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -fmin-function-alignment=16 -fstrict-flex-arrays=3 -fno-strict-overflow -fno-stack-check -fconserve-stack -Wall -Wundef -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs -Wno-frame-address -Wno-address-of-packed-member -Wmissing-declarations -Wmissing-prototypes -Wframe-larger-than=1024 -Wno-main -Wno-dangling-pointer -Wvla -Wno-pointer-sign -Wcast-function-type -Wno-stringop-overflow -Wno-array-bounds -Wno-alloc-size-larger-than -Wimplicit-fallthrough=5 -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -Wenum-conversion -Wextra -Wunused -Wno-unused-but-set-variable -Wno-unused-const-variable -Wno-packed-not-aligned -Wno-format-overflow -Wno-format-truncation -Wno-stringop-truncation -Wno-override-init -Wno-missing-field-initializers -Wno-type-limits -Wno-shift-negative-value -Wno-maybe-uninitialized -Wno-sign-compare -Wno-unused-parameter -g -gdwarf-5  -fsanitize=bounds-strict -fsanitize=shift -fsanitize=bool -fsanitize=enum  -fsanitize=signed-integer-overflow    -DKBUILD_MODFILE='"block/blk-throttle"' -DKBUILD_BASENAME='"blk_throttle"' -DKBUILD_MODNAME='"blk_throttle"' -D__KBUILD_MODNAME=kmod_blk_throttle \
-fopt-info-all-optall \
-fsave-optimization-record \
-c -o block/blk-throttle.o block/blk-throttle.c 
block/blk-throttle.c:954:56: note: considering unrolling loop 4 at BB 39
considering unrolling loop with constant number of iterations
considering unrolling loop with runtime-computable number of iterations

which confirms the theory again.

Feasibility of #16

See chat history https://chatgpt.com/share/68b53b44-6f98-8013-9258-dff728a75001
Especially the last question --

  1. Forget about compilers for now. If I want to achieve something semantically equivalent to "changing a constant in source from A to B", is it always possible to do so by only modifying assembly operands (i.e. not changing assembly code structure or instruction boundaries), even via heroic manual binary rewrite? If not, give counter examples
ChatGPT answer 3) Can you *always* emulate “A→B” by **only** changing existing operands (no instruction-length/control-flow/layout changes)?

No. Many real cases require changing instruction forms, duplicating/removing blocks, or editing data sections. Counterexamples:

  1. Immediate size overflows
    Original fits imm8, new needs imm32:
    cmp $127,%eax (3 bytes) → cmp $300,%eax needs the imm32 form (6 bytes). You can’t keep instruction boundaries unchanged.

  2. Algorithm selection changes
    Division/mod by constant:
    x/8sar $3,%eax (1 instruction).
    x/3 → sequence with a 64-bit multiply by magic + shifts/adds.
    Going to or from a power-of-two divisor crosses between one instruction and a sequence. No operand-only patch can do that.

  3. Loop unrolling/peeling/if-conversion driven by the constant
    A bound moving from 2→3 may flip heuristic thresholds: the compiler might (un)roll or (un)peel. The number of replicated basic blocks changes; you would have to add or remove blocks, not just immediates.

  4. Switch lowering & jump-table sizing
    Changing case 0..A to case 0..B generally requires resizing the jump table, adjusting the range check and bias, and possibly emitting more case bodies. Simply bumping a compare immediate will index past the last table entry — wrong semantics.

  5. Compile-time only constants
    Anything in #if / static_assert / template parameters / if constexpr affects what code exists. The “other side” of the branch was never generated; you can’t conjure it by operand edits.

  6. Layout-affecting constants
    Array/struct sizes or alignment constants change field offsets and stack frame layout throughout a function (and sometimes across TUs). Achieving semantic equivalence would require editing dozens of displacements, and may still miss ABI effects. Some changes also alter section sizes → cascading address shifts; you said “no structure change.”

  7. Call vs inline cutoff
    A size threshold might flip an inline into a library call (or vice versa). Operand-only edits can’t insert/remove a call sequence.

  8. Architectural encoding limits
    Some instructions simply lack an immediate form for the needed width (e.g., cmp r/m64, imm sign-extends imm32; if you needed a distinct 64-bit constant outside the sign-extendable range, there is no encoding to keep length identical).

  9. Constant folded away
    If A makes a branch statically dead (DCE removes the whole block) and B would keep it alive, the “missing” code isn’t present to re-enable with an operand tweak.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions