Merge new vantomkernel #7

With CONFIG_SCHED_WALT disabled, is_min_capacity_cpu() is defined to always return true, which breaks the intended behavior of task_fits_max(). Revise is_min_capacity_cpu() to return correct results. An earlier version of this patch failed to handle the case when min_cap_orig_cpu == -1 while sched domains are being updated due to hotplug. Add a check for this case. Test: trace shows increased top-app placement on medium cores Bug: 117499098 Bug: 128477368 Bug: 130756111 Change-Id: Ia2b41aa7c57f071c997bcd0e9cdfd0808f6a2bf9 Signed-off-by: Connor O'Brien <connoro@google.com>

None of these functions does what its name implies when CONFIG_SCHED_WALT=n. While all are currently unused, future patches could introduce subtle bugs by calling any of them from non WALT specific code. Delete the functions so it's obvious if new callers are added. Test: build kernel Change-Id: Ib7552afb5668b48fe2ae56307016e98716e00e63 Signed-off-by: Connor O'Brien <connoro@google.com>

Part of the fix from commit d86ab9c ("cpufreq: schedutil: use now as reference when aggregating shared policy requests") is reversed in commit 05d2ca2 ("cpufreq: schedutil: Ignore CPU load older than WALT window size") due to a porting mistake. Restore it while keeping the relevant change from the latter patch. Bug: 117438867 Test: build & boot Change-Id: I21399be760d7c8e2fff6c158368a285dc6261647 Signed-off-by: Connor O'Brien <connoro@google.com>

Android will unset SD_LOAD_BALANCE for single core cluster domain and for some product it is true to have a single core cluster and the MC domain thus lacks the SD_LOAD_BALANCE flag. This will cause select_task_rq_fair logic for that core. And the task will spin forever in that core. Bug: 141334320 Test: boot and see task on core7 scheduled correctly Change-Id: I7c2845b1f7bc1d4051eb3ad6a5f9838fb0b1ba04 Signed-off-by: Wei Wang <wvw@google.com>

Commit 20017f3 ("sched/fair: Only kick nohz balance when runqueue has more than 1 task") disabled the nohz kick for LB when a rq has a misfit task. The assumption is that this would be addressed in the forced up-migration path. However, this path is WALT-specific, so disabling the nohz kick breaks PELT. Fix it by re-enabling the nohz_kick when there is a misfit task on the rq. Bug: 143472450 Bug: 145190765 Test: 10/10 iterations of eas_small_to_big ended up up-migrating Fixes: 20017f3 ("sched/fair: Only kick nohz balance when runqueue has more than 1 task") Signed-off-by: Quentin Perret <qperret@google.com> Change-Id: I9f708eb7661a9e82afdd4e99b878995c33703a45 (cherry picked from commit 2639b055d2526603ec8c12a2fe2413bf445c3def) Signed-off-by: Kyle Lin <kylelin@google.com>

Because mark_reserved use for WALT and it's called by load_balance, it leads to build breakage when WALT disabled. Executing the function only if CONFIG_SCHED_WALT enabled. Bug: 144142283 Test: Build and boot to home Change-Id: I5cc3e3ece6a28c6cdabbe6964f6a6032ff2ea809 Signed-off-by: Kyle Lin <kylelin@google.com>

This reverts commit 63c2750. Conflicts: kernel/sched/fair.c Bug: 117438867 Test: Tracing confirms EAS is no longer always used Change-Id: If321547a86592527438ac21c3734a9f4decda712 Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: Volodymyr Zhdanov <wight554@gmail.com>

* Breaks CONFIG_SCHED_WALT=n compilation Signed-off-by: Subhajeet Muhuri <subhajeet.muhuri@gmail.com>

Define trace_sched_load_balance_skip_tasks() when WALT is disabled. Bug: 79886566 Change-Id: I73103fc5dc3b4635ed24297b8fda42c0f71c8eb4 Signed-off-by: Steve Muckle <smuckle@google.com>

trace_sched_blocked_trace in CFS is really useful for debugging via trace because it tell where the process was stuck on callstack. For example, <...>-6143 ( 6136) [005] d..2 50.278987: sched_blocked_reason: pid=6136 iowait=0 caller=SyS_mprotect+0x88/0x208 <...>-6136 ( 6136) [005] d..2 50.278990: sched_blocked_reason: pid=6142 iowait=0 caller=do_page_fault+0x1f4/0x3b0 <...>-6142 ( 6136) [006] d..2 50.278996: sched_blocked_reason: pid=6144 iowait=0 caller=SyS_prctl+0x52c/0xb58 <...>-6144 ( 6136) [006] d..2 50.279007: sched_blocked_reason: pid=6136 iowait=0 caller=vm_mmap_pgoff+0x74/0x104 However, sometime it gives pointless information like this. RenderThread-2322 ( 1805) [006] d.s3 50.319046: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220 logd.writer-594 ( 587) [002] d.s3 50.334011: sched_blocked_reason: pid=6126 iowait=1 caller=wait_on_page_bit+0x194/0x208 kworker/u16:13-333 ( 333) [007] d.s4 50.343161: sched_blocked_reason: pid=6136 iowait=1 caller=__lock_page_killable+0x17c/0x220 Such wait_on_page_bit, __lock_page_killable are pointless because it doesn't carry on higher information to identify the callstack. The reason is page_lock and waitqueue are special synchronization method unlike other normal locks(mutex, spinlock). Let's mark them as "__sched" so get_wchan which used in trace_sched_blocked_trace could detect it and skip them. It will produce more meaningful callstack function like this. <...>-2867 ( 1068) [002] d.h4 124.209701: sched_blocked_reason: pid=329 iowait=0 caller=worker_thread+0x378/0x470 <...>-2867 ( 1068) [002] d.s3 124.209763: sched_blocked_reason: pid=8454 iowait=1 caller=__filemap_fdatawait_range+0xa0/0x104 <...>-2867 ( 1068) [002] d.s4 124.209803: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470 ScreenDecoratio-2364 ( 1867) [002] d.s3 124.209973: sched_blocked_reason: pid=8454 iowait=1 caller=f2fs_wait_on_page_writeback+0x84/0xcc ScreenDecoratio-2364 ( 1867) [002] d.s4 124.209986: sched_blocked_reason: pid=869 iowait=0 caller=worker_thread+0x378/0x470 <...>-329 ( 329) [000] d..3 124.210435: sched_blocked_reason: pid=538 iowait=0 caller=worker_thread+0x378/0x470 kworker/u16:13-538 ( 538) [007] d..3 124.210450: sched_blocked_reason: pid=6 iowait=0 caller=worker_thread+0x378/0x470 Bug: 144713689 Change-Id: I30397400c5d056946bdfbc86c9ef5f4d7e6c98fe Signed-off-by: Minchan Kim <minchan@google.com>

File list: kernel/sched/* Bug: 147204230 Bug: 146759211 Change-Id: I683c02c82967654aac2fb97999e021032e9b79ef Signed-off-by: Wilson Sung <wilsonsung@google.com> Signed-off-by: Volodymyr Zhdanov <wight554@gmail.com>

Some devices need an additional sched-tune boost group to optimize performance for key tasks Bug: 150302001 Change-Id: I392c8cc05a8851f1d416c381b4a27242924c2c27 Signed-off-by: Todd Kjos <tkjos@google.com>

Test: build and boot Bug: 144451857 Bug: 147785606 Change-Id: Ib2d86a72cad12971a99c7105813473211a7fbd76 Signed-off-by: Wei Wang <wvw@google.com>

We have seen cases tasks stuck on little without migration. This also aligns with is_packing_eligible capacity margin calculation. Bug: 143472450 Bug: 147785606 Change-Id: I8650e6acc541172eb6a981c6b4b4c875d44dbb28 Signed-off-by: Wei Wang <wvw@google.com>

some UI tasks are struggling to meet their deadline in smaller cores. As those tasks have strong deadline and their IPC is very different than what we put in EM, bump capacity margin in the hope to lower the chance of those tasks migration to smaller cores. Bug: 144451857 Test: jank test Change-Id: Ifb53289103b2f92a7f30f456375839c4b147b106 Signed-off-by: Wei Wang <wvw@google.com>

Mid capacity cpu was first introduced on kernel 4.14 for floral. However, other 4.14 platforms may not have mid capacity cpu, such as sunfish. So, we need to check if it exists before using it. Bug: 142551658 Test: boot to home Change-Id: I9b7f5b94b337167b9790def4953854baab96eaa2 Signed-off-by: Rick Yiu <rickyiu@google.com> [fixed merge conflict] Signed-off-by: Volodymyr Zhdanov <wight554@gmail.com>

Further tuning based on more UI benchmarks and devices. Bug: 144451857 Test: jank test Signed-off-by: Wei Wang <wvw@google.com> Change-Id: I35f7730d06acbf4493b6506150ed54f73ab4537d

With the introduction of placement hint patch, boosted tasks will not scheduled from big cores. We tune capacity margin to let important boosted tasks get scheduled on big cores. However, the capacity margin affects all group of tasks, so that non-boosted tasks get more chances to be scheduled on big cores, too. This could be solved by separating capacity margin for boosted tasks. Bug: 147785606 Test: margin set correctly Signed-off-by: Rick Yiu <rickyiu@google.com> Change-Id: I2b02e138e36a6844afbc1ade60fe86a001814b30

Previously we skip util check on idle cpu if task prefers idle, but we still need to make sure task fits in that cpu after considering capacity margin (on little cores only). Bug: 147785606 Test: cpu skipped as expected Signed-off-by: Rick Yiu <rickyiu@google.com> Change-Id: I7c85768ceda94b44052c7c9428fd50088268edad

Bug: 153761742 Test: Build Signed-off-by: Wei Wang <wvw@google.com> Change-Id: I21289f1c920b59364b74ba2408e2ca87cf57ea49

Currently iowait doesn't distinguish background/foreground tasks and we have seen cases where a device run to high frequency unnecessarily when running some background I/O. This patch limits iowait boost to tasks with prefer_idle only. Specifically, on Pixel, those are foreground and top app tasks. Bug: 130308826 Test: Boot and trace Change-Id: I2d892beeb4b12b7e8f0fb2848c23982148648a10 Signed-off-by: Wei Wang <wvw@google.com>

Bug: 144142283 Test: build and boot to home Change-Id: I63128831127f52bd233f1ce99650ccaceb25bf5e Signed-off-by: Kyle Lin <kylelin@google.com>

Bug: 115684360 Bug: 113594604 Test: Build Change-Id: I9141b9bac316604730f0e277ca0212e86df3a90d Signed-off-by: Wei Wang <wvw@google.com>

This functionality is unused on this platform. Disable it to prevent incurring unnecessary overhead. Change-Id: Ia52ab5fb9a7119ba4495879fa755c846fdde498e Signed-off-by: Steve Muckle <smuckle@google.com>

Signed-off-by: Danny Lin <danny@kdrag0n.dev>

I have tons of warnings when compilings kernel with Clang-12, like below one: In file included from arch/arm64/kernel/asm-offsets.c:22: In file included from ./include/linux/mm.h:477: In file included from ./include/linux/huge_mm.h:7: In file included from ./include/linux/fs.h:8: ./include/linux/dcache.h:509:9: warning: '(' and '{' tokens introducing statement expression are separated by whitespace [-Wcompound-token-split-by-space] return mult_frac(val, sysctl_vfs_cache_pressure, 100); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/kernel.h:161:35: note: expanded from macro 'mult_frac' . #define mult_frac(x, numer, denom)( \ ^~~~~~~~~~~~~~~~~~~~~~~ In file included from arch/arm64/kernel/asm-offsets.c:22: In file included from ./include/linux/mm.h:477: In file included from ./include/linux/huge_mm.h:7: In file included from ./include/linux/fs.h:8: ./include/linux/dcache.h:509:9: warning: '}' and ')' tokens terminating statement expression are separated by whitespace [-Wcompound-token-split-by-space] return mult_frac(val, sysctl_vfs_cache_pressure, 100); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/kernel.h:165:50: note: expanded from macro 'mult_frac' (quot * (numer)) + ((rem * (numer)) / (denom)); \ ^ 2 warnings generated. Signed-off-by: atndko <z1281552865@gmail.com> Signed-off-by: DarkDampSquib <andrin.geiger1998@gmail.com>

Without this dependancy, the performance governor can be omitted. Signed-off-by: tytydraco <tylernij@gmail.com>

Let dirty_ratio handle this, do not use a constant timer to flush data to the disk. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

If the scenario is right, we can run realtime tasks for 5% longer. This also disables lockup protection from unhandled realtime tasks. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

… killed "howaboutsynergy" reported via kernel buzilla number 204165 that compact_zone_order was consuming 100% CPU during a stress test for prolonged periods of time. Specifically the following command, which should exit in 10 seconds, was taking an excessive time to finish while the CPU was pegged at 100%. stress -m 220 --vm-bytes 1000000000 --timeout 10 Tracing indicated a pattern as follows stress-3923 [007] 519.106208: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106212: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106216: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106219: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106223: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106227: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106231: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106235: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106238: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 stress-3923 [007] 519.106242: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0 Note that compaction is entered in rapid succession while scanning and isolating nothing. The problem is that when a task that is compacting receives a fatal signal, it retries indefinitely instead of exiting while making no progress as a fatal signal is pending. It's not easy to trigger this condition although enabling zswap helps on the basis that the timing is altered. A very small window has to be hit for the problem to occur (signal delivered while compacting and isolating a PFN for migration that is not aligned to SWAP_CLUSTER_MAX). This was reproduced locally -- 16G single socket system, 8G swap, 30% zswap configured, vm-bytes 22000000000 using Colin Kings stress-ng implementation from github running in a loop until the problem hits). Tracing recorded the problem occurring almost 200K times in a short window. With this patch, the problem hit 4 times but the task existed normally instead of consuming CPU. This problem has existed for some time but it was made worse by commit cf66f07 ("mm, compaction: do not consider a need to reschedule as contention"). Before that commit, if the same condition was hit then locks would be quickly contended and compaction would exit that way. Change-Id: I67b546921390d17b393c1f3f2f195db9de499255 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204165 Link: http://lkml.kernel.org/r/20190718085708.GE24383@techsingularity.net Fixes: cf66f07 ("mm, compaction: do not consider a need to reschedule as contention") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [5.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-Commit: 670105a Git-Repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git [vinmenon@codeaurora.org: trivial conflict fixes] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>

Signed-off-by: Joe Maples <joe@frap129.org>

Change-Id: I17e734057ba567f084ecd6eaa3ef282720608d66 Signed-off-by: Alex Naidis <alex.naidis@linux.com> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>

work_busy() was available for regular work only, however it is useful for delayed work too. This implements a variant of work_busy() for delayed work. Signed-off-by: Alex Naidis <alex.naidis@linux.com> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>

Android uses way too much printk. Avoid allocating heap memory for lower overheads. Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>

Currently when calculating boosted util for a cpu, it uses a fixed value of 1024 for calculation. So when top-app tasks moved to LC, which has much lower capacity than BC, the freq calculated will be high even the cpu util is low. This results in higher power consumption, especially on arch which has more little cores than big cores. By replacing the fixed value of 1024 with actual cpu capacity will reduce the freq calculated on LC. Bug: 152925197 Test: boosted util reduced on little cores Signed-off-by: Rick Yiu <rickyiu@google.com> Change-Id: I80cdd08a2c7fa5e674c43bfc132584d85c14622b

For prefer_high_cap case, it will start from mid/max cpu already, so there is no need to use boosted margin for task placement. Bug: 160082718 Test: tasks scheduled as expected Signed-off-by: Rick Yiu <rickyiu@google.com> Change-Id: I4df27b1e468484f5d9aedfa57ee444f397a8da81 Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> [UtsavBalar1231] Fix Conflict: kernel/sched/fair.c

With scheduler placement hint, there still could be several boosted tasks contending for big cores. On chipset with fewer big cores, it might cause problems like jank. To improve it, schedule tasks of prio >= DEFAULT_PRIO from little cores if they could fit, even for tasks that prefer high capacity cpus, since such prio means they are less important. Bug: 158936596 Test: tasks scheduled as expected Signed-off-by: Rick Yiu <rickyiu@google.com> Change-Id: Ic0cc06461818944e3e97ec0493c0d9c9f1a5e217 [backported to 4.14] Signed-off-by: Volodymyr Zhdanov <wight554@gmail.com>

This reverts commit 6a6e344.

All cpus are running at max freq, the reason is that in the check of sugov_up_down_rate_limit() in cpufreq_schedutil.c, the time passed in is always 0, so the check is always true and hence the freq will not be updated. It is caused by sched_ktime_clock() will return 0 if CONFIG_SCHED_WALT is not set. Fix it by replacing sched_ktime_clock() with ktime_get_ns(). Bug: 119932718 Test: cpu freq could change after fix Change-Id: I62a0b35208dcd7a1d23da27f909cce3e59208d1f Signed-off-by: Rick Yiu <rickyiu@google.com>

* Our camera HAL uses boot time for buffer timestamp, rather than system monotonic time. This leads to issues as framework uses system monotonic time as reference start time for timestamp adjustment. * This patch is taken from stock kernel source. Change-Id: Ia4fac1d48e2206a7befd0030585776371dd8c3a9

Previously we have used pure CFS wakeup in overutilized case. This is a tweaked version to activate the path only for important tasks. Bug: 161190988 Bug: 160883639 Test: boot and systrace Signed-off-by: Wei Wang <wvw@google.com> Change-Id: I2a27f241b3ba32a04cf6f88deb483d6636440dcf

This reverts commit dcfd5b9.

…based_app()"" This reverts commit efea863.

This reverts commit 9112371.

To identify certain apps which request max cpu freq to affine its tasks to specific cpus, besides checking its lib name, task name is also a factor that we can identify the suspcious task. Test: build and test the 'perfect kick 2' game. Bug: 163293825 Bug: 161324271 Change-Id: I4359859db743b4c9122e9df40af0b109370e8f1f Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>

../kernel/sched/fair.c:5897:17: error: ISO C90 forbids mixing declarations and code [-Werror,-Wdeclaration-after-statement] struct cfs_rq *cfs_rq; ^ 1 error generated.

This reverts commit 1718088.

Inhibit the actual boosting functionality while preserving the value store/show functionality of the node. This allows us to be free of CAF's suboptimal boosting (which causes mass task migrations and thus higher latency as well as unnecessarily high power usage) while not breaking perfd or kernel managers. Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Change the cpuidle policy to always enter C1 first and set a timer with a duration of half the C1_max_residency. Once the timer expired, it can then enter C4. Bug: 143480592 Test: Verify idle behavior and check power & perf on several use cases. Change-Id: I3c54e34b7af2f262bd029f48fef4e00620536586 Signed-off-by: Jia-yi Chen <jychen@google.com>

Signed-off-by: engstk <eng.stk@sapo.pt> Signed-off-by: Joe Maples <joe@frap129.org>

Qualcomm's LLCC controller does not have an error IRQ line and instead polls to check memory banks for errors every 5 seconds, which is inefficient and will add to system jitter. The generic Kryo CPU cache controller does have error IRQ lines so it doesn't need to use polling, but EDAC in general is fairly useless in its current state anyway because Google disabled the option to panic on uncorrectable error. Let's follow their decision and just disable EDAC entirely, as well as its placeholder RAS dependency. Change-Id: I861cf0a31b2a8798545d55400900b1c504f653da Signed-off-by: Danny Lin <danny@kdrag0n.dev>

GC should run conservatively as possible to reduce latency spikes to the user. Setting ioprio to idle class will allow the kernel to schedule GC thread's I/O to not affect any other processes' I/O requests. Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

We don't want the background GC work causing UI jitter should it ever collide with periods of user activity. Signed-off-by: Danny Lin <danny@kdrag0n.dev>

On command mode panels, we can power off the DSI PHY entirely during idle PC to save more power than ULPS. Change-Id: Ie566b8990acb77d837d078582b35f2a2b6ea98b6 Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Enabled upon user request for using Proxmark devices over USB. Change-Id: Ibdba0a32e5f4a32cd0eeabf25595680acac0eca6 Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Ratelimiting messages from init and fsck.f2fs make early userspace boot debugging much more difficult for no good reason. Change-Id: I024ec6549a6374ceecec04642b679f43ec559fd9 Signed-off-by: Danny Lin <danny@kdrag0n.dev>

The hwrng_fill() function can run while devices are suspending and resuming. If the hwrng is behind a bus such as i2c or SPI and that bus is suspended, the hwrng may hang the bus while attempting to add some randomness. It's been observed on ChromeOS devices with suspend-to-idle (s2idle) and an i2c based hwrng that this kthread may run and ask the hwrng device for randomness before the i2c bus has been resumed. Let's make this kthread freezable so that we don't try to touch the hwrng during suspend/resume. This ensures that we can't cause the hwrng backing driver to get into a bad state because the device is guaranteed to be resumed before the hwrng kthread is thawed. Cc: Andrey Pronin <apronin@chromium.org> Cc: Duncan Laurie <dlaurie@chromium.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Alexander Steffen <Alexander.Steffen@infineon.com> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: celtare21 <celtare21@gmail.com>

The kthread calling this function is freezable after commit 03a3bb7 ("hwrng: core - Freeze khwrng thread during suspend") is applied. Unfortunately, this function uses wait_event_interruptible() but doesn't check for the kthread being woken up by the fake freezer signal. When a user suspends the system, this kthread will wake up and if it fails the entropy size check it will immediately go back to sleep and not go into the freezer. Eventually, suspend will fail because the task never froze and a warning message like this may appear: PM: suspend entry (deep) Filesystems sync: 0.000 seconds Freezing user space processes ... (elapsed 0.001 seconds) done. OOM killer disabled. Freezing remaining freezable tasks ... Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0): hwrng R running task 0 289 2 0x00000020 [<c08c64c4>] (__schedule) from [<c08c6a10>] (schedule+0x3c/0xc0) [<c08c6a10>] (schedule) from [<c05dbd8c>] (add_hwgenerator_randomness+0xb0/0x100) [<c05dbd8c>] (add_hwgenerator_randomness) from [<bf1803c8>] (hwrng_fillfn+0xc0/0x14c [rng_core]) [<bf1803c8>] (hwrng_fillfn [rng_core]) from [<c015abec>] (kthread+0x134/0x148) [<c015abec>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c) Check for a freezer signal here and skip adding any randomness if the task wakes up because it was frozen. This should make the kthread freeze properly and suspend work again. Fixes: 03a3bb7 ("hwrng: core - Freeze khwrng thread during suspend") Reported-by: Keerthy <j-keerthy@ti.com> Tested-by: Keerthy <j-keerthy@ti.com> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: celtare21 <celtare21@gmail.com>

Sebastian reports that after commit ff29629 ("random: Support freezable kthreads in add_hwgenerator_randomness()") we can call might_sleep() when the task state is TASK_INTERRUPTIBLE (state=1). This leads to the following warning. do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000349d1489>] prepare_to_wait_event+0x5a/0x180 WARNING: CPU: 0 PID: 828 at kernel/sched/core.c:6741 __might_sleep+0x6f/0x80 Modules linked in: CPU: 0 PID: 828 Comm: hwrng Not tainted 5.3.0-rc7-next-20190903+ #46 RIP: 0010:__might_sleep+0x6f/0x80 Call Trace: kthread_freezable_should_stop+0x1b/0x60 add_hwgenerator_randomness+0xdd/0x130 hwrng_fillfn+0xbf/0x120 kthread+0x10c/0x140 ret_from_fork+0x27/0x50 We shouldn't call kthread_freezable_should_stop() from deep within the wait_event code because the task state is still set as TASK_INTERRUPTIBLE instead of TASK_RUNNING and kthread_freezable_should_stop() will try to call into the freezer with the task in the wrong state. Use wait_event_freezable() instead so that it calls schedule() in the right place and tries to enter the freezer when the task state is TASK_RUNNING instead. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Keerthy <j-keerthy@ti.com> Fixes: ff29629 ("random: Support freezable kthreads in add_hwgenerator_randomness()") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: celtare21 <celtare21@gmail.com>

Becauase of using wrong pair of PM operation device was not resuming if suspend fails. Also device gives errors in dmesg like below: [ 2420.308491] dpm_run_callback(): icnss_pm_suspend_noirq+0x0/0xc0 returns -11 [ 2420.308510] PM: Device 18800000.qcom,icnss failed to suspend async: error -11 [ 2420.312069] PM: noirq suspend of devices failed [ 2423.384002] dpm_run_callback(): icnss_pm_suspend_noirq+0x0/0xc0 returns -11 [ 2423.384020] PM: Device 18800000.qcom,icnss failed to suspend async: error -11 [ 2423.317523] PM: noirq suspend of devices failed [ 2426.444164] dpm_run_callback(): icnss_pm_suspend_noirq+0x0/0xc0 returns -11 [ 2426.444181] PM: Device 18800000.qcom,icnss failed to suspend async: error -11 [ 2426.447813] PM: noirq suspend of devices failed [ 2428.915643] dpm_run_callback(): icnss_pm_suspend_noirq+0x0/0xc0 returns -11 [ 2428.915659] PM: Device 18800000.qcom,icnss failed to suspend async: error -11 [ 2428.919208] PM: noirq suspend of devices failed [ 2429.529067] dpm_run_callback(): icnss_pm_suspend_noirq+0x0/0xc0 returns -11 [ 2429.529086] PM: Device 18800000.qcom,icnss failed to suspend async: error -11 [ 2423.532786] PM: noirq suspend of devices failed Adding changes to use correct set of PM operations and fix log spam. Signed-off-by: atndko <z1281552865@gmail.com>

It's been observed that a panic could occur due to a race with a MSM_DRM_BLANK_POWERDOWN handler and a driver removal function upon shutdown while the screen is on. Since this shouldn't happen, send DRM events only when the system is running. This doesn't seem to be easy to detect as the events are sent after the userspace drm process terminates, which is before init binary notifies the kernel about the shutdown. Workaround this by detecting shutdown upon sysrq kill-all-tasks(i), sent from the init binary. Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>

This saves power on displaying static images Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>

Giving userspace intimate control over CPU latency requirements is nonsense. Userspace can't even stop itself from being preempted, so there's no reason for it to have access to a mechanism primarily used to eliminate CPU delays on the order of microseconds. Remove userspace's ability to send pm_qos requests so that it can't hurt power consumption. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

This option tends to assign tasks to the best (in energy terms) idle CPU, but if there is no totally idle core, then the task will be assigned to a very high power core, generating a lot of heat and a lot of power consumption for literally NO REASON.

SELinux doesn't actually need audit support. In fact, auditing is not particularly useful on production kernels, where minimal debugging is needed. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

Signed-off-by: Raphiel Rollerscaperers <raphielscape@outlook.com>

We don't need accurate-math on GPU driver Signed-off-by: Raphiel Rollerscaperers <raphielscape@outlook.com> Signed-off-by: Pulkit077 <pulkitagarwal2k1@gmail.com>

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Change-Id: I574693c868322c6181c72804052932b8e1769324

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Change-Id: I2b362439afc303be1cec6c76e666091e8aec2613

WIP Change-Id: Ib8c82877333f7099ba5f4fd390b80b02ff83757b

Signed-off-by: Alexander <YaAlex@yaalex.tk>

This patch avoids having TCP sender or congestion control overestimate the min RTT by orders of magnitude. This happens when all the samples in the windowed filter are one-packet transfer like small request and health-check like chit-chat, which is farily common for applications using persistent connections. This patch tries to conservatively labels and skip RTT samples obtained from this type of workload. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

A persistent connection may send tiny amount of data (e.g. health-check) for a long period of time. BBR's windowed min RTT filter may only see RTT samples from delayed ACKs causing BBR to grossly over-estimate the path delay depending how much the ACK was delayed at the receiver. This patch skips RTT samples that are likely coming from delayed ACKs. Note that it is possible the sender never obtains a valid measure to set the min RTT. In this case BBR will continue to set cwnd to initial window which seems fine because the connection is thin stream. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com> Change-Id: Ie79dd514069980bf5f2305484bf4fd533c833f1a

This is second part of dealing with suboptimal device gso parameters. In first patch (350c9f4 "tcp_bbr: better deal with suboptimal GSO") we dealt with devices having low gso_max_segs Some devices lower gso_max_size from 64KB to 16 KB (r8152 is an example) In order to probe an optimal cwnd, we want BBR being not sensitive to whatever GSO constraint a device can have. This patch removes tso_segs_goal() CC callback in favor of min_tso_segs() for CC wanting to override sysctl_tcp_min_tso_segs Next patch will remove bbr->tso_segs_goal since it does not have to be persistent. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

Its value is computed then immediately used, there is no need to store it. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

Set tp->snd_ssthresh to BDP upon STARTUP exit. This allows us to check if a BBR flow exited STARTUP and the BDP at the time of STARTUP exit with SCM_TIMESTAMPING_OPT_STATS. Since BBR does not use snd_ssthresh this fix has no impact on BBR's behavior. Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

This commit makes BBR use only the MSS (without any headers) to calculate pacing rates when internal TCP-layer pacing is used. This is necessary to achieve the correct pacing behavior in this case, since tcp_internal_pacing() uses only the payload length to calculate pacing delays. Signed-off-by: Kevin Yang <yyd@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

This patch add a helper function bbr_check_probe_rtt_done() to 1. check the condition to see if bbr should exit probe_rtt mode; 2. process the logic of exiting probe_rtt mode. Fixes: 0f8782e ("tcp_bbr: add BBR congestion control") Signed-off-by: Kevin Yang <yyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

This patch fix the case where BBR does not exit PROBE_RTT mode when it restarts from idle. When BBR restarts from idle and if BBR is in PROBE_RTT mode, BBR should check if it's time to exit PROBE_RTT. If yes, then BBR should exit PROBE_RTT mode and restore the cwnd to its full value. Fixes: 0f8782e ("tcp_bbr: add BBR congestion control") Signed-off-by: Kevin Yang <yyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

This commit fixes a corner case where TCP BBR would enter PROBE_RTT mode but not reduce its cwnd. If a TCP receiver ACKed less than one full segment, the number of delivered/acked packets was 0, so that bbr_set_cwnd() would short-circuit and exit early, without cutting cwnd to the value we want for PROBE_RTT. The fix is to instead make sure that even when 0 full packets are ACKed, we do apply all the appropriate caps, including the cap that applies in PROBE_RTT mode. Fixes: 0f8782e ("tcp_bbr: add BBR congestion control") Signed-off-by: Kevin Yang <yyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

Centralize the code that sets gains used for computing cwnd and pacing rate. This simplifies the code and makes it easier to change the state machine or (in the future) dynamically change the gain values and ensure that the correct gain values are always used. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

Because bbr_target_cwnd() is really a general-purpose BBR helper for computing some volume of inflight data as a function of the estimated BDP, refactor it into following helper functions: - bbr_bdp() - bbr_quantization_budget() - bbr_inflight() Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

Aggregation effects are extremely common with wifi, cellular, and cable modem link technologies, ACK decimation in middleboxes, and LRO and GRO in receiving hosts. The aggregation can happen in either direction, data or ACKs, but in either case the aggregation effect is visible to the sender in the ACK stream. Previously BBR's sending was often limited by cwnd under severe ACK aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets were acked in bursts after long delays (e.g. one ACK acking 5*BDP after 5*RTT), BBR's sending was halted after sending 2*BDP over 2*RTT, leaving the bottleneck idle for potentially long periods. Note that loss-based congestion control does not have this issue because when facing aggregation it continues increasing cwnd after bursts of ACKs, growing cwnd until the buffer is full. To achieve good throughput in the presence of aggregation effects, this algorithm allows the BBR sender to put extra data in flight to keep the bottleneck utilized during silences in the ACK stream that it has evidence to suggest were caused by aggregation. A summary of the algorithm: when a burst of packets are acked by a stretched ACK or a burst of ACKs or both, BBR first estimates the expected amount of data that should have been acked, based on its estimated bandwidth. Then the surplus ("extra_acked") is recorded in a windowed-max filter to estimate the recent level of observed ACK aggregation. Then cwnd is increased by the ACK aggregation estimate. The larger cwnd avoids BBR being cwnd-limited in the face of ACK silences that recent history suggests were caused by aggregation. As a sanity check, the ACK aggregation degree is upper-bounded by the cwnd (at the time of measurement) and a global max of BW * 100ms. The algorithm is further described by the following presentation: https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00 In our internal testing, we observed a significant increase in BBR throughput (measured using netperf), in a basic wifi setup. - Host1 (sender on ethernet) -> AP -> Host2 (receiver on wifi) - 2.4 GHz -> BBR before: ~73 Mbps; BBR after: ~102 Mbps; CUBIC: ~100 Mbps - 5.0 GHz -> BBR before: ~362 Mbps; BBR after: ~593 Mbps; CUBIC: ~601 Mbps Also, this code is running globally on YouTube TCP connections and produced significant bandwidth increases for YouTube traffic. This is based on Ian Swett's max_ack_height_ algorithm from the QUIC BBR implementation. Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

This commit is a bug fix for the Linux TCP app-limited logic used for collecting rate (bandwidth) samples. Previously the app-limited logic only looked for "bubbles" of silence in between application writes, by checking at the start of each sendmsg. But "bubbles" of silence can also happen before retransmits: e.g. bubbles can happen between an application write and a retransmit, or between two retransmits. Retransmits are triggered by ACKs or timers. So this commit checks for bubbles of app-limited silence upon ACKs or timers. Why does this commit check for app-limited state at the start of ACKs and timer handling? Because at that point we know whether inflight was fully using the cwnd. During processing the ACK or timer event we often change the cwnd; after changing the cwnd we can't know whether inflight was fully using the old cwnd. Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc Change-Id: I37221506f5166877c2b110753d39bb0757985e68 (cherry-picked from google/bbr@80d039a) Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

…ches in TCP In order to track CE marks per rate sample (one round trip), we'll need to snap the starting tcp delivered_ce acount in the packet meta header (tcp_skb_cb). But there's not enough space. Good news is that the "last_in_flight" in the header, used by NV congestion control, is almost equivalent as "delivered". In fact "delivered" is better by accounting out-of-order packets additionally. Therefore we can remove it to make room for the CE tracking. This would make delayed ACK detection slightly less accurate but the impact is negligible since it's not used for any critical control. Effort: net-tcp_rate Origin-9xx-SHA1: ddcd46ec85d5f1c4454258af0c54b3254c0d64a7 Change-Id: I1a184aad6d101c981ac7f2f275aa9417ff856910 (cherry-picked from google/bbr@3de61ba) Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

…_mstamp to u32 to free up 8 bytes Free up some space for tracking inflight and losses for each bw sample, in upcoming commits. These timestamps are in microseconds, and are now stored in 32 bits. So they can only hold time intervals up to roughly 2^12 = 4096 seconds. But Linux TCP RTT and RTO tracking has the same 32-bit microsecond implementation approach and resulting deployment limitations. So this is not introducing a new limit. And these should not be a limitation for the foreseeable future. Effort: net-tcp_bbr Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55 Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c (cherry-picked from google/bbr@6cabc5b) Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Signed-off-by: Anirudh Gupta <anirudhgupta109@gmail.com>

… as default Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com> Change-Id: I875f68bd82a28cdc6ae046f4fc84cd6a0c5eca86

Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Change-Id: Id6ab509daa28533cad3f6cb1c820860e24ffe00f

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

Userspace reads /proc/config.gz and spits out an error message after boot finishes when it doesn't like the kernel's configuration. In order to preserve our freedom to customize the kernel however we'd like, show userspace the stock defconfig so that it never complains about our kernel configuration. Signed-off-by: Sultan Alsawaf <sultanxda@gmail.com>

…s to SHMLBA This patch repeats the original one from David S Miller: 2dca699 ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA") but for missed vmalloc_32_user() case, which also requires correct alignment of virtual address on kernel side to avoid D-caches aliases. A bit of copy-paste from original patch to recover in memory of what is all about: When a vmalloc'd area is mmap'd into userspace, some kind of co-ordination is necessary for this to work on platforms with cpu D-caches which can have aliases. Otherwise kernel side writes won't be seen properly in userspace and vice versa. If the kernel side mapping and the user side one have the same alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared of VMA and for all current users this is true. VM_SHARED will force SHMLBA alignment of the user side mmap on platforms with D-cache aliasing matters. David S. Miller > What are the user-visible runtime effects of this change? In simple words: proper alignment avoids possible difference in data, seen by different virtual mapings: userspace and kernel in our case. I.e. userspace reads cache line A, kernel writes to cache line B. Both cache lines correspond to the same physical memory (thus aliases). So this should fix data corruption for archs with vivt and vipt caches, e.g. armv6. Personally I've never worked with this archs, I just spotted the strange difference in code: for one case we do alignment, for another - not. I have a strong feeling that David simply missed vmalloc_32_user() case. > > Is a -stable backport needed? No, I do not think so. The only one user of vmalloc_32_user() is virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the description "The main use of this frame buffer device is testing and debugging the frame buffer subsystem. Do NOT enable it for normal systems!". And it seems to me that this vfb.c does not need 32bit addressable pages (vmalloc_32_user() case), because it is virtual device and should not care about things like dma32 zones, etc. Probably is better to clean the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case and wipe out vmalloc_32_user() from vmalloc.c completely. But I'm not very much sure that this is worth to do, that's so minor, so we can leave it as is. Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Michal Hocko <mhocko@suse.com> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

vmalloc_user*() calls differ from normal vmalloc() only in that they set VM_USERMAP flags for the area. During the whole history of vmalloc.c changes now it is possible simply to pass VM_USERMAP flags directly to __vmalloc_node_range() call instead of finding the area (which obviously takes time) after the allocation. Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is enabled. Some test cases in vmalloc test suite module require and make use of that function. Please note, that it is not supposed to be used for other purposes. We need it only for performance analysis, stressing and stability check of vmalloc allocator. Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Commit 763b218 ("mm: add preempt points into __purge_vmap_area_lazy()") introduced some preempt points, one of those is making an allocation more prioritized over lazy free of vmap areas. Prioritizing an allocation over freeing does not work well all the time, i.e. it should be rather a compromise. 1) Number of lazy pages directly influences the busy list length thus on operations like: allocation, lookup, unmap, remove, etc. 2) Under heavy stress of vmalloc subsystem I run into a situation when memory usage gets increased hitting out_of_memory -> panic state due to completely blocking of logic that frees vmap areas in the __purge_vmap_area_lazy() function. Establish a threshold passing which the freeing is prioritized back over allocation creating a balance between each other. Using vmalloc test driver in "stress mode", i.e. When all available test cases are run simultaneously on all online CPUs applying a pressure on the vmalloc subsystem, my HiKey 960 board runs out of memory due to the fact that __purge_vmap_area_lazy() logic simply is not able to free pages in time. How I run it: 1) You should build your kernel with CONFIG_TEST_VMALLOC=m 2) ./tools/testing/selftests/vm/test_vmalloc.sh stress During this test "vmap_lazy_nr" pages will go far beyond acceptable lazy_max_pages() threshold, that will lead to enormous busy list size and other problems including allocation time and so on. Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long" that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit as well. Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Joel Fernandes <joelaf@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Patch series "improve vmap allocation", v3. Objective --------- Please have a look for the description at: https://lkml.org/lkml/2018/10/19/786 but let me also summarize it a bit here as well. The current implementation has O(N) complexity. Requests with different permissive parameters can lead to long allocation time. When i say "long" i mean milliseconds. Description ----------- This approach organizes the KVA memory layout into free areas of the 1-ULONG_MAX range, i.e. an allocation is done over free areas lookups, instead of finding a hole between two busy blocks. It allows to have lower number of objects which represent the free space, therefore to have less fragmented memory allocator. Because free blocks are always as large as possible. It uses the augment tree where all free areas are sorted in ascending order of va->va_start address in pair with linked list that provides O(1) access to prev/next elements. Since the tree is augment, we also maintain the "subtree_max_size" of VA that reflects a maximum available free block in its left or right sub-tree. Knowing that, we can easily traversal toward the lowest (left most path) free area. Allocation: ~O(log(N)) complexity. It is sequential allocation method therefore tends to maximize locality. The search is done until a first suitable block is large enough to encompass the requested parameters. Bigger areas are split. I copy paste here the description of how the area is split, since i described it in https://lkml.org/lkml/2018/10/19/786 <snip> A free block can be split by three different ways. Their names are FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested size and alignment fit to a free block. FL_FIT_TYPE - in this case a free block is just removed from the free list/tree because it fully fits. Comparing with current design there is an extra work with rb-tree updating. LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is just cutting a free block. It is as fast as a current design. Most of the vmalloc allocations just end up with this case, because the edge is always aligned to 1. NE_FIT_TYPE - Is much less common case. Basically it happens when requested size and alignment does not fit left nor right edges, i.e. it is between them. In this case during splitting we have to build a remaining left free area and place it back to the free list/tree. Comparing with current design there are two extra steps. First one is we have to allocate a new vmap_area structure. Second one we have to insert that remaining free block to the address sorted list/tree. In order to optimize a first case there is a cache with free_vmap objects. Instead of allocating from slab we just take an object from the cache and reuse it. Second one is pretty optimized. Since we know a start point in the tree we do not do a search from the top. Instead a traversal begins from a rb-tree node we split. <snip> De-allocation. ~O(log(N)) complexity. An area is not inserted straight away to the tree/list, instead we identify the spot first, checking if it can be merged around neighbors. The list provides O(1) access to prev/next, so it is pretty fast to check it. Summarizing. If merged then large coalesced areas are created, if not the area is just linked making more fragments. There is one more thing that i should mention here. After modification of VA node, its subtree_max_size is updated if it was/is the biggest area in its left or right sub-tree. Apart of that it can also be populated back to upper levels to fix the tree. For more details please have a look at the __augment_tree_propagate_from() function and the description. Tests and stressing ------------------- I use the "test_vmalloc.sh" test driver available under "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo ./test_vmalloc.sh" to find out how to deal with it. Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA. Regarding last one, i do not have any physical access to NUMA system, therefore i emulated it. The time of stressing is days. If you run the test driver in "stress mode", you also need the patch that is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it: http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c After massive testing, i have not identified any problems like memory leaks, crashes or kernel panics. I find it stable, but more testing would be good. Performance analysis -------------------- I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1 as well, but the results have not been ready by time i an writing this. Currently it consist of 8 tests. There are three of them which correspond to different types of splitting(to compare with default). We have 3 ones(see above). Another 5 do allocations in different conditions. a) sudo ./test_vmalloc.sh performance When the test driver is run in "performance" mode, it runs all available tests pinned to first online CPU with sequential execution test order. We do it in order to get stable and repeatable results. Take a look at time difference in "long_busy_list_alloc_test". It is not surprising because the worst case is O(N). How many cycles all tests took: CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt How many cycles all tests took: CPU0=3478683207 cycles vs CPU0=463767978 cycles ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt b) time sudo ./test_vmalloc.sh test_repeat_count=1 With this configuration, all tests are run on all available online CPUs. Before running each CPU shuffles its tests execution order. It gives random allocation behaviour. So it is rough comparison, but it puts in the picture for sure. <default> vs <patched> real 101m22.813s real 0m56.805s user 0m0.011s user 0m0.015s sys 0m5.076s sys 0m0.023s ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt <default> vs <patched> real unknown real 4m25.214s user unknown user 0m0.011s sys unknown sys 0m0.670s I did not manage to complete this test on "default Hikey960" kernel version. After 24 hours it was still running, therefore i had to cancel it. That is why real/user/sys are "unknown". This patch (of 3): Currently an allocation of the new vmap area is done over busy list iteration(complexity O(n)) until a suitable hole is found between two busy areas. Therefore each new allocation causes the list being grown. Due to over fragmented list and different permissive parameters an allocation can take a long time. For example on embedded devices it is milliseconds. This patch organizes the KVA memory layout into free areas of the 1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks sorted by their offsets in pair with linked list keeping the free space in order of increasing addresses. Nodes are augmented with the size of the maximum available free block in its left or right sub-tree. Thus, that allows to take a decision and traversal toward the block that will fit and will have the lowest start address, i.e. it is sequential allocation. Allocation: to allocate a new block a search is done over the tree until a suitable lowest(left most) block is large enough to encompass: the requested size, alignment and vstart point. If the block is bigger than requested size - it is split. De-allocation: when a busy vmap area is freed it can either be merged or inserted to the tree. Red-black tree allows efficiently find a spot whereas a linked list provides a constant-time access to previous and next blocks to check if merging can be done. In case of merging of de-allocated memory chunk a large coalesced area is created. Complexity: ~O(log(N)) [urezki@gmail.com: v3] Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

This macro adds some debug code to check that the augment tree is maintained correctly, meaning that every node contains valid subtree_max_size value. By default this option is set to 0 and not active. It requires recompilation of the kernel to activate it. Set to 1, compile the kernel. [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

This macro adds some debug code to check that vmap allocations are happened in ascending order. By default this option is set to 0 and not active. It requires recompilation of the kernel to activate it. Set to 1, compile the kernel. [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Reported-by: Nicholas Joll <najoll@posteo.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

gcc gets confused in pcpu_get_vm_areas() because there are too many branches that affect whether 'lva' was initialized before it gets used: mm/vmalloc.c: In function 'pcpu_get_vm_areas': mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized] insert_vmap_area_augment(lva, &va->rb_node, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ &free_vmap_area_root, &free_vmap_area_list); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/vmalloc.c:916:20: note: 'lva' was declared here struct vmap_area *lva; ^~~ Add an intialization to NULL, and check whether this has changed before the first use. [akpm@linux-foundation.org: tweak comments] Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de Fixes: 68ad4a3 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Patch series "Some cleanups for the KVA/vmalloc", v5. This patch (of 4): Remove unused argument from the __alloc_vmap_area() function. Link: http://lkml.kernel.org/r/20190606120411.8298-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Refactor the NE_FIT_TYPE split case when it comes to an allocation of one extra object. We need it in order to build a remaining space. The preload is done per CPU in non-atomic context with GFP_KERNEL flags. More permissive parameters can be beneficial for systems which are suffer from high memory pressure or low memory condition. For example on my KVM system(4xCPUs, no swap, 256MB RAM) i can simulate the failure of page allocation with GFP_NOWAIT flags. Using "stress-ng" tool and starting N workers spinning on fork() and exit(), i can trigger below trace: <snip> [ 179.815161] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 [ 179.815168] CPU: 0 PID: 12612 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #1003 [ 179.815170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 179.815171] Call Trace: [ 179.815178] dump_stack+0x5c/0x7b [ 179.815182] warn_alloc+0x108/0x190 [ 179.815187] __alloc_pages_slowpath+0xdc7/0xdf0 [ 179.815191] __alloc_pages_nodemask+0x2de/0x330 [ 179.815194] cache_grow_begin+0x77/0x420 [ 179.815197] fallback_alloc+0x161/0x200 [ 179.815200] kmem_cache_alloc+0x1c9/0x570 [ 179.815202] alloc_vmap_area+0x32c/0x990 [ 179.815206] __get_vm_area_node+0xb0/0x170 [ 179.815208] __vmalloc_node_range+0x6d/0x230 [ 179.815211] ? _do_fork+0xce/0x3d0 [ 179.815213] copy_process.part.46+0x850/0x1b90 [ 179.815215] ? _do_fork+0xce/0x3d0 [ 179.815219] _do_fork+0xce/0x3d0 [ 179.815226] ? __do_page_fault+0x2bf/0x4e0 [ 179.815229] do_syscall_64+0x55/0x130 [ 179.815231] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 179.815234] RIP: 0033:0x7fedec4c738b ... [ 179.815237] RSP: 002b:00007ffda469d730 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 179.815239] RAX: ffffffffffffffda RBX: 00007ffda469d730 RCX: 00007fedec4c738b [ 179.815240] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 179.815241] RBP: 00007ffda469d780 R08: 00007fededd6e300 R09: 00007ffda47f50a0 [ 179.815242] R10: 00007fededd6e5d0 R11: 0000000000000246 R12: 0000000000000000 [ 179.815243] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000 [ 179.815245] Mem-Info: [ 179.815249] active_anon:12686 inactive_anon:14760 isolated_anon:0 active_file:502 inactive_file:61 isolated_file:70 unevictable:2 dirty:0 writeback:0 unstable:0 slab_reclaimable:2380 slab_unreclaimable:7520 mapped:15069 shmem:14813 pagetables:10833 bounce:0 free:1922 free_pcp:229 free_cma:0 <snip> Link: http://lkml.kernel.org/r/20190606120411.8298-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Roman Gushchin <guro@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

It does not make sense to try to "unlink" the node that is definitely not linked with a list nor tree. On the first merge step VA just points to the previously disconnected busy area. On the second step, check if the node has been merged and do "unlink" if so, because now it points to an object that must be linked. Link: http://lkml.kernel.org/r/20190606120411.8298-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hillf Danton <hdanton@sina.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Trigger a warning if an object that is about to be freed is detached. We used to have a BUG_ON(), but even though it is considered as faulty behaviour that is not a good reason to break a system. Link: http://lkml.kernel.org/r/20190606120411.8298-5-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

It is probably safe to assume that all Armv8-A implementations have a multiplier whose efficiency is comparable or better than a sequence of three or so register-dependent arithmetic instructions. Select ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the few dusty old corners which care. In a contrived benchmark calling hweight64() in a loop, this does indeed turn out to be a small win overall, with no measurable impact on Cortex-A57 but about 5% performance improvement on Cortex-A53. Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

seq_put_hex_ll() prints a number in hexadecimal notation and works faster than seq_printf(). == test.py num = 0 with open("/proc/1/maps") as f: while num < 10000 : data = f.read() f.seek(0, 0) num = num + 1 == == Before patch == $ time python test.py real 0m1.561s user 0m0.257s sys 0m1.302s == After patch == $ time python test.py real 0m0.986s user 0m0.279s sys 0m0.707s $ perf -g record python test.py: == Before patch == - 67.42% 2.82% python [kernel.kallsyms] [k] show_map_vma.isra.22 - 64.60% show_map_vma.isra.22 - 44.98% seq_printf - seq_vprintf - vsnprintf + 14.85% number + 12.22% format_decode 5.56% memcpy_erms + 15.06% seq_path + 4.42% seq_pad + 2.45% __GI___libc_read == After patch == - 47.35% 3.38% python [kernel.kallsyms] [k] show_map_vma.isra.23 - 43.97% show_map_vma.isra.23 + 20.84% seq_path - 15.73% show_vma_header_prefix 10.55% seq_put_hex_ll + 2.65% seq_put_decimal_ull 0.95% seq_putc + 6.96% seq_pad + 2.94% __GI___libc_read [avagin@openvz.org: use unsigned int instead of int where it is suitable] Link: http://lkml.kernel.org/r/20180214025619.4005-1-avagin@openvz.org [avagin@openvz.org: v2] Link: http://lkml.kernel.org/r/20180117082050.25406-1-avagin@openvz.org Link: http://lkml.kernel.org/r/20180112185812.7710-1-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

seq_printf() is slow and it can be replaced by memset() in this case. == test.py num = 0 with open("/proc/1/maps") as f: while num < 10000 : data = f.read() f.seek(0, 0) num = num + 1 == == Before patch == $ time python test.py real 0m0.986s user 0m0.279s sys 0m0.707s == After patch == $ time python test.py real 0m0.932s user 0m0.261s sys 0m0.669s $ perf record -g python test.py == Before patch == - 47.35% 3.38% python [kernel.kallsyms] [k] show_map_vma.isra.23 - 43.97% show_map_vma.isra.23 + 20.84% seq_path - 15.73% show_vma_header_prefix + 6.96% seq_pad + 2.94% __GI___libc_read == After patch == - 44.01% 0.34% python [kernel.kallsyms] [k] show_pid_map - 43.67% show_pid_map - 42.91% show_map_vma.isra.23 + 21.55% seq_path - 15.68% show_vma_header_prefix + 2.08% seq_pad 0.55% seq_putc Link: http://lkml.kernel.org/r/20180112185812.7710-2-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

… file get_task_umask locks/unlocks the task on its own. The only caller does the same thing immediately after. Utilize the fact the task has to be locked anyway and just do it once. Since there are no other users and the code is short, fold it in. Link: http://lkml.kernel.org/r/1517995608-23683-1-git-send-email-mguzik@redhat.com Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Commit ca469f3 ("deal with races between remove_proc_entry() and proc_reg_release()") moved too much stuff under ->pde_unload_lock making a problem described at series "[PATCH v5] procfs: Improve Scaling in proc" worse. While RCU is being figured out, move kfree() out of ->pde_unload_lock. On my potato, difference is only 0.5% speedup with concurrent open+read+close of /proc/cmdline, but the effect should be more noticeable on more capable machines. $ perf stat -r 16 -- ./proc-j 16 Performance counter stats for './proc-j 16' (16 runs): 130569.502377 task-clock (msec) # 15.872 CPUs utilized ( +- 0.05% ) 19,169 context-switches # 0.147 K/sec ( +- 0.18% ) 15 cpu-migrations # 0.000 K/sec ( +- 3.27% ) 437 page-faults # 0.003 K/sec ( +- 1.25% ) 300,172,097,675 cycles # 2.299 GHz ( +- 0.05% ) 96,793,267,308 instructions # 0.32 insn per cycle ( +- 0.04% ) 22,798,342,298 branches # 174.607 M/sec ( +- 0.04% ) 111,764,687 branch-misses # 0.49% of all branches ( +- 0.47% ) 8.226574400 seconds time elapsed ( +- 0.05% ) ^^^^^^^^^^^ $ perf stat -r 16 -- ./proc-j 16 Performance counter stats for './proc-j 16' (16 runs): 129866.777392 task-clock (msec) # 15.869 CPUs utilized ( +- 0.04% ) 19,154 context-switches # 0.147 K/sec ( +- 0.66% ) 14 cpu-migrations # 0.000 K/sec ( +- 1.73% ) 431 page-faults # 0.003 K/sec ( +- 1.09% ) 298,556,520,546 cycles # 2.299 GHz ( +- 0.04% ) 96,525,366,833 instructions # 0.32 insn per cycle ( +- 0.04% ) 22,730,194,043 branches # 175.027 M/sec ( +- 0.04% ) 111,506,074 branch-misses # 0.49% of all branches ( +- 0.18% ) 8.183629778 seconds time elapsed ( +- 0.04% ) ^^^^^^^^^^^ Link: http://lkml.kernel.org/r/20180213132911.GA24298@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Move the proc_mkdir() call within the sysvipc subsystem such that we avoid polluting proc_root_init() with petty cpp. [dave@stgolabs.net: contributed changelog] Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

The whole point of code in fs/proc/inode.c is to make sure ->release hook is called either at close() or at rmmod time. All if it is unnecessary if there is no ->release hook. Save allocation+list manipulations under spinlock in that case. Link: http://lkml.kernel.org/r/20180214063033.GA15579@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

/proc/self inode numbers, value of proc_inode_cache and st_nlink of /proc/$TGID are fixed constants. Link: http://lkml.kernel.org/r/20180103184707.GA31849@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

"struct pde_opener" is fixed size and we can have more granular approach to debugging. For those who don't know, per cache SLUB poisoning and red zoning don't work if there is at least one object allocated which is hopeless in case of kmalloc-64 but not in case of standalone cache. Although systemd opens 2 files from the get go, so it is hopeless after all. Link: http://lkml.kernel.org/r/20180214082306.GB17157@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

The allocation is persistent in fact as any fool can open a file in /proc and sit on it. Link: http://lkml.kernel.org/r/20180214082409.GC17157@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

On a machine with 5-level paging support a process can allocate significant amount of memory and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PUD page tables. We don't account PUD page tables, only PMD and PTE. We already addressed the same issue for PMD page tables, see commit dc6c9a3 ("mm: account pmd page tables to the process"). Introduction of 5-level paging brings the same issue for PUD page tables. The patch expands accounting to PUD level. [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/] Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting] Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd and nr_pud. The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page table accounting doesn't make sense if you don't have page tables. It's preparation for consolidation of page-table counters in mm_struct. Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Currently, we account page tables separately for each page table level, but that's redundant -- we only make use of total memory allocated to page tables for oom_badness calculation. We also provide the information to userspace, but it has dubious value there too. This patch switches page table accounting to single counter. mm->pgtables_bytes is now used to account all page table levels. We use bytes, because page table size for different levels of page table tree may be different. The change has user-visible effect: we don't have VmPMD and VmPUD reported in /proc/[pid]/status. Not sure if anybody uses them. (As alternative, we can always report 0 kB for them.) OOM-killer report is also slightly changed: we now report pgtables_bytes instead of nr_ptes, nr_pmd, nr_puds. Apart from reducing number of counters per-mm, the benefit is that we now calculate oom_badness() more correctly for machines which have different size of page tables depending on level or where page tables are less than a page in size. The only downside can be debuggability because we do not know which page table level could leak. But I do not remember many bugs that would be caught by separate counters so I wouldn't lose sleep over this. [akpm@linux-foundation.org: fix mm/huge_memory.c] Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> [kirill.shutemov@linux.intel.com: fix build] Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

…rtual memory If start_code / end_code pointers are screwed then "VmExe" could be bigger than total executable virtual memory and "VmLib" becomes negative: VmExe: 294320 kB VmLib: 18446744073709327564 kB VmExe and VmLib documented as text segment and shared library code size. Now their sum will be always equal to mm->exec_vm which sums size of executable and not writable and not stack areas. I've seen this for huge (>2Gb) statically linked binary which has whole world inside. For it start_code .. end_code range also covers one of rodata sections. Probably this is bug in customized linker, elf loader or both. Anyway CONFIG_CHECKPOINT_RESTORE allows to change these pointers, thus we cannot trust them without validation. Link: http://lkml.kernel.org/r/150728955451.743749.11276392315459539583.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

seq_put_decimal_ull_w(m, str, val, width) prints a decimal number with a specified minimal field width. It is equivalent of seq_printf(m, "%s%*d", str, width, val), but it works much faster. == test_smaps.py num = 0 with open("/proc/1/smaps") as f: for x in xrange(10000): data = f.read() f.seek(0, 0) == == Before patch == $ time python test_smaps.py real 0m4.593s user 0m0.398s sys 0m4.158s == After patch == $ time python test_smaps.py real 0m3.828s user 0m0.413s sys 0m3.408s $ perf -g record python test_smaps.py == Before patch == - 79.01% 3.36% python [kernel.kallsyms] [k] show_smap.isra.33 - 75.65% show_smap.isra.33 + 48.85% seq_printf + 15.75% __walk_page_range + 9.70% show_map_vma.isra.23 0.61% seq_puts == After patch == - 75.51% 4.62% python [kernel.kallsyms] [k] show_smap.isra.33 - 70.88% show_smap.isra.33 + 24.82% seq_put_decimal_ull_w + 19.78% __walk_page_range + 12.74% seq_printf + 11.08% show_map_vma.isra.23 + 1.68% seq_puts [akpm@linux-foundation.org: fix drivers/of/unittest.c build] Link: http://lkml.kernel.org/r/20180212074931.7227-1-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

seq_putc() works much faster than seq_printf() == Before patch == $ time python test_smaps.py real 0m3.828s user 0m0.413s sys 0m3.408s == After patch == $ time python test_smaps.py real 0m3.405s user 0m0.401s sys 0m3.003s == Before patch == - 75.51% 4.62% python [kernel.kallsyms] [k] show_smap.isra.33 - 70.88% show_smap.isra.33 + 24.82% seq_put_decimal_ull_aligned + 19.78% __walk_page_range + 12.74% seq_printf + 11.08% show_map_vma.isra.23 + 1.68% seq_puts == After patch == - 69.16% 5.70% python [kernel.kallsyms] [k] show_smap.isra.33 - 63.46% show_smap.isra.33 + 25.98% seq_put_decimal_ull_aligned + 20.90% __walk_page_range + 12.60% show_map_vma.isra.23 1.56% seq_putc + 1.55% seq_puts Link: http://lkml.kernel.org/r/20180212074931.7227-2-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

A delimiter is a string which is printed before a number. A syngle-symbol delimiters can be printed by set_putc() and this works faster than printing by set_puts(). == test_proc.c int main(int argc, char **argv) { int n, i, fd; char buf[16384]; n = atoi(argv[1]); for (i = 0; i < n; i++) { fd = open(argv[2], O_RDONLY); if (fd < 0) return 1; if (read(fd, buf, sizeof(buf)) <= 0) return 1; close(fd); } return 0; } == $ time ./test_proc 1000000 /proc/1/stat == Before patch == real 0m3.820s user 0m0.337s sys 0m3.394s == After patch == real 0m3.110s user 0m0.324s sys 0m2.700s Link: http://lkml.kernel.org/r/20180212074931.7227-3-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

seq_printf() works slower than seq_puts, seq_puts, etc. == test_proc.c int main(int argc, char **argv) { int n, i, fd; char buf[16384]; n = atoi(argv[1]); for (i = 0; i < n; i++) { fd = open(argv[2], O_RDONLY); if (fd < 0) return 1; if (read(fd, buf, sizeof(buf)) <= 0) return 1; close(fd); } return 0; } == $ time ./test_proc 1000000 /proc/1/status == Before path == real 0m5.171s user 0m0.328s sys 0m4.783s == After patch == real 0m4.761s user 0m0.334s sys 0m4.366s Link: http://lkml.kernel.org/r/20180212074931.7227-4-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

get_wchan() accesses stack page before permissions are checked, let's not play this game. Link: http://lkml.kernel.org/r/20180217071923.GA16074@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Link: http://lkml.kernel.org/r/20180217072011.GB16074@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

…ill_cache() proc_sys_link_fill_cache() does not need to check whether we're called for a link - it's already done by scan(). Link: http://lkml.kernel.org/r/20180228013506.4915-2-danilokrummrich@dk-develop.de Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

I totally forgot that _parse_integer() accepts arbitrary amount of leading zeroes leading to the following lookups: OK # readlink /proc/1/map_files/56427ecba000-56427eddc000 /lib/systemd/systemd bogus # readlink /proc/1/map_files/00000000000056427ecba000-56427eddc000 /lib/systemd/systemd # readlink /proc/1/map_files/56427ecba000-00000000000056427eddc000 /lib/systemd/systemd Link: http://lkml.kernel.org/r/20180303215130.GA23480@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

As soon as register_filesystem() exits, filesystem can be mounted. It is better to present fully operational /proc. Of course it doesn't matter because /proc is not modular but do it anyway. Drop error check, it should be handled by panicking. Link: http://lkml.kernel.org/r/20180309222709.GA3843@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Use seq_puts() and skip format string processing. Link: http://lkml.kernel.org/r/20180309222948.GB3843@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

mm_struct is not needed while printing as all the data was already extracted. Link: http://lkml.kernel.org/r/20180309223120.GC3843@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Various subsystems can create files and directories in /proc with names directly controlled by userspace. Which means "/", "." and ".." are no-no. "/" split is already taken care of, do the other 2 prohibited names. Link: http://lkml.kernel.org/r/20180310001223.GB12443@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

To improve page allocator's performance for order-0 pages, each CPU has a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, PCP will be checked first before asking pages from Buddy. When PCP is used up, a batch of pages will be fetched from Buddy to improve performance and the size of batch can affect performance. zone's batch size gets doubled last time by commit ba56e91("mm: page_alloc: increase size of per-cpu-pages") over ten years ago. Since then, CPU has envolved a lot and CPU's cache sizes also increased. Dave Hansen is concerned the current batch size doesn't fit well with modern hardware and suggested me to do two things: first, use a page allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find out how performance changes with different batch sizes on various machines and then choose a new default batch size; second, see how this new batch size work with other workloads. In the first test, we saw performance gains on high-core-count systems and little to no effect on older systems with more modest core counts. In this phase's test data, two candidates: 63 and 127 are chosen. In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability and more will-it-scale sub-tests are tested to see how these two candidates work with these workloads and decides a new default according to their results. Most test results are flat. will-it-scale/page_fault2 process mode has 10%-18% performance increase on 4-sockets Skylake and Broadwell. vm-scalability/lru-file-mmap-read has 17%-47% performance increase for 4-sockets servers while for 2-sockets servers, it caused 3%-8% performance drop. Further analysis showed that, with a larger pcp->batch and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch is maintained in this patch), zone lock contention shifted to LRU add side lock contention and that caused performance drop. This performance drop might be mitigated by others' work on optimizing LRU lock. Another downside of increasing pcp->batch is, when PCP is used up and need to fetch a batch of pages from Buddy, since batch is increased, that time can be longer than before. My understanding is, this doesn't affect slowpath where direct reclaim and compaction dominates. For fastpath, throughput is a win(according to will-it-scale/page_fault1) but worst latency can be larger now. Overall, I think double the batch size from 31 to 63 is relatively safe and provide good performance boost for high-core-count systems. The two phase's test results are listed below(all tests are done with THP disabled). Phase one(will-it-scale/page_fault1) test results: Skylake-EX: increased batch size has a good effect on zone->lock contention, though LRU contention will rise at the same time and limited the final performance increase. batch score change zone_contention lru_contention total_contention 31 15345900 +0.00% 64% 8% 72% 53 17903847 +16.67% 32% 38% 70% 63 17992886 +17.25% 24% 45% 69% 73 18022825 +17.44% 10% 61% 71% 119 18023401 +17.45% 4% 66% 70% 127 18029012 +17.48% 3% 66% 69% 137 18036075 +17.53% 4% 66% 70% 165 18035964 +17.53% 2% 67% 69% 188 18101105 +17.95% 2% 67% 69% 223 18130951 +18.15% 2% 67% 69% 255 18118898 +18.07% 2% 67% 69% 267 18101559 +17.96% 2% 67% 69% 299 18160468 +18.34% 2% 68% 70% 320 18139845 +18.21% 2% 67% 69% 393 18160869 +18.34% 2% 68% 70% 424 18170999 +18.41% 2% 68% 70% 458 18144868 +18.24% 2% 68% 70% 467 18142366 +18.22% 2% 68% 70% 498 18154549 +18.30% 1% 68% 69% 511 18134525 +18.17% 1% 69% 70% Broadwell-EX: similar pattern as Skylake-EX. batch score change zone_contention lru_contention total_contention 31 16703983 +0.00% 67% 7% 74% 53 18195393 +8.93% 43% 28% 71% 63 18288885 +9.49% 38% 33% 71% 73 18344329 +9.82% 35% 37% 72% 119 18535529 +10.96% 24% 46% 70% 127 18513596 +10.83% 23% 48% 71% 137 18514327 +10.84% 23% 48% 71% 165 18511840 +10.82% 22% 49% 71% 188 18593478 +11.31% 17% 53% 70% 223 18601667 +11.36% 17% 52% 69% 255 18774825 +12.40% 12% 58% 70% 267 18754781 +12.28% 9% 60% 69% 299 18892265 +13.10% 7% 63% 70% 320 18873812 +12.99% 8% 62% 70% 393 18891174 +13.09% 6% 64% 70% 424 18975108 +13.60% 6% 64% 70% 458 18932364 +13.34% 8% 62% 70% 467 18960891 +13.51% 5% 65% 70% 498 18944526 +13.41% 5% 64% 69% 511 18960839 +13.51% 5% 64% 69% Skylake-EP: although increased batch reduced zone->lock contention, but the effect is not as good as EX: zone->lock contention is still as high as 20% with a very high batch value instead of 1% on Skylake-EX or 5% on Broadwell-EX. Also, total_contention actually decreased with a higher batch but that doesn't translate to performance increase. batch score change zone_contention lru_contention total_contention 31 9554867 +0.00% 66% 3% 69% 53 9855486 +3.15% 63% 3% 66% 63 9980145 +4.45% 62% 4% 66% 73 10092774 +5.63% 62% 5% 67% 119 10310061 +7.90% 45% 19% 64% 127 10342019 +8.24% 42% 19% 61% 137 10358182 +8.41% 42% 21% 63% 165 10397060 +8.81% 37% 24% 61% 188 10341808 +8.24% 34% 26% 60% 223 10349135 +8.31% 31% 27% 58% 255 10327189 +8.08% 28% 29% 57% 267 10344204 +8.26% 27% 29% 56% 299 10325043 +8.06% 25% 30% 55% 320 10310325 +7.91% 25% 31% 56% 393 10293274 +7.73% 21% 31% 52% 424 10311099 +7.91% 21% 32% 53% 458 10321375 +8.02% 21% 32% 53% 467 10303881 +7.84% 21% 32% 53% 498 10332462 +8.14% 20% 33% 53% 511 10325016 +8.06% 20% 32% 52% Broadwell-EP: zone->lock and lru lock had an agreement to make sure performance doesn't increase and they successfully managed to keep total contention at 70%. batch score change zone_contention lru_contention total_contention 31 10121178 +0.00% 19% 50% 69% 53 10142366 +0.21% 6% 63% 69% 63 10117984 -0.03% 11% 58% 69% 73 10123330 +0.02% 7% 63% 70% 119 10108791 -0.12% 2% 67% 69% 127 10166074 +0.44% 3% 66% 69% 137 10141574 +0.20% 3% 66% 69% 165 10154499 +0.33% 2% 68% 70% 188 10124921 +0.04% 2% 67% 69% 223 10137399 +0.16% 2% 67% 69% 255 10143289 +0.22% 0% 68% 68% 267 10123535 +0.02% 1% 68% 69% 299 10140952 +0.20% 0% 68% 68% 320 10163170 +0.41% 0% 68% 68% 393 10000633 -1.19% 0% 69% 69% 424 10087998 -0.33% 0% 69% 69% 458 10187116 +0.65% 0% 69% 69% 467 10146790 +0.25% 0% 69% 69% 498 10197958 +0.76% 0% 69% 69% 511 10152326 +0.31% 0% 69% 69% Haswell-EP: similar to Broadwell-EP. batch score change zone_contention lru_contention total_contention 31 10442205 +0.00% 14% 48% 62% 53 10442255 +0.00% 5% 57% 62% 63 10452059 +0.09% 6% 57% 63% 73 10482349 +0.38% 5% 59% 64% 119 10454644 +0.12% 3% 60% 63% 127 10431514 -0.10% 3% 59% 62% 137 10423785 -0.18% 3% 60% 63% 165 10481216 +0.37% 2% 61% 63% 188 10448755 +0.06% 2% 61% 63% 223 10467144 +0.24% 2% 61% 63% 255 10480215 +0.36% 2% 61% 63% 267 10484279 +0.40% 2% 61% 63% 299 10466450 +0.23% 2% 61% 63% 320 10452578 +0.10% 2% 61% 63% 393 10499678 +0.55% 1% 62% 63% 424 10481454 +0.38% 1% 62% 63% 458 10473562 +0.30% 1% 62% 63% 467 10484269 +0.40% 0% 62% 62% 498 10505599 +0.61% 0% 62% 62% 511 10483395 +0.39% 0% 62% 62% Westmere-EP: contention is pretty small so not interesting. Note too high a batch value could hurt performance. batch score change zone_contention lru_contention total_contention 31 4831523 +0.00% 2% 3% 5% 53 4834086 +0.05% 2% 4% 6% 63 4834262 +0.06% 2% 3% 5% 73 4832851 +0.03% 2% 4% 6% 119 4830534 -0.02% 1% 3% 4% 127 4827461 -0.08% 1% 4% 5% 137 4827459 -0.08% 1% 3% 4% 165 4820534 -0.23% 0% 4% 4% 188 4817947 -0.28% 0% 3% 3% 223 4809671 -0.45% 0% 3% 3% 255 4802463 -0.60% 0% 4% 4% 267 4801634 -0.62% 0% 3% 3% 299 4798047 -0.69% 0% 3% 3% 320 4793084 -0.80% 0% 3% 3% 393 4785877 -0.94% 0% 3% 3% 424 4782911 -1.01% 0% 3% 3% 458 4779346 -1.08% 0% 3% 3% 467 4780306 -1.06% 0% 3% 3% 498 4780589 -1.05% 0% 3% 3% 511 4773724 -1.20% 0% 3% 3% Skylake-Desktop: similar to Westmere-EP, nothing interesting. batch score change zone_contention lru_contention total_contention 31 3906608 +0.00% 2% 3% 5% 53 3940164 +0.86% 2% 3% 5% 63 3937289 +0.79% 2% 3% 5% 73 3940201 +0.86% 2% 3% 5% 119 3933240 +0.68% 2% 3% 5% 127 3930514 +0.61% 2% 4% 6% 137 3938639 +0.82% 0% 3% 3% 165 3908755 +0.05% 0% 3% 3% 188 3905621 -0.03% 0% 3% 3% 223 3903015 -0.09% 0% 4% 4% 255 3889480 -0.44% 0% 3% 3% 267 3891669 -0.38% 0% 4% 4% 299 3898728 -0.20% 0% 4% 4% 320 3894547 -0.31% 0% 4% 4% 393 3875137 -0.81% 0% 4% 4% 424 3874521 -0.82% 0% 3% 3% 458 3880432 -0.67% 0% 4% 4% 467 3888715 -0.46% 0% 3% 3% 498 3888633 -0.46% 0% 4% 4% 511 3875305 -0.80% 0% 5% 5% Haswell-Desktop: zone->lock is pretty low as other desktops, though lru contention is higher than other desktops. batch score change zone_contention lru_contention total_contention 31 3511158 +0.00% 2% 5% 7% 53 3555445 +1.26% 2% 6% 8% 63 3561082 +1.42% 2% 6% 8% 73 3547218 +1.03% 2% 6% 8% 119 3571319 +1.71% 1% 7% 8% 127 3549375 +1.09% 0% 6% 6% 137 3560233 +1.40% 0% 6% 6% 165 3555176 +1.25% 2% 6% 8% 188 3551501 +1.15% 0% 8% 8% 223 3531462 +0.58% 0% 7% 7% 255 3570400 +1.69% 0% 7% 7% 267 3532235 +0.60% 1% 8% 9% 299 3562326 +1.46% 0% 6% 6% 320 3553569 +1.21% 0% 8% 8% 393 3539519 +0.81% 0% 7% 7% 424 3549271 +1.09% 0% 8% 8% 458 3528885 +0.50% 0% 8% 8% 467 3526554 +0.44% 0% 7% 7% 498 3525302 +0.40% 0% 9% 9% 511 3527556 +0.47% 0% 8% 8% Sandybridge-Desktop: the 0% contention isn't accurate but caused by dropped fractional part. Since multiple contention path's contentions are all under 1% here, with some arithmetic operations like add, the final deviation could be as large as 3%. batch score change zone_contention lru_contention total_contention 31 1744495 +0.00% 0% 0% 0% 53 1755341 +0.62% 0% 0% 0% 63 1758469 +0.80% 0% 0% 0% 73 1759626 +0.87% 0% 0% 0% 119 1770417 +1.49% 0% 0% 0% 127 1768252 +1.36% 0% 0% 0% 137 1767848 +1.34% 0% 0% 0% 165 1765088 +1.18% 0% 0% 0% 188 1766918 +1.29% 0% 0% 0% 223 1767866 +1.34% 0% 0% 0% 255 1768074 +1.35% 0% 0% 0% 267 1763187 +1.07% 0% 0% 0% 299 1765620 +1.21% 0% 0% 0% 320 1767603 +1.32% 0% 0% 0% 393 1764612 +1.15% 0% 0% 0% 424 1758476 +0.80% 0% 0% 0% 458 1758593 +0.81% 0% 0% 0% 467 1757915 +0.77% 0% 0% 0% 498 1753363 +0.51% 0% 0% 0% 511 1755548 +0.63% 0% 0% 0% Phase two test results: Note: all percent change is against base(batch=31). ebizzy.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0% lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1% lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6% lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1% lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4% lkp-skl-d01 264126 262791 -0.5% 264113 +0.0% lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1% lkp-sb02 98937 98937 +0.0% 99030 +0.1% kbuild.buildtime (less is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1% lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1% lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1% lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4% lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1% lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7% lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1% netperf/TCP_STREAM.Throughput_total_Mbps (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0% lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1% lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1% lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1% lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1% lkp-skl-d01 77314 77303 -0.0% 77375 +0.1% lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7% lkp-sb02 29990 30137 +0.5% 30103 +0.4% oltp.transactions (higer is better) machine batch=31 batch=63 batch=127 lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4% lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0% lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7% lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4% lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0% lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0% lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8% pigz.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2% lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8% lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0% lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7% lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3% lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1% lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0% lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0% will-it-scale.malloc1.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0% lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0% lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3% lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0% lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6% lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0% lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2% lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6% lkp-sb02 589286 589284 -0.0% 588101 -0.2% will-it-scale.malloc1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2% lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4% lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9% lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9% lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2% lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1% lkp-skl-d01 175277 173343 -1.1% 173082 -1.3% lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8% lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4% will-it-scale.malloc2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0% lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0% lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1% lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1% lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1% lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1% lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0% lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0% lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0% will-it-scale.malloc2.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0% lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0% lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2% lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3% lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0% lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0% lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0% lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0% lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0% will-it-scale.page_fault2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0% lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0% lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3% lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9% lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7% lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5% lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9% lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6% lkp-sb02 842616 837844 -0.6% 833921 -1.0% will-it-scale.page_fault2.threads machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1% lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9% lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0% lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8% lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3% lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9% lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4% lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5% lkp-sb02 750626 748615 -0.3% 746621 -0.5% will-it-scale.page_fault3.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2% lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2% lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0% lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3% lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3% lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9% lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5% lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5% lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4% will-it-scale.page_fault3.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1% lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8% lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9% lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6% lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5% lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9% lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9% lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9% lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6% will-it-scale.read1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable) lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2% lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1% lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6% lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2% lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2% lkp-skl-d01 10969487 10972650 +0.0% no data lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8% lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5% will-it-scale.read1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable) lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3% lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6% lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3% lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1% lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7% lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1% lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1% lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0% will-it-scale.write1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0% lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4% lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1% lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9% lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7% lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9% lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8% lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4% lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6% will-it-scale.write1.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7% lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6% lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7% lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1% lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8% lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8% lkp-skl-d01 8346174 8235695 -1.3% no data lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3% lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0% vm-scalability.anon-r-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1% lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9% lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4% lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7% lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5% lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2% lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8% lkp-sb02 570572 567854 -0.5% 568449 -0.4% vm-scalability.anon-r-rand-mt.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1% lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9% lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6% lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9% lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0% lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0% lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1% lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9% lkp-sb02 567137 567346 +0.0% 566144 -0.2% vm-scalability.lru-file-mmap-read.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7% lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9% lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5% lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1% lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7% lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3% lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9% lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4% lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3% vm-scalability.lru-file-mmap-read-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6% lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2% lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0% lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1% lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1% lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1% lkp-skl-d01 696689 695537 -0.2% 694106 -0.4% lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2% lkp-sb02 258939 258787 -0.1% 258199 -0.3% Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

The busy tree can be quite big, even though the area is freed or unmapped it still stays there until "purge" logic removes it. 1) Optimize and reduce the size of "busy" tree by removing a node from it right away as soon as user triggers free paths. It is possible to do so, because the allocation is done using another augmented tree. The vmalloc test driver shows the difference, for example the "fix_size_alloc_test" is ~11% better comparing with default configuration: sudo ./test_vmalloc.sh performance <default> Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec <default> <this patch> Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec <this patch> 2) Since the busy tree now contains allocated areas only and does not interfere with lazily free nodes, introduce the new function show_purge_info() that dumps "unpurged" areas that is propagated through "/proc/vmallocinfo". 3) Eliminate VM_LAZY_FREE flag. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Recent changes to the vmalloc code by commit 68ad4a3 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can cause spurious percpu allocation failures. These, in turn, can result in panic()s in the slub code. One such possible panic was reported by Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939. Another related panic observed is, RIP: 0033:0x7f46f7441b9b Call Trace: dump_stack+0x61/0x80 pcpu_alloc.cold.30+0x22/0x4f mem_cgroup_css_alloc+0x110/0x650 cgroup_apply_control_enable+0x133/0x330 cgroup_mkdir+0x41b/0x500 kernfs_iop_mkdir+0x5a/0x90 vfs_mkdir+0x102/0x1b0 do_mkdirat+0x7d/0xf0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly uses two lists (vmap_area_list & free_vmap_area_list) to track the used and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[], sizes[], nr_vms, align) function is used for allocating congruent VM areas for percpu memory allocator. In order to not conflict with VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the VMALLOC space. So the search for free vm_area for the given requirement starts near VMALLOC_END and moves upwards towards VMALLOC_START. Prior to commit 68ad4a3, the search for free vm_area in pcpu_get_vm_areas() involves following two main steps. Step 1: Find a aligned "base" adress near VMALLOC_END. va = free vm area near VMALLOC_END Step 2: Loop through number of requested vm_areas and check, Step 2.1: if (base < VMALLOC_START) 1. fail with error Step 2.2: // end is offsets[area] + sizes[area] if (base + end > va->vm_end) 1. Move the base downwards and repeat Step 2 Step 2.3: if (base + start < va->vm_start) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 But Commit 68ad4a3 removed Step 2.2 and modified Step 2.3 as below: Step 2.3: if (base + start < va->vm_start || base + end > va->vm_end) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 Above change is the root cause of spurious percpu memory allocation failures. For example, consider a case where a relatively large vm_area (~ 30 TB) was ignored in free vm_area search because it did not pass the base + end < vm->vm_end boundary check. Ignoring such large free vm_area's would lead to not finding free vm_area within boundary of VMALLOC_start to VMALLOC_END which in turn leads to allocation failures. So modify the search algorithm to include Step 2.2. Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com Fixes: 68ad4a3 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Objective --------- The current implementation of struct vmap_area wasted space. After applying this commit, sizeof(struct vmap_area) has been reduced from 11 words to 8 words. Description ----------- 1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem because A) "subtree_max_size" is only used when vmap_area is in "free" tree B) "vm" is only used when vmap_area is in "busy" tree C) "purge_list" is only used when vmap_area is in vmap_purge_list 2) Eliminate "flags". ;Since only one flag VM_VM_AREA is being used, and the same thing can be done by judging whether "vm" is NULL, then the "flags" can be eliminated. Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com Signed-off-by: Pengfei Li <lpf.vector@gmail.com> Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Roman Gushchin <guro@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Some background. The preemption was disabled before to guarantee that a preloaded object is available for a CPU, it was stored for. The aim was to not allocate in atomic context when spinlock is taken later, for regular vmap allocations. But that approach conflicts with CONFIG_PREEMPT_RT philosophy. It means that calling spin_lock() with disabled preemption is forbidden in the CONFIG_PREEMPT_RT kernel. Therefore, get rid of preempt_disable() and preempt_enable() when the preload is done for splitting purpose. As a result we do not guarantee now that a CPU is preloaded, instead we minimize the case when it is not, with this change. For example i run the special test case that follows the preload pattern and path. 20 "unbind" threads run it and each does 1000000 allocations. Only 3.5 times among 1000000 a CPU was not preloaded. So it can happen but the number is negligible. V2 - > V3: - update the commit message V1 -> V2: - move __this_cpu_cmpxchg check when spin_lock is taken, as proposed by Andrew Morton - add more explanation in regard of preloading - adjust and move some comments Fixes: 82dd23e ("mm/vmalloc.c: preload a CPU with one object for split purpose") Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

alloc_vmap_area() is given a gfp_mask for the page allocator. Let's respect that mask and consider it even in the case when doing regular CPU preloading, i.e. where a context can sleep. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

When fit type is NE_FIT_TYPE there is a need in one extra object. Usually the "ne_fit_preload_node" per-CPU variable has it and there is no need in GFP_NOWAIT allocation, but there are exceptions. This commit just adds more explanations, as a result giving answers on questions like when it can occur, how often, under which conditions and what happens if GFP_NOWAIT gets failed. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

With the new allocation approach introduced in the 5.2 kernel, it becomes possible to get rid of one global spinlock. By doing that we can further improve the KVA from the performance point of view. Basically we can have two independent locks, one for allocation part and another one for deallocation, because of two different entities: "free data structures" and "busy data structures". As a result, allocation/deallocation operations can still interfere between each other in case of running simultaneously on different CPUs, it means there is still dependency, but with two locks it becomes lower. Summarizing: - it reduces the high lock contention - it allows to perform operations on "free" and "busy" trees in parallel on different CPUs. Please note it does not solve scalability issue. Test results: In order to evaluate this patch, we can run "vmalloc test driver" to see how many CPU cycles it takes to complete all test cases running sequentially. All online CPUs run it so it will cause a high lock contention. HiKey 960, ARM64, 8xCPUs, big.LITTLE: <snip> sudo ./test_vmalloc.sh sequential_test_order=1 <snip> <default> [ 390.950557] All test took CPU0=457126382 cycles [ 391.046690] All test took CPU1=454763452 cycles [ 391.128586] All test took CPU2=454539334 cycles [ 391.222669] All test took CPU3=455649517 cycles [ 391.313946] All test took CPU4=388272196 cycles [ 391.410425] All test took CPU5=384036264 cycles [ 391.492219] All test took CPU6=387432964 cycles [ 391.578433] All test took CPU7=387201996 cycles <default> <patched> [ 304.721224] All test took CPU0=391521310 cycles [ 304.821219] All test took CPU1=393533002 cycles [ 304.917120] All test took CPU2=392243032 cycles [ 305.008986] All test took CPU3=392353853 cycles [ 305.108944] All test took CPU4=297630721 cycles [ 305.196406] All test took CPU5=297548736 cycles [ 305.288602] All test took CPU6=297092392 cycles [ 305.381088] All test took CPU7=297293597 cycles <patched> ~14%-23% patched variant is better. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Optimize for flash storage. Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

…ound We try to find an idle CPU to run the next task, but in case we don't find an idle CPU it is better to pick a CPU which will run the task the soonest, for performance reason. A CPU which isn't idle but has only SCHED_IDLE activity queued on it should be a good target based on this criteria as any normal fair task will most likely preempt the currently running SCHED_IDLE task immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one shall give better results as it should be able to run the task sooner than an idle CPU (which requires to be woken up from an idle state). This patch updates both fast and slow paths with this optimization. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: chris.redpath@arm.com Cc: quentin.perret@linaro.org Cc: songliubraving@fb.com Cc: steven.sistare@oracle.com Cc: subhra.mazumdar@oracle.com Cc: tkjos@google.com Link: https://lkml.kernel.org/r/eeafa25fdeb6f6edd5b2da716bc8f0ba7708cbcf.1561523542.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

…hout There are instances where we keep searching for an idle CPU despite already having a sched-idle CPU (in find_idlest_group_cpu(), select_idle_smt() and select_idle_cpu() and then there are places where we don't necessarily do that and return a sched-idle CPU as soon as we find one (in select_idle_sibling()). This looks a bit inconsistent and it may be worth having the same policy everywhere. On the other hand, choosing a sched-idle CPU over a idle one shall be beneficial from performance and power point of view as well, as we don't need to get the CPU online from a deep idle state which wastes quite a lot of time and energy and delays the scheduling of the newly woken up task. This patch tries to simplify code around sched-idle CPU selection and make it consistent throughout. Testing is done with the help of rt-app on hikey board (ARM64 octa-core, 2 clusters, 0-3 and 4-7). The cpufreq governor was set to performance to avoid any side affects from CPU frequency. Following are the tests performed: Test 1: 1-cfs-task: A single SCHED_NORMAL task is pinned to CPU5 which runs for 2333 us out of 7777 us (so gives time for the cluster to go in deep idle state). Test 2: 1-cfs-1-idle-task: A single SCHED_NORMAL task is pinned on CPU5 and single SCHED_IDLE task is pinned on CPU6 (to make sure cluster 1 doesn't go in deep idle state). Test 3: 1-cfs-8-idle-task: A single SCHED_NORMAL task is pinned on CPU5 and eight SCHED_IDLE tasks are created which run forever (not pinned anywhere, so they run on all CPUs). Checked with kernelshark that as soon as NORMAL task sleeps, the SCHED_IDLE task starts running on CPU5. And here are the results on mean latency (in us), using the "st" tool. $ st 1-cfs-task/rt-app-cfs_thread-0.log N min max sum mean stddev 642 90 592 197180 307.134 109.906 $ st 1-cfs-1-idle-task/rt-app-cfs_thread-0.log N min max sum mean stddev 642 67 311 113850 177.336 41.4251 $ st 1-cfs-8-idle-task/rt-app-cfs_thread-0.log N min max sum mean stddev 643 29 173 41364 64.3297 13.2344 The mean latency when we need to: - wakeup from deep idle state is 307 us. - wakeup from shallow idle state is 177 us. - preempt a SCHED_IDLE task is 64 us. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/b90cbcce608cef4e02a7bbfe178335f76d201bab.1573728344.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

The fair scheduler performs periodic load balance on every CPU to check if it can pull some tasks from other busy CPUs. The duration of this periodic load balance is set to sd->balance_interval for the idle CPUs and is calculated by multiplying the sd->balance_interval with the sd->busy_factor (set to 32 by default) for the busy CPUs. The multiplication is done for busy CPUs to avoid doing load balance too often and rather spend more time executing actual task. While that is the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE CPU instead of other busy or idle CPUs. The same reasoning should be applied to the load balancer as well to make it migrate tasks more aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling latency of the migrated (SCHED_OTHER) tasks. This patch makes minimal changes to the fair scheduler to do the next load balance soon after the last non SCHED_IDLE task is dequeued from a runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is ignored while calculating the balance_interval for such CPUs. This is done to avoid delaying the periodic load balance by few hundred milliseconds for SCHED_IDLE CPUs. This is tested on ARM64 Hikey620 platform (octa-core) with the help of rt-app and it is verified, using kernel traces, that the newly SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE and pulls tasks from other busy CPUs. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/e485827eb8fe7db0943d6f3f6e0f5a4a70272781.1578471925.git.viresh.kumar@linaro.org Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

CAF's new fuel gauge drivers report POWER_SUPPLY_TYPE_BMS (Battery Monitor System) instead of POWER_SUPPLY_TYPE_BATTERY (battery), and rightfully so because it describes their purpose more accurately. Update the power_supply_is_system_supplied function to recognize BMS power supplies as batteries to prevent it from attempting to query the POWER_SUPPLY_PROP_ONLINE property on our fuel gauge drivers. Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: K A R T H I K <karthik.lal558@gmail.com> Signed-off-by: AtomicXZ <achubadyal4@gmail.com>

1) 35-36 degrees threshold for toggling fast charge mode is bad since it is ambient temperature in most places. Increased it to 39-40 degrees. 45-46 degrees still remain cutoff threshold. 2) system_temp_level is mostly betweeb 10-14. Thermal mitigation votes in for crossing threshold of 3-4!! Increased this threshold value to 14-15. 3) Added debug for quick_charge_type activation. Let's monitor. QUICK_CHARGE_TURBE is activated when pd_verifed = 1, charger type = 9ish and class_ab = 1. With POCO charger pd_verifed is always 0, charger type = 13ish and class_ab = 1. Need to check these parameters on stock ROM. While we get very good rapid / fast charging, turbo charging is still missing. Why? Signed-off-by: Pranav Vashi <neobuddy89@gmail.com> Signed-off-by: AtomicXZ <achubadyal4@gmail.com>

Bug: 142000171 cherry-pick from I81666dcc18c584dbcbd9f588f187ae87a377bcb0 Signed-off-by: Jie Song <jies@google.com> Change-Id: I7bb84dec4d352ef731b2457dc15590a29ed57a90

Add suspend timeout handler to prevent device stuck during suspend/ resume process. Suspend timeout handler will dump disk sleep task at first round timeout and trigger kernel panic at second round timeout. The default timer for each round is 30 seconds. Note: Can use following command to simulate suspend hang for testing. adb shell echo 1 > /sys/power/pm_hang adb shell echo mem > /sys/power/state Bug: 132733612 Change-Id: Ifc8aa4ee9145187d14511e29529ba50a4b19324e Signed-off-by: josephjang <josephjang@google.com>

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

Arter has done a similar commit, where the random fops routed the read hook to the urandom_read method. However, this leads to a warning about random_read being unused, as well as having the poll hook still linked to random_poll. This commit should solve both of those issues, from the roots. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

Red Hat linux states that vmstat updates can be mildly expensive. Increase from 1 second to 1 minute. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com> [ghostrider-reborn] Reduce to 10s Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>

This should let brand new tasks launch marginally faster. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

By scheduling the last woken task first, we can increase cache locality since that task is likely to touch the same data as before. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

Require the current task to be surpassing the new task by 5ms instead of 1ms before preemption occurs. Reduce jitter due to less frequent task interruptions. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

Also, set the minimum task granularity to 1ms. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

This reverts commit ea8e50a.

This will save a lot of power should the USB IRQ ever get affined to cpu6 or cpu7, whether it's for testing or balancing. The extra 6 µs for the big cluster's WFI state won't hurt. Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: Carlos Jimenez (JavaShin-X) <javashin1986@gmail.com>

For regular bound workers that don't request to be queued onto a specific CPU, just use CPU0 to save power. Additionally, adjust the CPU affinity of unbound workqueues to force their workers onto the power cluster (CPU0/CPU1) to further improve power consumption. Signed-off-by: Sultanxda <sultanxda@gmail.com> Signed-off-by: Panchajanya1999 <rsk52959@gmail.com> Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>

Sleepers are tasks that spend most of their time asleep. When sleepers are placed onto the runqueue, their vruntime is reduced by one full scheduling latency period so that interactive tasks are given priority over sleepers. The sched feature GENTLE_FAIR_SLEEPERS attemps to reduce the penalty given to sleeper tasks by only deducing half of a scheduling latency period, rather than a full one. This allows the sleeper to catch up in terms of vruntime with other tasks until it is preempted once again. A forked child task is not considered a sleeper task, so app launch performance will not regress. This commit should give interactive tasks more runtime while background tasks are awoken. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>

This adds a simple implementation of a scheduled compaction mechanism. The scheduled compaction gets executed compaction_soff_delay_ms (default 3000ms) after the screen gets turned off unless the compaction_timeout_ms (default 900000ms) prevents it because a memory compaction is not required. The workqueue is freezable to indicate that this shouldn't be running during suspend. The parameters of the screen off delay and the forced timeout are configurable. Signed-off-by: Alex Naidis <alex.naidis@linux.com> Signed-off-by: UtsavisGreat <utsavbalar1231@gmail.com>

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: UtsavisGreat <utsavbalar1231@gmail.com>

* IRQF_NO_SUSPEND and enable_irq_wake should not be used simultaneously, as stated by google in their previous commits * fixes misconfigured mpm irq causing too many wakeups from suspend

This change is for general scheduler improvement. Change-Id: I6cb5de5a2da4bac8651a35ab9e162f290f07a51b Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> (cherry picked from commit 7b9a9c98e1d329f134393c0076d87bba059798c3)

cpu_capacity() returns maximum capacity when WALT is disabled, hence we couldn't take advantage of CAF's optimization Return CPU's original capacity instead to make it usable Signed-off-by: Diep Quynh <remilia.1505@gmail.com> (cherry picked from commit 44de7d59e0267ad72e3e3d741e3a9428b19aa1d7)

This task migration logic is guarded by a WALT #ifdef even though it has nothing specific to do with WALT. The result is that, with PELT, boosted tasks can be migrated to the little cluster, causing visible stutters. Move the WALT #ifdef so PELT can benefit from this logic as well. Thanks to Zachariah Kennedy <zkennedy87@gmail.com> and Danny Lin <danny@kdrag0n.dev> for discovering this issue and creating this fix. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> (cherry picked from commit 99bbfad46762cd9da629764e19916541188c4a81)

echo 0 /sys/kernel/fast_charge/force_fast_charge (disable) echo 1 /sys/kernel/fast_charge/force_fast_charge (enable) Enables force charging up to 900mA in usb2 mode Signed-off-by: engstk <eng.stk@sapo.pt>

For some ROMs not having the toggle exposed in userspace, users can not switch it unless rooted. Rather keep it enabled by default. Change-Id: Iff4f755f31123b82fc258342d3d0f71f0e286716 Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>

Change-Id: If9e10d0f59e268240946a18a4a3015581559ae15

Change-Id: Ib184c28c0441847399df567466a04b48b9b68c46

Signed-off-by: Yaroslav Furman <yaro330@gmail.com> Change-Id: I802cd25298f98e4731b667b2db87895e1091de21

Change-Id: Ia19f1ed5fd5d9c0e71dc6f8bef53f6e7757a678f

Fixes the following clang warning: ../drivers/thermal/thermal_core.c:2694:8: error: 'snprintf' size argument is too large; destination buffer has size 128, but size argument is 4096 [-Werror,-Wfortify-source] ret = snprintf(boost_buf, PAGE_SIZE, buf); ^ 1 error generated. Change-Id: I41c56cb39f0ac508fd0e54fabfc8844c64f66d09 Signed-off-by: Khusika Dhamar Gusti <mail@khusika.com>

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

Fixes the following clang warning: drivers/thermal/thermal_core.c:1773:8: warning: 'snprintf' size argument is too large; destination buffer has size 128, but size argument is 4096 [-Wfortify-source] snprintf(board_sensor_temp, PAGE_SIZE, buf); ^ 1 warning generated.

It is unnecessary to update disabled thermal zones post suspend and sometimes leads error/warning in bad behaved thermal drivers. Bug: 129435616 Change-Id: If5d3bfe84879779ec1ee024c0cf388ea3b4be2ea Signed-off-by: Wei Wang <wvw@google.com>

…olling Whenever BCL interrupt triggers, it notifies thermal framework. The framework disables the BCL interrupt and initiates a passive polling to monitor clear threshold. But BCL peripheral interrupts are lazy IRQ disable in nature by default. Even if BCL has initiated disable interrupt, there is a chance it may take some time to disable in hardware. During this time hardware can trigger interrupt again. But BCL driver assumes it as spurious interrupt and disables the interrupt again which will cause permanent disablement of that interrupt. If BCL interrupt is triggering again post BCL interrupt disable, just ignore that interrupt to avoid nested interrupt disablement. From above scenario, BCL is already in polling mode, ignoring this spurious interrupt doesn't cause any issue. Bug: 118493676 Change-Id: Ia77fc66eaf66f97bacee96906cc6a5735a6ed158 Signed-off-by: Manaf Meethalavalappu Pallikunhi <manafm@codeaurora.org> Signed-off-by: Wei Wang <wvw@google.com>

…_capacity before patch and "echo 50000 > /sys/class/thermal/tz-by-name/sdm-therm/emul_temp" com.android.uibench.janktests.UiBenchJankTests#testInvalidateTree: PASSED (02m6.247s) gfx-avg-slow-ui-thread: 0.07110321338664297 gfx-avg-missed-vsync: 0.0 gfx-avg-high-input-latency: 74.25140826299423 gfx-max-frame-time-50: 12 gfx-min-total-frames: 2250 gfx-avg-frame-time-99: 11.8 gfx-avg-num-frame-deadline-missed: 1.6 gfx-avg-frame-time-50: 9.6 gfx-max-high-input-latency: 99.86666666666667 gfx-avg-frame-time-90: 11.0 gfx-avg-frame-time-95: 11.0 gfx-max-frame-time-95: 13 gfx-max-frame-time-90: 13 gfx-max-slow-draw: 0.0 gfx-max-frame-time-99: 13 gfx-avg-slow-draw: 0.0 gfx-max-total-frames: 2251 gfx-avg-jank: 43.678000000000004 gfx-max-slow-bitmap-uploads: 0.0 gfx-max-missed-vsync: 0.0 gfx-avg-total-frames: 2250 gfx-max-jank: 96.67 gfx-max-slow-ui-thread: 0.13333333333333333 gfx-max-num-frame-deadline-missed: 3 gfx-avg-slow-bitmap-uploads: 0.0 aefore patch and "echo 50000 > /sys/class/thermal/tz-by-name/sdm-therm/emul_temp" google/perf/jank/UIBench/UIBench (1 Test) ---------------------------------------- [1/1] com.android.uibench.janktests.UiBenchJankTests#testInvalidateTree: PASSED (02m7.027s) gfx-avg-slow-ui-thread: 0.0 gfx-avg-missed-vsync: 0.0 gfx-avg-high-input-latency: 11.53777777777778 gfx-max-frame-time-50: 7 gfx-min-total-frames: 2250 gfx-avg-frame-time-99: 8.0 gfx-avg-num-frame-deadline-missed: 0.0 gfx-avg-frame-time-50: 7.0 gfx-max-high-input-latency: 41.15555555555556 gfx-avg-frame-time-90: 7.2 gfx-avg-frame-time-95: 7.8 gfx-max-frame-time-95: 8 gfx-max-frame-time-90: 8 gfx-max-slow-draw: 0.0 gfx-max-frame-time-99: 8 gfx-avg-slow-draw: 0.0 gfx-max-total-frames: 2250 gfx-avg-jank: 0.0 gfx-max-slow-bitmap-uploads: 0.0 gfx-max-missed-vsync: 0.0 gfx-avg-total-frames: 2250 gfx-max-jank: 0.0 gfx-max-slow-ui-thread: 0.0 gfx-max-num-frame-deadline-missed: 0 gfx-avg-slow-bitmap-uploads: 0.0 Bug: 143162654 Test: use emul_temp to change thermal condition and see capacity changed Change-Id: Idbf943f9c831c288db40d820682583ade3bbf05e Signed-off-by: Wei Wang <wvw@google.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev>

The memory allocated dynamically here is just used to store a single instance of a struct. Allocate both possible structs on the stack instead of allocating them dynamically to improve performance. These structs also do not need to be zeroed out. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: celtare21 <celtare21@gmail.com>

The temporary command buffer in _set_pagetable_gpu is only the size of a single page, and _set_pagetable_gpu is never executed concurrently. It is therefore easy to replace the dynamic command buffer allocation with a static one to improve performance by avoiding the latency of dynamic memory allocation. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: celtare21 <celtare21@gmail.com>

Most command buffers here are rather small (fewer than 256 words); it's a waste of time to dynamically allocate memory for such a small buffer when it could easily fit on the stack. Conditionally using an on-stack command buffer when the size is small enough eliminates the need for using a dynamically-allocated buffer most of the time, reducing GPU command submission latency. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: celtare21 <celtare21@gmail.com>

Add cpu_lp_mask and cpu_perf_mask to represent the CPUs that belong to each cluster in a dual-cluster, heterogeneous system. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>

Change-Id: I07980dd72f06d74d1f989d5524802776642a9d83

On devices with a CPU that contains heterogeneous cores (e.g., big.LITTLE), it can be beneficial to place some performance-critical IRQs and kthreads onto the fast CPU clusters in order to improve performance. This commit adds the following APIs: -kthread_run_perf_critical() to create and start a perf-critical kthread and affine it to the given performance CPUs -irq_set_perf_affinity() to mark an active IRQ as perf-critical and affine it to the given performance CPUs -IRQF_PERF_AFFINE to schedule an IRQ and any threads it may have onto the given performance CPUs -PF_PERF_CRITICAL to mark a process (mainly a kthread) as performance critical (this is used by kthread_run_perf_critical()) In order to accommodate this new API, the following changes are made: -Performance-critical IRQs are distributed evenly among online CPUs available in the given performance CPU mask -Performance-critical IRQs have their affinities reaffined upon exit from suspend (since the affinities are broken when non-boot CPUs are disabled) -Performance-critical IRQs and their threads have their affinities reset upon entering suspend, so that upon immediate suspend exit (when only the boot CPU is online), interrupts can be processed and interrupt threads can be scheduled onto an online CPU (otherwise we'd hit a kernel BUG) -__set_cpus_allowed_ptr() is modified to enforce a performance-critical kthread's affinity -Perf-critical IRQs aren't allowed to have their affinity changed by userspace Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

It's helpful to be able to affine low-priority kthreads to the little CPU, such as for deferred memory cleanup. Extend the perf-critical API to make this possible. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

An IRQ affinity notifier getting overwritten can point to some annoying issues which need to be resolved, like multiple pm_qos objects being registered to the same IRQ. Print out a warning when this happens to aid debugging. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

Although unbound workqueues are eligible to run their workers on any CPU, the workqueue subsystem prefers scheduling workers onto the CPU which queues them. This results in unbound workers consuming valuable CPU time on the big and prime CPU clusters. We can alleviate the burden of kernel housekeeping on the more important CPUs by moving the unbound workqueues to the little CPU cluster by default. This may also reduce power consumption, which is a plus. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

These kthreads (particularly crtc_event and crtc_commit) play a major role in rendering frames to the display, so affine them to the big CPU cluster, matching the DRM IRQ. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Change-Id: I3087f4437a309bb57fdae94fb139a68e749983ca

Change-Id: I49e2d258b25d0cbb66554d4dc7bc6d67c2600903 Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>

Change-Id: I24affcddfa4395ae983c7ac143fbee289ef21bdb Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>

Change-Id: I69e6010b20f895faf61951d97e9028c8909bada7 Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>

Change-Id: I5258f76f641807a2fb88251e30da413a554ab3eb Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: Kailash <kailash.sudhakar@gmail.com> Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>

Unfortunately, our display driver must occasionally sleep inside start_atomic(), which causes the CPU running the ioctl to schedule and potentially idle. Depending on how deeply the running CPU idles, there can be quite a bit of latency added to processing the "atomic" ioctl as a result, which hurts display rendering latency. We can mitigate this effect somewhat by forcing the CPU running the ioctl to stick to shallow idle states. This is done by forcing the task running the ioctl to stick to its current CPU and then using PM QoS to keep said CPU from entering idle states that take more than 100 us to exit. We can also further reduce the rendering latency by elevating the priority of the task running the ioctl, so that after it goes to sleep, it will wake up sooner when the wait inside start_atomic() is over. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>

do_shrink_slab() scans each shrinker in batches of at most batch_size (128) pages at a time until total_scan pages are scanned or until shrinker returns SHRINK_STOP. Under heavy memory pressure total_scan can be large (in thousands) and kgsl_pool_shrink_scan_objects() ends up returning 0 after all pages that were reclaimable are reclaimed. This results in multiple calls to kgsl_pool_shrink_scan_objects() that do not reclaim any memory. To prevent this kgsl_pool_shrink_scan_objects() is modified to return SHRINK_STOP as soon as no more memory can be reclaimed. Bug: 69931996 Test: tested using alloc-stress with additional traces Change-Id: Ia48fc2c0d888c54ec9642c0b0962a70ca3cb4c5e Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Bug: 115776306 Bug: 77146523 Change-Id: I1ecd92b74552f3bd4ecc105605886bd61289eb03 Signed-off-by: Miguel de Dios <migueldedios@google.com>

Bug: 115776306 Bug: 77146523 Change-Id: I6cabdc15cd66f734a3ddd4fa116111a0ee3325a6 Signed-off-by: Miguel de Dios <migueldedios@google.com>

MIUI-1567325 As part of the change to fix unmovable block migration over time, we need to reduce the orders requested by ion to under order-5 allocations. In a future CL, we can add fixed-size larger-order pools to improve performance when allocating large buffers. Change-Id: Iabb6af6911935256e8335a77f95ce1edc64b78a7 Signed-off-by: wangdong12 <wangdong12@xiaomi.com> (cherry picked from commit 82e48cf5c576e7c0286c5ce92555ec4b1941ab07)

MIUI-1567325 This change adjusts the page migration heuristic to reduce unmovable allocations Change-Id: I6d4681e4d349a8cf78c5aaceea8f83bad4c46e09 Signed-off-by: wangdong12 <wangdong12@xiaomi.com> (cherry picked from commit ea159061eb8daf96226a7ffed28d447172c5d67f)

BSPSYS-10620 This change includes: - fix reclaim count calculation issue - only reclaim un-shared pages - avoid burning CPU when nr_to_reclaim pages - export reclaim_pte_range() API for rtmm Change-Id: I04e77a2811f5ac4cae6d261c3eaaf88e26eee2ce Signed-off-by: fangqinwu <fangqinwu@xiaomi.com> (cherry picked from commit 7e67f10971240af85c0f1bee0ea38f639014261e)

MIUI-1428085 Change-Id: I7c910321b66c6877cbc5656b3b3e426557dc3314 Signed-off-by: xiongping1 <xiongping1@xiaomi.com>

MIUI-1728660 This patch enhances the performance of fsync. Doing checkpoint every time when setting xattr for dir can seriously affect the performance of fsync. By adding the inode to the tracking list when setting up xattr, we can avoid doing checkpoint when xattr of the parent directory is not set. Change-Id: I0ccf34212f77fb686dd86b705b78554c29bd38f8 Signed-off-by: liuchao12 <liuchao12@xiaomi.com> (cherry picked from commit fd0c10e2600e817347bc708ad6b76b3c9bd61d18)

This reverts commit fcc6b24. Signed-off-by: Khusika Dhamar Gusti <khusikadhamar@gmail.com> Change-Id: Ib983c72f8d0868c4c2d50e31d0e5a48fb4f4354e

…nals" This reverts commit a66a739. Signed-off-by: Khusika Dhamar Gusti <khusikadhamar@gmail.com> Change-Id: Ic7d585615698ee688b5a3999da07d3118563bb78

…es and kuser helpers (C sources) (cherry picked from url http://lkml.iu.edu/hypermail/linux/kernel/1709.1/01901.html) AArch32 processes are currently installed a special [vectors] page that contains the sigreturn trampolines and the kuser helpers, at the fixed address mandated by the kuser helpers ABI. Having both functionalities in the same page has become problematic, because: * It makes it impossible to disable the kuser helpers (the sigreturn trampolines cannot be removed), which is possible on arm. * A future 32-bit vDSO would provide the sigreturn trampolines itself, making those in [vectors] redundant. This patch addresses the problem by moving the sigreturn trampolines to a separate [sigpage] page, mirroring [sigpage] on arm. Even though [vectors] has always been a misnomer on arm64/compat, as there is no AArch32 vector there (and now only the kuser helpers), its name has been left unchanged, for compatibility with arm (there are reports of software relying on [vectors] being there as the last mapping in /proc/maps). mm->context.vdso used to point to the [vectors] page, which is unnecessary (as its address is fixed). It now points to the [sigpage] page (whose address is randomized like a vDSO). Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Bug: 9674955 Bug: 63737556 Bug: 20045882 Change-Id: I52a56ea71d7326df8c784f90eb73b5c324fe9d20 Signed-off-by: khusika <khusikadhamar@gmail.com>

…es and kuser helpers (assembler sources) (cherry picked from url http://lkml.iu.edu/hypermail/linux/kernel/1709.1/01902.html) AArch32 processes are currently installed a special [vectors] page that contains the sigreturn trampolines and the kuser helpers, at the fixed address mandated by the kuser helpers ABI. Having both functionalities in the same page has become problematic, because: * It makes it impossible to disable the kuser helpers (the sigreturn trampolines cannot be removed), which is possible on arm. * A future 32-bit vDSO would provide the sigreturn trampolines itself, making those in [vectors] redundant. This patch addresses the problem by moving the sigreturn trampolines sources to its own file. Wrapped the comments to reduce the wrath of checkpatch.pl. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Bug: 9674955 Bug: 63737556 Bug: 20045882 Change-Id: I1d7b96e7cfbe979ecf4cb4996befd1f3ae0e64fd Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry picked from url http://lkml.iu.edu/hypermail/linux/kernel/1709.1/01903.html) Make it possible to disable the kuser helpers by adding a KUSER_HELPERS config option (enabled by default). When disabled, all kuser helpers-related code is removed from the kernel and no mapping is done at the fixed high address (0xffff0000); any attempt to use a kuser helper from a 32-bit process will result in a segfault. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Bug: 9674955 Bug: 63737556 Bug: 20045882 Change-Id: Ie8c543301d39bfe88ef71fb6a669e571914b117b Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry picked from url https://patchwork.kernel.org/patch/10044505/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Rename seq_count to tb_seq_count. Rename tk_is_cntvct to use_syscall. Rename cs_mult to cs_mono_mult. All to align with the variables in the arm64 vdso datapage. Rework vdso_read_begin() and vdso_read_retry() functions to reflect modern access patterns for tb_seq_count field. Update copyright message to reflect the start of the contributions in this series. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: I13f16e71b1ecba3d72b999caafef72e3c7f48dfe Signed-off-by: khusika <khusikadhamar@gmail.com>

…_datapage() (cherry picked from url https://patchwork.kernel.org/patch/10044481/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Define the prototype for __get_datapage() in local datapage.h header. Rename all vdata variable that point to the datapage shortened to vd to relect a consistent and concise style. Make sure that all references to the datapage in vdso operations are readonly (const). Make sure datapage is first parameter to all subroutines to also be consistent. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: I9512b49d36d53ca1b71d3ff82219a7c64e0fc613 Signed-off-by: khusika <khusikadhamar@gmail.com>

…compiler.h (cherry picked from commit https://patchwork.kernel.org/patch/10044507/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Move compiler-specific code to a local compiler.h file: - CONFIG_AEABI dependency check. - System call fallback functions standardized into a DEFINE_FALLBACK macro. - Replace arch_counter_get_cntvct() with arch_vdso_read_counter. - Deal with architecture specific unresolved references emitted by GCC. - Optimize handling of fallback calls in callers. - For time functions that always return success, do not waste time checking return value for switch to fallback. - Optimize unlikely nullptr checking in __vdso_gettimeofday, if tv null no need to proceed to fallback, as vdso is still capable of filling in the tv values. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: I468e4c32b5136d199982bf25df8967321e384d90 Signed-off-by: khusika <khusikadhamar@gmail.com>

…loops (cherry picked from url https://patchwork.kernel.org/patch/10044477/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. In variable timer reading loops, pick up just the values until all are synchronized, then outside of loop pick up cntvct and perform calculations to determine final offset, shifted and multiplied output value. This replaces get_ns with get_clock_shifted_nsec as cntvct reader. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: I8008197f08485ef89b267128e41624ff69c33f6b Signed-off-by: khusika <khusikadhamar@gmail.com>

…_RAW (cherry pick from url https://patchwork.kernel.org/patch/10052099/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Add a case for CLOCK_MONOTONIC_RAW to match up with support that is available in arm64's vdso. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: If9c09d131e236ba4a483dbc122e6b876f471df72 Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry picked from url https://patchwork.kernel.org/patch/10044545/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Add clock_getres vdso support to match up with existing support in the arm64's vdso. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: Ie37bf76d2992027f06a2cdd001d8654a860d2aac Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry pick from url https://patchwork.kernel.org/patch/10044491/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Make sure kasan and ubsan profiling, and kcov instrumentation, is turned off for VDSO code. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: I2b44c1edd81665b8bb235a65ba642767c35f1e61 Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry picked from url https://patchwork.kernel.org/patch/10044543/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Add ARCH_CLOCK_FIXED_MASK as an efficiency since arm64 has no purpose for cs_mask vdso_data variable. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: Iadf94bed6166d2ee43bb46bdf54636618e4b8854 Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry pick from url https://patchwork.kernel.org/patch/10044497/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Declare arch/arm/vdso/vgettimeofday.c to be a candidate for a global implementation of the vdso timer calls. The hope is that new architectures can take advantage of the current unification of arm and arm64 implementations. We urge future efforts to merge their implementations into the global vgettimeofday.c file and thus provide functional parity. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: If7da1d8144684d52ed9520a581e6023c623df931 Signed-off-by: khusika <khusikadhamar@gmail.com>

…lobal vgettimeofday.C (cherry picked from url https://patchwork.kernel.org/patch/10044501/) Take an effort from the previous 9 patches to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. apinski@cavium.com makes the following claims in the original patch: This allows the compiler to optimize the divide by 1000 and remove the other divides. On ThunderX, gettimeofday improves by 32%. On ThunderX 2, gettimeofday improves by 18%. Note I noticed a bug in the old (arm64) implementation of __kernel_clock_getres; it was checking only the lower 32bits of the pointer; this would work for most cases but could fail in a few. <end of claim> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: I71ff27ff5bfa323354fda6867b01ec908d8d6cbd Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry pick from url https://patchwork.kernel.org/patch/10044503/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. Add a case for CLOCK_BOOTTIME as it is popular for measuring relative time on systems expected to suspend() or hibernate(). Android uses CLOCK_BOOTTIME for all relative time measurements and timeouts. Switching to vdso reduced CPU utilization and improves accuracy. There is also a desire by some partners to switch all logging over to CLOCK_BOOTTIME, and thus this operation alone would contribute to a near percentile CPU load. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: I76c26b054baf7f1100e03c65d6b16fe649b883b1 Signed-off-by: khusika <khusikadhamar@gmail.com>

…no arch supported timer (cherry pick from url https://patchwork.kernel.org/patch/10044539/) Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. But instead of landing it in arm64, land the result into lib/vdso and unify both implementations to simplify future maintenance. If ARCH_PROVIDES_TIMER is not defined, do not expose gettimeofday. libc will default directly to syscall. Also ifdef clock_gettime switch cases and stubs if not supported and other unused components. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: James Morse <james.morse@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Gross <andy.gross@linaro.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Andrew Pinski <apinski@cavium.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Bug: 63737556 Bug: 20045882 Change-Id: I362a7114db0aac800e16eb90d14a8739e18f42e4 Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry picked from url https://patchwork.kernel.org/patch/10006025/) This will be needed to provide unwinding information in compat sigreturn trampolines, part of the future compat vDSO. There is no obvious header the compat_sig* struct's should be moved to, so let's put them in signal32.h. Also fix minor style issues reported by checkpatch. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andy Gross <andy.gross@linaro.org> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Bug: 63737556 Bug: 20045882 Change-Id: I9c23dd6b56ca48c0953cbf78ccb7b49ded906052 Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry pick from url https://patchwork.kernel.org/patch/10053549/) Add time() vdso support to match up with existing support in the x86's vdso. Currently benefitting arm and arm64 which uses the common vgettimeofday.c implementation. On arm provides about a ~14 fold improvement in speed over the straight syscall, and about a ~5 fold improvement in speed over an alternate library implementation that relies on the vdso call to gettimeofday to fulfill the request. We can provide __vdso_time even if we can not provide a speed enhanced __vdso_gettimeofday. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Bug: 63737556 Bug: 20045882 Change-Id: I0bb3c6bafe57f9ed69350e2dd54edaae58316e8f Signed-off-by: khusika <khusikadhamar@gmail.com>

…f available (cherry pick from url https://patchwork.kernel.org/patch/10060449/) If the compat vDSO is enabled, it replaces the sigreturn page. Therefore, we use the sigreturn trampolines the vDSO provides instead. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Bug: 63737556 Bug: 20045882 Change-Id: Ic0933741e321e1bf66409b7e190a776f12948024 Signed-off-by: khusika <khusikadhamar@gmail.com>

…sses (cherry pick from url https://patchwork.kernel.org/patch/10060431/) If the compat vDSO is enabled, we need to set AT_SYSINFO_EHDR in the auxiliary vector of compat processes to the address of the vDSO code page, so that the dynamic linker can find it (just like the regular vDSO). Note that we cast context.vdso to Elf64_Off, instead of elf_addr_t, because elf_addr_t is Elf32_Off in compat_binfmt_elf.c, and casting context.vdso to u32 would trigger a pointer narrowing warning. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Bug: 63737556 Bug: 20045882 Change-Id: I5d0b191d3b2f4c0b2ec31fe9faef0246253635ce Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry pick from url https://patchwork.kernel.org/patch/10060439/) Move the logic for setting up mappings and pages for the vDSO into static functions. This makes the vDSO setup code more consistent with the compat side and will allow to reuse it for the future compat vDSO. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Bug: 63737556 Bug: 20045882 Change-Id: I13e84479591091669190360f2a7f4d04462e6344 Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry pick from url https://patchwork.kernel.org/patch/10060445/) Provide the files necessary for building a compat (AArch32) vDSO in kernel/vdso32. This is mostly an adaptation of the arm vDSO. The most significant change in vgettimeofday.c is the use of the arm64 vdso_data struct, allowing the vDSO data page to be shared between the 32 and 64-bit vDSOs. Additionally, a different set of barrier macros is used (see aarch32-barrier.h), as we want to support old 32-bit compilers that may not support ARMv8 and its new barrier arguments (*ld). In addition to the time functions, sigreturn trampolines are also provided, aiming at replacing those in the sigreturn page as the latter don't provide any unwinding information (and it's easier to have just one "user code" page). arm-specific unwinding directives are used, based on glibc's implementation. Symbol offsets are made available to the kernel using the same method as the 64-bit vDSO. There is unfortunately an important caveat: we cannot get away with hand-coding 32-bit instructions like in kernel/kuser32.S, this time we really need a 32-bit compiler. The compat vDSO Makefile relies on CROSS_COMPILE_ARM32 to provide a 32-bit compiler, appropriate logic will be added to the arm64 Makefile later on to ensure that an attempt to build the compat vDSO is made only if this variable has been set properly. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Take an effort to recode the arm64 vdso code from assembler to C previously submitted by Andrew Pinski <apinski@cavium.com>, rework it for use in both arm and arm64, overlapping any optimizations for each architecture. Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Bug: 63737556 Bug: 20045882 Change-Id: I3fb9d21b29bd9fec1408f2274d090e6def546b0d Signed-off-by: khusika <khusikadhamar@gmail.com>

(cherry pick from url https://patchwork.kernel.org/patch/10060459/) If the compat vDSO is enabled, install it in compat processes. In this case, the compat vDSO replaces the sigreturn page (it provides its own sigreturn trampolines). Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Bug: 63737556 Bug: 20045882 Change-Id: Ia6acf4c3ffea636bc750ac00853ea762c182e5b5 Signed-off-by: khusika <khusikadhamar@gmail.com>

…pat vDSO (cherry pick from url https://patchwork.kernel.org/patch/10060447/) Expose the new compat vDSO via the COMPAT_VDSO config option. The option is not enabled in defconfig because we really need a 32-bit compiler this time, and we rely on the user to provide it themselves by setting CROSS_COMPILE_ARM32. Therefore enabling the option by default would make little sense, since the user must explicitly set a non-standard environment variable anyway. CONFIG_COMPAT_VDSO is not directly used in the code, because we want to ignore it (build as if it were not set) if the user didn't set CROSS_COMPILE_ARM32. If the variable has been set to a valid prefix, CONFIG_VDSO32 will be set; this is the option that the code and Makefiles test. For more flexibility, like CROSS_COMPILE, CROSS_COMPILE_ARM32 can also be set via CONFIG_CROSS_COMPILE_ARM32 (the environment variable overrides the config option, as expected). Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Mark Salyzyn <salyzyn@android.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Bug: 63737556 Bug: 20045882 Change-Id: Ie8a7d6c2b5ba3edca591a9a953ce99ec792da882

Prevent surprise loss of vdso32 support. Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 63737556 Bug: 20045882 Bug: 19198045 Change-Id: I8b381f7649b95b298ea9e1a99aa3794c7bc08d09 Signed-off-by: khusika <khusikadhamar@gmail.com>

clock_gettime(CLOCK_BOOTTIME,) slows down after significant accumulation of suspend time creating a large offset between it and CLOCK_MONOTONIC time. The __iter_div_u64_rem() is only for the usage of adding a few second+nanosecond times and saving cycles on more expensive remainder and division operations, but iterates one second at a time which quickly goes out of scale in CLOCK_BOOTTIME's case since it was specified as nanoseconds only. The fix is to split off seconds from the boot time and cap the nanoseconds so that __iter_div_u64_rem does not iterate. Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 72406285 Change-Id: Ia647ef1e76b7ba3b0c003028d4b3b955635adabb Signed-off-by: khusika <khusikadhamar@gmail.com>

… PATH Currently, in order to build the compat VDSO with Clang, this format has to be used: PATH=${BIN_FOLDER}:${PATH} make CC=clang Prior to the addition of this file, this format would also be acceptable: make CC=${BIN_FOLDER}/clang This is because the vdso32 Makefile uses cc-name instead of CC. After this path, CC will still evaluate to clang for the first case as expected but now the second case will use the specified Clang, rather than the host's copy, which may not be compatible as shown below. /usr/bin/as: unrecognized option '-mfloat-abi=soft' clang-6.0: error: assembler command failed with exit code 1 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> (cherry picked from https://patchwork.kernel.org/patch/10419665) Bug: 80184372 Change-Id: If90a5a4edbc2b5883b4c78161081ebeafbebdcde Signed-off-by: khusika <khusikadhamar@gmail.com>

Clang needs to have access to a GCC toolchain which we advertise using the command line option --gcc-toolchain=. Clang previously picked the wrong toolchain which resulted in the following error message: /..//bin/as: unrecognized option '-EL' Bug: 123422077 Signed-off-by: Daniel Mentz <danielmentz@google.com> Change-Id: I3e339dd446b71e2c75eb9e2c186eba715b3771cd Signed-off-by: khusika <khusikadhamar@gmail.com>

Deal with regression from 7b4edf240be4a86ede06af51caf056fb1e80682e ("clocksource: arch_timer: make virtual counter access configurable") by selecting ARM_ARCH_TIMER_VCT_ACCESS if COMPAT_VDSO is selected. Signed-off-by: Mark Salyzyn <salyzyn@google.com> Bug: 72417836 Change-Id: Ie11498880941977a8014adb8b8a3b07a6ef82e27 Signed-off-by: khusika <khusikadhamar@gmail.com>

The vDSO needs to be build with x18 reserved in order to accommodate userspace platform ABIs built on top of Linux that use the register to carry inter-procedural state, as provided for by the AAPCS. An example of such a platform ABI is the one that will be used by an upcoming version of Android. Although this change is currently a no-op due to the fact that the vDSO is currently implemented in pure assembly on arm64, it is necessary in order to prepare for another change [1] that will add C code to the vDSO. [1] https://patchwork.kernel.org/patch/10044501/ Change-Id: Icaac4b1c9127d81d754d3b8688274e9afc781760 Signed-off-by: Peter Collingbourne <pcc@google.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Mark Salyzyn <salyzyn@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Signed-off-by: Khusika Dhamar Gusti <khusikadhamar@gmail.com>

Newer versions of clang only look for $(COMPAT_GCC_TOOLCHAIN_DIR)as [1], rather than $(COMPAT_GCC_TOOLCHAIN_DIR)$(CROSS_COMPILE_COMPAT)as, resulting in the following build error: $ make -skj"$(nproc)" ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- \ CROSS_COMPILE_COMPAT=arm-linux-gnueabi- LLVM=1 O=out/aarch64 distclean \ defconfig arch/arm64/kernel/vdso32/ ... /home/nathan/cbl/toolchains/llvm-binutils/bin/as: unrecognized option '-EL' clang-12: error: assembler command failed with exit code 1 (use -v to see invocation) make[3]: *** [arch/arm64/kernel/vdso32/Makefile:181: arch/arm64/kernel/vdso32/note.o] Error 1 ... Adding the value of CROSS_COMPILE_COMPAT (adding notdir to account for a full path for CROSS_COMPILE_COMPAT) fixes this issue, which matches the solution done for the main Makefile [2]. [1]: llvm/llvm-project@3452a0d [2]: https://lore.kernel.org/lkml/20200721173125.1273884-1-maskray@google.com/ Change-Id: Icbfcface887f655310c247b17fe39ea4a3372d74 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Cc: stable@vger.kernel.org Link: ClangBuiltLinux/linux#1099 Link: https://lore.kernel.org/r/20200723041509.400450-1-natechancellor@gmail.com Signed-off-by: Will Deacon <will@kernel.org> [dl: Backported to 4.14, depends on commit 38253a0ed057c49dde77588eef05fdcb4008ce0b ("vdso32: Invoke clang with correct path to GCC toolchain") from the Pixel 4 kernel] Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Bug: 119630636 Change-Id: Ibee0c22092130342fbd4c5360f5541d4cd814532 Signed-off-by: Miguel de Dios <migueldedios@google.com>

This reverts commit 949e5855f7468e94a00d14fe3f21fa549fd7519c.

This fixes the following warnings: ../drivers/power/supply/maxim/onewire_gpio.c:460:84: warning: cast to smaller integer type 'uint32_t' (aka 'unsigned int') from 'void *' [-Wvoid-pointer-to-int-cast] ow_log("onewire_data->gpio_cfg66_reg is %x; onewire_data->gpio_in_out_reg is %x", (uint32_t)(onewire_data->gpio_cfg66_reg), (uint32_t)(onewire_data->gpio_in_out_reg)); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After commit d24c407b0f7a from v4.9.236, one will expect following two warnings appear during compilation: /mnt/ssd/kernels/mido-4.9/drivers/md/dm-req-crypt.c:54:9: warning: 'SECTOR_SIZE' macro redefined [-Wmacro-redefined] #define SECTOR_SIZE 512 ^ /mnt/ssd/kernels/mido-4.9/include/linux/blkdev.h:873:9: note: previous definition is here #define SECTOR_SIZE (1 << SECTOR_SHIFT) ^ 1 warning generated. /mnt/ssd/kernels/mido-4.9/drivers/md/dm-bow.c:15:9: warning: 'SECTOR_SIZE' macro redefined [-Wmacro-redefined] #define SECTOR_SIZE 512 ^ /mnt/ssd/kernels/mido-4.9/include/linux/blkdev.h:873:9: note: previous definition is here #define SECTOR_SIZE (1 << SECTOR_SHIFT) ^ 1 warning generated. Since it has been moved globally, just remove local definition from these drivers. signed-off-by: Albert I <kras@raphielgang.org>

This reverts commit e8da234.

Userspace parses androidboot.* flags from /proc/cmdline and sets the ro.boot.* props accordingly, which in turn trips SafetyNet when the reported values are incorrect. Patch the cmdline flags checked by SafetyNet to prevent it from failing a device. These flags were found by extracting the latest snet.jar and searching for 'ro.boot.' strings. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: celtare21 <celtare21@gmail.com>

This is another flag that SafetyNet checks now. Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: celtare21 <celtare21@gmail.com>

Replace with "locked" fake values in fashion of Magisk, to make non magisked phones pass CTS safetynet check again. Signed-off-by: sreekfreak995 <sreekfreak995@gmail.com>

Signed-off-by: sreekfreak995 <sreekfreak995@gmail.com>

Signed-off-by: Pranav Vashi <neobuddy89@gmail.com> Signed-off-by: sreekfreak995 <sreekfreak995@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge new vantomkernel #7

Merge new vantomkernel #7

Commits on Mar 31, 2021

Commits on Apr 1, 2021