Commits on Nov 1, 2018
  1. Merge tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux…

    torvalds committed Nov 1, 2018
    …/kernel/git/kees/linux
    
    Pull stackleak gcc plugin from Kees Cook:
     "Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
      was ported from grsecurity by Alexander Popov. It provides efficient
      stack content poisoning at syscall exit. This creates a defense
      against at least two classes of flaws:
    
       - Uninitialized stack usage. (We continue to work on improving the
         compiler to do this in other ways: e.g. unconditional zero init was
         proposed to GCC and Clang, and more plugin work has started too).
    
       - Stack content exposure. By greatly reducing the lifetime of valid
         stack contents, exposures via either direct read bugs or unknown
         cache side-channels become much more difficult to exploit. This
         complements the existing buddy and heap poisoning options, but
         provides the coverage for stacks.
    
      The x86 hooks are included in this series (which have been reviewed by
      Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
      been merged through the arm64 tree (written by Laura Abbott and
      reviewed by Mark Rutland and Will Deacon).
    
      With VLAs having been removed this release, there is no need for
      alloca() protection, so it has been removed from the plugin"
    
    * tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
      arm64: Drop unneeded stackleak_check_alloca()
      stackleak: Allow runtime disabling of kernel stack erasing
      doc: self-protection: Add information about STACKLEAK feature
      fs/proc: Show STACKLEAK metrics in the /proc file system
      lkdtm: Add a test for STACKLEAK
      gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
      x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
Commits on Oct 26, 2018
  1. memcg: remove memcg_kmem_skip_account

    shakeelb authored and torvalds committed Oct 26, 2018
    The flag memcg_kmem_skip_account was added during the era of opt-out kmem
    accounting.  There is no need for such flag in the opt-in world as there
    aren't any __GFP_ACCOUNT allocations within memcg_create_cache_enqueue().
    
    Link: http://lkml.kernel.org/r/20180919004501.178023-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Greg Thelen <gthelen@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  2. psi: pressure stall information for CPU, memory, and IO

    hnaz authored and torvalds committed Oct 26, 2018
    When systems are overcommitted and resources become contended, it's hard
    to tell exactly the impact this has on workload productivity, or how close
    the system is to lockups and OOM kills.  In particular, when machines work
    multiple jobs concurrently, the impact of overcommit in terms of latency
    and throughput on the individual job can be enormous.
    
    In order to maximize hardware utilization without sacrificing individual
    job health or risk complete machine lockups, this patch implements a way
    to quantify resource pressure in the system.
    
    A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
    expose the percentage of time the system is stalled on CPU, memory, or IO,
    respectively.  Stall states are aggregate versions of the per-task delay
    accounting delays:
    
           cpu: some tasks are runnable but not executing on a CPU
           memory: tasks are reclaiming, or waiting for swapin or thrashing cache
           io: tasks are waiting for io completions
    
    These percentages of walltime can be thought of as pressure percentages,
    and they give a general sense of system health and productivity loss
    incurred by resource overcommit.  They can also indicate when the system
    is approaching lockup scenarios and OOMs.
    
    To do this, psi keeps track of the task states associated with each CPU
    and samples the time they spend in stall states.  Every 2 seconds, the
    samples are averaged across CPUs - weighted by the CPUs' non-idle time to
    eliminate artifacts from unused CPUs - and translated into percentages of
    walltime.  A running average of those percentages is maintained over 10s,
    1m, and 5m periods (similar to the loadaverage).
    
    [hannes@cmpxchg.org: doc fixlet, per Randy]
      Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
    [hannes@cmpxchg.org: code optimization]
      Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
    [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
      Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
    [hannes@cmpxchg.org: fix build]
      Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: Daniel Drake <drake@endlessm.com>
    Tested-by: Suren Baghdasaryan <surenb@google.com>
    Cc: Christopher Lameter <cl@linux.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Johannes Weiner <jweiner@fb.com>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Enderborg <peter.enderborg@sony.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Vinayak Menon <vinmenon@codeaurora.org>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Oct 24, 2018
  1. Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/ke…

    torvalds committed Oct 24, 2018
    …rnel/git/ebiederm/user-namespace
    
    Pull siginfo updates from Eric Biederman:
     "I have been slowly sorting out siginfo and this is the culmination of
      that work.
    
      The primary result is in several ways the signal infrastructure has
      been made less error prone. The code has been updated so that manually
      specifying SEND_SIG_FORCED is never necessary. The conversion to the
      new siginfo sending functions is now complete, which makes it
      difficult to send a signal without filling in the proper siginfo
      fields.
    
      At the tail end of the patchset comes the optimization of decreasing
      the size of struct siginfo in the kernel from 128 bytes to about 48
      bytes on 64bit. The fundamental observation that enables this is by
      definition none of the known ways to use struct siginfo uses the extra
      bytes.
    
      This comes at the cost of a small user space observable difference.
      For the rare case of siginfo being injected into the kernel only what
      can be copied into kernel_siginfo is delivered to the destination, the
      rest of the bytes are set to 0. For cases where the signal and the
      si_code are known this is safe, because we know those bytes are not
      used. For cases where the signal and si_code combination is unknown
      the bits that won't fit into struct kernel_siginfo are tested to
      verify they are zero, and the send fails if they are not.
    
      I made an extensive search through userspace code and I could not find
      anything that would break because of the above change. If it turns out
      I did break something it will take just the revert of a single change
      to restore kernel_siginfo to the same size as userspace siginfo.
    
      Testing did reveal dependencies on preferring the signo passed to
      sigqueueinfo over si->signo, so bit the bullet and added the
      complexity necessary to handle that case.
    
      Testing also revealed bad things can happen if a negative signal
      number is passed into the system calls. Something no sane application
      will do but something a malicious program or a fuzzer might do. So I
      have fixed the code that performs the bounds checks to ensure negative
      signal numbers are handled"
    
    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
      signal: Guard against negative signal numbers in copy_siginfo_from_user32
      signal: Guard against negative signal numbers in copy_siginfo_from_user
      signal: In sigqueueinfo prefer sig not si_signo
      signal: Use a smaller struct siginfo in the kernel
      signal: Distinguish between kernel_siginfo and siginfo
      signal: Introduce copy_siginfo_from_user and use it's return value
      signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
      signal: Fail sigqueueinfo if si_signo != sig
      signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
      signal/unicore32: Use force_sig_fault where appropriate
      signal/unicore32: Generate siginfo in ucs32_notify_die
      signal/unicore32: Use send_sig_fault where appropriate
      signal/arc: Use force_sig_fault where appropriate
      signal/arc: Push siginfo generation into unhandled_exception
      signal/ia64: Use force_sig_fault where appropriate
      signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
      signal/ia64: Use the generic force_sigsegv in setup_frame
      signal/arm/kvm: Use send_sig_mceerr
      signal/arm: Use send_sig_fault where appropriate
      signal/arm: Use force_sig_fault where appropriate
      ...
Commits on Oct 23, 2018
  1. Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm…

    torvalds committed Oct 23, 2018
    …/linux/kernel/git/tip/tip
    
    Pull locking and misc x86 updates from Ingo Molnar:
     "Lots of changes in this cycle - in part because locking/core attracted
      a number of related x86 low level work which was easier to handle in a
      single tree:
    
       - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E.
         McKenney, Andrea Parri)
    
       - lockdep scalability improvements and micro-optimizations (Waiman
         Long)
    
       - rwsem improvements (Waiman Long)
    
       - spinlock micro-optimization (Matthew Wilcox)
    
       - qspinlocks: Provide a liveness guarantee (more fairness) on x86.
         (Peter Zijlstra)
    
       - Add support for relative references in jump tables on arm64, x86
         and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens)
    
       - Be a lot less permissive on weird (kernel address) uaccess faults
         on x86: BUG() when uaccess helpers fault on kernel addresses (Jann
         Horn)
    
       - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav
         Amit)
    
       - ... and a handful of other smaller changes as well"
    
    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
      locking/lockdep: Make global debug_locks* variables read-mostly
      locking/lockdep: Fix debug_locks off performance problem
      locking/pvqspinlock: Extend node size when pvqspinlock is configured
      locking/qspinlock_stat: Count instances of nested lock slowpaths
      locking/qspinlock, x86: Provide liveness guarantee
      x86/asm: 'Simplify' GEN_*_RMWcc() macros
      locking/qspinlock: Rework some comments
      locking/qspinlock: Re-order code
      locking/lockdep: Remove duplicated 'lock_class_ops' percpu array
      x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y
      futex: Replace spin_is_locked() with lockdep
      locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y
      x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
      x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
      x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
      x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
      x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
      x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
      x86/refcount: Work around GCC inlining bug
      x86/objtool: Use asm macros to work around GCC inlining bugs
      ...
Commits on Oct 3, 2018
  1. signal: Distinguish between kernel_siginfo and siginfo

    ebiederm committed Sep 25, 2018
    Linus recently observed that if we did not worry about the padding
    member in struct siginfo it is only about 48 bytes, and 48 bytes is
    much nicer than 128 bytes for allocating on the stack and copying
    around in the kernel.
    
    The obvious thing of only adding the padding when userspace is
    including siginfo.h won't work as there are sigframe definitions in
    the kernel that embed struct siginfo.
    
    So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
    traditional name for the userspace definition.  While the version that
    is used internally to the kernel and ultimately will not be padded to
    128 bytes is called kernel_siginfo.
    
    The definition of struct kernel_siginfo I have put in include/signal_types.h
    
    A set of buildtime checks has been added to verify the two structures have
    the same field offsets.
    
    To make it easy to verify the change kernel_siginfo retains the same
    size as siginfo.  The reduction in size comes in a following change.
    
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Commits on Sep 4, 2018
  1. fs/proc: Show STACKLEAK metrics in the /proc file system

    a13xp0p0v authored and kees committed Aug 16, 2018
    Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
    tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
    shows the maximum kernel stack consumption for the current and previous
    syscalls. Although this information is not precise, it can be useful for
    estimating the STACKLEAK performance impact for your workloads.
    
    Suggested-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Alexander Popov <alex.popov@linux.com>
    Tested-by: Laura Abbott <labbott@redhat.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
  2. x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls

    a13xp0p0v authored and kees committed Aug 16, 2018
    The STACKLEAK feature (initially developed by PaX Team) has the following
    benefits:
    
    1. Reduces the information that can be revealed through kernel stack leak
       bugs. The idea of erasing the thread stack at the end of syscalls is
       similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
       crypto, which all comply with FDP_RIP.2 (Full Residual Information
       Protection) of the Common Criteria standard.
    
    2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
       CVE-2010-2963). That kind of bugs should be killed by improving C
       compilers in future, which might take a long time.
    
    This commit introduces the code filling the used part of the kernel
    stack with a poison value before returning to userspace. Full
    STACKLEAK feature also contains the gcc plugin which comes in a
    separate commit.
    
    The STACKLEAK feature is ported from grsecurity/PaX. More information at:
      https://grsecurity.net/
      https://pax.grsecurity.net/
    
    This code is modified from Brad Spengler/PaX Team's code in the last
    public patch of grsecurity/PaX based on our understanding of the code.
    Changes or omissions from the original code are ours and don't reflect
    the original grsecurity/PaX code.
    
    Performance impact:
    
    Hardware: Intel Core i7-4770, 16 GB RAM
    
    Test #1: building the Linux kernel on a single core
            0.91% slowdown
    
    Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
            4.2% slowdown
    
    So the STACKLEAK description in Kconfig includes: "The tradeoff is the
    performance impact: on a single CPU system kernel compilation sees a 1%
    slowdown, other systems and workloads may vary and you are advised to
    test this feature on your expected workload before deploying it".
    
    Signed-off-by: Alexander Popov <alex.popov@linux.com>
    Acked-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
    Acked-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
Commits on Sep 3, 2018
  1. x86/fault: BUG() when uaccess helpers fault on kernel addresses

    thejh authored and Thomas Gleixner committed Aug 28, 2018
    There have been multiple kernel vulnerabilities that permitted userspace to
    pass completely unchecked pointers through to userspace accessors:
    
     - the waitid() bug - commit 96ca579 ("waitid(): Add missing
       access_ok() checks")
     - the sg/bsg read/write APIs
     - the infiniband read/write APIs
    
    These don't happen all that often, but when they do happen, it is hard to
    test for them properly; and it is probably also hard to discover them with
    fuzzing. Even when an unmapped kernel address is supplied to such buggy
    code, it just returns -EFAULT instead of doing a proper BUG() or at least
    WARN().
    
    Try to make such misbehaving code a bit more visible by refusing to do a
    fixup in the pagefault handler code when a userspace accessor causes a #PF
    on a kernel address and the current context isn't whitelisted.
    
    Signed-off-by: Jann Horn <jannh@google.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Kees Cook <keescook@chromium.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: kernel-hardening@lists.openwall.com
    Cc: dvyukov@google.com
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
    Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Borislav Petkov <bp@alien8.de>
    Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
Commits on Aug 30, 2018
  1. rcu: Remove now-unused ->b.exp_need_qs field from the rcu_special union

    paulmck committed Jun 28, 2018
    The ->b.exp_need_qs field is now set only to false, so this commit
    removes it.  The job this field used to do is now done by the rcu_data
    structure's ->deferred_qs field, which is a consequence of a better
    split between task-based (the rcu_node structure's ->exp_tasks field) and
    CPU-based (the aforementioned rcu_data structure's ->deferred_qs field)
    tracking of quiescent states for RCU-preempt expedited grace periods.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commits on Aug 22, 2018
  1. Merge branch 'akpm' (patches from Andrew)

    torvalds committed Aug 22, 2018
    Merge more updates from Andrew Morton:
    
     - the rest of MM
    
     - procfs updates
    
     - various misc things
    
     - more y2038 fixes
    
     - get_maintainer updates
    
     - lib/ updates
    
     - checkpatch updates
    
     - various epoll updates
    
     - autofs updates
    
     - hfsplus
    
     - some reiserfs work
    
     - fatfs updates
    
     - signal.c cleanups
    
     - ipc/ updates
    
    * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
      ipc/util.c: update return value of ipc_getref from int to bool
      ipc/util.c: further variable name cleanups
      ipc: simplify ipc initialization
      ipc: get rid of ids->tables_initialized hack
      lib/rhashtable: guarantee initial hashtable allocation
      lib/rhashtable: simplify bucket_table_alloc()
      ipc: drop ipc_lock()
      ipc/util.c: correct comment in ipc_obtain_object_check
      ipc: rename ipcctl_pre_down_nolock()
      ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
      ipc: reorganize initialization of kern_ipc_perm.seq
      ipc: compute kern_ipc_perm.id under the ipc lock
      init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
      fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
      adfs: use timespec64 for time conversion
      kernel/sysctl.c: fix typos in comments
      drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
      fork: don't copy inconsistent signal handler state to child
      signal: make get_signal() return bool
      signal: make sigkill_pending() return bool
      ...
  2. kernel/hung_task.c: allow to set checking interval separately from ti…

    dvyukov authored and torvalds committed Aug 22, 2018
    …meout
    
    Currently task hung checking interval is equal to timeout, as the result
    hung is detected anywhere between timeout and 2*timeout.  This is fine for
    most interactive environments, but this hurts automated testing setups
    (syzbot).  In an automated setup we need to strictly order CPU lockup <
    RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
    is not detected as task hung and task hung is not detected as silent
    machine loss.  The large variance in task hung detection timeout requires
    setting silent machine loss timeout to a very large value (e.g.  if task
    hung is 3 mins, then silent loss need to be set to ~7 mins).  The
    additional 3 minutes significantly reduce testing efficiency because
    usually we crash kernel within a minute, and this can add hours to bug
    localization process as it needs to do dozens of tests.
    
    Allow setting checking interval separately from timeout.  This allows to
    set timeout to, say, 3 minutes, but checking interval to 10 secs.
    
    The interval is controlled via a new hung_task_check_interval_secs sysctl,
    similar to the existing hung_task_timeout_secs sysctl.  The default value
    of 0 results in the current behavior: checking interval is equal to
    timeout.
    
    [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
    Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Aug 21, 2018
  1. Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/ke…

    torvalds committed Aug 21, 2018
    …rnel/git/ebiederm/user-namespace
    
    Pull core signal handling updates from Eric Biederman:
     "It was observed that a periodic timer in combination with a
      sufficiently expensive fork could prevent fork from every completing.
      This contains the changes to remove the need for that restart.
    
      This set of changes is split into several parts:
    
       - The first part makes PIDTYPE_TGID a proper pid type instead
         something only for very special cases. The part starts using
         PIDTYPE_TGID enough so that in __send_signal where signals are
         actually delivered we know if the signal is being sent to a a group
         of processes or just a single process.
    
       - With that prep work out of the way the logic in fork is modified so
         that fork logically makes signals received while it is running
         appear to be received after the fork completes"
    
    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
      signal: Don't send signals to tasks that don't exist
      signal: Don't restart fork when signals come in.
      fork: Have new threads join on-going signal group stops
      fork: Skip setting TIF_SIGPENDING in ptrace_init_task
      signal: Add calculate_sigpending()
      fork: Unconditionally exit if a fatal signal is pending
      fork: Move and describe why the code examines PIDNS_ADDING
      signal: Push pid type down into complete_signal.
      signal: Push pid type down into __send_signal
      signal: Push pid type down into send_signal
      signal: Pass pid type into do_send_sig_info
      signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
      signal: Pass pid type into group_send_sig_info
      signal: Pass pid and pid type into send_sigqueue
      posix-timers: Noralize good_sigevent
      signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
      pid: Implement PIDTYPE_TGID
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct
      kvm: Don't open code task_pid in kvm_vcpu_ioctl
      pids: Compute task_tgid using signal->leader_pid
      ...
Commits on Aug 17, 2018
  1. mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CO…

    tkhai authored and torvalds committed Aug 17, 2018
    …NFIG_SLOB
    
    Introduce new config option, which is used to replace repeating
    CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
    memcg+kmem related code, so let's keep the defines more clearly.
    
    Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
    Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
    Tested-by: Shakeel Butt <shakeelb@google.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Guenter Roeck <linux@roeck-us.net>
    Cc: "Huang, Ying" <ying.huang@intel.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Josef Bacik <jbacik@fb.com>
    Cc: Li RongQing <lirongqing@baidu.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Matthias Kaehlcke <mka@chromium.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Philippe Ombredanne <pombredanne@nexb.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Sahitya Tummala <stummala@codeaurora.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Waiman Long <longman@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  2. memcg, oom: move out_of_memory back to the charge path

    Michal Hocko authored and torvalds committed Aug 17, 2018
    Commit 3812c8c ("mm: memcg: do not trap chargers with full
    callstack on OOM") has changed the ENOMEM semantic of memcg charges.
    Rather than invoking the oom killer from the charging context it delays
    the oom killer to the page fault path (pagefault_out_of_memory).  This
    in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
    the corresponding memcg hits the hard limit and the memcg is is OOM.
    This is behavior is inconsistent with !memcg case where the oom killer
    is invoked from the allocation context and the allocator keeps retrying
    until it succeeds.
    
    The difference in the behavior is user visible.  mmap(MAP_POPULATE)
    might result in not fully populated ranges while the mmap return code
    doesn't tell that to the userspace.  Random syscalls might fail with
    ENOMEM etc.
    
    The primary motivation of the different memcg oom semantic was the
    deadlock avoidance.  Things have changed since then, though.  We have an
    async oom teardown by the oom reaper now and so we do not have to rely
    on the victim to tear down its memory anymore.  Therefore we can return
    to the original semantic as long as the memcg oom killer is not handed
    over to the users space.
    
    There is still one thing to be careful about here though.  If the oom
    killer is not able to make any forward progress - e.g.  because there is
    no eligible task to kill - then we have to bail out of the charge path
    to prevent from same class of deadlocks.  We have basically two options
    here.  Either we fail the charge with ENOMEM or force the charge and
    allow overcharge.  The first option has been considered more harmful
    than useful because rare inconsistencies in the ENOMEM behavior is hard
    to test for and error prone.  Basically the same reason why the page
    allocator doesn't fail allocations under such conditions.  The later
    might allow runaways but those should be really unlikely unless somebody
    misconfigures the system.  E.g.  allowing to migrate tasks away from the
    memcg to a different unlimited memcg with move_charge_at_immigrate
    disabled.
    
    Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Greg Thelen <gthelen@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Shakeel Butt <shakeelb@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  3. fs: fsnotify: account fsnotify metadata to kmemcg

    shakeelb authored and torvalds committed Aug 17, 2018
    Patch series "Directed kmem charging", v8.
    
    The Linux kernel's memory cgroup allows limiting the memory usage of the
    jobs running on the system to provide isolation between the jobs.  All
    the kernel memory allocated in the context of the job and marked with
    __GFP_ACCOUNT will also be included in the memory usage and be limited
    by the job's limit.
    
    The kernel memory can only be charged to the memcg of the process in
    whose context kernel memory was allocated.  However there are cases
    where the allocated kernel memory should be charged to the memcg
    different from the current processes's memcg.  This patch series
    contains two such concrete use-cases i.e.  fsnotify and buffer_head.
    
    The fsnotify event objects can consume a lot of system memory for large
    or unlimited queues if there is either no or slow listener.  The events
    are allocated in the context of the event producer.  However they should
    be charged to the event consumer.  Similarly the buffer_head objects can
    be allocated in a memcg different from the memcg of the page for which
    buffer_head objects are being allocated.
    
    To solve this issue, this patch series introduces mechanism to charge
    kernel memory to a given memcg.  In case of fsnotify events, the memcg
    of the consumer can be used for charging and for buffer_head, the memcg
    of the page can be charged.  For directed charging, the caller can use
    the scope API memalloc_[un]use_memcg() to specify the memcg to charge
    for all the __GFP_ACCOUNT allocations within the scope.
    
    This patch (of 2):
    
    A lot of memory can be consumed by the events generated for the huge or
    unlimited queues if there is either no or slow listener.  This can cause
    system level memory pressure or OOMs.  So, it's better to account the
    fsnotify kmem caches to the memcg of the listener.
    
    However the listener can be in a different memcg than the memcg of the
    producer and these allocations happen in the context of the event
    producer.  This patch introduces remote memcg charging API which the
    producer can use to charge the allocations to the memcg of the listener.
    
    There are seven fsnotify kmem caches and among them allocations from
    dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
    inotify_inode_mark_cachep happens in the context of syscall from the
    listener.  So, SLAB_ACCOUNT is enough for these caches.
    
    The objects from fsnotify_mark_connector_cachep are not accounted as
    they are small compared to the notification mark or events and it is
    unclear whom to account connector to since it is shared by all events
    attached to the inode.
    
    The allocations from the event caches happen in the context of the event
    producer.  For such caches we will need to remote charge the allocations
    to the listener's memcg.  Thus we save the memcg reference in the
    fsnotify_group structure of the listener.
    
    This patch has also moved the members of fsnotify_group to keep the size
    same, at least for 64 bit build, even with additional member by filling
    the holes.
    
    [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
      Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt <shakeelb@google.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Amir Goldstein <amir73il@gmail.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Aug 14, 2018
  1. Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block

    torvalds committed Aug 14, 2018
    Pull block updates from Jens Axboe:
     "First pull request for this merge window, there will also be a
      followup request with some stragglers.
    
      This pull request contains:
    
       - Fix for a thundering heard issue in the wbt block code (Anchal
         Agarwal)
    
       - A few NVMe pull requests:
          * Improved tracepoints (Keith)
          * Larger inline data support for RDMA (Steve Wise)
          * RDMA setup/teardown fixes (Sagi)
          * Effects log suppor for NVMe target (Chaitanya Kulkarni)
          * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
          * TP4004 (ANA) support (Christoph)
          * Various NVMe fixes
    
       - Block io-latency controller support. Much needed support for
         properly containing block devices. (Josef)
    
       - Series improving how we handle sense information on the stack
         (Kees)
    
       - Lightnvm fixes and updates/improvements (Mathias/Javier et al)
    
       - Zoned device support for null_blk (Matias)
    
       - AIX partition fixes (Mauricio Faria de Oliveira)
    
       - DIF checksum code made generic (Max Gurtovoy)
    
       - Add support for discard in iostats (Michael Callahan / Tejun)
    
       - Set of updates for BFQ (Paolo)
    
       - Removal of async write support for bsg (Christoph)
    
       - Bio page dirtying and clone fixups (Christoph)
    
       - Set of bcache fix/changes (via Coly)
    
       - Series improving blk-mq queue setup/teardown speed (Ming)
    
       - Series improving merging performance on blk-mq (Ming)
    
       - Lots of other fixes and cleanups from a slew of folks"
    
    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
      blkcg: Make blkg_root_lookup() work for queues in bypass mode
      bcache: fix error setting writeback_rate through sysfs interface
      null_blk: add lock drop/acquire annotation
      Blk-throttle: reduce tail io latency when iops limit is enforced
      block: paride: pd: mark expected switch fall-throughs
      block: Ensure that a request queue is dissociated from the cgroup controller
      block: Introduce blk_exit_queue()
      blkcg: Introduce blkg_root_lookup()
      block: Remove two superfluous #include directives
      blk-mq: count the hctx as active before allocating tag
      block: bvec_nr_vecs() returns value for wrong slab
      bcache: trivial - remove tailing backslash in macro BTREE_FLAG
      bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
      bcache: set max writeback rate when I/O request is idle
      bcache: add code comments for bset.c
      bcache: fix mistaken comments in request.c
      bcache: fix mistaken code comments in bcache.h
      bcache: add a comment in super.c
      bcache: avoid unncessary cache prefetch bch_btree_node_get()
      bcache: display rate debug parameters to 0 when writeback is not running
      ...
Commits on Aug 13, 2018
  1. Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm…

    torvalds committed Aug 13, 2018
    …/linux/kernel/git/tip/tip
    
    Pull locking/atomics update from Thomas Gleixner:
     "The locking, atomics and memory model brains delivered:
    
       - A larger update to the atomics code which reworks the ordering
         barriers, consolidates the atomic primitives, provides the new
         atomic64_fetch_add_unless() primitive and cleans up the include
         hell.
    
       - Simplify cmpxchg() instrumentation and add instrumentation for
         xchg() and cmpxchg_double().
    
       - Updates to the memory model and documentation"
    
    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
      locking/atomics: Rework ordering barriers
      locking/atomics: Instrument cmpxchg_double*()
      locking/atomics: Instrument xchg()
      locking/atomics: Simplify cmpxchg() instrumentation
      locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
      tools/memory-model: Rename litmus tests to comply to norm7
      tools/memory-model/Documentation: Fix typo, smb->smp
      sched/Documentation: Update wake_up() & co. memory-barrier guarantees
      locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
      sched/core: Use smp_mb() in wake_woken_function()
      tools/memory-model: Add informal LKMM documentation to MAINTAINERS
      locking/atomics/Documentation: Describe atomic_set() as a write operation
      tools/memory-model: Make scripts executable
      tools/memory-model: Remove ACCESS_ONCE() from model
      tools/memory-model: Remove ACCESS_ONCE() from recipes
      locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
      MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
      tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
      tools/memory-model: Add litmus test for full multicopy atomicity
      locking/refcount: Always allow checked forms
      ...
Commits on Jul 25, 2018
  1. sched/numa: Remove redundant field

    srikard authored and Ingo Molnar committed Jun 20, 2018
    'numa_entry' is a struct list_head defined in task_struct, but never used.
    
    No functional change.
    
    Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Rik van Riel <riel@surriel.com>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/1529514181-9842-2-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commits on Jul 21, 2018
  1. pid: Implement PIDTYPE_TGID

    ebiederm committed Jun 4, 2017
    Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id).  Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
    
    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array.  Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.
    
    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.
    
    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest.  The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.
    
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
  2. pids: Move the pgrp and session pid pointers from task_struct to sign…

    ebiederm committed Sep 26, 2017
    …al_struct
    
    To access these fields the code always has to go to group leader so
    going to signal struct is no loss and is actually a fundamental simplification.
    
    This saves a little bit of memory by only allocating the pid pointer array
    once instead of once for every thread, and even better this removes a
    few potential races caused by the fact that group_leader can be changed
    by de_thread, while signal_struct can not.
    
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
  3. pids: Compute task_tgid using signal->leader_pid

    ebiederm committed Sep 26, 2017
    The cost is the the same and this removes the need
    to worry about complications that come from de_thread
    and group_leader changing.
    
    __task_pid_nr_ns has been updated to take advantage of this change.
    
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Commits on Jul 17, 2018
  1. sched/Documentation: Update wake_up() & co. memory-barrier guarantees

    Andrea Parri Ingo Molnar
    Andrea Parri authored and Ingo Molnar committed Jul 16, 2018
    Both the implementation and the users' expectation [1] for the various
    wakeup primitives have evolved over time, but the documentation has not
    kept up with these changes: brings it into 2018.
    
    [1] http://lkml.kernel.org/r/20180424091510.GB4064@hirez.programming.kicks-ass.net
    
    Also applied feedback from Alan Stern.
    
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Akira Yokosawa <akiyks@gmail.com>
    Cc: Alan Stern <stern@rowland.harvard.edu>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Daniel Lustig <dlustig@nvidia.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Jade Alglave <j.alglave@ucl.ac.uk>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Luc Maranget <luc.maranget@inria.fr>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: linux-arch@vger.kernel.org
    Cc: parri.andrea@gmail.com
    Link: http://lkml.kernel.org/r/20180716180605.16115-12-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commits on Jul 9, 2018
  1. blkcg: add generic throttling mechanism

    Josef Bacik authored and axboe committed Jul 3, 2018
    Since IO can be issued from literally anywhere it's almost impossible to
    do throttling without having some sort of adverse effect somewhere else
    in the system because of locking or other dependencies.  The best way to
    solve this is to do the throttling when we know we aren't holding any
    other kernel resources.  Do this by tracking throttling in a per-blkg
    basis, and if we require throttling flag the task that it needs to check
    before it returns to user space and possibly sleep there.
    
    This is to address the case where a process is doing work that is
    generating IO that can't be throttled, whether that is directly with a
    lot of REQ_META IO, or indirectly by allocating so much memory that it
    is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
    don't want to induce a memory allocation in the IO path, so simply
    saving the request queue in the task and flagging it to do the
    notify_resume thing achieves the same result without the overhead of a
    memory allocation.
    
    Signed-off-by: Josef Bacik <jbacik@fb.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Jul 3, 2018
  1. kthread, sched/core: Fix kthread_parkme() (again...)

    Peter Zijlstra Ingo Molnar
    Peter Zijlstra authored and Ingo Molnar committed Jun 7, 2018
    Gaurav reports that commit:
    
      85f1abe ("kthread, sched/wait: Fix kthread_parkme() completion issue")
    
    isn't working for him. Because of the following race:
    
    > controller Thread                               CPUHP Thread
    > takedown_cpu
    > kthread_park
    > kthread_parkme
    > Set KTHREAD_SHOULD_PARK
    >                                                 smpboot_thread_fn
    >                                                 set Task interruptible
    >
    >
    > wake_up_process
    >  if (!(p->state & state))
    >                 goto out;
    >
    >                                                 Kthread_parkme
    >                                                 SET TASK_PARKED
    >                                                 schedule
    >                                                 raw_spin_lock(&rq->lock)
    > ttwu_remote
    > waiting for __task_rq_lock
    >                                                 context_switch
    >
    >                                                 finish_lock_switch
    >
    >
    >
    >                                                 Case TASK_PARKED
    >                                                 kthread_park_complete
    >
    >
    > SET Running
    
    Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
    handling is buggered because the TASK_DEAD thing is done with
    preemption disabled, the current code can still complete early on
    preemption :/
    
    So basically revert that earlier fix and go with a variant of the
    alternative mentioned in the commit. Promote TASK_PARKED to special
    state to avoid the store-store issue on task->state leading to the
    WARN in kthread_unpark() -> __kthread_bind().
    
    But in addition, add wait_task_inactive() to kthread_park() to ensure
    the task really is PARKED when we return from kthread_park(). This
    avoids the whole kthread still gets migrated nonsense -- although it
    would be really good to get this done differently.
    
    Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Fixes: 85f1abe ("kthread, sched/wait: Fix kthread_parkme() completion issue")
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commits on Jun 22, 2018
  1. rseq: Avoid infinite recursion when delivering SIGSEGV

    wildea01 authored and Thomas Gleixner committed Jun 22, 2018
    When delivering a signal to a task that is using rseq, we call into
    __rseq_handle_notify_resume() so that the registers pushed in the
    sigframe are updated to reflect the state of the restartable sequence
    (for example, ensuring that the signal returns to the abort handler if
    necessary).
    
    However, if the rseq management fails due to an unrecoverable fault when
    accessing userspace or certain combinations of RSEQ_CS_* flags, then we
    will attempt to deliver a SIGSEGV. This has the potential for infinite
    recursion if the rseq code continuously fails on signal delivery.
    
    Avoid this problem by using force_sigsegv() instead of force_sig(), which
    is explicitly designed to reset the SEGV handler to SIG_DFL in the case
    of a recursive fault. In doing so, remove rseq_signal_deliver() from the
    internal rseq API and have an optional struct ksignal * parameter to
    rseq_handle_notify_resume() instead.
    
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: peterz@infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Cc: boqun.feng@gmail.com
    Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com
Commits on Jun 21, 2018
  1. rseq/cleanup: Do not abort rseq c.s. in child on fork()

    compudj authored and Ingo Molnar committed Jun 19, 2018
    Considering that we explicitly forbid system calls in rseq critical
    sections, it is not valid to issue a fork or clone system call within a
    rseq critical section, so rseq_fork() is not required to restart an
    active rseq c.s. in the child process.
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Ben Maurer <bmaurer@fb.com>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Chris Lameter <cl@linux.com>
    Cc: Dave Watson <davejwatson@fb.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Josh Triplett <josh@joshtriplett.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Paul Turner <pjt@google.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Russell King <linux@arm.linux.org.uk>
    Cc: Shuah Khan <shuahkh@osg.samsung.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: linux-api@vger.kernel.org
    Cc: linux-kselftest@vger.kernel.org
    Link: https://lore.kernel.org/lkml/20180619133230.4087-4-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commits on Jun 14, 2018
  1. sched/core / kcov: avoid kcov_area during task switch

    Mark Rutland authored and torvalds committed Jun 14, 2018
    During a context switch, we first switch_mm() to the next task's mm,
    then switch_to() that new task.  This means that vmalloc'd regions which
    had previously been faulted in can transiently disappear in the context
    of the prev task.
    
    Functions instrumented by KCOV may try to access a vmalloc'd kcov_area
    during this window, and as the fault handling code is instrumented, this
    results in a recursive fault.
    
    We must avoid accessing any kcov_area during this window.  We can do so
    with a new flag in kcov_mode, set prior to switching the mm, and cleared
    once the new task is live.  Since task_struct::kcov_mode isn't always a
    specific enum kcov_mode value, this is made an unsigned int.
    
    The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers,
    which are empty for !CONFIG_KCOV kernels.
    
    The code uses macros because I can't use static inline functions without a
    circular include dependency between <linux/sched.h> and <linux/kcov.h>,
    since the definition of task_struct uses things defined in <linux/kcov.h>
    
    Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.com
    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  2. Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables

    torvalds committed Jun 14, 2018
    The changes to automatically test for working stack protector compiler
    support in the Kconfig files removed the special STACKPROTECTOR_AUTO
    option that picked the strongest stack protector that the compiler
    supported.
    
    That was all a nice cleanup - it makes no sense to have the AUTO case
    now that the Kconfig phase can just determine the compiler support
    directly.
    
    HOWEVER.
    
    It also meant that doing "make oldconfig" would now _disable_ the strong
    stackprotector if you had AUTO enabled, because in a legacy config file,
    the sane stack protector configuration would look like
    
      CONFIG_HAVE_CC_STACKPROTECTOR=y
      # CONFIG_CC_STACKPROTECTOR_NONE is not set
      # CONFIG_CC_STACKPROTECTOR_REGULAR is not set
      # CONFIG_CC_STACKPROTECTOR_STRONG is not set
      CONFIG_CC_STACKPROTECTOR_AUTO=y
    
    and when you ran this through "make oldconfig" with the Kbuild changes,
    it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
    been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
    CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
    used to be disabled (because it was really enabled by AUTO), and would
    disable it in the new config, resulting in:
    
      CONFIG_HAVE_CC_STACKPROTECTOR=y
      CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
      CONFIG_CC_STACKPROTECTOR=y
      # CONFIG_CC_STACKPROTECTOR_STRONG is not set
      CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
    
    That's dangerously subtle - people could suddenly find themselves with
    the weaker stack protector setup without even realizing.
    
    The solution here is to just rename not just the old RECULAR stack
    protector option, but also the strong one.  This does that by just
    removing the CC_ prefix entirely for the user choices, because it really
    is not about the compiler support (the compiler support now instead
    automatially impacts _visibility_ of the options to users).
    
    This results in "make oldconfig" actually asking the user for their
    choice, so that we don't have any silent subtle security model changes.
    The end result would generally look like this:
    
      CONFIG_HAVE_CC_STACKPROTECTOR=y
      CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
      CONFIG_STACKPROTECTOR=y
      CONFIG_STACKPROTECTOR_STRONG=y
      CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
    
    where the "CC_" versions really are about internal compiler
    infrastructure, not the user selections.
    
    Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Jun 12, 2018
  1. Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

    torvalds committed Jun 12, 2018
    Pull KVM updates from Paolo Bonzini:
     "Small update for KVM:
    
      ARM:
       - lazy context-switching of FPSIMD registers on arm64
       - "split" regions for vGIC redistributor
    
      s390:
       - cleanups for nested
       - clock handling
       - crypto
       - storage keys
       - control register bits
    
      x86:
       - many bugfixes
       - implement more Hyper-V super powers
       - implement lapic_timer_advance_ns even when the LAPIC timer is
         emulated using the processor's VMX preemption timer.
       - two security-related bugfixes at the top of the branch"
    
    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (79 commits)
      kvm: fix typo in flag name
      kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access
      KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system
      KVM: x86: introduce linear_{read,write}_system
      kvm: nVMX: Enforce cpl=0 for VMX instructions
      kvm: nVMX: Add support for "VMWRITE to any supported field"
      kvm: nVMX: Restrict VMX capability MSR changes
      KVM: VMX: Optimize tscdeadline timer latency
      KVM: docs: nVMX: Remove known limitations as they do not exist now
      KVM: docs: mmu: KVM support exposing SLAT to guests
      kvm: no need to check return value of debugfs_create functions
      kvm: Make VM ioctl do valloc for some archs
      kvm: Change return type to vm_fault_t
      KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008
      kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation
      KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
      KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
      KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation
      KVM: introduce kvm_make_vcpus_request_mask() API
      KVM: x86: hyperv: do rep check for each hypercall separately
      ...
Commits on Jun 6, 2018
  1. rseq: Introduce restartable sequences system call

    compudj authored and Thomas Gleixner committed Jun 2, 2018
    Expose a new system call allowing each thread to register one userspace
    memory area to be used as an ABI between kernel and user-space for two
    purposes: user-space restartable sequences and quick access to read the
    current CPU number value from user-space.
    
    * Restartable sequences (per-cpu atomics)
    
    Restartables sequences allow user-space to perform update operations on
    per-cpu data without requiring heavy-weight atomic operations.
    
    The restartable critical sections (percpu atomics) work has been started
    by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
    critical sections. [1] [2] The re-implementation proposed here brings a
    few simplifications to the ABI which facilitates porting to other
    architectures and speeds up the user-space fast path.
    
    Here are benchmarks of various rseq use-cases.
    
    Test hardware:
    
    arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
    x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
    
    The following benchmarks were all performed on a single thread.
    
    * Per-CPU statistic counter increment
    
                    getcpu+atomic (ns/op)    rseq (ns/op)    speedup
    arm32:                344.0                 31.4          11.0
    x86-64:                15.3                  2.0           7.7
    
    * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                 per-cpu buffer
    
                    getcpu+atomic (ns/op)    rseq (ns/op)    speedup
    arm32:               2502.0                 2250.0         1.1
    x86-64:               117.4                   98.0         1.2
    
    * liburcu percpu: lock-unlock pair, dereference, read/compare word
    
                    getcpu+atomic (ns/op)    rseq (ns/op)    speedup
    arm32:                751.0                 128.5          5.8
    x86-64:                53.4                  28.6          1.9
    
    * jemalloc memory allocator adapted to use rseq
    
    Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
    rseq 2016 implementation):
    
    The production workload response-time has 1-2% gain avg. latency, and
    the P99 overall latency drops by 2-3%.
    
    * Reading the current CPU number
    
    Speeding up reading the current CPU number on which the caller thread is
    running is done by keeping the current CPU number up do date within the
    cpu_id field of the memory area registered by the thread. This is done
    by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
    current thread. Upon return to user-space, a notify-resume handler
    updates the current CPU value within the registered user-space memory
    area. User-space can then read the current CPU number directly from
    memory.
    
    Keeping the current cpu id in a memory area shared between kernel and
    user-space is an improvement over current mechanisms available to read
    the current CPU number, which has the following benefits over
    alternative approaches:
    
    - 35x speedup on ARM vs system call through glibc
    - 20x speedup on x86 compared to calling glibc, which calls vdso
      executing a "lsl" instruction,
    - 14x speedup on x86 compared to inlined "lsl" instruction,
    - Unlike vdso approaches, this cpu_id value can be read from an inline
      assembly, which makes it a useful building block for restartable
      sequences.
    - The approach of reading the cpu id through memory mapping shared
      between kernel and user-space is portable (e.g. ARM), which is not the
      case for the lsl-based x86 vdso.
    
    On x86, yet another possible approach would be to use the gs segment
    selector to point to user-space per-cpu data. This approach performs
    similarly to the cpu id cache, but it has two disadvantages: it is
    not portable, and it is incompatible with existing applications already
    using the gs segment selector for other purposes.
    
    Benchmarking various approaches for reading the current CPU number:
    
    ARMv7 Processor rev 4 (v7l)
    Machine model: Cubietruck
    - Baseline (empty loop):                                    8.4 ns
    - Read CPU from rseq cpu_id:                               16.7 ns
    - Read CPU from rseq cpu_id (lazy register):               19.8 ns
    - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
    - getcpu system call:                                     234.9 ns
    
    x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
    - Baseline (empty loop):                                    0.8 ns
    - Read CPU from rseq cpu_id:                                0.8 ns
    - Read CPU from rseq cpu_id (lazy register):                0.8 ns
    - Read using gs segment selector:                           0.8 ns
    - "lsl" inline assembly:                                   13.0 ns
    - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
    - getcpu system call:                                      53.9 ns
    
    - Speed (benchmark taken on v8 of patchset)
    
    Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
    expectations, that enabling CONFIG_RSEQ slightly accelerates the
    scheduler:
    
    Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
    saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
    kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
    restartable sequences series applied.
    
    * CONFIG_RSEQ=n
    
    avg.:      41.37 s
    std.dev.:   0.36 s
    
    * CONFIG_RSEQ=y
    
    avg.:      40.46 s
    std.dev.:   0.33 s
    
    - Size
    
    On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
    567 bytes, and the data size increase of vmlinux is 5696 bytes.
    
    [1] https://lwn.net/Articles/650333/
    [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
    
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Dave Watson <davejwatson@fb.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: Andi Kleen <andi@firstfloor.org>
    Cc: "H . Peter Anvin" <hpa@zytor.com>
    Cc: Chris Lameter <cl@linux.com>
    Cc: Russell King <linux@arm.linux.org.uk>
    Cc: Andrew Hunter <ahh@google.com>
    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: Paul Turner <pjt@google.com>
    Cc: Boqun Feng <boqun.feng@gmail.com>
    Cc: Josh Triplett <josh@joshtriplett.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Ben Maurer <bmaurer@fb.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: linux-api@vger.kernel.org
    Cc: Andy Lutomirski <luto@amacapital.net>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
    Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
    Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
Commits on Jun 5, 2018
  1. Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/l…

    torvalds committed Jun 5, 2018
    …inux/kernel/git/tip/tip
    
    Pull scheduler updates from Ingo Molnar:
    
     - power-aware scheduling improvements (Patrick Bellasi)
    
     - NUMA balancing improvements (Mel Gorman)
    
     - vCPU scheduling fixes (Rohit Jain)
    
    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
      sched/fair: Update util_est before updating schedutil
      sched/cpufreq: Modify aggregate utilization to always include blocked FAIR utilization
      sched/deadline/Documentation: Add overrun signal and GRUB-PA documentation
      sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
      sched/wait: Include <linux/wait.h> in <linux/swait.h>
      sched/numa: Stagger NUMA balancing scan periods for new threads
      sched/core: Don't schedule threads on pre-empted vCPUs
      sched/fair: Avoid calling sync_entity_load_avg() unnecessarily
      sched/fair: Rearrange select_task_rq_fair() to optimize it
Commits on Jun 4, 2018
  1. Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/lin…

    torvalds committed Jun 4, 2018
    …ux/kernel/git/tip/tip
    
    Pull RCU updates from Ingo Molnar:
    
     - updates to the handling of expedited grace periods
    
     - updates to reduce lock contention in the rcu_node combining tree
    
       [ These are in preparation for the consolidation of RCU-bh,
         RCU-preempt, and RCU-sched into a single flavor, which was
         requested by Linus in response to a security flaw whose root cause
         included confusion between the multiple flavors of RCU ]
    
     - torture-test updates that save their users some time and effort
    
     - miscellaneous fixes
    
    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
      rcu/x86: Provide early rcu_cpu_starting() callback
      torture: Make kvm-find-errors.sh find build warnings
      rcutorture: Abbreviate kvm.sh summary lines
      rcutorture: Print end-of-test state in kvm.sh summary
      rcutorture: Print end-of-test state
      torture: Fold parse-torture.sh into parse-console.sh
      torture: Add a script to edit output from failed runs
      rcu: Update list of rcu_future_grace_period() trace events
      rcu: Drop early GP request check from rcu_gp_kthread()
      rcu: Simplify and inline cpu_needs_another_gp()
      rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp()
      rcu: Make rcu_start_this_gp() check for out-of-range requests
      rcu: Add funnel locking to rcu_start_this_gp()
      rcu: Make rcu_start_future_gp() caller select grace period
      rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp()
      rcu: Clear request other than RCU_GP_FLAG_INIT at GP end
      rcu: Cleanup, don't put ->completed into an int
      rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs()
      rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition
      rcu: Make rcu_migrate_callbacks wake GP kthread when needed
      ...
Commits on Jun 1, 2018
  1. Merge tag 'kvmarm-for-v4.18' of git://git.kernel.org/pub/scm/linux/ke…

    bonzini committed Jun 1, 2018
    …rnel/git/kvmarm/kvmarm into HEAD
    
    KVM/ARM updates for 4.18
    
    - Lazy context-switching of FPSIMD registers on arm64
    - Allow virtual redistributors to be part of two or more MMIO ranges
Commits on May 25, 2018
  1. thread_info: Add update_thread_flag() helpers

    Dave Martin Marc Zyngier
    Dave Martin authored and Marc Zyngier committed Apr 11, 2018
    There are a number of bits of code sprinkled around the kernel to
    set a thread flag if a certain condition is true, and clear it
    otherwise.
    
    To help make those call sites terser and less cumbersome, this
    patch adds a new family of thread flag manipulators
    
    	update*_thread_flag([...,] flag, cond)
    
    which do the equivalent of:
    
    	if (cond)
    		set*_thread_flag([...,] flag);
    	else
    		clear*_thread_flag([...,] flag);
    
    Signed-off-by: Dave Martin <Dave.Martin@arm.com>
    Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
    Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
    Acked-by: Marc Zyngier <marc.zyngier@arm.com>
    Acked-by: Catalin Marinas <catalin.marinas@arm.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>